CAREER: Visual Question Answering (VQA)

$104,201FY2016CSENSF

Virginia Polytechnic Institute And State University, Blacksburg VA

Investigators

Abstract

This project addresses the problem of Visual Question Answering (VQA). Given an image and a free-form natural language question about the image (e.g., "What kind of store is this?", "How many people are waiting in the queue?", "Is it safe to cross the street?"), the machine's task is to automatically produce a concise, accurate, free-form, natural language answer ("bakery", "5", "Yes"). VQA is directly applicable to a variety of applications of high societal impact that involve humans eliciting situationally-relevant information from visual data; where humans and machines must collaborate to extract information from pictures. Examples include aiding visually-impaired users in understanding their surroundings, analysts in making decisions based on large quantities of surveillance, and interacting with a robot. This project has the potential to fundamentally improve the way visually-impaired users live their daily lives, and revolutionize how society at large interacts with visual data. This research enables that VQA represents not a single narrowly-defined problem (e.g., image classification) but rather a rich spectrum of semantic scene understanding problems and associated research directions. Each question in VQA may lie at a different point on this spectrum: from questions that directly map to existing well-studied computer-vision problems ("What is this room called?" = indoor scene recognition) all the way to questions that require an integrated approach of vision (scene), language (semantics), and reasoning (understanding) over a knowledge base ("Does the pizza in the back row next to the bottle of Coke seem vegetarian?"). Consequently, this work maps to a sequence of waypoints along this spectrum. Motivated by addressing VQA from a variety of perspectives, this research program is generating new datasets, knowledge, and techniques in (i) pure computer vision (ii) integrating vision + language (iii) integrating vision + language + common sense (iv) building interpretable models and (v) combining a portfolio of methods. In addition, novel contributions are being made to (a) training the machine to be curious and actively ask questions to learn (b) using VQA as a modality to learn more about the visual world than what existing annotation modalities allow and (c) training the machine to know what it knows and what it does not.

View original record on NSF Award Search →