RI: Small: Visual How: Task Understanding and Description in the Real World

$262,237FY2022CSENSF

University Of Minnesota-Twin Cities, Minneapolis MN

Investigators

Abstract

Problem solving is an innate capability that humans develop through evolution and experience. Compared to human intelligence that can solve general and complex problems, current AI systems only perform well in narrow and structured tasks. With the overarching goal of bridging this gap, this project develops AI systems that can understand general real-world tasks (e.g., How to set up a tent? How to teach kids to garden? How to travel in London?) and come up with solutions with step-by-step language and visual guidance. It will allow for real-world tasks to be solved even in general and complex circumstances, resulting in more human-like AI. Ultimately, the project will take a step forward toward artificial general intelligence. The project will provide a publicly available dataset, a framework of computational models, and a mobile application prototype. Furthermore, this project will support integrated research and education with a focus on increasing minority participation through K-12 outreach, underrepresented and undergraduate mentoring, and curriculum development. This project proposes a VisualHow problem that represents a rich spectrum of real-world tasks. The generality and complexity of the problem call for capabilities to understand the visual and textual contents of the task, reason with knowledge relevant to the task, and generate step-by-step multimodal descriptions about how the task can be completed. This project aims to achieve these goals in three tasks. First, generate a new dataset with diverse and real-world tasks and solutions, with rich annotations of key semantics and task structures to guide the multimodal attention and structural reasoning. Second, develop a novel framework in which a series of models are derived for explainable VisualHow learning to understand the visual-textual contents and generate steps to complete real-world tasks. Third, develop novel methods to generalize the models with knowledge and validate them on mobile platforms to assist people in real-world applications. Achieving these goals will not only lead to new vision-language tasks and computational methods for real-world problem solving, but also spur innovations in the development of explainable and generalizable AI models and systems. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →