RI: Small: Learning to Read, Ground, and Reason in Multimodal Text
University Of Washington, Seattle WA
Investigators
Abstract
Web data, news, and textbooks offer informative but unstructured multimodal text. The ability to translate multimodal text into a semantic representation that is amenable to further reasoning is a key step toward taming information overload, one of the fundamental problems in modern AI. Designing systems that can understand and use multimodal text requires multiple interconnected components: semantic interpretation, multimodal alignment, knowledge acquisition, and reasoning. Most previous work has focused on a single component in isolation and ignored the high-order crucial interdependencies between these tasks. This proposal aims at building a unified frame
work for learning to read, ground, and reason in multimodal textbooks. This
 framework will include three interconnected
 components: context-aware visual and textual interpretation, acquiring and representing knowledge, and reasoning. This work is designed for significant social impact through a broad range of applications including educational and accessibility. The advances in understanding textbooks and question answering could be potentially helpful in designing an automatic personalized tutoring system to educate students about algebra, geometry, and science topics. Advancements in visual interpretation and multimodal knowledge could be beneficial to visually impaired individuals to make the diagrammatic information accessible to them. This project will be instrumental for education, research, and collaborative experience for undergraduate and graduate students including under-represented and minority groups. The proposed framework is designed to iteratively read multimodal textbooks in context, acquire knowledge, interpret data, update and prune the acquired knowledge, and finally reason about the queries. A core challenge is to do robust, scalable, context-aware semantic analysis and reasoning on multimodal text. The proposal is organized in three main thrusts that build upon each other toward the complete proposed framework. First, the project proposes a precise reasoning algorithm in narratives in learning to solve algebra word problems. The proposed algorithm will learn to combine local contextual cues into a novel semantic structure using the global context of the narrative. Second, it proposes to build an automated system for interpreting and reasoning in multimodal text by learning to ground text and diagram into a formal representation and a new reasoning algorithm to solve those problems. Finally, it will construct a novel, principled machine learning framework for knowledge acquisition, interpretation, and reasoning in multimodal texts - science textbooks. The proposed framework will be applied in conversational dialogs and personalized tutoring systems. The key contributions will include a unified framework for learning to read, ground, and reason in multimodal textbooks, new algorithms for joint multi-modal text and diagram interpretation, precise understanding of narratives, gradual knowledge acquisition, and reasoning.
View original record on NSF Award Search →