CAREER: Geometry, Physics and Semantics from Motion: Learning Expressive and Space-Aware Video Representations
Carnegie Mellon University, Pittsburgh PA
Investigators
Abstract
This project develops view-invariant 3D visual representations for visual recognition, robot control and language grounding that support scene understanding. The project minimizes human annotation efforts required for effective 3D visual recognition. The project will inject common sense and affordability reasoning in vision, language and control. It will also introduce learning paradigms for visuomotor representations supervised by embodiment, interaction and human demonstrations and narrations, just as humans learn. The project will be instrumental in controlling any vision-enabled mobile agents, such as ground vehicles and drones, to bring AI systems closer to the levels of human performance in visual reasoning. It will further establish connections between AI research and computational neuroscience and cognitive psychology by suggesting learning paradigms similar to those of humans, powered by embodiment and prediction, and by exploring inductive biases, such as motion/appearance disentanglement that need to be integrated to current computational models to enable the type of reasoning humans are capable of, with the appropriate amount of training. The research of this project with be integrated with the educational program of the investigator and results of this research will be disseminated to research communities. This research introduces visual feature representations that decompose RGB and RGB-D streams into scene appearance and motion for the camera and the objects. Appearance encodes properties that persist over time, such as semantics, material properties, shape, and so on, and motion encodes properties that vary quickly over time, such as camera motion, object locations and poses, and object non-rigid deformations. The project envisions embodied agents equipped with cameras to observe the world and end-effectors to interact with it, that learn to distill their visuomotor experiences into 3D feature representations of the scene appearance and their temporal action-conditioned dynamics. The new video representations learn to encode object properties and spatial common sense, such as world object size, 3D extent, shape, semantics, material properties, object permanence, by optimizing self-supervised objectives of view prediction, time frame prediction, and action-conditioned prediction. The representations enable processing a video stream in terms of objects, their temporal pose and deformation trajectories in 3D, without cross-object interference during occlusions. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →