CAREER: Representing, Discovering, and Assembling Motifs for Video Understanding

$471,359FY2023CSENSF

University Of Maryland, College Park, College Park MD

Investigators

Abstract

This project will build innovative technology to allow computers to understand temporal phenomena, such as human actions in videos, with the potential of transforming applications across security, health, and robotics. Temporal phenomena exhibit structure at various time scales—long events (e.g., breakfast) are composed of multiple long-term activities (e.g., cooking an omelet), which in turn are composed of various atomic actions (e.g., cutting onions). This project will develop computational representations of temporal phenomena that capture how concepts evolve over time by accentuating temporal cues in videos. Leveraging these representations, this work will create software that can discover distinctive recurring atomic actions from video collections and learn to compose these atomic actions to understand and explain long-term, complex temporal phenomena. The outcomes of this project will enable computers to be significantly better at recognizing complex activities, detecting anomalies, and forecasting future actions. The developed technologies have the potential to address challenges in several areas, such as analyzing weather patterns and forecasting extreme events, provenance search for malicious content on the internet, and indexing and searching internet-scale video collections using natural activity-level queries. Integrated with the research is a comprehensive plan for education, mentoring, and outreach, including training students in research at multiple levels, contributing to curriculum development for undergraduate and graduate courses, and designing outreach programs to attract diverse students at multiple levels. At a technical level, this project will address fundamental challenges in understanding long-term temporal phenomena. At the core of this project is the notion of ‘motifs’—distinctive repeating temporal patterns—that can be assembled into long-term narratives, such as activities. The research program seeks advances in three key areas: (a) unsupervised temporal representation learning, where this work will develop disentangled temporal representations that can better model temporal phenomena, (b) discovering motifs, which will develop a large scale framework for discovering distinctive repeating temporal patterns as atomic actions, from unlabeled videos, which can help understand long-term activities, and (c) learning to assemble motifs to decompose actions. The project presents scalable strategies to learn implicit and explicit stochastic grammars of actions from unlabeled videos, revisiting a long-standing problem using contemporary data-driven methods. This research effort provides a roadmap for exciting new research directions in video understanding. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →