EAGER: Spatiotemporal Transformer for Activity Recognition

$280,754FY2023CSENSF

University Of Virginia Main Campus, Charlottesville VA

Investigators

Abstract

Understanding human activity from video is important to several applications in security, defense, medicine, robotics, manufacturing, and education. The field of computer vision explores the use of cameras and computers to automate tasks such as object recognition and activity recognition. Traditionally, researchers have developed computer vision systems by extracting the constituent features in an image or video and matching those features to models of more complex objects. More recently, machine learning methods have been applied that train a computer to perform such a recognition task from data rather than a physical model. This project explores a learning-based object recognition approach based on learning semantic relationships between objects and people observed in video. Specifically, the project attempts to design computing methods that will automatically derive relationships between people and objects in digital video and then exploit those correlative relationships in classifying a human action (e.g., kicking a ball or shaking hands). Unlike machine learning methods developed for understanding language, the proposed solution will use elements of the video specific to understanding human action such as detection of imaged objects, the motion of objects, and the spatial and temporal position in the video. Successful implementation of the computer vision solution will allow human activities in video to be automatically analyzed. The analysis will benefit critical tasks such as learning to perform a surgery or understanding the actions taken in an effective classroom. Transformers are a type of neural network that use attention to compute relationships between words in a sentence or series of sentences. The advantages of the transformer model include the ability to assess these relationships over long sequences of words and the ability to automatically process all words simultaneously via a positional encoding. Instead of taking the transformer developed for natural language and fitting it to a video problem, this project seeks to develop a video transformer from first principles. The realization of this system involves three distinct advances in the machine learning design. First, the proposed approach brings the concept of motion as a feature to the transformer by way of optical flow information encoded with time. Second, the proposed method allows interactions between geometric and motion features of action semantics to exploited in a transformer framework. Last, the distribution-based attention model goes beyond the traditional correlative notion of attention. The proposed attention model captures significant correlations in action sequences. Together, the three theoretical contributions have the potential to significantly advance video understanding. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →