CAREER: A Compression-Based Approach to Learning Video Representations

$497,466FY2019CSENSF

University Of Texas At Austin, Austin TX

Investigators

Abstract

An ever-increasing amount of our digital communication, media consumption, and content creation revolves around videos. We share, watch, and archive many aspects of our lives through them. However, designing and learning representations to understand these videos has proven challenging. Direct extensions of sequence or image-based convolutional neural networks to videos have yielded only moderate success. The goal of this project is to develop efficient, robust, and compact video representations. Every percent increase in the compression rate from this project translates into decreased internet traffic and more storage efficiency, reducing the massive economic and environmental costs of modern digital infrastructure. Any increase in recognition accuracy results in safer autonomous agents, more responsive surveillance and assistive technologies for the elderly, and a deeper understanding of video dynamics in sports and entertainment. Furthermore, this research will translate to the classroom through updated and new undergraduate and graduate-level courses on video recognition and compression. The technical aim of this project is divided into four thrusts. The first thrust develops video recognition models inspired by video compression. The video compression community developed sophisticated, compact and efficient representations for video, used to store the bulk of digital media. The project will study what video compression can teach us about video representations, and how modern codec design can drive the structure of deep video models. The second thrust brings concepts from video recognition back to compression. The interplay between compression and recognition is not a one-way street. The project will investigate how video compression can be learned directly from data, side-stepping many of the manual design choices, and how video compression can learn to be robust to missing or corrupted information. The research team will develop a novel interpretation of video compression as repeated image interpolation. This interpretation opens the door to learned deep video compression algorithms. The third thrust studies the optical representation of motion for both recognition and compression tasks. At the core of both video compression and recognition lies a good representation of motion. The motion fields will be represented in a compact, compressible, temporally consistent, and easy to understand manner. Finally, the fourth thrust finds new supervisory signals, evaluation tasks, and their associated data. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →