CAREER: Toward Spatial-Temporal Architectures with Deformable and Interpretable Convolutions

$521,726FY2018CSENSF

Oregon State University, Corvallis OR

Investigators

Abstract

Artificial neural networks have successfully been applied to analyzing visual imagery. The goal of this project is to build a convolutional neural network (CNN) that can scale and deform automatically in order to be able to be invariant to object size and pose. Currently, CNNs cannot even perform well on an image rescaled twice or half as large, if not trained on the re-scaled image. This leads to a lot of redundancies in the model and unnecessary over-complication of the architecture. This project explores approaches to automatically figure out the correct scaling, as well as other transformations, from visual objects in images and videos. The proposed methods will also make convolutional neural networks easier to interpret, and to reduce the amount of data needed to train a network. Besides normal computer vision benchmarks, the research team evaluates the approach with collaborations to apply the technologies to different applications, such as forestry and tumor-cell morphology, The educational goal of this project involves developing a new ?what-you-see-is-what-you-get? (WYSIWYG) deep learning toolbox that enables people without much programming and mathematical skills to utilize deep learning for data analysis. The research team also plans to outreach to high schools and community colleges to introduce more than 100 students to deep learning and visual object recognition. This research develops spatial-temporal CNNs that scale and deform automatically, hence able to concisely represent object recognition models that generalize better under invariant and equivariant transformations unseen in the training set. The project explores novel auto-scaling and multi-deformable convolutional network architectures that utilize parametric motion fields to automatically locate the correct deformations of a visual object for each convolutional filter. In order to learn the motion fields from video, the research team uses a Siamese convolutional-deconvolutional network predicting boundaries in two consecutive frames, and utilizes an output-to-output feedback loop to deduce boundary motion. The research team applies this approach to video segmentation and uses it to generate annotations for a weakly supervised learning of the motion fields. The approach is evaluated on several tasks with limited annotations, such as video segmentation, multi-target tracking and object classification and detection in videos under unseen deformations. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →