RI:Small:Dynamic Networks for Efficient, Adaptive, and Multimodal Vision

$600,000FY2023CSENSF

University Of California-San Diego, La Jolla CA

Investigators

Abstract

The recent introduction of a class of artificial intelligence (AI) methods known as deep learning has enabled transformational advances in computer vision. It is now possible to imagine AI systems capable of understanding and responding to a visual scene in a manner similar to people. However, much research is still needed to enable effective solutions for perceptual tasks combining multiple modalities (vision, audio, language), with great robustness to domain variability, and high efficiency in terms of computation and energy consumption. While unified solutions to these problems are critically important for applications such as robotics or the deployment of AI on edge devices, such as cellphones, the problems are usually addressed independently, leading to disconnected solutions. Hence, there is a need for basic research in neural network architectures that not only advance each of the problems per se but also contribute to the integrated solution of all problems. Drawing inspiration from the plasticity of biological brains, the project will explore the use of dynamic networks, which vary their computations depending on their input, to achieve this goal. The challenge is that, because it is impossible to dynamically predict the massive number of parameters of a modern network, this requires careful selection of the parameters to endow with dynamics. The project aims to develop algorithms for such parameter selection and investigate their benefits for a range of applications in computer vision. To achieve its objectives, the project will leverage the well-known importance of feature mixing in deep learning, implementing network dynamics efficiently via feature-based attention mechanisms that perform mixing through dynamic matrix factorizations. These are factorizations of layer weight matrices that enable the restriction of dynamic parameters to a small latent kernel per layer. Two classes of factorization are proposed, enabling the implementation of dynamic attention by parameter synthesis or non-linear transformations. These factorizations are then proposed as a unified substrate for the design of perception algorithms with state-of-the-art performance for multimodal perceptual tasks, such as visual grounding or audio-visual spatial localization. These algorithms will be implementable with extremely light weight deep learning networks to enable high computation and energy efficiency, and are applicable across architectures that range from convolutional networks (CNNs) to transformers. Finally, they will have strong robustness to variability of operating environments and so can support challenging applications such as home robotics or perception on edge devices. The research has applicability in areas of societal relevance, such as manufacturing, self-driving vehicles, intelligent systems, assisted living, homeland security, etc. Educationally, the project will provide exciting opportunities for both graduate and undergraduate research. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →