SHF: Small: Improving Efficiency of Vision Transformers via Software-Hardware Co-Design and Acceleration
University Of California-Davis, Davis CA
Investigators
Abstract
Transformer models are a relatively recent breakthrough in machine learning that have revolutionized natural language processing and boosted the generalization of computer vision models. However, the wide adoption of transformer models requires making them significantly more energy efficient. The Transformer models are too complex, and existing hardware is not optimal for their efficient execution. In this project, researchers are exploring two interrelated research problems to tackle transformer efficiency: 1) creating new Transformer models that can be dynamically pruned to improve efficiency without sacrificing accuracy, and 2) designing specialized hardware to make the execution of Transformers more efficient. The impact of this project is significant in several ways: Firstly, it promotes scientific progress in the research community by advancing communities understanding of attention as the primary mechanism in transformer models and how it can be used for context-aware pruning of complex learning models. Secondly, it extends knowledge of the hardware community in designing advanced solutions for dynamic precision tuning and scheduling of complex and tunable hardware systems. Additionally, the project supports diversity and higher education at University of California (UC) – Davis while improving education by integrating research into teaching at UC Davis classes. Ultimately, the project's success will benefit society by making superior transformer models more accessible across various applications. Researchers explore an incremental sampling approach in their new transformer model to process input images across encoder layers gaining contextual awareness progressively. They aim to leverage incremental contextual awareness to remove unattended tokens and mask unimportant input patches in new samples. Additionally, researchers explore learning-based and context-aware attention-head dropping, encoder-layer skipping, and early termination for coarse grain model pruning. To improve the transformer model's inference efficiency, the researchers explore architecting a stochastic pre-processing unit that approximates matrix-matrix multiplication supporting attention-based model pruning classifiers for patch, token, attention head, and encoder elimination. To build the hardware accelerator's multiplication and accumulation (MAC) units, researchers explore a novel solution for temporal carry-bit deferment, eliminating carry-bit propagation in MAC. This solution simplifies MAC logic, enhancing stream processing speed and efficiency. Furthermore, Researchers aim to leverage the shallow logic depth of the new MAC to design highly efficient diffusible MACs, enabling dynamic precision trade-offs in the processing array. Researchers also investigate developing a scheduler for balancing workload and minimizing memory accesses when operating on sparse attention graphs with support for out-of-order token processing, processing element clustering, in-order completion, and precision-aware scheduling to optimize performance. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →