SHF: Small: Sparsity-Aware Hardware Accelerators for Natural Language Processing with Transformers

$499,999FY2020CSENSF

Suny At Stony Brook, Stony Brook NY

Investigators

Peter Mildercontact Hansen Schwartz Michael Ferdman Niranjan Balasubramanian

Abstract

Natural Language Processing (NLP) enables people to interact with machines in the same manner as with each other. More importantly, it provides machines with the ability to access the information and knowledge that are readily available in books, articles, and various unstructured documents. Because the quality and usability of NLP-powered services depends primarily on the quantity of text the system is able to process, the computational demands of advanced NLP applications far exceed the capabilities of general-purpose computers and continue to grow. This project aims to greatly improve the performance of NLP applications based on transformers, a class of neural networks used in most state-of-the-art NLP technology. This project will significantly improve performance and efficiency for NLP applications, enabling their widespread deployment in emerging datacenters and thus enhancing the quality of human interactions with machines and each other. This project advances the state of the art of accelerators (hardware and compilers) for natural language processing, focusing primarily on sparsity-aware inference in large multi-layered self-attention based models, which have so far received limited attention from the architecture community. The project also advances NLP knowledge of sparse attention functions, studies design techniques that allow for repurposing pre-trained models to run faster, and improves the effectiveness in applications which diverge from its training setting. The investigation focuses on the key observation that the massive growth in computational complexity can be mitigated by dynamically identifying inherent sparsity and ineffectual computation in models, refitting the model to induce sparsity with the goal of either approximating or entirely avoiding parts of the computation that have limited impact on the model results. This investigation will demonstrate the performance improvement obtained by these techniques, leveraging sparsity and dynamic predictions within a novel sparsity-aware hardware acceleration framework, implemented on a field-programmable gate array (FPGA). This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →