CRII: SHF: Optimizing Deep Learning Training through Modeling and Scheduling Support

$174,990FY2018CSENSF

Board Of Regents, Nshe, Obo University Of Nevada, Reno, Reno NV

Investigators

Abstract

Deep learning models trained on large amounts of data using lots of computing resources have recently achieved state-of-the-art training performance on important yet challenging artificial intelligence tasks. The success of deep learning has attracted significant research interest from hardware and software communities to improve training speed and efficiency. Despite the great efforts and rapid progress made, one important bridge to connect software and hardware support with deep learning domain knowledge is still missing: efficient configuration exploration and runtime scheduling. Both the quality of deep learning models and the training time are very sensitive to many adjustable parameters that are set before and during the training process, including the hyperparameter configurations (such as learning rate, momentum, number and size of hidden layers) and system configurations (such as thread parallelism, model parallelism, and data parallelism). Efficient exploration of hyperparameter configurations and judicious selection of system configurations is of great importance to find high-quality models with affordable time and cost. This is however a challenging problem due to a huge search space, expensive training runtime, sparsity of good configurations, and scarcity of time and resources. The objective of this research work is to systematically study the unique properties of deep learning systems and workloads, and establish new modeling and scheduling methodologies for improving deep learning training. The PI aims to improve the efficiency of discovering high performing models through a dynamic scheduling methodology driven by a novel hyperparameter configuration classification approach. The PI aims at developing an accuracy- and efficiency-aware hybrid scheduling methodology that makes judicious scheduling decisions based on a global view of both the time dimension (accuracy potential) and spatial dimension (efficiency potential) information. This research work integrates techniques in workload characterization, performance modeling, resource management, and scheduling to dramatically speedup the training process while significantly reducing the cost in time and resources. More broadly, this project will gain foundational knowledge about the interaction between software-hardware support and deep learning domain knowledge. This knowledge can help design next generation deep learning systems and frameworks, making deep learning training handy for researchers and practitioners with limited system and machine learning domain expertise. This research will help enhance curriculum and provide research topics for both undergraduate and graduate students, especially students from underrepresented groups. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →