EAPSI: Applying Machine Learning Techniques to Predict Task High Performance Computing Performance on a Variety of Execution Platforms

$5,400FY2016O/DNSF

Grossman Jonathan M, Houston TX

Investigators

Abstract

Ten years ago, the highest performing computing systems in the world were homogeneous and specialized: they were composed of a single processor architecture and they executed primarily scientific workloads. Today, we see much more diversity in both the hardware platforms and applications. Many of the world's largest computing clusters contain multiple processor architectures and execute a wider variety of application workloads applications in science, data analytics, genomics, and medical imaging. The choice of software execution platforms has expanded. This increased complexity in multiple dimensions makes efficiently scheduling workloads on high-performance computing (HPC) systems more challenging. Even worse, our ability to reason about the behavior of automated scheduling systems usually diminishes as system complexity increases. This project takes steps to address these problems by experimenting with machine learning techniques for predicting task performance on a variety of execution platforms, including the Java Virtual Machine (JVM), native CPU threads, and native GPU threads. This research will be conducted under the mentorship of Professor Hironori Kasahara, Director of the Advanced Multicore Processor Research Institute at Waseda University in Tokyo, Japan. The work has the potential to positively impact present and future high-performance computing applications in both industry and research las that run on heterogeneous platforms. The project will focus on the development of a novel framework for automatic platform selection in the area of heterogeneous systems, as well as an understanding of how efficient, accurate, and analyzable modern techniques are at offline and online performance model training. Our proposed approach is hybrid, using both one-time offline training of performance models as well as continuous online training to produce a predictive performance model. We will construct and open source a general-purpose automatic platform selection framework to validate our approach. Rather than focusing on evaluating a single technique, we will use this framework to perform a comprehensive survey of proposed machine learning techniques for automatic platform selection to understand the tradeoffs of each in terms of performance, accuracy, and analyzability. This award under the East Asia and Pacific Summer Institutes program supports summer research by a U.S. graduate student and is jointly funded by NSF and the Japan Society for the Promotion of Science.

View original record on NSF Award Search →