Collaborative Research: SHF: SMALL: Compile-Parallelize-Schedule-Retarget-Repeat (EASER) Paradigm for Dealing with Extreme Heterogeneity

$250,000FY2022CSENSF

College Of William And Mary, Williamsburg VA

Investigators

Abstract

Heterogeneity in computing refers to having a variety of devices present within one computing system or even within one node of a cluster. A number of technological trends are making a high degree of heterogeneity inevitable in High Performance Computing (HPC), leading to research along many directions. The traditional scheduling problem, which refers to taking a set of programs to be executed and mapping them to the available resources, becomes more complicated in the presence of such heterogeneity, as the schedulers need to interact with the compiler also. The goal of this project is to consider new paradigms for application execution in view of these developments and conduct research in developing predictions of execution times, compilation, parallelization, and scheduling. Traditionally, deciding (likely manually) how an application is to be parallelized, compilation, and cluster-level scheduling are done sequentially and independently. The investigators posit that their isolated treatment is not going to be acceptable when one tries to optimize for multi-tenant heterogeneous clusters. Instead, the investigators envision a requirement that can be referred to as EASER -- compilE-pArallelize-Schedule-rEtarget-Repeat. To elaborate on the vision, in the EASER paradigm the compiler first maps the core functions to a specific device, generating predictions of execution time that are input to the parallelization approach selection module, and together they produce a final executable. Subsequently, this binary is presented to the scheduler, which assesses the job queue and might suggest alternative configuration(s)/device(s). If so, a retargeting module is to be invoked, leading to a potential repetition of the above steps. This project develops, supports, and evaluates the EASER framework in the context of a cluster that executes emerging machine learning (ML) workloads. Research is proposed in the following areas: 1) Compiler-Driven Performance Prediction -- It includes a novel strategy that comprises a general model for predicting SIMD/VLIW performance and an operator classification based approach to developing a memory hierarchy performance model. 2) Integrated Job Scheduling and Parallelization Strategy Selection -- Building on the performance prediction models, these two (conventionally independent) modules are integrated, by including parameterized and incremental parallelization strategy selection methods and aggressively reducing the search space in scheduling methods. 3) Retargeting Compiler -- By classifying optimizations as either architecture-dependent or independent, a retargeting compiler for ML workloads will be developed. This project will also make several contributions to education and human resource development. Both investigators will be introducing course(s) (material) at the intersection of computer systems and machine learning, bringing attention to ML-related workloads in computer systems education. A majority of funds at each University will be used to support Ph.D. students in their research, who will be trained to work across traditional (sub-) areas. Both investigators are strongly committed to increasing diversity in computing fields and have a strong record of supervising members of underrepresented groups in their research programs. Building on their Universities' existing connections, they will be further working on improving diversity at all levels. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →