GGrantIndex
← Search

Elements: Development and Dissemination of a Slurm Simulator

$590,305FY2020CSENSF

Suny At Buffalo, Amherst NY

Investigators

Abstract

Slurm is an open source resource management and job scheduling system that is widely used on small and large high-performance computing (HPC) systems. Slurm is highly tuneable, with many settings that can significantly influence job throughput, overall system utilization and job wait times. Unfortunately, in many cases it is difficult to judge how modification of these settings will affect the overall performance of the HPC resource. This project develops a prototype version of a Slurm simulator that allows HPC personnel to tune Slurm parameters, to optimize throughput or meet specific workload objectives without impacting an HPC system in production. The proposed Slurm simulator could have impacts far beyond computer science and the field of scheduling. The ability to optimize scheduler parameters on production HPC resources, most of which are significantly oversubscribed, has the potential to dramatically improve job throughput for researchers and reduce the ‘time to science’. The simulator would allow many different job scheduling schemes to be rapidly evaluated, and implemented if they are useful or rejected if they are not. Furthermore, the ability to model the impact of various features that impact job execution time (such as network traffic, parallel file system loads and node sharing) can be explored to determine if they merit additional scheduler development work. The developed framework would allow researchers in other science and engineering fields to incorporate their models, and study a range of problems affecting the performance and execution of users' jobs on HPC systems. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →