GGrantIndex
← Search

A Framework for Next-Generation Scheduling and Task Management for Extreme-Scale Computing

$503,150FY2004CSENSF

Regents Of The University Of Michigan - Ann Arbor, Ann Arbor MI

Investigators

Abstract

A Framework for Next-generation Scheduling and Task Management for Extreme-scale Computing Kang G. Shin and Abhijit Bose The University of Michigan The objective of this research project is to develop efficient algorithms and robust software for scheduling and resource management in support of extreme-scale computing. Such computations may involve tens of thousands of processors configured as a high-end computing system. More efficient scheduling and resource management systems than those currently used are needed to address scalability and fault-tolerance for such large systems. The same argument also holds for the increasingly complex requirements of many emerging applications. The current-generation schedulers primarily manage a cluster of CPUs that are configured statically for an application. An application is allocated a fixed number of processors or nodes (for SMP systems), and is not expected to modify its resource requirement during the execution. This may result in lower resource utilization. Furthermore, most of the current schedulers do not provide transparent fault-tolerance to the running workload. One or more processor/node failures often terminate the currently-executing task on these processors, resulting in wasted cycles. Fault-tolerant scheduling will be critical to the scalability of future HEC systems to an extremely large number of processors. The increasing usage of high-end systems for both computation- and data-intensive applications requires that future scheduling systems must address the problem of co-scheduling CPUs, I/O and network resources when mapping tasks to appropriate resources. In some HEC architectures, the placement of processes has an effect on the overall performance of the application due to the underlying interconnection topology. Both co-scheduling and workload-aware scheduling will be important for future high-end computing systems. An integrated software framework will be developed for scheduling and resource management for extreme-scale computing systems that provides the following capabilities: (i) on-line workload characterization, (ii) predictive scheduling based on time-series modeling and forecasting of resource utilization and queued workload, (iii) transparent fault-tolerant scheduling of applications that are interrupted by one or more faults, and (iv) efficient heuristic and evolutionary algorithms that consider the workload and resource characteristics and forecasts as part of the scheduling decision. This work represents a mix of scheduling theory and robust software development. These algorithms and the software framework in a production HEC environment at the Center for Advanced Computing (CAC) at the University of Michigan. The CAC HEC facility provides a testbed consisting of over 1400 CPUs representing multiple processor families (AMD Athlons, Opterons, Apple Xserve/G5) and several interconnection systems such as Gigabit Ethernet and Myrinet. The intellectual merit of this research will be to advance the state-of-the-art in scheduling and resource management for HEC systems. By implementing and deploying the proposed framework at CAC, we will be able to collect realistic workload traces from a diverse array of end-users and applications. This research will serve as a catalyst for the development of robust fault-tolerant scheduling algorithms and their implementation software for such systems. We specifically address the scalability of fault-tolerance mechanisms such as checkpointing/restart and I/O that can scale across thousands of processors. Furthermore, our proposed integration of research, outreach and education activities will make broader impacts to other HEC centers and the scheduling research community. The framework developed as part of this project and our research results will be disseminated to industry and academia through open-source software and high-quality publications.

View original record on NSF Award Search →