CSR: EAGER: An Integrated Framework for Performance and Reliability in Large-scaled Computing Systems
Northeastern University, Boston MA
Investigators
Abstract
Large-scale computing environments such as data centers and cloud computing are becoming the core computing infrastructure, making the availability of such services extremely critical. However, these environments are increasingly vulnerable to both hardware and software failures. This project designs failure-aware techniques for modeling, prediction, and resource management in large-scale computing environments with the presence of hardware and software failures at various levels. Intellectually, this project develops fundamental understanding of workload and reliability characteristics, and investigates how improved capacity planning models and prediction techniques can obtain useful information for system design and maintenance. This project further provides insights of the impact of software/hardware component failures in the area of resource management. The results of this project will include new capacity planning models that evaluate both reliability and performance of a given system and new prediction techniques that forecast the future failure occurrences by taking advantage of temporal dependence in failure events. Based on the modeling and prediction techniques, this project will develop new failure-aware runtime strategies for job scheduling, node allocation, and system maintenance, aiming to achieve high system performance and reliability in complex large scale systems.
View original record on NSF Award Search →