EAGER: Resilient, Energy Efficient HPC System Configuration
University Of Iowa, Iowa City IA
Investigators
Abstract
High-performance computing systems (supercomputers) continue to grow in size and complexity. Today's leading edge systems contain tens of thousands of server nodes, and proposed, next-generation systems are likely to contain hundreds of thousands of nodes. At this scale, maintaining system operation when hardware components may fail every few minutes or hours is increasingly difficult. Increasing system sizes bring a complementary challenge surrounding energy availability and costs, with projected systems expected to consume ten or more megawatts of power. For future high-performance computing systems to be useable and cost effective, we must develop new design methodologies and operating principles that embody the two important realities of large-scale systems: frequent hardware component failures are a part of normal operation and (b) energy consumption and power costs must be managed as carefully as performance and resilience. As part of this research, the principal investigator will apply new ideas from commercial cloud computing to HPC systems, focusing on reliability and energy efficiency. This includes models of high-performance computing system design based on right-sizing hardware building blocks to balance operating costs for component replacement and repair against capital costs for over-provisioning, and incorporation of energy costs and constraints into scheduling systems and resource allocations, making computing costs visible to researchers. The deployment of very large-scale computing systems, which target science, engineering and defense problems of critical national interest, is currently limited by both system reliability and energy consumption. New design and operating approaches for reliability and energy management can both reduce costs and increase access, allowing computer companies to design larger systems, research institutions to deploy systems more widely, and researchers to better manage computational resources.
View original record on NSF Award Search →