GGrantIndex
← Search

CSR: Small: Automatic Storage and Network Contention Management for Large-scale High-performance Computing Systems

$450,000FY2015CSENSF

University Of California-Santa Cruz, Santa Cruz CA

Investigators

Abstract

High performance computing is essential to science, industry, and the environment, from resource exploration to the design of the next generation of consumer electronics. These high performance computer systems are among the most complex and expensive computer systems and require that their resources be used in the most efficient manner. Many of the applications that utilize high performance computing are data-intensive, and storage system performance is a crucial aspect of system performance. However, storage systems are notoriously sensitive to contention caused by competition among storage clients for limited bandwidth and disk access. This is a significant problem for shared storage systems. This project provides an automatic storage contention alleviation and reduction system (ASCAR) for large-scale high-performance storage to increase bandwidth utilization and fairness of resource allocation. ASCAR uses machine learning methods combined with several heuristics to discover the fittest control strategy. It is a highly scalable and fully automatic storage contention and congestion management system, which can improve the efficiency of both legacy and new systems, with no need to change either server hardware/software or existing applications. ASCAR regulates I/O traffic from the client side using a rule based algorithm. It employs a shared-nothing design and requires no runtime coordination between clients or with a central coordinator whatsoever, because runtime coordination is slow and unscalable. The effectiveness of ASCAR relies on the quality of traffic control. The research team has designed a prototype algorithm, the SHAred-nothing Rule Producer (SHARP), which produces rules in an unsupervised manner by systematically exploring the solution space of possible designs. Starting from one initial rule, SHARP uses heuristics similar to random-restart hill climbing to find the optimal parameters without the need for an exhaustive search. ASCAR monitors the workloads running on the system and uses several heuristics to pick up the fittest rules. It is clear that computer systems are getting ever more sophisticated, and human-lead empirical-based approach towards system optimization is not the most efficient way to realize the full potential of these modern, complex, high performance computing systems. This research brings machine learning, artificial intelligence, and big data methods to systems research and could lead to a very low cost I/O performance increase for a wide range of systems.

View original record on NSF Award Search →