CSR: Small: Diagnosing Performance and Correctness Errors in Parallel Applications at Large Scales

$466,000FY2015CSENSF

Purdue University, West Lafayette IN

Investigators

Abstract

Dependability has become a critically necessary property for many of the computer systems that surround us or work behind the scenes supporting our personal and professional lives. We have made great strides in our ability to design, implement, and operate dependable systems. However, dependability solutions are increasingly being stressed due to rapid increases in the scale of the computing systems. Computer applications used in areas such as computational genomics, data mining, and prediction of natural phenomena tackle extremely complex problems which generate vast amounts of sensory data; thus, the inputs to these applications is tremendous. Computing is rapidly becoming more dependent on parallelism - where many calculations are carried out simultaneously. This means increasing core counts for servers, more servers and racks for data centers, and a dramatic increase in the number of computing cores that these applications must run. The traditional dependability solutions just are not working. When an application does not complete or completes with incorrect results, the developer must identify the offending parallel task and then the portion of the code in that task that caused the error. This is hard enough for parallel applications at small to moderate sizes. These issues get exacerbated at large scales. Dealing with tens of processes is within reach of mere mortal developers, a few hundreds of processes is within reach of heroic developers, but on machines of petascale and beyond, this requires sophisticated support. This project will create design principles for debugging tools that can operate at large scales of data and process count and a practical instantiation of these principles in a system called LANCET. The methodology will be based on the insight that the numbers of equivalence classes of processes in an application do not grow even as the number of processes grows. Analysis will mostly deal with equivalence classes. Resilience runtime will have elements that operate on individual processes in a completely distributed manner. Where non-local knowledge is needed, the techniques will operate in a sampling mode. Finally, the project will develop solutions for data-dependent errors that have resisted convincing widely applicable solutions, i.e., errors of the kind that manifest themselves for specific input datasets or specific input parameter combinations.

View original record on NSF Award Search →