GGrantIndex
← Search

CSR: Small: Monitoring for Error Detection in Today's High Throughput Applications

$275,000FY2009CSENSF

Purdue University, West Lafayette IN

Investigators

Abstract

CSR: Small: Monitoring for Error Detection in Today?s High Throughput Applications Abstract: Much of our critical infrastructure is formed by distributed systems with real-time requirements. Downtime of a system providing critical services in power systems, air traffic control, banking, and railways signaling could be catastrophic. The errors may come from individual software components, interactions between multiple components, or misconfiguration of these components. It is therefore imperative to build low latency detection systems that can subsequently trigger the diagnosis and recovery phases leading to systems that are robust to failures. A powerful approach for error detection is the stateful approach, in which the error detection system builds up state related to the application by aggregating multiple messages. The rules are then based on the state, thus on aggregated information rather than on instantaneous information. Though the merits of stateful detection seem to be well accepted, it is difficult to scale stateful detection with an increasing number of application components or increasing data rate. This is due to the increased processing load of tracking application state and rule matching based on the state. In this project, we address this issue through designing a runtime monitoring system focused on high throughput distributed applications. Our solution is based on intelligent sampling, probabilistic reasoning on the application state, and opportunistic monitoring of the heavy-duty rules. A successful solution will allow reliable operation of high bandwidth distributed applications and those with a large number of consumers. We will also achieve broader impact through an innovative service learning program at Purdue called EPICS and a new course.

View original record on NSF Award Search →