CAREER: Lineage-driven Fault Injection
University Of California-Santa Cruz, Santa Cruz CA
Investigators
Abstract
With each passing day, Americans become more dependent on large-scale, cloud-based distributed systems for everything from commerce (e.g., shopping for groceries) to transportation (e.g. booking a cab) to communication (e.g., making plans via a chat app) to storage (e.g., archiving baby photos). These systems operate at a scale where machine crashes and network interruptions are the rule rather than the exception. Unfortunately, writing fault tolerant software -- programs that attempt to detect and mitigate such failures, in order to prevent data loss or system unavailability -- remains an art form. Application programmers, data analysts and mobile developers have few tools to help them implement, maintain and debug fault-tolerant systems. To address these needs, we introduce lineage-driven fault injection (LDFI), a novel bug-finding methodology that uses data lineage to reason backwards (from effects to causes) about combinations of faults could prevented desired system outcomes. If the project succeeds, it will improve the overall availability and resiliency of these increasingly critical systems. It will also provide a new class of tools for the distributed programmers who must build and maintain cloud systems, identifying and explaining bugs at all stages of the development process, from specification to post-production. LDFI combines ideas from database theory, distributed systems, formal methods, optimization, reliability and data visualization. By exercising only the faults that it can prove could have affected a known good outcome, LDFI can identify invariant violations and user-visible failures using orders of magnitude fewer executions than conventional techniques such as random or heuristic fault injection. LDFI generalizes the notion of data lineage beyond data management , using it to explain large-scale system outcomes and to identify redundant computations that can mask faults. Given a model of a system's built-in redundancy, LDFI can use state-of-the-art solvers to generate hypotheses to exercise via fault injection infrastructures. This project engenders new research in areas such as query explanations, low-overhead system tracing, and how-to analysis.
View original record on NSF Award Search →