GGrantIndex
← Search

Reducing Latency by Replicating Jobs

$299,711FY2015ENGNSF

Carnegie Mellon University, Pittsburgh PA

Investigators

Abstract

A new notion of redundancy in computer systems design is the act of purposely replicating a job, dispatching each replica to a different queue, and only waiting for the first completion (all remaining replicas are discarded). While still nascent, recent studies at Google, Microsoft, and U.C. Berkeley have shown that redundancy can reduce mean latency in data centers and cloud services by up to 50 percent. Unfortunately, there are almost no analytical studies of redundancy. The research team will provide the first exact analysis of redundancy, understanding both the pros and cons of redundancy. The results of this research will impact several diverse research communities including: operations research, queueing theory, coding theory, cloud computing, data center optimization, and healthcare, and will also increase K-12 STEM opportunities for students. Redundancy requires a new queueing paradigm: there is no longer a single copy of each job, and redundant copies disappear as soon as one copy completes. Different classes of jobs may have different levels of redundancy, necessitating a complex state space where the exact ordering of all copies of all jobs in all queues must be tracked. Systems with redundancy bear some resemblance to other classically hard operations research systems, like fork-join systems, coupled-processor systems, and flexible server systems. The research team aims to explicitly characterize the latency benefit to jobs that are replicated as well as the pain inflicted (increase in latency) on those jobs that cannot be replicated. The research will also investigate how the benefit of redundancy is affected by real-world externalities, like cancellation cost, reneging of users, and variability of service times. Redundancy will also be compared with other dispatching policies both via analysis and experimentation in cloud systems.

View original record on NSF Award Search →