CAREER: Architecting Datacenters for Optimized Tail Latency at Scale

$535,474FY2023CSENSF

Georgia Tech Research Corporation, Atlanta GA

Investigators

Abstract

Modern societies heavily rely on cloud computing and a wide variety of online services, such as social media, web search, digital content delivery, etc. The multi-billion-dollar market of these utilities is enabled by massive deployments of servers (i.e., enterprise-grade computers), known as datacenters. It is imperative for such datacenter-deployed services to reliably generate high-quality, low-latency responses to every user request, as even infrequent performance or availability hiccups incur significant revenue loss. Therefore, service providers set strict Service Level Objectives (SLOs) that define the acceptable behavior of the tail-end of each service component's response latency distribution. Abiding by such SLOs is challenging, as computing systems have been conventionally optimized to meet average rather than tail performance goals. A second key challenge stems from the need to keep up with growing demands of next-generation services. Servers constantly communicate over the datacenter's internal network to collaboratively enable an online service at the required scale, in terms of features, users, and datasets. With demand relentlessly growing in each of these three dimensions, inter-server data movement within the datacenter is reportedly doubling every 12-15 months, thus becoming a major performance determinant. This project aims to inform the design of future datacenters that will be able to keep up with the growing demands of next-generation online services powering modern digital economies. The two primary approaches to achieve that are (i) drastic improvement of intra-datacenter data movement efficiency via holistic cross-component (compute, memory, and network) design optimizations, and (ii) development of SLO-aware mechanisms embedded within each key system component to natively cater to the performance metrics of interest that are unique to datacenter environments. This project pursues two main avenues to advance the efficiency of next-generation datacenters in their crucial role of delivering high-quality online services. The first avenue investigates the potential of promoting SLO-awareness from a retrospective evaluation metric to a pervasive, integral optimization knob from cluster scale down to microarchitecture. A holistic SLO-aware optimization framework will be developed to enable enforcement of cluster-wide dynamic request prioritization policies. In turn, these policies will drive a set of SLO-aware mechanisms implemented in the underlying hardware of performance-critical system components, including compute, network, and memory resources. Because data movement predominantly dictates computation efficiency, the second research avenue proposes techniques for reduced data movement both at the cluster scale and at each individual endpoint (i.e., server). As networking capabilities grow, on-server data movement must be intelligently orchestrated via judicious co-design of each server's network interface and memory hierarchy. The two research avenues combined introduce a new large-scale system design perspective, where a holistic system design approach enables new inter-component synergies that promise drastic improvements to end-to-end performance and efficiency. Finally, a third objective of the planned research, integral to the two main research avenues, is the development of new simulation methodologies and tools that enable evaluation of datacenter-scale techniques using limited compute resources attainable in typical academic environments, while striking a balance between simulated system scale, speed, and accuracy. Synergistic datacenter-themed educational activities will also be undertaken at the graduate, undergraduate, and K-12 levels. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →