MLWiNS: Optimization and Coding Theory for Fast and Robust Wireless Distributed Learning

$300,000FY2020CSENSF

University Of California-Santa Barbara, Santa Barbara CA

Investigators

Abstract

Wireless distributed learning systems can enable a variety of new applications including industrial automation, semantic learning, autonomous driving, health-care applications, etc. While wireless distributed learning brings about new opportunities, it faces two major challenges that severely limit its efficiency, reliability, and scalability: (1) Network heterogeneity, which is due to varying computational capabilities of edge devices. This challenge, also known as Straggler bottleneck, incurs large delays and failures due to computing nodes that are significantly slower than the rest; and (2) Communication Bottleneck, which is due to the massive amounts of raw or processed data that must be moved around the network. To tackle these bottlenecks, this project proposes techniques from coding theory and optimization theory to develop distributed learning algorithms with strong theoretical guarantees and empirical performance. Wireless distributed learning systems are driven by scaling out computations across many wireless edge nodes. There are, however, two major systems bottlenecks that arise: (1) Straggler Delay Bottleneck, which is due to the latency in waiting for slowest nodes to finish their tasks; (2) Data Shuffling Bottleneck, which is due to the massive amounts of data that must be moved among nodes. Moreover, there are privacy concerns about sharing sensitive local data, as well as vulnerabilities to adversarial attacks. This proposal aims to develop novel techniques from coding theory and optimization theory to tackle the mentioned bottlenecks and concerns. The project develops new "coded computing" algorithms for robust gradient aggregation, as well as new optimization algorithms for distributed learning. These algorithms are then used in two network settings to develop communication-efficient, straggler-resilient, and robust distributed learning frameworks: (i) a collaborative setting where a learning task is allocated to multiple edge nodes of the network. In this setting, data points can be encoded and offloaded to the edge nodes to provide resiliency against system bottlenecks; (ii) a federated setting where data points are gathered locally at edge devices and have to remain local due to privacy concerns. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →