CNS Core: Small: Online Safe Reinforcement Learning for Wireless Resource Allocation
University Of Texas At Austin, Austin TX
Investigators
Abstract
Next generation wireless networks are being engineered to meet a complex mix of application requirements, from traditional mobile broadband (e.g., web browsing, video streaming) to new emerging applications (e.g., augmented reality, self-driving cars, industrial automation, robotics) with heterogeneous much more stringent reliability-latency requirements. The ability to support these requirements drives the potential to deliver the new business models and new revenue streams that would enable deployments of the new technology. Thus, wireless scheduling and resource allocation will take center stage in terms of enabling technologies for such networks. Meanwhile reinforcement learning (RL) using deep networks has emerged as a powerful framework to devise polices that optimize complex systems' performance (including wireless systems); however, these usually do not come with any formal guarantees. The central thesis of this research is that RL-based resource allocation policies without operational guarantees, e.g., throughput-optimality/stability, are unlikely to be accepted and/or deployed, thus a key requirement to make these techniques usable, is to develop approaches which ensure safety guarantees. The research advances the state-of-the-art in safe reinforcement learning, with specific applications to wireless systems, but also is expected to benefit other application domains as well as society more broadly, through planned efforts in education, innovation, achieving diversity, engaging the community and industry, and disseminating results to a wider public. This research centers on the development and analysis of a safe reinforcement learning (Safe-RL) framework, which optimizes rewards over short-time scales, and also provides theoretically strong long-term throughput-optimality guarantees for state-of-art wireless scheduling algorithms. The key underlying observation is that many of today's scheduling algorithms derive their performance guarantees from Lyapunov analysis. The project leverages the innovative concept of guardrails -- constraints on the state-dependent actions of Safe-RL -- that guarantee that the wireless system's Lyapunov evolution stay within a bounded perturbation of classical algorithms. This guarantee, in turn, ensures that Safe-RL has safety/stability properties of state-of-the-art schedulers, while leveraging RL to realize complex performance tradeoffs. The research consists of three inter-related thrusts. Thrust 1 develops the foundations and representations for safe-RL at the core of this project, along with a theoretical basis for safety guarantees and new classes of efficient learning for wireless system network abstractions. Thrust 2 focuses on the application of safe-RL theory to wireless resource allocation, including addressing challenges associated with joint scheduling of real-time and broadband traffic, learning and exploiting traffic patterns, and an exploration of the degree to which a policy hits guardrails as an indication of system anomalies or need for re-optimization. Thrust 3 centers on the challenging but necessary task of validating the safe-RL framework leveraging an industrial strength multi-cell simulator. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →