RI: Small: Combining Reinforcement Learning and Deep Learning Methods to Address High-Dimensional Perception, Partial Observability and Delayed Reward

$499,886FY2015CSENSF

Regents Of The University Of Michigan - Ann Arbor, Ann Arbor MI

Investigators

Satinder S Bavejacontact Honglak Lee Richard L Lewis

Abstract

Consider the problem faced by a machine agent that has to interact with some dynamical environment to achieve some goals. Concretely, imagine an agent engaged in a virtual competition as a human would. It can see the screen composed of many moving objects. At any time, it can choose one of a dozen or so actions. Its action controls one of the objects on the screen, but it often is not clear which one. Every so often the an evaluation is given of the competition. At some point the competition ends. How should such an agent choose actions, or more importantly how can we build agents that can learn to compete, i.e., achieve high scores, through trial and error. In this project methods will be developed and evaluated to build such agents. The above problem is an instance of what is called a reinforcement learning (RL) problem. Such problems abound in sequential decision-making settings. Applications in industry include factory optimization, robotics, and chronic disease management (to list but three diverse domains of interest). Like many of these RL problems, Atari games (used as a testbed here to evaluate learning strategies) have three characteristics of interest to this project. First, they generate high-dimensional images and so the agent faces a difficult perception problem. Second, they often have deeply-delayed rewards; i.e., actions have long-term consequences. For example, losing a resource may not cost at the moment of loss, but could lead to very high losses much later when that resource is critically necessary. Third, they have deep partial observability, i.e., to compete effectively one has to often remember the deep past. For example, a location encountered far back in the past may become valuable much later because a critical resource becomes available at that time and the agent would have to find its way back to that location to use the resource. It is proposed to address these three challenges respectively with new neural network architectures for predicting the consequences of actions, new methods for intrinsically motivating agents even when reward is delayed, and new recurrent neural network architectures to remember the past effectively. Success of the proposed work is expected to significantly expand the scope of application of reinforcement learning. Finally, Atari games will be used instead of, say, factory optimization as an evaluation domain because they are readily available. They will be used to draw high-school and under-represented undergraduate students interest into complex ideas underlying the proposed work; their fun visualizations will allow them to be integrated into teaching in the PIs' classes, and there are a variety of games that vary in the degree of difficulty of the three challenge dimensions allowing more effective control of the evaluations more effectively.

View original record on NSF Award Search →