# From the Inside: Theory of Reinforcement Learning

by Vidya Muthukumar (Google Research Fellow, Simons Institute)

The roots of most, if not all, ideas in reinforcement learning (RL) can be traced back to the classical problem of multistage decision-making to maximize overall *reward*, introduced by Richard Bellman.^{1} The problem is called *multistage decision-making* because decisions made in the present impact the evolution of the state of the environment and rewards gleaned in the future. This problem has wide-ranging applications — spanning automatic aircraft control, robotics and large-scale inventory scheduling — but assumes that the impact of present decisions on the future is (at least approximately) known.

Reinforcement learning removes this fundamental premise, by asking whether we can *learn *to make decisions optimally from observed reward feedback. On reflection, we are likely to find a variant of RL in our daily lives, whether in learning to drive our new car up a steep mountain, or invest our savings to maximize long-term profit. Indeed, RL has close historical ties to artificial intelligence. These ties to AI could explain the unique way in which RL successes have captured the imagination of popular culture — the unveiling of a chess program that learns* *to beat a grand master, or robots that learn to do your dishes, brings excitement unmatched for now by other branches of ML.

Understanding the inner workings of these inventions is an entirely different matter. Classically, several ideas in the theory of RL arise organically out of approaches to *approximate *large-scale dynamic programming, and form a bedrock of ingredients for successful RL. The modern “success stories,” however, remain largely unexplained, even at an intuitive level. The celebrated network scientist Steven Strogatz perfectly articulates a sentiment about RL shared by the layman and the expert alike:

What is frustrating[…] is that the algorithms can’t articulate what they’re thinking. We don’t know why they work, so we don’t know if they can be trusted. AlphaZero givesevery appearanceof having discovered some important principles of chess, but it can’t share that understanding with us[…] As human beings, we want more than answers. We wantinsight.This is going to be a source of tension in our interactions with computers from now on.^{2}

This tension was alluded to by none other than Bellman himself, who recognized that the tremendous empirical potential of RL and optimal control is part of what makes obtaining this insight so difficult.^{3} And as we seek to move RL from the intellectual domain of gaming into high-stakes applications like autonomous driving and healthcare, this insight is increasingly important.

The Fall 2020 Simons Institute program on Theory of Reinforcement Learning brings together experts from control theory, online learning, operations research, optimization and statistics to provide this insight. The diversity of expertise was on full display during** **the program’s boot camp**,** where each day featured an introduction to RL from the perspective of each of these fields. In addition, each day featured a talk by a practitioner in an application area, such a robotics, clinical trials, macroeconomics, or power grid optimization. These talks spurred lively discussions around several open-ended topics, such as the importance of safety and the philosophical meaning of specifying an external reward function.

One of the critical ingredients in RL’s empirical success is *function approximation *in extremely large environments. For example, the game of Go has a notoriously complex environment that takes on 10^{360 }possible values. The famous AlphaGo algorithm derives much of its power by approximating both the value function and policy class of this game by cleverly chosen neural networks. Function approximation is only the tip of the iceberg, however: several mysterious heuristics, like *self-play*, *experience replay*, and *hierarchical RL*, are ubiquitous on today’s blogosphere. The first workshop of the program, on** **Deep Reinforcement Learning, featured successful developments in deep RL — from exploration heuristics and optimization tricks to assistance from the sister areas of computer vision, natural language processing and verification. This workshop was as much a demystification of deep RL as a survey of recent progress.

Traditional applications of RL, like gaming and robotics, assume practically unrestricted* *access to a simulator for policy evaluation and improvement. Some of the most exciting potential applications of RL — healthcare, education and criminal justice — are so high stakes that most learning must take place from *batched *data collected from prior deployments. The statistical questions that arise in *batch RL* have a decidedly modern flavor: how do we reliably evaluate policies that have not yet been deployed? Should prior deployments be designed with an “exploration” component? Is reliable learning even possible when data are at such a premium? The final workshop of the semester, Reinforcement Learning from Batch Data and Simulation**, **will address these questions. Two fields have been identified as groundswells of theory to address this challenge: *causal inference *to estimate treatment effects, and *online decision-making, *which trades off exploring and exploiting the environment. Indeed, RL is considered a final frontier in online decision-making, which was the central focus of the second workshop of the semester, Mathematics of Online Decision-Making.

In some ways, even one frenetically paced semester is an incredibly short time to address the scope of RL in all its enormity. In several interdisciplinary topics like safety, RL with multiple agents, fairness-utility tradeoffs and societal implications, formulating the questions is proving to be as pressing a challenge as answering them. Internal program activities so far include reading groups on causality, deep RL and function approximation; a weekly open problems session; and several lively discussions on the program’s Discord channel. While the pace of an online program is somewhat slower in this extraordinary year, the energy remains intact.

Journalist in residence Brian Christian put his finger on why RL generates such unique excitement: “In gaining insight into its inner workings, we ultimately hope to learn more about ourselves.”

**References**

1. Computer science undergraduates might recognize Bellman’s name from the Bellman-Ford algorithm.

2. Steven Strogatz, “One Giant Step for a Chess-Playing Machine,” *The New York Times, *December 26, 2018.

3. For more on Bellman’s articulation of this tension, see the preface to Dimitri Bertsekas’s textbook, *Reinforcement Learning and Optimal Control* (Athena Scientific, Belmont, Massachusetts, 2019).