Watch and Learn: Offline Reinforcement Learning

by Brian Christian (Science Communicator in Residence, Simons Institute)

In 1963, cognitive scientists Richard Held and Alan Hein published the results of a rather unusual experiment that involved putting twin kittens inside cylindrical “carousels.” The kittens were placed into a harness set up so that one kitten was free to move while the other was unable to move but would be subjected to all the same visual stimuli as its sibling. Held and Hein’s question was straightforward: Did visual perception develop equally in both kittens as a function purely of the sensory input, or was agency somehow integral?

The answer was unambiguous. The kitten that could move — and therefore control its visual stimuli — developed normal visual discrimination. But its sibling that had been unable to control its movement within the carousel did not develop normal visual discrimination. It failed to blink when an object rapidly approached, or avoid steep cliffs, or follow moving objects with its eyes, suggesting deficits in both depth perception and image segmentation.

As Held and Hein’s colleague, neurophysiologist Marc Jeannerod, would put it: “Perception is constructed by action.”

A number of studies have pursued similar questions in the domains of both humans and robots. In 2005, roboticists at the EPFL in Switzerland and the University of Sussex replicated the basic findings of Held and Hein in robots controlled by simple neural networks. Being able to link actions with one’s visual input seemed crucial in developing useful behavior. The results suggested a passive system couldn’t differentiate between visual features that were relevant to action and those that were not. “In other words,” the authors wrote, “freely behaving systems select a subset of stimuli that coherently support the generation of behavior itself.” More recently, in 2020 a team of French and British cognitive scientists showed that humans learn differently from choices they make freely than from choices that are forced or merely observed.

Taken together, this research over the better part of a century illustrates the difficulties of learning in its more passive mode, offering a kind of tantalizing “negative result.”

In fact, going back all the way through reinforcement learning’s roots in the behaviorism of the early 20th century to behaviorism’s own roots in the late 19th century, both Edward Thorndike’s “law of effect,” along with Alexander Bain’s “trial and error” (Bain coined the phrase) and “groping experiment” (his other catchphrase didn’t quite stick), assume that the agent is going forth into the world to learn for itself by interacting with the environment.

And yet, one of the most intriguing and dynamic frontiers of reinforcement learning is precisely this: how to do reinforcement learning passively, that is, by simply observing the behavior of another agent.

In 2019, Stanford’s Emma Brunskill was at the Simons Institute’s workshop on Emerging Challenges in Deep Learning, talking about her experience confronting a problem in educational technology. She had been working on a data set from the game Refraction, which teaches students about fractions. Might there be a way, she wondered, to encourage students to persist in the game — playing, and therefore learning, just a bit more — through better sequencing the game’s levels?

In a normal reinforcement-learning context, this would be done by exploration: by trying out different sequences and iterating based on the resulting feedback. But Brunskill didn’t want to risk giving the human learners a bad experience and driving them away. So instead she wanted to see how much she could infer without exploration, based only on the status quo data lying around.

“I’d been doing reinforcement learning for a while before this,” she recalls, “but this was the first time I really thought about this issue of sort of counterfactual, or batch off-policy, reinforcement learning.”

Traditionally, reinforcement learning was divided into “on-policy” and “off-policy” techniques. On-policy algorithms would interact with the environment, making adjustments as they went. Off-policy algorithms would interact with the environment but store a record of their past behavior and be able to draw from it as they continued to learn and interact further. In both cases, though, this direct interaction was essential. But what about something able to learn without that direct environmental interaction at all? In a testament to the recent activity in this area, it goes by several names: “counterfactual” RL, “batch” or “batch off-policy” RL, “data-driven” RL, or simply “offline” RL.

There are a number of domains in which this approach makes sense. In medicine, for instance, we hardly want to explore random policies when human lives are at stake. And yet making the most of the preexisting data, to the point of being able to confidently suggest improved treatment policies, is a significant challenge.

“I think that there’s a lot of really exciting things to be done in the context of policy optimization, policy evaluation, and error bounds,” she says. “We’re also thinking about this from some other perspectives, too, of how do we get error bounds that allow us to have guarantees on things before we deploy them, to try to provide more safe AI, as well as trustworthy AI.”

Brunskill and her collaborators were able not only to identify what seemed like a more promising sequence for the game, but also to estimate how much better it was: namely, that it would likely increase student persistence by 30%. Later, they were able to run an experiment with an additional 2,000 students. Their persistence indeed went up — by 30%.

“I thought this was a really exciting result,” she says. One, machine learning offered a sizable improvement over the status quo. Two, it came along with a reliable prediction of its own impact. Finally, beyond the immediate success, it hinted at a new frontier for the field as a whole.

A separate drama of rethinking the assumptions of RL was playing out in parallel. The University of Florida’s Sean Meyn had been at the Simons Institute’s Real-Time Decision Making program the year before when he asked Associate Director Peter Bartlett if he could give a three-hour tutorial on his view of reinforcement learning. Meyn wanted to see if he could stir up some trouble. “I gave examples,” he says, “where, if you really try to do things online, by the time it converged, you’d all be dead — everyone would be dead.”

For Meyn, high-stakes domains like autonomous vehicles necessitate a more offline approach to RL: “Of course we’re not going to just search that needle in a haystack as we drive.” He thinks back to the X-15 aircraft of the 1950s and '60s, which was retired after a fatal crash. “That shook up the community,” he says. “And the era of robust control came right on the heels of that.”

Reflecting on that history, Meyn had become convinced that a fresh approach was necessary to RL — and a new mathematics. Reinforcement learning had leaped dramatically forward in recent years, but the field had become dogmatic. “There’s only one way you’re allowed to analyze these algorithms,” he argued. “You have to have a finite-n bound. And I’d love that!” he exclaimed. “I’d love to be able to do that! And the thing is, though: forget it. You’re not gonna get it. And it’s not gonna be informative.”

The finite-n bound puts a direct cap on the error in our model after some concrete number of data points or time steps. This is often surprisingly elusive, even in systems that are otherwise simple, well-behaved, or well understood: for instance, in analysis of finite Markov chains, of the behavior of simple single-server queues, and of stochastic simulation — all mature fields that can be viewed as attacking special cases of reinforcement learning problems.

These negative results offer a kind of sobering perspective from other areas that share a lot technically; they show just how subtle and difficult these problems are. If the finite-n bound is indeed too much to hope for in most types of RL problems, Meyn wanted to think about what might serve in its place. The field of stochastic approximation has long explored alternate desiderata, for example asymptotic covariance, which looks at equilibrium behavior rather than at discrete intermediate points.

It began to be appreciated, circa the mid-1990s, that there were intimate connections between reinforcement learning and stochastic approximation, which estimates the long-term behavior of phenomena that are unknown or otherwise too complex to analyze directly. Meyn started to think there was something crucial here, something underexplored between the two fields. Something about his 2018 critique at the Real-Time Decision Making program seemed to have resonated, and not just with the program participants. The video of his lecture had gone some approximation of viral. 

“Ten hours later, there were a thousand views or something,” he says. “That’s the fun thing about Simons.”

In the fall of 2020, at Bartlett’s impetus, Brunskill and Meyn joined with Princeton’s Mengdi Wang to organize a full workshop on the topic of reinforcement learning from batch data and simulation. Over the week, participants offered a host of ideas about how to expand the frontiers of RL: to push into new problem spaces, to strive for new definitions. By the end, there was a sense of promise, of possibility.

“The RL tent has become much broader than it was a few years ago,” says Meyn. “The marriage of data science and control systems is singularly exciting.” The field is not only diversifying but approaching a kind of phase transition. “I am hopeful,” he says, “for big progress in the next decade.”

“I think it’s really exciting to see the number of people that are starting to talk about counterfactual RL,” says Brunskill. “I think we’re only at the very beginning.”