Sheila McIlraith (University of Toronto)
A standard assumption in reinforcement learning (RL) is that the agent does not have access to a faithful model of the world. As such, to learn optimal behaviour, an RL agent must interact with the environment and learn from its experience. While it seems reasonable to assume that the transition probabilities relating to the agent's actions are unknown, there is less reason to hide the reward function from the agent. Artificial agents cannot inherently perceive reward from the environment; someone must program those reward functions (even if the agent is interacting with the real world). Two challenges that face RL are reward specification and sample complexity. Specification of a reward function -- a mapping from state to numeric value -- can be challenging, particularly when reward-worthy behaviour is complex and temporally extended. Further, when reward is sparse, it can require millions of exploratory episodes for an RL agent to converge to a reasonable quality policy. In this talk, I present the notion of a Reward Machine, an automata-based structure that provides a normal form representation for reward functions. Reward Machines can be used natively to specify complex, possibly non-Markovian, reward-worthy behavior. Alternatively, because of their automata-based structure, a variety of compelling human-friendly formal languages can be used as reward specification languages and straightforwardly translated into Reward Machines, including variants of Linear Temporal Logic (LTL), and a variety of regular languages. Furthermore, Reward Machines expose reward function structure in a normal form. Our Q-Learning for Reward Machines (QRM), Counterfactual Reward Machine (CRM), and Reward Shaping algorithms exploit Reward Machine structure in their learning, while preserving optimality guarantees. Experiments show that these algorithms significantly outperform state-of-the-art (deep) RL algorithms, solving problems that otherwise can't reasonably be solved and critically reducing the sample complexity.