We review foundational aspects of planning in Markov decision processes (MDPs). The goal is to build up the necessary knowledge to get to the point that we can discuss planning with generative models in large-scale MDPs. For this, we will focus on finite (but perhaps large-scale) MDPs, and leave extensions that go beyond finite MDPs to discussions. We start with basic definitions (what is an MDP, various criteria of optimality) and review fundamentals of dynamic programming (value functions, optimal value functions, Bellman’s equations). After reviewing basic properties of the three fundamental approaches of finding an optimal policy (value-, policy-iteration, linear programming, policy gradient), we wrap up exact planning with discussing known computational complexity and query complexity results. We next consider approximate planning with generative models and how one can save computation and improve generalization by moving away from exact global planning. We discuss differences between local and global planning in MDPs, and finish with a discussion of what is known about using (mainly linear) function approximation in approximate planning.

### Monday, August 31st, 2020

We review foundational aspects of planning in Markov decision processes (MDPs). The goal is to build up the necessary knowledge to get to the point that we can discuss planning with generative models in large-scale MDPs. For this, we will focus on finite (but perhaps large-scale) MDPs, and leave extensions that go beyond finite MDPs to discussions. We start with basic definitions (what is an MDP, various criteria of optimality) and review fundamentals of dynamic programming (value functions, optimal value functions, Bellman’s equations). After reviewing basic properties of the three fundamental approaches of finding an optimal policy (value-, policy-iteration, linear programming, policy gradient), we wrap up exact planning with discussing known computational complexity and query complexity results. We next consider approximate planning with generative models and how one can save computation and improve generalization by moving away from exact global planning. We discuss differences between local and global planning in MDPs, and finish with a discussion of what is known about using (mainly linear) function approximation in approximate planning.

Online learning provides a method to study repeated decision making; the objective is to derive algorithms with small regret, defined to be the difference between the performance of the algorithm and the performance of the best in some comparator class of policies or actions. Unlike the MDP model, long term reasoning is not explicitly built into the model but is rather implicitly injected through the regret term. In effect, we need to be mindful of what comparators have performed well in the past and what comparators may perform best in the future. This general framework allows us to handle stochastic and adversarial data as well as full and partial information. We will begin by presenting well known algorithms for the full-information case, such as exponential weights (a.k.a. Hedge), follow-the-regularized-leader, online gradient descent, and mirror descent, and derive bounds on their worst-case regret. We will look in detail at linear and quadratic losses, to which the convex and curved convex cases reduce. We will discuss an application to learning saddle points in two-player zero-sum games by means of repeatedly playing a pair of online learners against each other (aka “self-play”).

In the second part, we will study the "multi-armed bandit" problem where only the reward of the chosen action is revealed to the algorithm, and show that we may formalize this problem as either a modified full-information game or as a MDP with a trivial state space. When the target is to minimize regret, we may apply full-information algorithms by constructing estimates of the full loss (as is done by the EXP3 algorithm) or apply the "optimism in the face of uncertainty" (OFU) principle to choose the action with the highest plausible reward (as is done by the UCB algorithm). We will then discuss the "pure exploration" problem (a.k.a. the best-arm-identification problem), where we wish to maximize our chances of finding a (nearly) optimal arm, and introduce "action elimination" algorithms that do well in this setting. Finally, we discuss the setting where the learner also receives some contextual information before having to select an action: the contextual bandit problem requires the learner to compete with a class of mappings from contexts to actions. Throughout, we will highlight important lower bounds on regret which, when combined with the upper bounds for specific algorithms, often fully characterize the difficulty of these settings.

AI work tends to focus on how to optimize a specified reward function, but rewards that lead to the desired behavior consistently are not so easy to specify. Rather than optimizing specified reward, which is already hard, robots have the much harder job of optimizing intended reward. While the specified reward does not have as much information as we make our robots pretend, the good news is that humans constantly leak information about what the robot should optimize. In this talk, we will explore how to read the right amount of information from different types of human behavior -- and even the lack thereof.

### Tuesday, September 1st, 2020

This tutorial will cover the basics of regret analysis in (tabular) Markov Decision Processes (MDPs). The first part will focus on learning in an unknown but fixed MDP. Two families of algorithms will be covered: those based on optimism in the face of uncertainty (OFU) and those based on posterior (or Thompson) sampling. The second part will focus on learning in a harder setting where the MDP transition function and/or the reward function can change adversarially at each time step. Techniques from online learning such as exponential weights, online mirror descent (OMD), and follow the regularized leader (FTRL) will be used to study this challenging problem. Algorithms and their regret analyses establish upper bounds on the minimax regret in various settings. To complement these upper bounds, we will also include a discussion of hardness results especially in the form of regret lower bounds.

This tutorial will cover the basics of regret analysis in (tabular) Markov Decision Processes (MDPs). The first part will focus on learning in an unknown but fixed MDP. Two families of algorithms will be covered: those based on optimism in the face of uncertainty (OFU) and those based on posterior (or Thompson) sampling. The second part will focus on learning in a harder setting where the MDP transition function and/or the reward function can change adversarially at each time step. Techniques from online learning such as exponential weights, online mirror descent (OMD), and follow the regularized leader (FTRL) will be used to study this challenging problem. Algorithms and their regret analyses establish upper bounds on the minimax regret in various settings. To complement these upper bounds, we will also include a discussion of hardness results especially in the form of regret lower bounds.

This tutorial will cover the basics of performing offline / batch reinforcement learning. There is a significant potential to better leverage existing datasets about decisions made and their outcomes in many applications including commerce and healthcare. The basic problems may be described as batch policy evaluation-- how to estimate the performance of a policy given old data -- and batch policy optimization-- how to find the best policy to deploy in the future. I will discuss common assumptions underlying estimators and optimizers, recent progress in this area, and ideas for relaxing some of the common assumptions, such as that the data collection policy sufficiently covers the space of states and actions, and lack of confounding variables that might have influenced prior data. These topics are also of significant interest in epidemiology, statistics and economics: here I will focus particularly on the sequential decision process (such as Markov decision process) setting with more than 2 actions, which is of particular interest in RL and has been much less well studied than the binary action, single time step setting.

This tutorial will cover the basics of performing offline / batch reinforcement learning. There is a significant potential to better leverage existing datasets about decisions made and their outcomes in many applications including commerce and healthcare. The basic problems may be described as batch policy evaluation-- how to estimate the performance of a policy given old data -- and batch policy optimization-- how to find the best policy to deploy in the future. I will discuss common assumptions underlying estimators and optimizers, recent progress in this area, and ideas for relaxing some of the common assumptions, such as that the data collection policy sufficiently covers the space of states and actions, and lack of confounding variables that might have influenced prior data. These topics are also of significant interest in epidemiology, statistics and economics: here I will focus particularly on the sequential decision process (such as Markov decision process) setting with more than 2 actions, which is of particular interest in RL and has been much less well studied than the binary action, single time step setting.

https://blog.einstein.ai/the-ai-economist/

Tackling real-world socio-economic challenges requires designing and testing economic policies. However, this is hard in practice, due to a lack of appropriate (micro-level) economic data and limited opportunity to experiment. In Zheng et al., 2020, we propose a two-level deep reinforcement learning approach to learn dynamic tax policies, based on principled economic simulations in which both agents and a social planner (government) learn and adapt. AI social planners can discover tax policies that improve the equality and productivity trade-off by at least 16%, compared to the prominent Saez tax model, US Federal tax, and the free market. The learned tax policies are qualitatively different from the baselines, and certain model instances are effective in human studies as well.

This talk will present three topics: 1) economic policy design in the context of multi-agent RL, 2) our two-level RL approach to economic policy design, and 3) open research problems towards an AI Economist for the real world. These include key methodological challenges in two-level RL and data-driven economic modeling, multi-agent RL, mechanism design, convergence guarantees, robustness, explainability, and others.

### Wednesday, September 2nd, 2020

Many estimands of interest in sequential decision problems are non-smooth functionals of the data-generating distribution. Examples include the marginal mean outcome under an optimal policy and coefficients indexing regression models in approximate dynamic programming. We review how this non-regularity induces instability which in turn can cause standard inference procedures based on asymptotic approximations to perform poorly. We derive inference procedures based on bounding a non-regular functional between two smooth functionals and show that the resulting inference is valid under fixed and moving parameter asymptotic frameworks. We also show that in some cases, the resulting procedures are universal (i.e., provide consistent coverage uniformly over a large class of distributions) and also provide conditional coverage. Having derived confidence intervals and tests for the estimands of interest, we then consider power and sample size calculations based on these estimands.

Many estimands of interest in sequential decision problems are non-smooth functionals of the data-generating distribution. Examples include the marginal mean outcome under an optimal policy and coefficients indexing regression models in approximate dynamic programming. We review how this non-regularity induces instability which in turn can cause standard inference procedures based on asymptotic approximations to perform poorly. We derive inference procedures based on bounding a non-regular functional between two smooth functionals and show that the resulting inference is valid under fixed and moving parameter asymptotic frameworks. We also show that in some cases, the resulting procedures are universal (i.e., provide consistent coverage uniformly over a large class of distributions) and also provide conditional coverage. Having derived confidence intervals and tests for the estimands of interest, we then consider power and sample size calculations based on these estimands.

In the second half of this tutorial, we introduce a number of open problems in statistical reinforcement learning. These include testing a parsimonious model against a nonparametric alternative; spatio-temporal decision problems; and doubly-robust methods for combining model-based and model-free methods. We introduce each problem and, in most cases, present partial solutions while identifying remaining technical and conceptual challenges.

In the second half of this tutorial, we introduce a number of open problems in statistical reinforcement learning. These include testing a parsimonious model against a nonparametric alternative; spatio-temporal decision problems; and doubly-robust methods for combining model-based and model-free methods. We introduce each problem and, in most cases, present partial solutions while identifying remaining technical and conceptual challenges.

Observational learning, or learning how to solve tasks by observing others perform them, is a key component of human development. Notably, we are capable of learning behaviors through only observations of state trajectories without direct access to the underlying actions (e.g., the exact kinematic forces, torques on joints, etc.) or intentions that yielded them. In order to be general, artificial agents should also be equipped with the ability to quickly solve problems after observing the solution. In this talk, I will discuss two techniques in depth for enabling observational learning in agents. First, I will describe an approach that trains a latent policy directly from state observations, which can then be quickly mapped to real actions in the agent’s environment. Then I will describe how we can train a novel value function, Q(s,s’), to learn off-policy from observations. Unlike previous imitation from observation approaches, this formulation goes beyond simply imitating and rather enables learning from potentially suboptimal observations.

### Thursday, September 3rd, 2020

This tutorial is designed to present RL theory and design techniques with emphasis on control foundations, and without any probability prerequisites. Motivation comes in part from deep-rooted control philosophy (e.g., do you think we modeled the probability of hitting a seagull when we designed a control system to go to the moon?) The models used in traditional control design are often absurdly simple, but good enough to get insight on how to control a rocket or a pancreas. Beyond this, once you understand RL techniques in this simplified framework, it does not take much work to extend the ideas to more complex probabilistic settings. The tutorial is organized as follows:

1. Control crash course: control objectives, state space philosophy and modeling, and the dynamic programming equations

2. Convex formulations of Q-learning and their relationship to deep Q-learning.

3. Extremum seeking control and gradient free approaches to RL

4. Applications to power systems. In particular, online optimization with measurement feedback (both gradient-based and gradient-free) and application to optimal power flow (presented by Andrey Berstein@NREL: former student of Nahum Shimkin, Technion)

Resources: Extensive lecture notes will be provided prior to the bootcamp (link below). In addition, students are strongly encouraged to view Prof. Richard Murray's tutorial on control fundamentals: https://simons.berkeley.edu/talks/murray-control-1

This tutorial is designed to present RL theory and design techniques with emphasis on control foundations, and without any probability prerequisites. Motivation comes in part from deep-rooted control philosophy (e.g., do you think we modeled the probability of hitting a seagull when we designed a control system to go to the moon?) The models used in traditional control design are often absurdly simple, but good enough to get insight on how to control a rocket or a pancreas. Beyond this, once you understand RL techniques in this simplified framework, it does not take much work to extend the ideas to more complex probabilistic settings. The tutorial is organized as follows:

1. Control crash course: control objectives, state space philosophy and modeling, and the dynamic programming equations

2. Convex formulations of Q-learning and their relationship to deep Q-learning.

3. Extremum seeking control and gradient free approaches to RL

4. Applications to power systems. In particular, online optimization with measurement feedback (both gradient-based and gradient-free) and application to optimal power flow (presented by Andrey Berstein@NREL: former student of Nahum Shimkin, Technion)

Resources: Extensive lecture notes will be provided prior to the bootcamp (link below). In addition, students are strongly encouraged to view Prof. Richard Murray's tutorial on control fundamentals: https://simons.berkeley.edu/talks/murray-control-1

This tutorial is designed to present RL theory and design techniques with emphasis on control foundations, and without any probability prerequisites. Motivation comes in part from deep-rooted control philosophy (e.g., do you think we modeled the probability of hitting a seagull when we designed a control system to go to the moon?) The models used in traditional control design are often absurdly simple, but good enough to get insight on how to control a rocket or a pancreas. Beyond this, once you understand RL techniques in this simplified framework, it does not take much work to extend the ideas to more complex probabilistic settings. The tutorial is organized as follows:

1. Control crash course: control objectives, state space philosophy and modeling, and the dynamic programming equations

2. Convex formulations of Q-learning and their relationship to deep Q-learning.

3. Extremum seeking control and gradient free approaches to RL

4. Applications to power systems. In particular, online optimization with measurement feedback (both gradient-based and gradient-free) and application to optimal power flow (presented by Andrey Berstein@NREL: former student of Nahum Shimkin, Technion)

Resources: Extensive lecture notes will be provided prior to the bootcamp (link below). In addition, students are strongly encouraged to view Prof. Richard Murray's tutorial on control fundamentals: https://simons.berkeley.edu/talks/murray-control-1

2. Convex formulations of Q-learning and their relationship to deep Q-learning.

3. Extremum seeking control and gradient free approaches to RL

We examine the problem of real-time optimization of networked systems and develop online algorithms that steer the system towards the optimal system trajectory. The problem is modeled as a dynamic optimization problem with time-varying performance objectives and engineering constraints. The design of the algorithms leverages the online primal-dual projected-gradient method. Both first-order and zero-order algorithms are considered. For zero-order algorithms, the primal step that involves the gradient of the objective function (and hence requires networked systems model) is replaced by its zero-order approximation with two function evaluations using a deterministic perturbation signal. The evaluations are performed using the measurements of the system output, hence giving rise to a feedback interconnection, with the optimization algorithm serving as a feedback controller. We provide insights on the stability and tracking properties of this interconnection. Finally, we apply this methodology to a real-time optimal power flow problem in power systems, for reference power tracking and voltage regulation.

### Friday, September 4th, 2020

In the first part of the tutorial we discuss theoretical and modeling issues involved in making "optimal" decisions under conditions of uncertainty. The traditional approach is to model the underlying data process as random (stochastic) and to optimize a specified objective function on average. This raises the questions of controlling the risk, and the uncertainty with respect to the considered probability distributions themselves. In the second part of the tutorial we discuss numerical methods for solving stochastic programming problems. I will try to emphasize difficulties and limitations of different approaches and suggest important, from my point of view, open questions.

In the first part of the tutorial we discuss theoretical and modeling issues involved in making "optimal" decisions under conditions of uncertainty. The traditional approach is to model the underlying data process as random (stochastic) and to optimize a specified objective function on average. This raises the questions of controlling the risk, and the uncertainty with respect to the considered probability distributions themselves. In the second part of the tutorial we discuss numerical methods for solving stochastic programming problems. I will try to emphasize difficulties and limitations of different approaches and suggest important, from my point of view, open questions.

In this tutorial, we will discuss the basics of simulation. We will start with a quick review of what makes sampling-based methods attractive in many computational settings, and then talk about simulation output analysis. Output analysis is concerned with questions of determining how long a simulation must be run to obtain a solution of prescribed accuracy. In the second half of the tutorial, we will discuss several variance reduction methods that have broad applicability, including control variates and importance sampling. We will conclude with a discussion of gradient estimation.

In this tutorial, we will discuss the basics of simulation. We will start with a quick review of what makes sampling-based methods attractive in many computational settings, and then talk about simulation output analysis. Output analysis is concerned with questions of determining how long a simulation must be run to obtain a solution of prescribed accuracy. In the second half of the tutorial, we will discuss several variance reduction methods that have broad applicability, including control variates and importance sampling. We will conclude with a discussion of gradient estimation.

Robotics is a fantastic testbed for both RL theory and RL practice. We have some problems (e.g. a humanoid doing a backflip) that look very hard, but model-based control performs quite well. We have other problems that really shouldn’t be that hard (e.g. spreading peanut butter on toast), where model-based control falls short. I’ll discuss a key theoretical sticking points; progress here might unlock practical success. I’m excited to see what you come up with this fall!