No abstract available.

### Monday, August 5th, 2019

In the last 18 months, Natural Language Processing has been transformed by the success of using deep contextual word representations, that is, the output of ELMo, BERT, and their rapidly growing circle of friends, as a universal representation suitable as a source for fine-tuning for any language task. The simple training loss of predicting a word in context played out over mountains of text has been so successful that regardless of whether you're doing parsing, sentiment analysis or question answering, you now cannot win on a benchmark test without them. In some sense these models are full of knowledge. In another sense these models still just seem to reflect local text statistics. At any rate, can these models reason? Against this backdrop, what are more successful ways to build neural networks that can reason? And can we use them to tackle more of the problems of old-fashioned AI?

With the advent of higher throughput and more accurate technologies to measure protein properties of interest, such as target binding to a drug, the time for machine learning to act synergistically with protein design is here. The obvious first place to do so is to replace the lab measurements with, for example, a deep neural network based predictive model. Then, one can ask how to invert that model to find desired protein sequences. Naively, inverting this model could be viewed as combinatorial optimization. However, one must take into account possibly heteroscedasctic uncertainty of the predictive model. Calibrating these uncertainties, even in region of the training data, has been tackled, but could be improved. Moreover, "further away" from the training data, the uncertainties are arbitrarily bad. How can we tackle the general design problem when the functions we are optimizing cannot even be trusted?

While deep learning produces supervised models with unprecedented predictive performance on many tasks, under typical training procedures, advantages over classical methods emerge only with large datasets. The extreme data-dependence of reinforcement learners may be even more problematic. Millions of experiences sampled from video-games come cheaply, but human-interacting systems can’t afford to waste so much labor. In this talk I will discuss several efforts to increase the labor-efficiency of learning from human interactions. Specifically, I will cover work on learning dialogue policies, deep active learning for natural language processing , learning from noisy singly-labeled data, and active learning with partial feedback.

For open-ended tasks it is often difficult to measure or even define performance. For example, it's unclear what objective I should optimize in order to better understand how I should spend my time or which laws we should pass. I'll describe this problem in the language of learning theory, lay out the approaches that seem most promising to me, and overview some current work.

Understanding the power of depth in feed-forward neural networks is an ongoing challenge in the field of deep learning theory. While current works account for the importance of depth for the expressive power of neural-networks, it remains an open question whether these benefits are exploited during a gradient-based optimization process. In this work we explore the relation between expressivity properties of deep networks and the ability to train them efficiently using gradient-based algorithms. We give a depth separation argument for distributions with fractal structure, showing that they can be expressed efficiently by deep networks, but not with shallow ones. These distributions have a natural coarse-to-fine structure, and we show that the balance between the coarse and fine details has a crucial effect on whether the optimization process is likely to succeed. We prove that when the distribution is concentrated on the fine details, gradient-based algorithms are likely to fail. Using this result we prove that, at least in some distributions, the success of learning deep networks depends on whether the distribution can be well approximated by shallower networks, and we conjecture that this property holds in general. Joint with work Eran Malach

Deep learning is a powerful tool for learning to interface with open-world unstructured environments, enabling machines to parse images and other sensory inputs, perform flexible control, and reason about complex situations. However, deep models depend on large amounts of data or experience for effective generalization, which means that each skill or concept takes time and often human labeling labor to acquire. In this talk, I will discuss how meta-learning, or learning to learn, can lift this burden, by learning how to learn quickly and efficiently from past experience on related but distinct tasks. In particular, I will discuss the frontier of what meta-learning algorithms can accomplish today and what open challenges remain to make these algorithms more practical and universally applicable. These challenges include the online meta-learning problem, where the algorithm must become faster at learning as it learns, the problem of constructing task distributions without human supervision, and what happens when these algorithms are applied on very broad task distributions.

### Tuesday, August 6th, 2019

Value-function approximation methods that operate in batch mode have foundational importance to reinforcement learning (RL). Finite sample guarantees for these methods---which provide the theoretical backbones for empirical ("deep") RL today---crucially rely on strong representation assumptions, e.g., that the function class is closed under Bellman update. Given that such assumptions are much stronger and less desirable than the ones needed for supervised learning (e.g., realizability), it is important to confirm the hardness of learning in their absence. Such a hardness result would also be a crucial piece of a bigger picture on the tractability of various RL settings. Unfortunately, while algorithm-specific lower bound has existed for decades, the information-theoretic hardness remains a mystery. In this talk I will introduce the mathematical setup for studying value-function approximation, introduce our findings in the investigation of the hardness conjecture, and discuss connections to related results/open problems and their implications. Part of the talk is based on work with my student Jinglin Chen at ICML-19.

Recent years have witnessed increasing empirical successes in reinforcement learning (RL). However, many theoretical questions about RL are not well understood even in the most basic setting. For example, how many observations are needed and sufficient for learning a good policy? How to learn to control in unstructured Markov decision process with provable regret? In this talk, we study the statistical efficiency of reinforcement learning in feature space and show how to learn the optimal policy algorithmically and efficiently. We will introduce feature-based reinforcement learning algorithms with minimax-optimal sample complexity and near-optimal regret. We will also discuss a state embedding learning method that is able to automatically learn state features from state trajectories.

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: 1) if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); 2) how they cope with approximation error due to using a restricted class of parametric policies; or 3) their finite sample behavior. In this talk, we will study all these issues, and provide a broad understanding of when first-order approaches to direct policy optimization in RL succeed. We will also identify the relevant notions of policy class expressivity underlying these guarantees in the approximate setting. Throughout, we will also highlight the interplay of exploration with policy optimization, both in our upper bounds and illustrative lower bounds. This talk is based on joint work with Sham Kakade, Jason Lee and Gaurav Mahajan.

Off-policy optimization seeks to overcome the data inefficiency of on-policy learning by leveraging arbitrarily collected domain observations. Although compensating for data provenance is often considered to be the central challenge in off-policy policy optimization, the choice of training loss and the mechanism used for missing data inference also play a critical role. I will discuss some recent progress in developing alternative loss functions for off-policy optimization that are compatible with principled forms of missing data inference. By leveraging standard concepts from other areas of machine learning, such as calibrated surrogate losses and empirical Bayes estimation, simple policy optimization techniques can be derived that are theoretically sound and empirically effective in simple scenarios. I will discuss prospects for scaling these approaches up to large problems.

I'll present a brief overview of some recent work on reinforcement learning motivated by practical issues that arise in the application of RL to online, user-facing applications like recommender systems. These include (a) stochastic action sets; (b) long-term cumulative effects; and (c) combinatorial action spaces. With respect to (c) I will discuss SlateQ, a novel decomposition technique that allows value-based RL (e.g., Q-learning) in slate-based recommender to scale to commercial production systems, and briefly describe both small-scale simulation and a large-scale experiment with YouTube. With respect to (b), I will briefly discuss a Advantage Amplification, a temporal aggregation technique that allows for more effective RL in partially observable domains with low SNR, as often arise in recommender systems.

**Panelists**: Lihong Li (Google Brain), Jennifer Listgarten (UC Berkeley), Elchanan Mossel (Massachusetts Institute of Technology), Shai Shalev-Shwartz (The Hebrew University of Jerusalem), Mengdi Wang (Princeton University)

**Moderators**: Po-Ling Loh (University of Wisconsin, Madison), Matus Telgarsky (University of Illinois, Urbana-Champaign)

No abstract available.

### Wednesday, August 7th, 2019

Value function approximation lies at the heart of almost all reinforcement learning algorithms. Dominant approaches in the literature are based on dynamic programming, which apply various forms of the Bellman operator to iteratively update the value function estimate, hoping that it converges to the true value function. While successful in several prominent cases, these methods often do not have convergence guarantees and are hard to analyze, except in rather restricted cases.

In this talk, we will focus on a different approach that receives fast-growing interest recently. The key is to frame value function approximation as a more standard optimization problem with an easy-to-optimize objective function. By doing so, one can often develop better and provably convergent algorithms, whose theoretical properties can be more conveniently analyzed using existing techniques from statistical machine learning.

I will consider the goal of designing deep learning systems that have strong, verified assurances of correctness with respect to mathematically-specified requirements. I will describe some challenges for achieving verified deep learning, and propose a few principles for addressing these challenges, with a special focus on techniques based on formal methods. I will illustrate the ideas with examples and sample results from the domain of intelligent cyber-physical systems, with a particular focus on the use of deep learning in autonomous vehicles.

The typical view of deep learning architectures is that they consist of stacked linear operators followed by (typically elementwise) non-linear operators, with minor additions such as residual connections, attention units, etc. However, a great deal of recent work has looked at integrating substantially more structured layers into deep architectures, such as explicit optimization solvers, ODE solvers, physical simulations, and many other examples. In these examples, any differentiable program can serve as a layer in a deep network, and by properly structuring these problems we can encode a great deal of prior knowledge into the system. In this talk, I will highlight the basic approach behind these structured layers, and highlight some recent advances in the area related to incorporating discrete optimization solvers as layers in deep networks. However, despite their potential advantages, these structured layers raise a number of challenges, especially regarding gradient-based training of the systems. I will discuss these challenges and potential ways forward.

The nascent field of fair machine learning aims to ensure that decisions guided by algorithms are equitable. Over the last several years, three formal definitions of fairness have gained prominence: (1) anti-classification, meaning that protected attributes -- like race, gender, and their proxies -- are not explicitly used to make decisions; (2) classification parity, meaning that common measures of predictive performance (e.g., false positive and false negative rates) are equal across groups defined by the protected attributes; and (3) calibration, meaning that conditional on risk estimates, outcomes are independent of protected attributes. In this talk, I'll show that all three of these fairness definitions suffer from significant statistical limitations. Requiring anti-classification or classification parity can, perversely, harm the very groups they were designed to protect; and calibration, though generally desirable, provides little guarantee that decisions are equitable. In contrast to these formal fairness criteria, I'll argue that it is often preferable to treat similarly risky people similarly, based on the most statistically accurate estimates of risk that one can produce. Such a strategy, while not universally applicable, often aligns well with policy objectives; notably, this strategy will typically violate both anti-classification and classification parity. In practice, it requires significant effort to construct suitable risk estimates. One must carefully define and measure the targets of prediction to avoid retrenching biases in the data. But, importantly, one cannot generally address these difficulties by requiring that algorithms satisfy popular mathematical formalizations of fairness. By highlighting these challenges in the foundation of fair machine learning, we hope to help researchers and practitioners productively advance the area.

At every level of government, officials contract for technical systems that employ machine learning?systems that perform tasks without using explicit instructions, relying on patterns and inference instead. These systems frequently displace discretion previously exercised by policymakers or individual front-end government employees with an opaque logic that bears no resemblance to the reasoning processes of agency personnel. However, because agencies acquire these systems through government procurement processes, they, and the public, have little input into?or even knowledge about?their design, or how well that design aligns with public goals and values. In this talk I explain the ways that the decisions about goals, values, risk, certainty, and the elimination of case-by-case discretion inherent in machine-learning system design make policy; how the use of procurement to manage their adoption undermines appropriate attention to these embedded policies; and, draw on administrative law to argue that when system design embeds policies the government must use processes that address technocratic concerns about the informed application of expertise, and democratic concerns about political accountability. Specifically, I describe specific ways that the policy choices embedded in machine learning system design today fail the prohibition against arbitrary and capricious agency action absent a reasoned decision making process that both enlists the expertise necessary for reasoned deliberation about such choices and makes visible the political choices being made. I conclude by sketching out options for bringing necessary technical expertise and political visibility into government processes for adopting machine learning systems through a mix of institutional and engineering design solutions.

**Panelists**: Sharad Goel (Stanford University), Deirdre Mulligan (UC Berkeley), Emily Witt (Salesforce), Alice Xiang (Partnership on AI)

**Moderators**: Deirdre Mulligan (UC Berkeley), Matus Telgarsky (University of Illinois, Urbana-Champaign)

No abstract available.

### Thursday, August 8th, 2019

As deep learning models are been used in different tasks and settings, there has been increased interest in ways to `interpret', 'debug', and 'understand' these models. Consequently, there has now been a wave of post-hoc, sensitivity-based, methods for interpreting DNNs. These methods typically provide a local 'explanation' around single input examples. With a wave of several proposed methods, it is currently difficult for a practitioner to select a method for use.

In this talk, we will look at potential benefits and limitations of the local explanations paradigm. First, we will consider ways to assess these interpretation methods. In particular, we will try to get at whether these methods can help debug models, i.e., help identify model mistakes prior to deployment. In addition, we will consider privacy trade-offs. Recent work has shown that it is easy to recover a model with modest access to local explanations for a few data points; hence, raising privacy concerns. We will look at recent results in this line of work, and end with some interesting research directions.

Stoic philosophers practiced a method called “premeditation of evils” to help think about how to prepare today for potential failures. This powerful idea is simple: think in reverse. Instead of figuring out how to succeed, think about how to fail, then try to avoid those mistakes. Interpretability has been well-recognized as an important problem in ML, but It is a complex problem with many potential failure modes. This talk, I’ll share a few failure modes in 1) setting expectations 2) making an interpretability method 3) evaluating and 4) interpreting an explanation. To put things in context, I’ll share statistics on how many papers are making these mistakes from last year’s publications on the topic. I will also share some open theoretical questions may help us move forward. I hope this talk will offer a new angle to look at ways to make progress in this field.

A key question is how to better leverage the data generated (from RL agents or humans) to get better control policies. We would like to be able to combine the power of extremely expressive function approximators like deep learning with rigorous statistical guarantees to ensure data efficiency and strong guarantees on resulting performance. In this talk I will outline some of our recent work in this space and potential future directions.

Data can be corrupted in many ways: via outliers, measurement errors, failed sensors, batch effects, and so on. Standard maximum likelihood learning will either reproduce there errors or fail to converge entirely. Given this, what learning objectives should we use instead? I will present a general framework for studying robustness to different families of errors in the data, and use this framework to provide guidance on designing error-robust estimators.