# Abstracts

### Monday, March 27th, 2017

9:30 am10:10 am
What is it that enables learning with multi-layer networks?  What makes it possible to optimize the error, despite the problem being
hard in the worst case?  What causes the network to generalize well despite the model class having extremely high capacity?  In this talk I will explore these questions through experimentation, analogy to matrix factorization (including some new results on the energy
landscape and implicit regularization in matrix factorization), and study of alternate geometries and optimization approaches.
10:10 am10:50 am
We'll describe a novel theoretical framework for unsupervised learning which is not based on generative assumptions. It is comparative, and allows to avoid known computational hardness results and improper algorithms based on convex relaxations. We show how several families of unsupervised learning models, which were previously only analyzed under probabilistic assumptions and are otherwise provably intractable, can be efficiently learned in our framework by convex optimization. These includes dictionary learning and learning of algebraic manifolds.

Joint work with Tengyu Ma.
11:20 am12:20 pm

A basic – yet very successful – tool for modeling human language has been a new generation of distributed word representations: neural word embeddings. However, beyond just word meanings, we need to understand the meanings of larger pieces of text and the relationships between pieces of text, like questions and answers. Two requirements for that are good ways to understand the structure of human language utterances and ways to compose their meanings. Deep learning methods can help for both tasks. I will then look at methods for understanding the relationships between pieces of text, for tasks such as natural language inference, question answering, and machine translation. A key, still open, question raised by recent deep learning work for NLP is to what extent do we need explicit language and knowledge representations versus everything being latent in distributed representations. Put most controversially, that is the question of whether a bidirectional LSTM with attention is the answer for all language processing needs.

2:15 pm3:15 pm
We know how to spot object in images, but we have to learn on more images than a human can see in a lifetime. We know how to translate text (somehow), but we have to learn it on more text than a human can read in a lifetime. We know how to learn playing Atari games, but we have to learn it by playing more games than any teenager can endure. The list is long.

We can of course try to pin this inefficiently to some properties of our algorithms. However we can also take the point of view that there is possibly a lot of signal in natural data that we simply do not exploit. I will report on two works in this direction. The first one establishes that something as simple as a collection of static images contains non trivial information about the causal relations between the objects they represent. The second one shows how an attempt to discover structure in observational data led to a clear improvement of Generative Adversarial Networks.
3:45 pm4:25 pm

I'll describe recent work on modeling complex relationships in a neural setting, based primarily on a combination of topic-model style dictionary learning (for interpretability) and recurrent neural networks (to capture the flow of time). This all ties in to the question of how to learn common-sense knowledge; for this, I'll talk first about understanding how relationships between humans evolve, learned from text alone; and then how this can be extended to multimodal (image and text) settings. Joint with many people, especially Snigdha Chaturvedi, Mohit Iyyer and Jordan Boyd-Graber.

4:30 pm4:45 pm
We will introduce a formal framework for thinking about representation learning, endeavoring to capture its power in settings like semi-supervised
and transfer learning. The framework involves modeling how the data was generated, and thus is related to previous Bayesian notions with some new twists.

In simple settings where it is possible to learn representations from unlabeled data, we show:
(i) it can greatly reduce the need for labeled data (semisupervised learning)
(ii) it allows solving classification tasks when previous notions such as nearest neighbors or manifold learning either don't work or require too much data.

We also clarify two important settings ---linear mixture models and loglinear models---where representation learning can be done under plausible assumptions (despite being NP-hard in the worst case).

Joint work with Sanjeev Arora.
4:45 pm5:00 pm

Training neural networks is a difficult non-convex optimization problem with possibly numerous local optimal and saddle points. However, empirical evidence seems to suggest the effectiveness of simple gradient-based algorithms. In this work, we analyze the properties of stationary points for training one-hidden layer neural networks with ReLU activation functions and show that a stationary point implies a global optimum with high probability under some conditions on the neural weights. Moreover, we introduce semi-random units where the activation pattern is determined by a random projection of the input, and show that networks with these units are guaranteed to converge to global optimal with high probability.

### Tuesday, March 28th, 2017

9:30 am10:10 am
TBD
10:10 am10:50 am
The status quo in visual recognition is to learn from batches of unrelated Web photos labeled by human annotators.  Yet cognitive science tells us that perception develops in the context of acting and moving in the world---and without intensive supervision.  How can unlabeled video augment computational visual learning?  I’ll describe our recent work exploring how a system can learn effective representations by watching unlabeled video.  First we consider how the ego-motion signals accompanying a video provide a valuable cue during learning, allowing the system to internalize the link between “how I move” and “what I see”.   Building on this link, we explore end-to-end learning for active recognition: an agent learns how its motions will affect its recognition, and moves accordingly.  Next, we explore how the temporal coherence of video permits new forms of invariant feature learning.  Incorporating these ideas into various recognition tasks, we demonstrate the power in learning from ongoing, unlabeled visual observations---even overtaking traditional heavily supervised approaches in some cases.
11:20 am12:20 pm
Option 1: Tutorial on Deep RL

Option 2: Recent Research on Deep RL for Robotics
2:15 pm2:55 pm

In this talk, I will discuss how to learn representation for perception and action without using any manual supervision. First, I am going to discuss how we can learn ConvNets for vision in a completely unsupervised manner using auxiliary tasks. Specifically, I am going to demonstrate how spatial context in images and viewpoint changes in videos can be used to train visual representations. Next, I am going to talk about how we can use a robot to physically explore the world and learn visual representations for classification/recognition tasks. Finally, I am going to talk about how we can perform end-to-end learning for actions using self-supervision.

2:55 pm3:35 pm
The empirical success of Deep Learning is stunning, and everyday we hear new success stories. However, for both theoreticians and practitioners, it is important to understand the limitations. We describe three families of problems for which existing deep learning algorithms fail. We illustrate practical cases in which these failures apply and provide a theoretical insight explaining the source of difficulty.

Joint work with Ohad Shamir and Shaked Shammah.

### Wednesday, March 29th, 2017

9:30 am10:10 am
Linguistic structure prediction infers abstract representations of text, like syntax trees and semantic graphs, enabling interpretation in applications like question answering, information extraction, and opinion analysis.  This talk is about the latest family of methods for linguistic structure prediction, which make heavy use of representation learning via neural networks.  I'll present these new methods as continuous generalizations of state machines and probabilistic grammars.  I'll show how they've led to fast and accurate performance on several syntactic and semantic parsing problems.
10:10 am10:50 am
Learning of layered or "deep" representations has provided significant advances in computer vision in recent years, but has traditionally been limited to fully supervised settings with very large amounts of training data.  New results in adversarial adaptive representation learning show how such methods can also excel when learning in sparse/weakly labeled settings across modalities and domains. I'll review state-of-the-art models for fully convolutional pixel-dense segmentation from weakly labeled input, and will discuss new methods for adapting models to new domains with few or no target labels for categories of interest.  As time permits, I'll present recent long-term recurrent network models that learn cross-modal description and explanation, visuomotor robotic policies that adapt to new domains, and deep autonomous driving policies that can be learned from heterogeneous large-scale dashcam video datasets.
2:15 pm2:55 pm

In this talk I will focus on discussing deep learning models that can find semantically meaningful representations of words, learn to read documents and answer questions about their content. First, I will introduce the Gated-Attention (GA) Reader model, that  integrates a multi-hop architecture with a novel attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enables the reader to build query-specific representations of tokens in the document for accurate answer selection. Second, I will next introduce a two-step learning system to question answering from unstructured text, consisting of a retrieval step and a reading comprehension step. Finally, I will discuss a fine-grained gating mechanism to dynamically combine word-level and character-level representations based on properties of the words. I will show that on several tasks, these models significantly improve upon many of the existing techniques.

Joint work with with Bhuwan Dhingra, Zhilin Yang, Yusuke Watanabe, Hanxiao Liu, Ye Yuan, Junjie Hu, and William W. Cohen

2:55 pm3:35 pm

Over the past few decades, various approaches have been introduced for learning probabilistic models, depending on whether the examples are labeled or unlabelled, and whether they are complete or incomplete. In this talk, I will introduce an orthogonal class of machine learning problems, which have not been treated as systematically before. In these problems, one has access to Boolean constraints that characterize examples which are known to be impossible (e.g., due to known domain physics). The task is then to learn a tractable probabilistic model over a structured space defined by the constraints.I will describe a new class of Arithmetic Circuits, the PSDD, for addressing this class of learning problems. The PSDD is based on advances from both machine learning and logical reasoning and can be learned under Boolean constraints. I will also provide a number of results on learning PSDDs. First, I will contrast PSDD learning with approaches that ignore known constraints, showing how it can learn more accurate models. Second, I will show that PSDDs can be utilized to learn, in a domain-independent manner, distributions over combinatorial objects, such as rankings, game traces and routes on a map. Third, I will show how PSDDs can be learned from a new type of datasets, in which examples are specified using arbitrary Boolean expressions. A number of case studies will be illustrated throughout the talk, including the unsupervised learning of preference rankings and the supervised learning of classifiers for routes and game traces.

4:10 pm4:30 pm

Tensor methods have emerged as a powerful paradigm for consistent learning of many latent variable models such as topic models, independent component analysis and dictionary learning. Model parameters are estimated via CP decomposition of the observed higher order input moments. We extend tensor decomposition framework to models with invariances, such as convolutional dictionary models. Our tensor decomposition algorithm is based on the popular alternating least squares method, but with additional shift invariance constraints on the factors. We demonstrate that each ALS update can be computed efficiently using simple operations such as fast Fourier transforms and matrix multiplications. Our algorithm converges to models with better reconstruction error and is much faster, compared to the popular alternating minimization heuristic, where the filters and activation maps are alternately updated.

4:35 pm4:55 pm

We show that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost dimension-free''). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free.

### Thursday, March 30th, 2017

9:30 am10:10 am

TBD

10:10 am10:50 am
Representation learning through an end-to-end learning framework has shown increasingly strong results across many applications in NLP and computer vision, gradually removing the reliance on human engineered features. However, end-to-end learning is feasible only if high quality training data is available at scale, learning from scratch for each task is rather wasteful, and learned representation tends to focus on task-specific or dataset-specific patterns, rather than producing interpretable and generalizable knowledge about the world.

In contrast, humans learn a great deal about the world with and without end-to-end learning, and it is this rich background knowledge about the world that enables humans to navigate through complex unstructured environments and learn new tasks efficiently from only a handful of examples.

In this talk, I will present our recent efforts that investigate the feasibility of acquiring and representing trivial everyday knowledge about the world. In the first part, I will present our work that focuses on procedural language and knowledge in the cooking recipe domain, where procedural knowledge, e.g., how to bake blueberry muffins’’, is implicit in the learned neural representation. In the second part, I will present a complementary approach that attempts to reverse engineer even such knowledge not explicit in language due to reporting bias — people rarely state the obvious, e.g., my house is bigger than me’’ — by jointly reasoning about multiple related types of knowledge about actions and objects in the world.
11:20 am12:20 pm
TBD
2:15 pm2:55 pm

This paper makes progress on several open the- oretical issues related to Generative Adversarial Networks. A definition is provided for what it means for the training to generalize, and it is shown that generalization is not guaranteed for the popular distances between distributions such as Jensen-Shannon or Wasserstein. We intro- duce a new metric called neural net distance for which generalization does occur. We also show that an approximate pure equilibrium in the 2- player game exists for a natural training objective (Wasserstein). Showing such a result has been an open problem (for any training objective).

Finally, the above theoretical ideas lead us to pro- pose a new training protocol, MIX+GAN, which can be combined with any existing method. We present experiments showing that it stabilizes and improves some existing methods.

Joint work with Rong Ge, Yingyu Liang, Tengyu Ma, Yi Zhang.

2:55 pm3:35 pm

In this talk we discuss recent works on learning the single-layer noisy or network, which is a textbook example of a Bayes net, and used for example in the classic QMR-DT software for diagnosing which disease(s) a patient may have by observing the symptoms he/she exhibits. These networks are highly non-linear, as a result previous works on matrix/tensor decomposition cannot be applied directly. In this talk we show matrix/tensor decomposition techniques can still be adapted to give strong theoretical guarantees even for these nonlinear models.

### Friday, March 31st, 2017

9:30 am10:10 am
I will discuss our experience in learning vector representations of sentences that reflect paraphrastic similarity. That is, if two sentences have similar meanings their vectors should have high cosine similarity. Our goal is to produce a function that can be used to embed any sentence for use in downstream tasks, analogous to how pretrained word embeddings are currently used by practitioners in a broad range of applications. I'll describe our experiments in which we train on large datasets of noisy paraphrase pairs and test our models on standard semantic similarity benchmarks. We consider a variety of functional architectures, including those based on averaging, long short-term memory, and convolutional networks. We find that simple architectures are easier to train and exhibit more stability when transferring to new domains, though they can be beaten by more powerful architectures when sufficient tuning and aggressive regularization are used.
11:20 am12:20 pm
The human visual system does not passively view the world, but actively moves its sensor array through eye, head and body movements.  Here we explore the consequences of the active perception setting for learning efficient visual representations.  This work focuses on two specific questions: 1) what is the optimal spatial layout of the image sampling lattice for visual search via eye movements?  and 2) how should information be assimilated from multiple fixations in order to form a holistic scene representation that allows for visual reasoning about compositional structure of the scene?   We answer these questions through the framework of end-to-end learning in a structured neural network trained to perform search and reasoning tasks.  The derived models provide new insight into the neural representations necessary for an efficient, functional active perception system.
2:15 pm2:55 pm

In this talk, I'll present the challenges in today's deep learning approach for learning representations resilient against attacks. I will also explore the question of providing provable guarantees of generalization of a learned model. As a concrete example, I will present our recent work on using recursion to enable provablely perfect generalization in the domain of neural program architectures.

2:55 pm3:35 pm

Languages synthesize, borrow, and coin new words. This observation is so uncontroversially robust that it is charaterized by empirical laws (Zipf's and Heap's Laws) about the distributions of words and word frequencies rather than by appeal to any particular linguistic theory.  However, the first assumption made in most work on word representation learning and language modeling is that a language's vocabulary is fixed, with the (interesting!) long tail of forms replaced with an out-of-vocabulary token, <unk>. In this talk, I discuss the challenges of modeling the statistical facts of language more accurately, rather than the simplifying caracature of linguistic distributions that receives so much attention in the literature. I discuss existing models that relax the closed vocabulary assumption, how these perform, and how they still might be improved.

4:10 pm4:30 pm

A popular machine learning strategy is the transfer of a representation (i.e. a feature extraction function) learned on a source task to a target task. Examples include the re-use of neural network weights or word embeddings. Our work proposes sufficient conditions for the success of this approach. If the representation learned from the source task is fixed, we identify conditions on how the tasks relate to obtain an upper bound on target task risk via a VC dimension-based argument. We then consider using the representation from the source task to construct a prior, which is fine-tuned using target task data. We give a PAC-Bayes target task risk bound in this setting under suitable conditions. We show examples of our bounds using feedforward neural networks. Our results motivate a practical approach to weight transfer, which we validate with experiments.