A Theoretical Approach to Semantic Coding and Hashing

Abstract

Finding similar items in a dataset is a basic problem in many fields, including information retrieval, natural language processing and machine learning. In theoretical CS much algorithmic work assumes a distance function on instances, or even that the instances are vectors in R^d and distance is defined using some l_p metric.

But in most settings it is rarely the case that the data is a vector in R^d: for instance treating an image as a vector of pixels in not a very natural representation. Instead, much applied work is devoted to finding a suitable vector representation that captures the semantic "content." For instance Hinton and Salakhutdinov (2007) show how to compute semantic hashing of text using Restricted Boltzmann Machines (RBMs). More recent work uses neural nets and matrix factorization to define sentence and paragraph embeddings.

The talk will cover recent work that gives a theoretical analysis of such nonlinear, nonconvex methods and what kind of semantics it uncovers, with new theoretical insight into low-dimensiional word embeddings, sentence/paragraph embeddings, and content recommendation.

(Joint works: [A., Li, Liang, Ma, Risteski] [A., Li and Ma] and [A. and Risteski])

A Theoretical Approach to Semantic Coding and Hashing

Abstract

Video Recording