by Ahmed El Alaoui (Richard M. Karp Research Fellow, Simons Institute)
Computation involving large amounts of noisy and complex data is a core primitive of modern science and technology. For instance, experimental scientists in various fields gather large amounts of data, and wish to test increasingly complex hypotheses about what the data might entail and what hidden patterns underlie the data, all in a time-efficient way.
Theoreticians and practitioners alike might ask the following questions: How can we understand the structure of a particular data set? How can we model the various sources of noise that corrupt it? How can we design robust algorithms to process these data and return accurate answers within an allowed budget of resources, be it runtime, memory, communication, or privacy? And finally, can we identify and then rigorously analyze all the involved trade-offs and build a mathematical theory for understanding this complex puzzle?
A great deal of effort has been put into the mathematical understanding of each piece of the puzzle, and more often than not, progress has been made in traditionally separate areas. An oversimplified view of this classical separation is as follows.
Understanding the mechanisms generating the data and modeling the noise that corrupts the data are the realm of statistics and probability theory. The design of algorithms processing the data and the analysis of their runtime are the specialty of computer science. And quantifying the information content of a particular data set, or, conversely, determining how noisy a data set can be before it turns into complete junk, is studied within information theory.
In the past 20 years, as data sets have grown and our computers have become more powerful, an exciting trend has emerged in which integration is pushed ahead by the need to explore the precise trade-offs between the aforementioned desiderata and constraints. Two important discoveries emerging from this integration are:
- The existence of sudden phase transitions affecting the solvability of a given computational task as a key parameter of the problem (say, the strength of the signal) is varied.
- The ubiquity of statistical-computational gaps, where a data set contains a faint amount of information that can be extracted in principle if one is allowed unbounded computational power, but all known time-efficient algorithms fail miserably at the task.
Such phenomena are inherently high dimensional in that they appear only when the number of variables involved in the problem is large and eventually grows unboundedly.
High dimensionality is traditionally associated with computational intractability: more variables mean exponentially more possible combinations, and this is typically seen as a curse. High dimensions also come with an unexpected blessing called concentration of measure: a surprising geometric phenomenon restoring smoothness and regularity in the unruly combinatorial mess and having rich probabilistic interpretations. Perhaps the most concise description of concentration of measure is the following one-liner by Michel Talagrand (1996): “A random variable that depends (in a "smooth way") on the influence of many independent random variables (but not too much on any of them) is essentially constant.”
The objective of the current Simons Institute program on Probability, Geometry, and Computation in High Dimensions is the precise study of these high-dimensional phenomena and the ways in which they manifest themselves geometrically, probabilistically, and algorithmically. The program brings together mathematicians, statisticians, computer scientists, and physicists to exchange ideas and methods developed in their respective fields and facilitates collaborations on this broad theme.
The program started with a boot camp of five mini-courses bringing everyone up to speed on a select number of topics currently witnessing active research.
The first workshop, Computational Phase Transitions, dove into the connection between the properties of the geometry, or landscape, of the set of solutions to randomized computational problems (e.g., combinatorial optimization on random graphs, random constraint satisfaction problems, and high-dimensional parameter estimation problems) and the performance of known classes of algorithms aimed at solving them. These algorithms notably include the “sum-of-squares” hierarchy; message-passing algorithms; and Langevin and Glauber dynamics, two classes of Markov chains. The results presented during this workshop featured a number of impressive “positive” results where new algorithms were shown to push the boundary of what is solvable in polynomial time, as well as a number of elegant and robust techniques for proving lower bounds on the performance of large classes of algorithms for certain tasks.
The second workshop, Concentration of Measure Phenomena, focused on an in-depth study of the various incarnations of concentration and the regularity it brings, together with its algorithmic consequences. We've had several talks about functional and isoperimetric inequalities, their usage in proving moment bounds, central limit theorems, and concentration of various observables of large interacting particle systems. A central property, appearing as key in several results, was the log-concavity of probability measures, from which fast convergence to equilibrium of Markov chains can be deduced in various settings.
The third and final workshop, Learning and Testing in High Dimensions, will close the loop and study statistical testing and estimation problems where large statistical-computational gaps are found in high dimensions. The workshop aims to bring learning theorists together with probabilists and geometers and facilitate a transfer of methods aimed to make progress on some of the outstanding problems in this area.
In addition to the workshops, the research fellows organized three reading groups with weekly meetings, one on stochastic localization and its applications to geometry and some statistical physics problems, one on statistical-computational gaps and average-case reductions between statistical problems, and one on information-theoretic and computational aspects of learning and testing problems.
Amid the pandemic, the whole program is running entirely online, and the organizers made a call for six “polymath projects” aimed to energize online collaboration among long-term participants. All projects have seen sustained participation throughout the semester and seem to be making steady progress.