Data arising from experimental, observational, and simulational processes in the natural and social sciences, as well as in industrial applications and other domains, have created enormous opportunities for understanding the world we live in. The pursuit of such understanding requires the development of systems and techniques for processing and analyzing data, falling under the general term "data science." Data science is a blend of old and new. Much of the old involves ideas and techniques that have been developed in existing methodological and application domains, and much of the new is being developed in response to new technologies that create enormous quantities of data.

This program brought together researchers working on algorithmic, mathematical, and statistical aspects of modern data science, with the aim of identifying a set of core techniques and principles that form a foundation for the subject. While the foundations of data science lie at the intersection of computer science, statistics, and applied mathematics, each of these disciplines in turn developed in response to particular long-standing problems. Building a foundation for modern data science requires rethinking not only how those three research areas interact with data, implementations, and applications, but also how each of the areas interacts with the others. For example, differing applications in computer science and scientific computing have led to different formalizations of appropriate models, questions to consider, computational environments (such as single machine vs. distributed data centers vs. supercomputers), and so on. Similarly, business, Internet, and social media applications tend to have certain design requirements and generate certain types of questions, and these tend to be very different from those that arise in scientific and medical applications. As well as these differences, there are also many similarities among these areas. Developing the theoretical foundations of data science requires paying appropriate attention to the questions and issues of domain scientists who generate and use the data, as well as to the computational environments and platforms supporting this work.  
Our emphasis was on such topics as dimensionality reduction, randomized numerical linear algebra, optimization, probability in high dimensions, sparse recovery, statistics (including inference and causality), and streaming and sublinear algorithms, as well as a variety of application areas that can benefit from these fields and other techniques for processing massive data sets. Each of these related areas has received attention from a diverse set of research communities, and an important goal for us was to explore and strengthen connections between methods and problems in these areas, discover new perspectives on old problems, and foster interactions among different research communities that address similar problems from quite different perspectives.

This program was supported in part by the Kavli Foundation and the Patrick J. McGovern Foundation.



Michael Mahoney (International Computer Science Institute and UC Berkeley)

Long-Term Participants (including Organizers)

Michael Mahoney (International Computer Science Institute and UC Berkeley)
Mike Luby (International Computer Science Institute)
Eric Price (University of Texas, Austin; Google Research Fellow)

Research Fellows

Gautam Kamath (Massachusetts Institute of Technology; Microsoft Research Fellow)
Rajiv Khanna (University of Texas at Austin; Patrick J. McGovern Research Fellow)
Jerry Li (Microsoft Research; VMware Research Fellow)
Marco Mondelli (Stanford University; Patrick J. McGovern Research Fellow)
Yan Shuo Tan (University of Michigan; Patrick J. McGovern Research Fellow)

Visiting Graduate Students and Postdocs

Jason Li (Simons Institute, UC Berkeley)
Fei Shi (Carnegie Mellon University)