Thursday, April 18th, 2019

From the Inside: Foundations of Data Science

by Shai Ben-David (University of Waterloo)

The endless generation of huge amounts of data has been changing the face of science in recent decades. While classical scientists devoted many of their resources to data collection, in more and more areas, the data is readily available to be explored, and the focus of scientists is on how to best take advantage of this abundant available data.

Having innumerable practical applications, data science has given rise to many tools and heuristics. The focus of the Fall 2018 Simons Institute program on Foundations of Data Science was to examine and solidify its core foundations. We were aiming at developing and analyzing theoretical underpinnings and expanding the toolbox of reliable data science techniques.

Over the past couple of decades, data science has matured into a discipline of its own. It incorporates at least three traditional scientific areas: statistics, addressing the generality and validity of patterns detected in data samples; key mathematical tools, like linear algebra, which underlie many of the data science working paradigms; and computer science — with the immense size of input data sets, one has to rely on computer algorithms to carry out the modeling analysis and predictions. The development of efficient fast-running algorithmic tools for those tasks therefore plays a key role.

Accordingly, the Fall 2018 Simons Institute program on Foundations of Data Science brought together researchers from statistics, mathematics and computer science departments from top universities around the world. Not only was there a mix of different areas of expertise, but there was also a great variety of in terms of seniority, from world-class senior professors to young, energetic junior faculty, postdocs and PhD students.

The Simons Institute program was successfully designed to foster interactions and collaborations, and to forge new scientific connections. The researchers were assigned to small shared offices, all around a pleasantly inviting central lounge with chairs, desks, couches and white boards. There was a daily social afternoon tea bringing everybody together, and there were many talks in smaller research-topic-centered reading groups and seminars led by program participants and short-term visitors.

The program included three week-long workshops consisting of talks by program participants and invited speakers. Those workshops also included open problem sessions and concluding panel discussions. The first workshop was a "boot camp" on data science in which the program organizers, as well as some prominent visitors, provided extensive tutorials addressing the various technical issues that were at the center of the scientific program.

The second workshop focused on randomized numerical linear algebra. Linear algebra tools are essential to data modeling and analysis. Carrying out algebraic manipulations on large data sets is highly computing-expensive. Randomization and sampling help manage such computational challenges. The workshop highlighted some of the state-of-the-art recent developments of such tools.

The third workshop focused on Robust High-Dimensional Statistics. Many of the data sets encountered in applications consist of records with many attributes — in other words, high-dimensional vectors. The statistics of such data has some special characteristics that require different techniques from those that suffice for low-dimensional data. Furthermore, real-life data is often contaminated with noise and outliers. This workshop discussed recent mathematical innovations that allow successful handling of such challenges.

The last workshop addressed the issues of sublinear algorithms and nearest-neighbor search. With very large data sets, it is sometimes the case that even just reading the full data set requires unrealistically large computing resources. Sublinear algorithms address this hurdle by instead processing only a sample of the input data, doing it in a carefully designed way so as to guarantee the reliability of the conclusions drawn. Nearest-neighbor algorithms are another common set of tools that allow fast and efficient extraction of conclusions from formidably large data sets.

This program offered an exciting semester in which researchers from many different countries and institutions came to learn from each other and work together, furthering our understanding of data science. Without a doubt, collaborations and ideas that were initiated during that term will continue to bear fruit for some time to come.

Related Articles