Abstract

Query evaluation is a fundamental problem in database systems, studying how to compute efficiently answers to questions specified in SQL. Join comes as the core operator, since many real-world analytical queries rely on the join operator to link information stored in different tables. The need to process and analyze big data has invigorated this long-running research area with new challenges. For example, massively parallel data systems, such as MapReduce and Spark, have become an effective tool for handling large volumes of data, while query evaluation algorithms in these systems have to be designed so that they can scale to thousands of machines in parallel.

In this talk, I will focus on algorithm design in massively parallel systems for join queries, the most fundamental and practically important class of queries. I will discuss the intrinsic relationship between the join structure and its parallel computational cost. In addition to a homogeneous parallel model, I will also bring up some new challenges when the underlying communication model exhibits heterogeneous properties. Lastly, I will briefly navigate the captivating crossroad where massively parallel query processing intersects with dynamic query processing.

Video Recording