Playlist: 21 videos

Statistics in the Big Data Era

Remote video URL
0:15:31
Jennifer Chayes (UC Berkeley) & Sandrine Dudoit (UC Berkeley)
https://simons.berkeley.edu/talks/opening-remarks-1
Statistics in the Big Data Era
Visit talk page
Remote video URL
1:2:15
Ya’acov Ritov (University of Michigan)
https://simons.berkeley.edu/talks/testing-variations-theme-solo-and-choir
Statistics in the Big Data Era

In this talk, we will try to present some aspects of Peter Bickel's contribution to statistics during much of the time of modern statistics.
Visit talk page
Remote video URL
0:40:10
Harrison Zhou (Yale University)
https://simons.berkeley.edu/talks/bickels-influence-my-career
Statistics in the Big Data Era

In this talk, I will discuss Professor Peter Bickel's influence on my career in three research areas: large covariance matrices estimation, statistical network analysis, and variational Bayes.
Visit talk page
Remote video URL
0:53:56
Jasjeet Sekhon (Yale University)
https://simons.berkeley.edu/talks/debiasing-random-forests-treatment-effect-estimation
Statistics in the Big Data Era
Visit talk page
Remote video URL
0:40:11
Jeff Wu (Georgia Institute of Technology)
https://simons.berkeley.edu/talks/bdrygp-boundary-integrated-gaussian-process-model-computer-code-emulation
Statistics in the Big Data Era

With advances in mathematical modeling and computational methods, complex phenomena (e.g., universe formations, rocket propulsion) can now be reliably simulated via computer code. This code solves a complicated system of equations representing the underlying science of the problem. Such simulations can be very time-intensive, requiring months of computation for a single run. Gaussian processes (GPs) are widely used as predictive models for “emulating” this expensive computer code. Yet with limited training data on a high-dimensional parameter space, such models can suffer from poor predictive performance and poor interpretability. Fortunately, in many physical applications, there is additional boundary information on the code beforehand, either from governing physics or scientific knowledge. We propose a new BdryGP model which incorporates such boundary information for prediction. We show that BdryGP enjoys improved convergence rates over standard GP models which do not incorporate boundaries. We then demonstrate the improved predictive performance and posterior contraction of the BdryGP model on a suite of numerical experiments and a real-world application.
Visit talk page
Remote video URL
0:44:10
Dominik Rothenhaeusler (Stanford University)
https://simons.berkeley.edu/talks/calibrated-inference-statistical-inference-accounts-both-sampling-uncertainty-and
Statistics in the Big Data Era

How can we draw trustworthy scientific conclusions? One criterion is that a study can be replicated by independent teams. While replication is critically important, it is arguably insufficient. If a study is biased for some reason and other studies recapitulate the approach then findings might be consistently incorrect. It has been argued that trustworthy scientific conclusions require disparate sources of evidence. However, different methods might have shared biases, making it difficult to judge the trustworthiness of a result. We formalize this issue by introducing a "distributional uncertainty model", which captures biases in the data collection process. Distributional uncertainty is related to other concepts in statistics, ranging from correlated data to selection bias and confounding. We show that a stability analysis on a single data set allows to construct confidence intervals that account for both sampling uncertainty and distributional uncertainty. We introduce an R package that allows to draw data under distributional uncertainty and calibrate inference in (generalized) linear models. This is joint work with Yujin Jeong.
Visit talk page
Remote video URL
0:40:1
Boaz Nadler (Weizmann Institute of Science)
https://simons.berkeley.edu/talks/trimmed-lasso-sparse-recovery-guarantees-and-practical-optimization
Statistics in the Big Data Era

Consider the sparse approximation or best subset selection problem: Given a vector y and a matrix A, find a k-sparse vector x that minimizes the residual ||Ax-y||. This sparse linear regression problem, and related variants, plays a key role in high dimensional statistics, compressed sensing, and more. In this talk we focus on the trimmed lasso penalty, defined as the L_1 norm of x minus the L_1 norm of its top k entries in absolute value. We advocate using this penalty by deriving sparse recovery guarantees for it, and by presenting a practical approach to optimize it. Our computational approach is based on the generalized soft-min penalty, a smooth surrogate that takes into account all possible k-sparse patterns. We derive a polynomial time algorithm to compute it, which in turn yields a novel method for the best subset selection problem. Numerical simulations illustrate its competitive performance compared to current state of the art.
Visit talk page
Remote video URL
0:42:30
Rachel Wang (University of Sydney)
https://simons.berkeley.edu/talks/global-community-detection-using-individual-centered-partial-information-networks
Statistics in the Big Data Era

In statistical network analysis, we often assume either the full network is available or multiple subgraphs can be sampled to estimate various global properties of the network. However, in a real social network, people frequently make decisions based on their local view of the network alone. Here, we consider a partial information framework that characterizes the local network centered at a given individual by path length $L$ and gives rise to a partial adjacency matrix. Under $L=2$, we focus on the problem of (global) community detection using the popular stochastic block model (SBM) and its degree-corrected variant (DCSBM). We derive general properties of the eigenvalues and eigenvectors from the signal term of the partial adjacency matrix and propose new spectral-based community detection algorithms that achieve consistency under appropriate conditions. Our analysis also allows us to propose a new centrality measure that assesses the importance of an individual's partial information in determining global community structure. Using simulated and real networks, we demonstrate the performance of our algorithms and compare our centrality measure with other popular alternatives to show it captures unique nodal information. Our results illustrate that the partial information framework enables us to compare the viewpoints of different individuals regarding the global structure.
Visit talk page
Remote video URL
0:38:25
Jing Lei (Carnegie Mellon University)
https://simons.berkeley.edu/talks/bias-adjusted-spectral-clustering-multi-layer-stochastic-block-models
Statistics in the Big Data Era
Visit talk page
Remote video URL
0:42:16
Nancy Zhang (University of Pennsylvania)
https://simons.berkeley.edu/talks/dna-copy-number-profiling-bulk-tissue-single-cells
Statistics in the Big Data Era

Cancer evolution is driven by both somatic copy number aberrations (CNAs) and chromatin remodeling, yet little is known about the interplay between these two classes of events in shaping the clonal diversity of cancers. In this talk I will describe how single cell DNA and ATAC sequencing can contribute to our understanding of cancer evolution. I will describe Alleloscope, a method for allele-specific copy number estimation that can be applied to single- cell DNA and/or ATAC sequencing data, enabling combined analysis of allele-specific copy number and chromatin accessibility. On scDNA-seq data from gastric, colorectal, and breast cancer samples, with validation using matched linked-read sequencing, Alleloscope finds pervasive occurrence of highly complex, multi-allelic copy number aberrations, where cells that carry varying allelic configurations adding to the same total copy number co-evolve within a tumor. These types of haplotype specific “mirrored” subclonal events have been under- reported due to lack of methods of their detection, and their role in cancer evolution is not yet clear. On scATAC-seq from two basal cell carcinoma samples and a gastric cancer cell line, Alleloscope detects multi-allelic copy number events and copy neutral loss-of-heterozygosity, enabling dissection of the contributions of chromosomal instability and chromatin remodeling to tumor evolution.
Visit talk page