Remote video URL
0:27:36
Steven Brenner (UC Berkeley)
https://simons.berkeley.edu/talks/biological-discovery-and-consumer-genomics-databases-activate-latent-privacy-risk-functional
Computational Challenges in Very Large-Scale 'Omics'

The privacy risks from individuals’ genomes have garnered increasing attention. Recent research studies and forensics have underscored the ability to re-identify a person using genomic-identified relatives and quasi-identifiers, such as sex, birthdate and zip code. However, summary omics data, such as gene expression values and DNA methylation sites, are generally treated as safe to share, with low privacy risks – though research studies have indicated they could be linked to existing genomes. We have demonstrated that some types of summary omics data can be accurately linked to a unique genome. We developed methods to match against genotypes in consumer genealogy databases with their restricted tools. Thus, the theoretical privacy concerns regarding summary omics data are now practically relevant. The ability to link sets of quasi-identifiers can reveal a research participant’s identity and protected health information. Most important, such risks increase over time, activated by new techniques, new knowledge, and new databases. Thus public omics data may become privacy time bombs: safe at the time of distribution, but increasingly likely to compromise personal information. The need to preserve individuals’ genomic privacy for their lifetime and beyond (for descendants and relatives) poses unique challenges to the effective sharing of high-throughput molecular data.
Visit talk page
Remote video URL
0:32:15
Carlos Bustamante (Galatea Bio, Stanford, University of Miami)
https://simons.berkeley.edu/talks/how-do-we-deliver-precision-health-scale-all
Computational Challenges in Very Large-Scale 'Omics'
Visit talk page
Remote video URL
0:34:25
Søren Brunak (Novo Nordisk Foundation Center for Protein Research)
https://simons.berkeley.edu/talks/tbd-458
Computational Challenges in Very Large-Scale 'Omics'

Multi-step disease and prescription trajectories are key to the understanding of human disease progression patterns and their underlying molecular level etiologies. The number of human protein coding genes is small, and many genes are presumably impacting more than one disease, a fact that complicates the process of identifying actionable variation for use in precision medicine efforts. We present approaches to the identification of frequent disease and prescription trajectories from population-wide healthcare data comprising millions of patients and corresponding strategies for linking disease co-occurrences to genomic individuality. In the work we carry out temporal analysis of clinical data in a life-course oriented fashion. We use data covering 7-10 million patients from Denmark collected over a 20-40 year period and use them to “condense” millions of individual trajectories into a smaller set of recurrent ones. Such sets represent patient subgroups sharing longitudinal phenotypes that could form a basis for differential treatment designs of relevance to individual patients. Individual disease and prescription trajectories can also be used as input to machine learning approaches for risk and time to event prediction.
Visit talk page
Remote video URL
0:31:16
Timothy Reddy (Duke University)
https://simons.berkeley.edu/talks/towards-making-identification-noncoding-causes-human-disease-routine
Computational Challenges in Very Large-Scale 'Omics'

Genetic variation outside of protein coding genes is a major driver of human disease. Revealing the underlying mechanisms through which such non-coding variants act has enormous potential to benefit human health. The long-term goal of my research is to make identifying those mechanisms routine through the development and application of new technologies and analyses to study the function of non-coding variants. I will present a combination of case studies that highlight the promise of identifying non-coding genetic contributions to human disease, and new high-throughput technologies that can rapidly advance progress on other diseases in the future.
Visit talk page
Remote video URL
0:32:8
Tandy Warnow (University of Illinois Urbana-Champaign)
https://simons.berkeley.edu/talks/new-approaches-phylogenetic-species-tree-estimation
Computational Challenges in Very Large-Scale 'Omics'
Visit talk page
Remote video URL
0:28:1
Ian Holmes (UC Berkeley)
https://simons.berkeley.edu/talks/nanopore-basecalling-directed-evolution
Computational Challenges in Very Large-Scale 'Omics'

Directed evolution offers a way to measure "fitness" of large numbers of synthetic sequences in parallel. Nanopore and other long-read sequencing technologies have the potential to probe distant epistatic interactions in such datasets, as well as resolving other genomic questions of interest in cancer, evolution, and other area of biology. I will describe our group's work in basecalling, analysis, and visualization of nanopore sequencing data, with an emphasis on synthetic biology applications.
Visit talk page
Remote video URL
0:28:46
Jingshen Wang (UC Berkeley)
https://simons.berkeley.edu/talks/breaking-winners-curse-mendelian-randomization-rerandomized-inverse-variance-weighted
Computational Challenges in Very Large-Scale 'Omics'

Developments in genome-wide association studies and the increasing availability of summary genetic association data have made the application of two-sample Mendelian Randomization (MR) with summary data increasingly popular. Conventional two-sample MR methods often employ the same sample for selecting relevant genetic variants and for constructing final causal estimates. Such a practice often leads to biased causal effect estimates due to the well known ``winner's curse" phenomenon. To address this fundamental challenge, we first examine the consequence of winner's curse on causal effect estimation both theoretically and empirically. We then propose a novel framework that systematically breaks the winner's curse, leading to unbiased association effect estimates for the selected genetic variants. Building upon the proposed framework, we introduce a novel rerandomized inverse variance weighted estimator that is consistent when selection and parameter estimation are conducted on the same sample. Under appropriate conditions, we show that the proposed RIVW estimator for the causal effect converges to a normal distribution asymptotically and its variance can be well estimated. We illustrate the finite-sample performance of our approach through Monte Carlo experiments and two empirical examples.
Visit talk page
Remote video URL
0:43:30
Katie Pollard (Gladstone Institute of Data Science & Biotechnology)
Computational Challenges in a Densely Sequenced Tree of Life
Computational Challenges in Very Large-Scale 'Omics'

Genome sequencing and assembly have exploded since 2015. Today, many linages contain closely related species, as well as species with multiple diverse genome sequences. Having more genomes seems like a good thing for studying ecology and evolution across the tree of life. However, the workhorse algorithm for genomic studies, sequence alignment, is breaking down in terms of both computational efficiency and accuracy. We explore these issues using metagenomic applications in which microbial communities are sequenced as a pool and alignment is used to map reads to the correct species and genomic site before downstream bioinformatics applications such as abundance estimation and genotyping. We quantify alignment errors and computational barriers across a broad range of scenarios, including lineages in which a commonly used, operational definition of the species boundary (greater than 95% average nucleotide identity) is blurred. Then, we propose several actionable and aspirational solutions to problems such as genome redundancy, reference bias, and cross-mapping. This work demonstrates that efficient algorithms and data structures are essential to maintain access to genomic and metagenomic data science for researchers without massive high-performance computing resources and to ensure read mapping is accurate on a densely sequenced tree of life.
Visit talk page