The rapid progress in technologies to automatically collect genetic and phenotypic information on living systems at all scales (from molecules to cells, to organisms, to ecosystems) offers a great opportunity to understand life at an unprecedented level of detail. Extracting meaningful and reliable biological information from the analysis of the resulting datasets that are ever-increasing in size and also in complexity (e.g., dependence structure, technical noise, sparsity) poses great computational and statistical challenges.  Some of these challenges arise from

  1. The increasing capacity, throughput, and read length of deep sequencing technologies (e.g., Illumina, Nanopore, 10x Genomics, Pacific Biosciences) for bulk and single-cell DNA and RNA.
  2. The launching of very large-scale projects to describe the many dimensions of biological diversity at the molecular level. These include, among others:

    1. The Human Cell Atlas (HCA), aiming to monitor the RNA content of all cells in the human body (estimated to be at about 1013) and to identify all distinct cell types.
    2. The Earth Biogenome Project (EBP), aiming to sequence the genome of all eukaryotic species living on earth (estimated to be about 1.5 milion)
    3. Large scale metagenomics projects, monitoring microbial diversity in natural (Tara Oceans) and urban (MetaSub) environments, as well as in the human body (Human Microbiome Project, HMP).

  3. Functional genomics, which uses deep sequencing to annotate genomic regions with their biological function (methylation, histone modifications, transcription factor binding, etc). 
  4. Other -omics technologies, such as proteomics, metabolomics, lipidomics, etc.
  5. Genome editing technologies (CRISPR screens), which allow for large-scale genomic perturbations, followed by phenotypic or molecular assays of the cellular response.
  6.  Advanced genomic imaging, allowing in vivo monitoring of genome activity (transcription, translation, etc.) within the cells, as well as the 3D organization of individual cells in organs (spatial transcriptomics).
  7. Cohort-based studies, which aim to analyze -omics data together with phenotypic information, either from medical records or (dynamically and continuously) collected through electronic recording devices, for thousands to millions of individuals (from GTEx and UK Biobank to national precision medicine projects).  Data may include, among others, medical annotations, high-resolution imaging (neuroimaging, X-ray imaging), histopathologies, and longitudinal measurements of physiological variables (heart rate, body temperature, physical activity), many of which may be collected autonomously.

Data produced by these projects appeal to methods that have been studied extensively in the bioinformatics literature, such as (multiple) sequence alignments, motif (gene) finding, (meta) genome assembly, and phylogenetic reconstruction. However, existing methods are unlikely to properly scale, and fundamental computational problems persist, in terms of adapting the methods and the algorithms to unprecedented data volumes. Other new data types (imaging, longitudinal recordings, etc.) and subject-matter questions will require the development of novel methods and algorithms. Many of these will build on sequence analysis, classical statistics, and, in particular on the recent success of machine and deep learning methods. 

At the workshop, we will discuss these algorithms and methods, as well as new ways to work with the data, and applications to specific domains. Finally, we will deliberate the ethical issues involved in generating and working with such data; in particular, how these data can be used in a nondiscriminatory fashion, and for the benefit of all.

Invited Participants

Steven Brenner (UCB), Soren Brunak (University of Copenhagen), Carlos Bustamante (Stanford University), Rayan Chikhi (Institute Pasteur), Ana Conesa (University of Florida), Sandrine Dudoit (UC Berkeley), Jasmin Fisher (UCL), Roderic Guigó (CRG - Center for Genomic Regulation), Eran Halperin (University of California, Los Angeles), Ian Holmes (UC Berkeley), Haiyan Huang (UC Berkeley), Kazutaka Katoh (Osaka University), Sunduz Keles (University of Wisconsin Madison), Harris Lewin (UC Davis), Jingyi Jessica Li (University of California, Los Angeles), Alejandra Medina-Rivera (International Laboratory for Human Genome Research), Priya Moorgani (UC Berkeley), Ali Mortazavi (University of Victoria), Lior Pachter (California Institute of Technology), Katherine Pollard (UCSF), Elizabeth Purdom (UC Berkeley), Tim Reddy (Duke), Davide Risso (UCB), Yana Safonova (Johns Hopkins University), Ron Shamir (Tel-Aviv University), Jay Shin (RIKEN), Meromit Singer (Harvard University), Oliver Stegle (EMBL Heidelberg), Hagen Tilgner (Weill Cornell Medicine), Jingshen Wang (UC Berkeley), Tandy Warnow (University of Illinois at Urbana–Champaign)