Summer 2022

Computational Challenges in Very Large-Scale "Omics"

Jul 18, 2022 to Jul 21, 2022 

Add to Calendar


Roderic Guigo Serra (Center for Genomic Regulation; chair), Steven Brenner (UC Berkeley & UCSF), Sandrine Dudoit (UC Berkeley), Tal Pupko (Tel Aviv University), Ron Shamir (Tel Aviv University)

List of participants (tentative list, including organizers):
Brenda Andrews (University of Toronto), Elhanan Borenstein (Tel-Aviv University), Steven Brenner (UC Berkeley & UCSF), Soren Brunak (University of Copenhagen), Ana Conesa (University of Florida), Alexis Dobin (Cold Spring Harbor Laboratory), Sandrine Dudoit (UC Berkeley), Guillaume Filion (CRG - Center for Genomic Regulation), Jasmin Fisher (UCL), Roderic Guigo Serra (Center for Genomic Regulation), David Haussler (UC Santa Cruz), Desmond Higgins (Conway Institute), Ian Holmes (UC Berkeley), Haiyan Huang (UC Berkeley), John Huelsenbeck (UC Berkeley), Rachel Karchin (John Hopkins), Kazutaka Katoh (Osaka University), Harris Lewin (UC Davis), Priya Moorgani (UC Berkeley), Rasmus Nielsen (UC Berkeley), Cedric Notredame (CRG Barcelona), Lior Pachter (California Institute of Technology), Pavel Pevzner (UC San Diego), Katherine Pollard (UCSF), Jonathan Pritchard (Stanford University), Tal Pupko (Tel-Aviv University), Elizabeth Purdom (UC Berkeley), Ron Shamir (Tel-Aviv University), Meromit Singer (Harvard University), Virginie Uhlmann (EMBL - EBI), Alfonso Valencia Herrera (Barcelona Supercomputing Center), Martin Vingron (Max Planck Institute for Molecular Genetics), Tandy Warnow (University of Illinois at Urbana–Champaign), Nir Yosef (UC Berkeley)

The rapid progress in technologies to automatically collect genetic and phenotypic information on living systems at all scales (from molecules to cells, to organisms, to ecosystems) offers a great opportunity to understand life at an unprecedented level of detail. Extracting meaningful and reliable biological information from the analysis of the resulting datasets that are ever-increasing in size and also in complexity (e.g., dependence structure, technical noise, sparsity) poses great computational and statistical challenges. Some of these challenges arise from:

  1. The increasing capacity, throughput, and read length of deep sequencing technologies (e.g., Illumina, Nanopore, 10x Genomics, Pacific Biosciences) for bulk and single-cell DNA and RNA. 
  2. The launching of very large-scale projects to describe the many dimensions of biological diversity at the molecular level. These include, among others:
    • The Human Cell Atlas (HCA), aiming to monitor the RNA content of all cells in the human body (estimated to be at about 1013) and to identify all distinct cell types;
    • The Earth Biogenome Project (EBP), aiming to sequence the genome of all eukaryotic species living on earth (estimated to be about 1.5 million); 
    • Large scale metagenomics projects, monitoring microbial diversity in natural (Tara Oceans) and urban (MetaSub) environments, as well as in the human body (Human Microbiome Project, HMP).
  3. Functional genomics, which uses deep sequencing to annotate genomic regions with their biological function (methylation, histone modifications, transcription factor binding, etc).  
  4. Other -omics technologies, such as proteomics, metabolomics, lipidomics, etc. 
  5. Genome editing technologies (CRISPR screens), which allow for large-scale genomic perturbations, followed by phenotypic or molecular assays of the cellular response.
  6. Advanced genomic imaging, allowing in vivo monitoring of genome activity (transcription, translation, etc.) within cells, as well as the 3D organization of individual cells in organs (spatial transcriptomics).
  7. Cohort-based studies, which aim to analyze -omics data together with phenotypic information, either from medical records or (dynamically and continuously) collected through electronic recording devices, for thousands to millions of individuals (from GTEx and UK Biobank to national precision medicine projects).  Data may include, among others, medical annotations, high-resolution imaging (neuroimaging, X-ray imaging), histopathologies, and longitudinal measurements of physiological variables (heart rate, body temperature, physical activity), many of which may be collected autonomously. 

Data produced by these projects appeal to methods that have been studied extensively in the bioinformatics literature, such as (multiple) sequence alignments, motif (gene) finding, (meta) genome assembly, and phylogenetic reconstruction. However, existing methods are unlikely to properly scale, and fundamental computational problems persist, in terms of adapting the methods and the algorithms to unprecedented data volumes. Other new data types (imaging, longitudinal recordings, etc.) and subject-matter questions will require the development of novel methods and algorithms. Many of these will build on sequence analysis, classical statistics, and recent machine and deep learning methods.  

This workshop will discuss these algorithms and methods, as well as new ways to work with the data, and applications to specific domains. It will also deliberate over the ethical issues involved in generating and working with such data; in particular, how these data can be used in a nondiscriminatory fashion, and for the benefit of all. 

Further details about this workshop will be posted in due course. 

If you are interested in joining this workshop, please see the Participate page.

Registration is required to attend this workshop. Space may be limited, and you are advised to register early. The link to the registration form will appear on this page approximately 10 weeks before the workshop. To submit your name for consideration, please register and await confirmation of your acceptance before booking your travel.