From the Inside: Algorithmic Challenges in Genomics

by Cenk Sahinalp

Since Descartes, and especially in the contributions of Kepler, Galileo and Newton, the main goal of the physical sciences has been to develop universal laws expressed as mathematical formulae, to which one can "input" measured conditions, i.e., known variables, to produce an "output," representing certain aspects of the unknown. Biology as a scientific discipline did not initially follow this reductionist path until the revolutionary works of Darwin and Mendel, which demonstrated that biological systems shared common traits and patterns, expressed as "models". By the early 20th century, mathematical methods, and statistical methods in particular, were already in use for the analysis of quantitative data in the biological sciences. The roots of computational biology, or bioinformatics, can be traced to that time, although, as in other sciences, biological data analyses were initially performed by humans. This changed after the advent of electronic computers, which were developed while pioneering physicists like Delbruck and Crick were establishing that hereditary information is carried by DNA, in long double-helical chains, encoding digital information – just like the tape in a Turing machine. Although the first computers, such as ENIAC, were initially employed in science for physical (thermonuclear) modeling by Von Neumann and others, it soon became natural for the likes of Gamow and Ulam to consider the use of ENIAC for biomolecular modeling and comparison. Simultaneously, Fisher started to use EDSAC in Britain to assess statistical significance, measured through the newly introduced notion of p-value. These happened while both Turing and Shannon, respective founders of theoretical computer science and information theory, were working on problems in biological computing.

By the late 1950s, computers were already used in population studies, species classification, and taxonomy construction. However, all these applications used the computer as a fancy calculator. While the work of Wiener and others on cybernetics laid the foundations of systems biology, much of the groundwork in that direction was speculative and not very algorithmic. Arguably, computational biology only became a distinct discipline in the 1970s with the biomolecular sequence comparison algorithms of Needleman and Wunsch as well as Smith and Waterman, developed in response to growth in the field of molecular biology. The 1970s also witnessed the development of the first protein structure prediction algorithms to aid crystallography and molecular dynamics simulations.

The first bioinformatics algorithms were developed simultaneously with the young field of theoretical computer science. After the introduction of Sanger sequencing, the first computational methods for genomic and proteomic sequence similarity search, as well as RNA secondary structure prediction, benefited from new algorithmic design and analysis ideas in dynamic programming (e.g. by Levenshtein for edit distance computation) and data structures for string matching (e.g. the suffix trees of Weiner and McCreight). With the kick-off of the Human Genome Project, the 1990s witnessed a significant increase in algorithmic activity for solving biomolecular problems. Novel methods for multiple sequence alignment, motif finding, haplotyping, phylogenetic tree construction, and protein structure analysis and prediction emerged at this time, all preparing the field for the eventual completion of the human genome sequence. The late 1990s were especially exciting, with both public and private efforts aiming to assemble the human genome from Sanger sequencing reads. Simultaneous development of microarray technologies resulted in numerous statistical and combinatorial techniques for clustering, classification and regression. Additional advances in co-expression analysis and protein interaction prediction through the yeast two hybrid system expanded systems biology into the realm of network science.

The completion of the Human Genome Project and the emergence of next generation sequencing technologies moved bioinformatics into a new era of Big Data science. Read mapping and variant detection algorithms, gene annotation and functional element discovery methods, gene expression and alternative splicing analysis tools that use RNA-Seq data, became central to international scientific projects of unprecedented scale, such as the 1000 Genomes Project, ENCODE and modENCODE, The Cancer Genome Atlas, and the International Cancer Genome Consortium.

As the size and variety of biomolecular data grow, interactions among molecular biology, theoretical computer science, statistics and statistical machine learning will need to deepen. Although algorithmic technology from the 80s and 90s, such as suffix arrays, locality sensitive hashing and color coding, as well as general techniques in linear and non-linear optimization and approximation algorithms for NP-hard problems, have already found profound applications in bioinformatics, newer techniques in streaming, sketching, metric embeddings, compressed data structures, differential privacy, homomorphic encryption and others are yet to attract significant interest from the computational biology community. Similarly, emerging problems in big data genomics, transcriptomics and proteomics have hardly attracted any attention in the theoretical computer science community. The objective of the Spring 2016 Simons Institute program on Algorithmic Challenges in Genomics is thus to re-facilitate collaborations between the two communities, increasing the awareness within each community of the latest developments in the other field.

Our program thus started with a boot camp consisting of several mini-courses in basic string algorithms and their applications to next-generation sequencing technologies, regulatory and epi-genomics, cancer genomics and network biology. During the semester, the Institute hosted three workshops, each focusing on a distinct application area of algorithmics in bioinformatics.

The first of these meetings focused on cancer biology, covering diverse computational problems. The meeting featured talks on (i) identifying molecular mechanisms responsible for large-scale genomic structural alterations and alternative splicing, (ii) improving reference genome representation and alignment, (iii) resolving the origins of tumor heterogeneity and inferring tumor phylogeny through the use of deep sequencing, (iv) single-cell sequencing and long read technologies coupled with signal deconvolution algorithms, (v) using population-scale mutational profiles and integration of molecular and imaging data for improved phenotyping, (vi) using network traversal and propagation models for tumor driver and potential drug target identification, and (vii) detection of mutual exclusivity and other patterns of coordinated genetic activity for improved mechanistic characterization of tumor progression.

The second workshop focused on the way gene expression is controlled in cells, by understanding the sequence elements and organization (including the 3-dimensional structure of chromosomes) in the nucleus that govern this expression. Techniques to analyze new data stemming from massively parallel sequencing technologies were of particular interest, both from a classical computational biology perspective, and also in response to the emerging need to analyze new data types in an integrative manner through the use of machine learning approaches, with the goal of understanding how a DNA sequence is read out differently under different circumstances.

The third workshop on network biology aimed to understand the web of interactions among cellular components, which affect all activities and disease. By understanding interactions among genes, proteins, RNAs, complexes and other molecules, it may be possible to reach a higher level of understanding of organismal function, dynamics and development. Molecular networks also provide a holistic framework within which to interpret other high-throughput heterogeneous measurements (e.g., expression levels, mutations, chemical modifications, etc.) that are being collected across organisms, individuals, cell types and conditions, and increasingly at the level of individual cells. This workshop brought together a diverse set of scientists focused on inferring and analyzing cellular networks, using a wide range of techniques including wet-lab experiments, graph algorithms, combinatorial optimization, machine learning and statistics.

Related Articles:

Letter from the Director, Spring 2016
From the Inside: Counting Complexity and Phase Transitions
Research Vignette: Strengthening SETH Lower Bounds
Research Vignette: Simplicity, Optimality and Efficiency Tradeoffs in Economics
Looking Ahead: Logic and Uncertainty

,