How the splicing machinery defines exons or introns as the spliced unit has remained a puzzle for 30 years. Here, we demonstrate that peripheral and central regions of the nucleus harbor genes with two distinct exon-intron GC content architectures that differ in the splicing outcome. Genes with low GC content exons, flanked by long introns with lower GC content, are localized in the periphery, and the exons are defined as the spliced unit. Alternative splicing of these genes results in exon skipping. In contrast, the nuclear center contains genes with a high GC content in the exons and short flanking introns. Most splicing of these genes occurs via intron definition, and aberrant splicing leads to intron retention. We demonstrate that the nuclear periphery and center generate different environments for the regulation of alternative splicing and that two sets of splicing factors form discrete regulatory subnetworks for the two gene architectures. Our study connects 3D genome organization and splicing, thus demonstrating that exon and intron definition modes of splicing occur in different nuclear regions.
Tuesday, July 5th, 2022
Each cell type in a solid tissue has a characteristic transcriptome and spatial arrangement, both of which are observable using modern spatial omics assays. Surprisingly however, spatial information is frequently ignored when clustering cells to identify cell types and states. In fact, spatial location is typically considered only when solving the related, but distinct, problem of demarcating tissue domains (which could include multiple cell types). We present BANKSY, an algorithm that unifies cell type clustering and domain segmentation by constructing a product space of cell and neighborhood transcriptomes, representing cell state and microenvironment, respectively. BANKSY's spatial kernel-based feature augmentation strategy improves performance on both tasks when tested on diverse FISH- and sequencing-based spatial omics datasets. Uniquely, BANKSY identified hitherto undetected niche-dependent cell states in mouse brain. We also show that quality control of spatial omics data can be formulated as a domain identification problem and solved using BANKSY. Lastly, BANKSY is orders of magnitude faster and more scalable than existing spatial clustering methods, and thus capable of processing the large datasets generated by emerging spatial technologies. In summary, BANKSY represents an accurate, biologically motivated, scalable, and versatile framework for analyzing spatial omics data.
Wednesday, July 6th, 2022
Metastasis is the primary cause for mortality in cancer. In our studies we show that breast cancer metastasis can be prevented by limiting cell movement using small RNA treatment. We compare experimental models that reveal the pathways involved in cancer aggressiveness. We pinpoint potential diagnostic markers that dictate the course of the disease. Overall, large data analysis and experimental studies assist in better understanding cancer genomics.
The chromosomes of the human genome are organized in three-dimensions by compartmentalizing the cell nucleus and different genomic loci also interact with each other. However, the principles underlying such 3D genome organization and its functional impact remain poorly understood. In this talk, I will introduce some of our recent work in developing representation learning methods to study single-cell 3D genome organization. Our methods reveal the single-cell chromatin interactome patterns in different cellular conditions and at different scales. We hope that these algorithms will provide new insights into the structure and function of nuclear organization in health and disease.
I will discuss a unifying statistical formulation for many fundamental problems in genome science and develop a reference-free, highly efficient algorithm that solves it. This formulation allows us to construct an algorithm that performs inference on raw reads, avoiding references completely. We illustrate the power of our approach for new data-driven biological discovery with examples of novel single-cell resolved, cell-type-specific isoform expression, including splicing, expression in the major histocompatibility complex, and de novo prediction of viral protein adaptation including in SARS-CoV-2.
Thursday, July 7th, 2022
For understanding how the microbiome and viral infections contribute to non-communicable human diseases, it is important to understand the network perturbations effected by these microbial agents. At the Institute of Network Biology (INET) we aim to understand the principles of protein interaction network function, define patterns of network perturbation by disease genetics and how microbial and viral perturbations interact with host genetics to cause modulate genetic disease risk. We use an integrated approach consisting of high-throughput network mapping, bioinformatic and deep-learning analyses, and targeted validation of specific hypotheses. I will present unpublished data on systematic interactome maps of coronaviral and microbiome encoded proteins in the human host network, their relation to human genetics, and extensive functional validation data.
Complex traits are established through the joint influences of multiple genetic and environmental perturbations. There is a shortage of generalizable principles explaining how molecular networks integrate genetic and environmental effects ultimately leading to complex cellular and organismal traits. In particular, it is poorly understood when and how genetic perturbations lead to molecular changes that are confined to small parts of a network versus when they lead to large-scale adaptations of global network states. Here, we present a concept classifying genetic effects as local, regional or global depending on what fraction of a molecular network they affect. We exemplify this notion using transcriptome, proteome and phospho-proteome profiling of genetically heterogeneous populations of yeast strains, which we integrate with an array of cellular traits. Our analysis identified a central gauge of the yeast molecular network that is related to PKA and TOR (PT) signaling. The resulting ‘PT state’ could be summarized in a single value that explained large parts of the molecular configuration of the strains. This PT state associated with a specific balance between cellular processes spanning energy- and amino acid metabolism, transcription, translation, cell cycle control and cellular stress response. Carbon source quality, oxidative stress, and gene-environment interactions caused monotonic shifts of the molecular network state along the same axis. We further show that complex traits like heat stress resistance and longevity (stationary phase viability) result from the synthesis of genetic effects modulating this PT state with global network effects, plus much more trait-specific effects modulating only small parts of the network. Our work provides a rational for the conditions under which genetic effects propagate through molecular networks with pleiotropic consequences.
Complementary methods are required to fully characterize multiprotein complexes in vitro and in vivo. Affinity purification coupled to mass spectrometry (MS) can identify the composition of protein complexes at scale. However, information on direct contacts between subunits is often lacking. In contrast, solving the 3D structure of protein complexes by X-ray diffraction or cryo-electron microscopy can provide this information, but is not yet scalable for proteome-wide efforts. We have developed quantitative bioluminescence-based methods that facilitate binary interaction mapping in mammalian cells with sensitivity and specificity. We have applied these technologies to study the associations of huntingtin (HTT), a protein of unknown function at the root of Huntington’s disease. We found that HTT controls the abundance of its partner HAP40 in mammalian cells, suggesting that it functions as a scaffold preventing the degradation of partner proteins in mammalian cells. In another systematic screen, we identified high-confidence binary interactions for proteins of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which subsequently were entered into an in silico compound screening. We discovered a new chemical compound that directly targets the interaction between NSP10 and NSP16, which is critical for virus replication. Finally, we defined partners for the AAA ATPase p97, which interacts with many proteins and plays a functional role in various subcellular processes. We found that p97 associates with splicing regulators in an ASPL-dependent manner, suggesting a functional link between the p97:ASPL complex and mRNA processing. Overall, systematic mapping of direct interactions between proteins in higher-order protein assemblies facilitates a better understanding of cellular and disease processes. Also, high-confidence binary interactions are important drug targets with a high potential for innovation in therapy development.
Cancer genomes accumulate many somatic mutations resulting from imperfection of DNA processing during normal cell cycle as well as from carcinogenic exposures or cancer related aberrations of DNA maintenance machinery. These processes often lead to distinctive patterns of mutations, called mutational signatures. Considering these signatures as quantitative traits, we can leverage them for studies of the interactions between mutagenic processes, other cellular processes, and environment. Untangling these interactions is critical for understanding the processes underlying mutational signatures and their impact on the organism. I will discuss several computational approaches including a method for the deconvolution of the contributions of DNA damage and repair to the mutational landscape of cancer.
A powerful method to study the genotype-to-phenotype relationship is the systematic assessment of mutant phenotypes using genetically accessible model systems. We have developed and applied methods for quantitative analysis of genetic interactions in double mutants using yeast colony size as a proxy for cell fitness. Our global digenic interaction network reveals a hierarchy of functional modules, including pathways and complexes, bioprocesses and cell compartments. We have also expanded our systematic genetics pipeline to include single cell image-based readouts and arrays of yeast strains expressing GFP-tagged proteins for exploration of proteome dynamics and the effects of genetic perturbations on subcellular compartment morphology. Recently, we have leveraged the principles about genetic networks that we discovered in yeast to map genetic interactions in human HAP1 cells using genome-wide CRISPR/Cas9 screens. Our yeast work guided our selection of query genes to screen and provided a road-map for extraction of functional information from the resulting data. The interactions screened to date include more than 85% of the genes in the human genome that are expressed in HAP1 cells, and as was observed in yeast, interaction profile similarity is highly predictive of gene function. I will describe our results in the context of our ongoing efforts to discover the principles of genetic networks in yeast and apply what we learn to understand the functional organization of the human genome.
Friday, July 8th, 2022
Approaches for the identification of disease causal mutations are widely applied in research and clinical settings, but interpretation and ranking of the resulting variants remains challenging. Combined Annotation Dependent Depletion (CADD, https://cadd-sv.bihealth.org/) integrates annotations by contrasting variants that survived purifying selection along the human lineage with simulated mutations to score short sequence variants (SNVs, InDels, multi-allelic substitutions). Since its publication (Kircher, Witten et al. Nat Genet. 2014), CADD was well adopted by the community and minor adjustments and fixes were released since, including the native support of both GRCh37 and GRCh38 assemblies (Rentzsch et al. NAR 2019). Recently, we assessed existing deep neural network (DNN) models for splice effects with the Multiplexed Functional Assay of Splicing using Sort-seq dataset (MFASS, Cheung et al. Mol Cell. 2019). We selected two DNN models based only on genomic sequence, MMSplice and SpliceAI, which showed the best performance for integration into CADD (Rentzsch et al. Genome Med. 2021). The DNN scores boosted CADD's predictions for splice effects and we noted that while the DNN scores have superior performance on splice variants, they fail to account for nonsense and missense effects of the same variants. This suggests that variant prioritization will improve with more domain-specific information and underlines the importance of identifying additional such features, e.g. for regulatory sequences. With rapid advances in the identification of structural variants (SVs), we decided to apply the general concept of CADD to score them (CADD-SV, https://cadd-sv.bihealth.org/). While methods utilizing individual mechanistic principles like the deletion of coding sequence or 3D architecture disruptions were available, a comprehensive tool that uses the broad spectrum of available SV annotations was missing. We show that CADD-SV scores are predictive of pathogenicity and population frequency and that CADD-SV's ability to prioritize pathogenic variants exceeds that of existing methods like SVScore and AnnotSV (Kleinert & Kircher, Genome Res. 2022). Our results highlight advantages of the CADD approach, like profiting from a large training data set covering diverse and rare feature annotations without major ascertainment effects from historic and on-going variant collections.
I will describe two projects that aim to better dissect the causal chain from functional genetic variant through molecular intermediates and finally to organismal trait or disease risk. In the first, we are using pooled profiling of RNA binding protein (RBPs, splice factors) binding across individuals to measure and then computationally model genetic effects on both binding and RNA splicing. In the second, we have developed a causal network inference method that scales to hundreds of nodes by leveraging convex optimization.
The host defense against invading pathogens consists of pathogen elimination (‘resistance’) and the limitation of tissue damage resulting from host-pathogen interactions (‘disease tolerance’). As disease tolerance is a critical component of the host defense, it has become of particular interest in the treatment of infectious diseases, such as the influenza virus and SARS-CoV-2 infections. However, the identification of distinct molecular programs underpinning disease tolerance and resistance remained obscure. The lack of such molecular understanding has been a barrier in developing therapies that specifically target disease tolerance (or resistance) components of the host defense. In this talk I will show our identification of two distinct gene programs that are co-activated during in vivo IAV infection in lungs. We show that one program is specific to disease-tolerance phenotypes while the other program is specific to resistance phenotypes. We developed and validated the programs using in vivo IAV infection across 33 mouse strains that differ in their physiological ability to resist and tolerate infection, and by integrating transcription profiles of isolated cell types across several human cohorts. The identified decoupling between disease-tolerance and resistance allowed us to reveal novel organizational principles, markers and regulators of the host defense.