Abstract
Approaches for the identification of disease causal mutations are widely applied in research and clinical settings, but interpretation and ranking of the resulting variants remains challenging. Combined Annotation Dependent Depletion (CADD, https://cadd-sv.bihealth.org/) integrates annotations by contrasting variants that survived purifying selection along the human lineage with simulated mutations to score short sequence variants (SNVs, InDels, multi-allelic substitutions). Since its publication (Kircher, Witten et al. Nat Genet. 2014), CADD was well adopted by the community and minor adjustments and fixes were released since, including the native support of both GRCh37 and GRCh38 assemblies (Rentzsch et al. NAR 2019). Recently, we assessed existing deep neural network (DNN) models for splice effects with the Multiplexed Functional Assay of Splicing using Sort-seq dataset (MFASS, Cheung et al. Mol Cell. 2019). We selected two DNN models based only on genomic sequence, MMSplice and SpliceAI, which showed the best performance for integration into CADD (Rentzsch et al. Genome Med. 2021). The DNN scores boosted CADD's predictions for splice effects and we noted that while the DNN scores have superior performance on splice variants, they fail to account for nonsense and missense effects of the same variants. This suggests that variant prioritization will improve with more domain-specific information and underlines the importance of identifying additional such features, e.g. for regulatory sequences. With rapid advances in the identification of structural variants (SVs), we decided to apply the general concept of CADD to score them (CADD-SV, https://cadd-sv.bihealth.org/). While methods utilizing individual mechanistic principles like the deletion of coding sequence or 3D architecture disruptions were available, a comprehensive tool that uses the broad spectrum of available SV annotations was missing. We show that CADD-SV scores are predictive of pathogenicity and population frequency and that CADD-SV's ability to prioritize pathogenic variants exceeds that of existing methods like SVScore and AnnotSV (Kleinert & Kircher, Genome Res. 2022). Our results highlight advantages of the CADD approach, like profiting from a large training data set covering diverse and rare feature annotations without major ascertainment effects from historic and on-going variant collections.