Abstract
Realistic synthetic data are essential for benchmarking the many computation tools developed for single-cell and spatial omics data. Here we propose a unified statistical framework scDesign3, which generates single-cell and spatial omics data from discrete cell types and continuous cell trajectories. Notably, scDesign3 uses a unified probabilistic model with an accessible likelihood. This probabilistic formulation is advantageous in that it enables the inference of the cell heterogeneity structure that fits a dataset, by leveraging the statistical model selection principle. Moreover, scDesign3 has interpretable parameters that can be adjusted to generate in silico negative and positive controls, providing the basis for false discovery rate control and power evaluation. In addition, scDesign3 coupled with scReadSim can generate sequence reads in addition to read counts, allowing the benchmarking of low-level computational tools.