About DeltaTopic
Single-cell RNA-seq technology has been successfully applied to profile regulatory-genomic changes in studying many human disease mechanisms. Our capability to measure single-cell-level mRNA molecules has dramatically changed our research paradigm in genomics and translational medicine. A typical single-cell study implicitly assumes observed transcript levels as a static value, considering that every cell is fixed at a particular state. Recently, researchers have developed a complementary method to measure gene expression dynamics (the speed of splicing) by taking the difference between the spliced and unspliced counts in scRNA-seq profiles.
Why is it difficult to estimate full-scale dynamics in data sets with limited snapshots? However, probabilistic inference of full-scale dynamics often poses a substantial challenge, and the inferred rate parameters may greatly vary depending on the normalization and embedding methods. Existing velocity analysis methods rely on a critical assumption unmet by most single-cell data sets at a study design level. Most single-cell datasets, especially those collected from patient-derived cancer samples, only span over several snapshots of full developmental, evolutionary or disease progression processes. In human case-control studies, cells may not have reached steady states in the disease progression process and are likely to fail to provide enough information for most genes and pathways. Such discontinuity and sparsity in data collection somewhat force statistical inference algorithms to rely on an unrealistic steady-state assumption and interpolated data points with high uncertainty.
Why do we need a topic model for transcription dynamics? Nevertheless, gene expression dynamics implicated by the transcript-level difference between the spliced and unspliced counts provide a valuable perspective in single-cell data analysis, making single-cell analysis more valuable beyond conventional static analysis. To overcome the limitations of poor and incompleteness in single-cell RNA velocity analysis, we propose a new modelling framework, DeltaTopic, short for Dynamically-Encoded Latent Transcriptomic pattern Analysis by Topic Modelling.
DeltaTopic combines two ideas:
Latent topic analysis that will guide unsupervised machine learning for discovering new dynamic cell states.
Application of first-order approximation to learn robust relationships between the spliced and unspliced counts instead of estimating a full trajectory of ODE models.
For a latent topic model, we view each cell as a document and each gene as a word to make model parameters directly interpretable while keeping the Bayesian model’s capability to impute missing information. The simplified dynamic model also permits an intuitive interpretation of spliced-unspliced differences as multiplicative “delta” parameters in the model.
We developed and applied our DeltaTopic approach to single-cell datasets on pancreatic ductal adenocarcinoma (PDAC), one of the most challenging cancer types with a poor prognosis. In the latent space, our model identified cancer survival-specific topics marked by a unique set of gene expression dynamics. We also find DeltaTopic further dissected sub-topics clumped together in traditional clustering methods implicating novel gene modules and cell states that are dynamically controlled along with the cancer progressions. With synthetic datasets, we demonstrate the effectiveness of DeltaTopic and BALSAM in cell label prediction and gene activity identification. Both methods significantly outperformed conventional principal component analysis (PCA), with DeltaTopic showing particular strength in recovering both static and dynamic gene activities.
See Zhang et al. (2023) for a detailed exposition of the methods.