Single-cell Analysis of Normal and Perturbed Dynamic Biological Systems
National Institute Of Environmental Health Sciences
Investigators
Linked publications & trials
Abstract
Cell lineage mapping and state transition modeling during normal or disease progression using time course single-cell data sets is still a challenge. Current models generally assume continuity of state space and use single graphical models for lineage reconstruction. For example, our recent publication in Nature used a single-tree model to help visualize and characterize progenitor cell populations of normal and SARS-CoV-2 infection in human distal lung organoids. However, a biological process like spermatogenesis has been shown to comprise of an initial discrete process followed by continuous transitions. Secondly, most trajectory models use prior knowledge in terms of progenitor differentiation markers and do not account for dynamic heterogeneity of these progenitor cellular differentiation. Thirdly, most single-cell time dependent data are collected from developmental staging studies. Studying developmental stages ignore variation associated with phenotypic plasticity. A very important developmental process that has been shown to exhibit such a complex phenotype characterized by reversible biological processes is the Epithelial Mesenchymal Transition (EMT). To overcome the above limitations, we propose a new computational framework called Dynamic Spanning Forests Mixtures (DSFMix). DSFMix is a fast and scalable computational model framework that takes as input a temporal or staging single-cell data collected at discrete time points and outputs a mixture of dependent trees. DSFMix uses binary decision-tree models to select statistically significant features associated with marginal distributions of multimodality and skewness as well as the underlying dynamic cell to cell variation. The selected features are then used to connect all cells with a minimum spanning tree, followed by breaking it up into a minimum spanning forest based on the tree motifs. The minimum spanning forest subtrees are derived by combining a tree agglomerative hierarchical clustering (TAHC) with a dynamic branch cutting method based on the shape of the underlying dendrogram. We also showed how the DSFMix algorithm can be further combined with correlation search engines for new scientific discoveries. We also benchmark real time dependent trajectory models with traditional pseudotime single-graphical approaches such as Spanning-tree progression analysis of density-normalized events (SPADE), Monocle3, tSpace, Wanderlust, PAGA and Slingshot using time correlation analysis. Our results show that predictions of DSFMix strongly correlates more with observed time trends compared to the above pseudo-time algorithms. Furthermore, a high variability in correlations between the methods and observed time was observed demonstrating a strong cell-to-cell and time-dependent variations introduced by the various underlying method assumptions. Our results indicate that the expression of genes during normal development exhibits a high proportion of non-uniformly distributed profiles that are mostly right skewed and multimodal, the latter being a characteristic of major steady states during development. This finding challenges the inferences from most current statistical methods used in single-cell analysis driven by averages and unimodality. Furthermore, DSFMix can visualize and characterize complex relationships between spanning trees or forests and the underlying unknown clusters in weighted directed biological networks derived from longitudinal or staging data. No other current approaches properly leverage the power of such study designs. Our work so far motivates the continued need for better clustering and ordering algorithms of biological systems that take into account heterogeneous and dynamic tree topologies. We applied our DSFMix framework to visualize, test and characterize nested, intermediate and simultaneous dynamic lineages in a number of important real biological applications such as EMT, spermatogenesis, induced pluripotent stem cells (iPSC) reprogramming, early hormonal transcriptional response and COVID-19 immune response. For example, in the EMT study, DSFMix identified time dependent additional markers e.g. phosphorylated retinoblastoma (pRb), FAP, TROP2, Keratin-7 and CD45 markers not included in the original EMT map by Karacosta et al. 2019 and automatically identified a forest with EMT and MET trees. Furthermore, by forcing several asynchronous cell types into one continuous tree, it was challenging to identify, visualize and characterize individual early, intermediate and late spermatogenesis lineages. DSFMix identified a complex microenvironment involving communication between different germ and somatic cell types associated with a diverse and nested differentiation trajectories during spermatogenesis. DSFMix was further able to identify a global dynamic branching trajectory during Chemically induced Pluripotent Stem Cells (CiPSC) reprogramming with 3 terminal branching differentiation lineages instead of 2 as captured by monocle in Zhao et al. 2018, the goal standard for scRNA-seq data. The additional terminal branch is enriched with genes such as CrxOS which has been shown to maintain the self-renewal capacity of murine embryonic stem cells. Finally, an application of DSFMix to study the immune response due to coronavirus disease (COVID-19) showed a clear visualization of patient-specific healthy to disease cellular progression as individual trees in a forest. This project involves research on human coronavirus, novel coronavirus, COVID-19, Severe Acute Respiratory Syndrome coronavirus disease, SARS coronavirus, SARS-coronavirus-2, SARS-cov-2, SARS-cov2, SARS-related coronavirus 2, Severe acute respiratory syndrome coronavirus 2, SARS-Associated Coronavirus, SARS-cov, or SARS-Related Coronavirus.
View original record on NIH RePORTER →