poss_dataset_ids = dataset_info
.map(d => d.dataset_id)
.filter(d => results.map(r => r.dataset_id).includes(d))
poss_method_ids = method_info
.map(d => d.method_id)
.filter(d => results.map(r => r.method_id).includes(d))
poss_metric_ids = metric_info
.map(d => d.metric_id)
.filter(d => results.map(r => Object.keys(r.scaled_scores)).flat().includes(d))
Denoising
Removing noise in sparse single-cell RNA-sequencing count data
3 datasets · 11 methods · 2 control methods · 2 metrics
Info
Repository
v1.0.0
MIT
Task info Method info Metric info Dataset info Results
Single-cell RNA-Seq protocols only detect a fraction of the mRNA molecules present in each cell. As a result, the measurements (UMI counts) observed for each gene and each cell are associated with generally high levels of technical noise (Grün et al., 2014). Denoising describes the task of estimating the true expression level of each gene in each cell. In the single-cell literature, this task is also referred to as imputation, a term which is typically used for missing data problems in statistics. Similar to the use of the terms “dropout”, “missing data”, and “technical zeros”, this terminology can create confusion about the underlying measurement process (Sarkar and Stephens, 2021).
A key challenge in evaluating denoising methods is the general lack of a ground truth. A recent benchmark study (Hou et al., 2020) relied on flow-sorted datasets, mixture control experiments (Tian et al., 2019), and comparisons with bulk RNA-Seq data. Since each of these approaches suffers from specific limitations, it is difficult to combine these different approaches into a single quantitative measure of denoising accuracy. Here, we instead rely on an approach termed molecular cross-validation (MCV), which was specifically developed to quantify denoising accuracy in the absence of a ground truth (Batson et al., 2019). In MCV, the observed molecules in a given scRNA-Seq dataset are first partitioned between a training and a test dataset. Next, a denoising method is applied to the training dataset. Finally, denoising accuracy is measured by comparing the result to the test dataset. The authors show that both in theory and in practice, the measured denoising accuracy is representative of the accuracy that would be obtained on a ground truth dataset.
Summary
Display settings
Filter datasets
Filter methods
Filter metrics
Results
Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.
Dataset info
Show
Pancreas (inDrop)
Human pancreatic islet scRNA-seq data from 6 datasets across technologies (CEL-seq, CEL-seq2, Smart-seq2, inDrop, Fluidigm C1, and SMARTER-seq). Here we just use the inDrop1 batch, which includes1937 cells × 15502 genes (Luecken et al. 2021).
1k Peripheral blood mononuclear cells
1k Peripheral Blood Mononuclear Cells (PBMCs) from a healthy donor. Sequenced on 10X v3 chemistry in November 2018 by 10X Genomics (10x Genomics 2018).
Tabula Muris Senis Lung
All lung cells from Tabula Muris Senis, a 500k cell-atlas from 18 organs and tissues across the mouse lifespan. Here we use just 10x data from lung. 24540 cells × 16160 genes across 3 time points (Tabula Muris Consortium 2020).
Method info
Show
ALRA (log norm)
Repository · Source Code · Container · v1.0.0
ALRA (Adaptively-thresholded Low Rank Approximation) is a method for imputation of missing values in single cell RNA-sequencing data. Given a normalised scRNA-seq expression matrix, it first imputes values using rank-k approximation, using singular value decomposition. Next, a symmetric distribution is fitted to the near-zero imputed values for each gene (row) of the matrix. The right “tail” of this distribution is then used to threshold the accepted nonzero entries. This same threshold is then used to rescale the matrix, once the “biological zeros” have been removed (Linderman, Zhao, and Kluger 2018)
ALRA (log norm, reversed normalization)
Repository · Source Code · Container · v1.0.0
ALRA (Adaptively-thresholded Low Rank Approximation) is a method for imputation of missing values in single cell RNA-sequencing data. Given a normalised scRNA-seq expression matrix, it first imputes values using rank-k approximation, using singular value decomposition. Next, a symmetric distribution is fitted to the near-zero imputed values for each gene (row) of the matrix. The right “tail” of this distribution is then used to threshold the accepted nonzero entries. This same threshold is then used to rescale the matrix, once the “biological zeros” have been removed (Linderman, Zhao, and Kluger 2018)
ALRA (sqrt norm)
Repository · Source Code · Container · v1.0.0
ALRA (Adaptively-thresholded Low Rank Approximation) is a method for imputation of missing values in single cell RNA-sequencing data. Given a normalised scRNA-seq expression matrix, it first imputes values using rank-k approximation, using singular value decomposition. Next, a symmetric distribution is fitted to the near-zero imputed values for each gene (row) of the matrix. The right “tail” of this distribution is then used to threshold the accepted nonzero entries. This same threshold is then used to rescale the matrix, once the “biological zeros” have been removed (Linderman, Zhao, and Kluger 2018)
ALRA (sqrt norm, reversed normalization)
Repository · Source Code · Container · v1.0.0
ALRA (Adaptively-thresholded Low Rank Approximation) is a method for imputation of missing values in single cell RNA-sequencing data. Given a normalised scRNA-seq expression matrix, it first imputes values using rank-k approximation, using singular value decomposition. Next, a symmetric distribution is fitted to the near-zero imputed values for each gene (row) of the matrix. The right “tail” of this distribution is then used to threshold the accepted nonzero entries. This same threshold is then used to rescale the matrix, once the “biological zeros” have been removed (Linderman, Zhao, and Kluger 2018)
DCA
Repository · Source Code · Container · v1.0.0
DCA (Deep Count Autoencoder) is a method to remove the effect of dropout in scRNA-seq data. DCA takes into account the count structure, overdispersed nature and sparsity of scRNA-seq datatypes using a deep autoencoder with a zero-inflated negative binomial (ZINB) loss. The autoencoder is then applied to the dataset, where the mean of the fitted negative binomial distributions is used to fill each entry of the imputed matrix (Eraslan et al. 2019)
KNN smoothing
Repository · Source Code · Container · v1.0.0
KNN-smoothing is a method for denoising data based on the k-nearest neighbours. Given a normalised scRNA-seq matrix, KNN-smoothing calculates a k-nearest neighbour matrix using Euclidean distances between cell pairs. Each cell’s denoised expression is then defined as the average expression of each of its neighbours (Open Problems for Single Cell Analysis Consortium 2022)
Iterative KNN smoothing
Repository · Source Code · Container · v1.0.0
Iterative kNN-smoothing is a method to repair or denoise noisy scRNA-seq expression matrices. Given a scRNA-seq expression matrix, KNN-smoothing first applies initial normalisation and smoothing. Then, a chosen number of principal components is used to calculate Euclidean distances between cells. Minimally sized neighbourhoods are initially determined from these Euclidean distances, and expression profiles are shared between neighbouring cells. Then, the resultant smoothed matrix is used as input to the next step of smoothing, where the size (k) of the considered neighbourhoods is increased, leading to greater smoothing. This process continues until a chosen maximum k value has been reached, at which point the iteratively smoothed object is then optionally scaled to yield a final result (Wagner, Yan, and Yanai 2018)
MAGIC
Repository · Source Code · Container · v1.0.0
MAGIC (Markov Affinity-based Graph Imputation of Cells) is a method for imputation and denoising of noisy or dropout-prone single cell RNA-sequencing data. Given a normalised scRNA-seq expression matrix, it first calculates Euclidean distances between each pair of cells in the dataset, which is then augmented using a Gaussian kernel (function) and row-normalised to give a normalised affinity matrix. A t-step markov process is then calculated, by powering this affinity matrix t times. Finally, the powered affinity matrix is right-multiplied by the normalised data, causing the final imputed values to take the value of a per-gene average weighted by the affinities of cells. The resultant imputed matrix is then rescaled, to more closely match the magnitude of measurements in the normalised (input) matrix (Dijk et al. 2018)
MAGIC (approximate)
Repository · Source Code · Container · v1.0.0
MAGIC (Markov Affinity-based Graph Imputation of Cells) is a method for imputation and denoising of noisy or dropout-prone single cell RNA-sequencing data. Given a normalised scRNA-seq expression matrix, it first calculates Euclidean distances between each pair of cells in the dataset, which is then augmented using a Gaussian kernel (function) and row-normalised to give a normalised affinity matrix. A t-step markov process is then calculated, by powering this affinity matrix t times. Finally, the powered affinity matrix is right-multiplied by the normalised data, causing the final imputed values to take the value of a per-gene average weighted by the affinities of cells. The resultant imputed matrix is then rescaled, to more closely match the magnitude of measurements in the normalised (input) matrix (Dijk et al. 2018)
MAGIC (approximate, reversed normalization)
Repository · Source Code · Container · v1.0.0
MAGIC (Markov Affinity-based Graph Imputation of Cells) is a method for imputation and denoising of noisy or dropout-prone single cell RNA-sequencing data. Given a normalised scRNA-seq expression matrix, it first calculates Euclidean distances between each pair of cells in the dataset, which is then augmented using a Gaussian kernel (function) and row-normalised to give a normalised affinity matrix. A t-step markov process is then calculated, by powering this affinity matrix t times. Finally, the powered affinity matrix is right-multiplied by the normalised data, causing the final imputed values to take the value of a per-gene average weighted by the affinities of cells. The resultant imputed matrix is then rescaled, to more closely match the magnitude of measurements in the normalised (input) matrix (Dijk et al. 2018)
MAGIC (reversed normalization)
Repository · Source Code · Container · v1.0.0
MAGIC (Markov Affinity-based Graph Imputation of Cells) is a method for imputation and denoising of noisy or dropout-prone single cell RNA-sequencing data. Given a normalised scRNA-seq expression matrix, it first calculates Euclidean distances between each pair of cells in the dataset, which is then augmented using a Gaussian kernel (function) and row-normalised to give a normalised affinity matrix. A t-step markov process is then calculated, by powering this affinity matrix t times. Finally, the powered affinity matrix is right-multiplied by the normalised data, causing the final imputed values to take the value of a per-gene average weighted by the affinities of cells. The resultant imputed matrix is then rescaled, to more closely match the magnitude of measurements in the normalised (input) matrix (Dijk et al. 2018)
Control method info
Show
No denoising
Repository · Source Code · Container · v1.0.0
Denoised outputs are defined from the unmodified input data (Open Problems for Single Cell Analysis Consortium 2022)
Perfect denoising
Repository · Source Code · Container · v1.0.0
Denoised outputs are defined from the target data (Open Problems for Single Cell Analysis Consortium 2022)
Metric info
Show
Mean-squared error
The mean squared error between the denoised counts of the training dataset and the true counts of the test dataset after reweighting by the train/test ratio (Batson, Royer, and Webber 2019).
Poisson loss
The Poisson log likelihood of observing the true counts of the test dataset given the distribution given in the denoised dataset (Batson, Royer, and Webber 2019).
Quality control results
Show
Category | Name | Value | Condition | Severity |
---|---|---|---|---|
Scaling | Worst score knn_smoothing poisson | -10.298315 | worst_score >= -1 | ✗✗✗ |
Scaling | Worst score alra_sqrt poisson | -2.301203 | worst_score >= -1 | ✗✗ |
Normalisation visualisation
Show
References
10x Genomics. 2018. “1k PBMCs from a Healthy Donor (V3 Chemistry).” https://www.10xgenomics.com/resources/datasets/1-k-pbm-cs-from-a-healthy-donor-v-3-chemistry-3-standard-3-0-0.
Batson, Joshua, Loı̈c Royer, and James Webber. 2019. “Molecular Cross-Validation for Single-Cell RNA-Seq.” bioRxiv. https://doi.org/10.1101/786269.
Dijk, David van, Roshan Sharma, Juozas Nainys, Kristina Yim, Pooja Kathail, Ambrose J. Carr, Cassandra Burdziak, et al. 2018. “Recovering Gene Interactions from Single-Cell Data Using Data Diffusion.” Cell 174 (3): 716–729.e27. https://doi.org/10.1016/j.cell.2018.05.061.
Eraslan, Gökcen, Lukas M. Simon, Maria Mircea, Nikola S. Mueller, and Fabian J. Theis. 2019. “Single-Cell RNA-Seq Denoising Using a Deep Count Autoencoder.” Nature Communications 10 (1). https://doi.org/10.1038/s41467-018-07931-2.
Linderman, George C., Jun Zhao, and Yuval Kluger. 2018. “Zero-Preserving Imputation of scRNA-Seq Data Using Low-Rank Approximation.” bioRxiv. https://doi.org/10.1101/397588.
Luecken, Malte D., M. Büttner, K. Chaichoompu, A. Danese, M. Interlandi, M. F. Mueller, D. C. Strobl, et al. 2021. “Benchmarking Atlas-Level Data Integration in Single-Cell Genomics.” Nature Methods 19 (1): 41–50. https://doi.org/10.1038/s41592-021-01336-8.
Open Problems for Single Cell Analysis Consortium. 2022. “Open Problems.” https://openproblems.bio.
Tabula Muris Consortium. 2020. “A Single-Cell Transcriptomic Atlas Characterizes Ageing Tissues in the Mouse.” Nature 583 (7817): 590–95. https://doi.org/10.1038/s41586-020-2496-1.
Wagner, Florian, Yun Yan, and Itai Yanai. 2018. “K-Nearest Neighbor Smoothing for High-Throughput Single-Cell RNA-Seq Data.” bioRxiv. https://doi.org/10.1101/217737.