OpenProblems NeurIPS2021 CITE-Seq – Open Problems in Single Cell Analysis

Info

openproblems_neurips2021/bmmc_cite
Luecken et al. (2021)
2.2 GiB
14-02-2024
90261 cells × 13953 genes

Quick links

Used in

No related benchmarks found.

Description

Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X 3 prime Single-Cell Gene Expression kit with Feature Barcoding in combination with the BioLegend TotalSeq B Universal Human Panel v1.0. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2021. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.

Preview

dataset_mod1 is an AnnData object with n_obs × n_vars = 90261 × 13953 with slots:

obs: size_factors, cell_type, batch
var: feature_name, feature_id, hvg, hvg_score
obsm: X_svd
layers: counts, normalized
uns: dataset_description, dataset_id, dataset_name, dataset_organism, dataset_reference, dataset_summary, dataset_url, normalization_id

dataset_mod2 is an AnnData object with n_obs × n_vars = 90261 × 134 with slots:

obs: cell_type, batch, size_factors
var: feature_name, feature_id, hvg, hvg_score
obsm: X_svd
layers: counts, normalized
uns: dataset_description, dataset_id, dataset_name, dataset_organism, dataset_reference, dataset_summary, dataset_url, normalization_id

Reference

Dataset mod1

Name	Description	Type	Data type	Size
obs
`batch`	A batch identifier. This label is very context-dependent and may be a combination of the tissue, assay, donor, etc.	`vector`	`category`	90261
`cell_type`	Classification of the cell type based on its characteristics and function within the tissue or organism.	`vector`	`category`	90261
`size_factors`	The size factors created by the normalisation method, if any.	`vector`	`float32`	90261
var
`feature_id`	Unique identifier for the feature, usually a ENSEMBL gene id.	`vector`	`object`	13953
`feature_name`	A human-readable name for the feature, usually a gene symbol.	`vector`	`object`	13953
`hvg`	Whether or not the feature is considered to be a ‘highly variable gene’	`vector`	`bool`	13953
`hvg_score`	A ranking of the features by hvg.	`vector`	`float64`	13953
obsm
`X_svd`	The resulting SVD embedding.	`densematrix`	`float32`	90261 × 100
layers
`counts`	Raw counts	`sparsematrix`	`float32`	90261 × 13953
`normalized`	Normalised expression values	`sparsematrix`	`float32`	90261 × 13953
uns
`dataset_description`	Long description of the dataset.	`atomic`	`str`	1
`dataset_id`	A unique identifier for the dataset. This is different from the `obs.dataset_id` field, which is the identifier for the dataset from which the cell data is derived.	`atomic`	`str`	1
`dataset_name`	A human-readable name for the dataset.	`atomic`	`str`	1
`dataset_organism`	The organism of the sample in the dataset.	`atomic`	`str`	1
`dataset_reference`	Bibtex reference of the paper in which the dataset was published.	`atomic`	`str`	1
`dataset_summary`	Short description of the dataset.	`atomic`	`str`	1
`dataset_url`	Link to the original source of the dataset.	`atomic`	`str`	1
`normalization_id`	Which normalization was used	`atomic`	`str`	1

Dataset mod2

Name	Description	Type	Data type	Size
obs
`batch`	A batch identifier. This label is very context-dependent and may be a combination of the tissue, assay, donor, etc.	`vector`	`category`	90261
`cell_type`	Classification of the cell type based on its characteristics and function within the tissue or organism.	`vector`	`category`	90261
`size_factors`	The size factors created by the normalisation method, if any.	`vector`	`float32`	90261
var
`feature_id`	Unique identifier for the feature, usually a ENSEMBL gene id.	`vector`	`object`	134
`feature_name`	A human-readable name for the feature, usually a gene symbol.	`vector`	`object`	134
`hvg`	Whether or not the feature is considered to be a ‘highly variable gene’	`vector`	`bool`	134
`hvg_score`	A ranking of the features by hvg.	`vector`	`float64`	134
obsm
`X_svd`	The resulting SVD embedding.	`densematrix`	`float32`	90261 × 100
layers
`counts`	Raw counts	`sparsematrix`	`float32`	90261 × 134
`normalized`	Normalised expression values	`sparsematrix`	`float32`	90261 × 134
uns
`dataset_description`	Long description of the dataset.	`atomic`	`str`	1
`dataset_id`	A unique identifier for the dataset. This is different from the `obs.dataset_id` field, which is the identifier for the dataset from which the cell data is derived.	`atomic`	`str`	1
`dataset_name`	A human-readable name for the dataset.	`atomic`	`str`	1
`dataset_organism`	The organism of the sample in the dataset.	`atomic`	`str`	1
`dataset_reference`	Bibtex reference of the paper in which the dataset was published.	`atomic`	`str`	1
`dataset_summary`	Short description of the dataset.	`atomic`	`str`	1
`dataset_url`	Link to the original source of the dataset.	`atomic`	`str`	1
`normalization_id`	Which normalization was used	`atomic`	`str`	1

Slot crossref data

`dataset_mod1.layers['counts']`

In R: dataset_mod1$layers[["counts"]]

Type: sparsematrix, data type: float32, shape: 90261 × 13953

Raw counts

`dataset_mod1.layers['normalized']`

In R: dataset_mod1$layers[["normalized"]]

Type: sparsematrix, data type: float32, shape: 90261 × 13953

Normalised expression values

`dataset_mod1.obs['size_factors']`

In R: dataset_mod1$obs[["size_factors"]]

Type: vector, data type: float32, shape: 90261

The size factors created by the normalisation method, if any.

`dataset_mod1.obs['cell_type']`

In R: dataset_mod1$obs[["cell_type"]]

Type: vector, data type: category, shape: 90261

Classification of the cell type based on its characteristics and function within the tissue or organism.

`dataset_mod1.obs['batch']`

In R: dataset_mod1$obs[["batch"]]

Type: vector, data type: category, shape: 90261

A batch identifier. This label is very context-dependent and may be a combination of the tissue, assay, donor, etc.

`dataset_mod1.obsm['X_svd']`

In R: dataset_mod1$obsm[["X_svd"]]

Type: densematrix, data type: float32, shape: 90261 × 100

The resulting SVD embedding.

`dataset_mod1.uns['dataset_description']`

In R: dataset_mod1$uns[["dataset_description"]]

Type: atomic, data type: str, shape: 1

Long description of the dataset.

`dataset_mod1.uns['dataset_id']`

In R: dataset_mod1$uns[["dataset_id"]]

Type: atomic, data type: str, shape: 1

A unique identifier for the dataset. This is different from the obs.dataset_id field, which is the identifier for the dataset from which the cell data is derived.

`dataset_mod1.uns['dataset_name']`

In R: dataset_mod1$uns[["dataset_name"]]

Type: atomic, data type: str, shape: 1

A human-readable name for the dataset.

`dataset_mod1.uns['dataset_organism']`

In R: dataset_mod1$uns[["dataset_organism"]]

Type: atomic, data type: str, shape: 1

The organism of the sample in the dataset.

`dataset_mod1.uns['dataset_reference']`

In R: dataset_mod1$uns[["dataset_reference"]]

Type: atomic, data type: str, shape: 1

Bibtex reference of the paper in which the dataset was published.

`dataset_mod1.uns['dataset_summary']`

In R: dataset_mod1$uns[["dataset_summary"]]

Type: atomic, data type: str, shape: 1

Short description of the dataset.

`dataset_mod1.uns['dataset_url']`

In R: dataset_mod1$uns[["dataset_url"]]

Type: atomic, data type: str, shape: 1

Link to the original source of the dataset.

`dataset_mod1.uns['normalization_id']`

In R: dataset_mod1$uns[["normalization_id"]]

Type: atomic, data type: str, shape: 1

Which normalization was used

`dataset_mod1.var['feature_name']`

In R: dataset_mod1$var[["feature_name"]]

Type: vector, data type: object, shape: 13953

A human-readable name for the feature, usually a gene symbol.

`dataset_mod1.var['feature_id']`

In R: dataset_mod1$var[["feature_id"]]

Type: vector, data type: object, shape: 13953

Unique identifier for the feature, usually a ENSEMBL gene id.

`dataset_mod1.var['hvg']`

In R: dataset_mod1$var[["hvg"]]

Type: vector, data type: bool, shape: 13953

Whether or not the feature is considered to be a ‘highly variable gene’

`dataset_mod1.var['hvg_score']`

In R: dataset_mod1$var[["hvg_score"]]

Type: vector, data type: float64, shape: 13953

A ranking of the features by hvg.

`dataset_mod2.layers['counts']`

In R: dataset_mod2$layers[["counts"]]

Type: sparsematrix, data type: float32, shape: 90261 × 134

Raw counts

`dataset_mod2.layers['normalized']`

In R: dataset_mod2$layers[["normalized"]]

Type: sparsematrix, data type: float32, shape: 90261 × 134

Normalised expression values

`dataset_mod2.obs['cell_type']`

In R: dataset_mod2$obs[["cell_type"]]

Type: vector, data type: category, shape: 90261

Classification of the cell type based on its characteristics and function within the tissue or organism.

`dataset_mod2.obs['batch']`

In R: dataset_mod2$obs[["batch"]]

Type: vector, data type: category, shape: 90261

A batch identifier. This label is very context-dependent and may be a combination of the tissue, assay, donor, etc.

`dataset_mod2.obs['size_factors']`

In R: dataset_mod2$obs[["size_factors"]]

Type: vector, data type: float32, shape: 90261

The size factors created by the normalisation method, if any.

`dataset_mod2.obsm['X_svd']`

In R: dataset_mod2$obsm[["X_svd"]]

Type: densematrix, data type: float32, shape: 90261 × 100

The resulting SVD embedding.

`dataset_mod2.uns['dataset_description']`

In R: dataset_mod2$uns[["dataset_description"]]

Type: atomic, data type: str, shape: 1

Long description of the dataset.

`dataset_mod2.uns['dataset_id']`

In R: dataset_mod2$uns[["dataset_id"]]

Type: atomic, data type: str, shape: 1

A unique identifier for the dataset. This is different from the obs.dataset_id field, which is the identifier for the dataset from which the cell data is derived.

`dataset_mod2.uns['dataset_name']`

In R: dataset_mod2$uns[["dataset_name"]]

Type: atomic, data type: str, shape: 1

A human-readable name for the dataset.

`dataset_mod2.uns['dataset_organism']`

In R: dataset_mod2$uns[["dataset_organism"]]

Type: atomic, data type: str, shape: 1

The organism of the sample in the dataset.

`dataset_mod2.uns['dataset_reference']`

In R: dataset_mod2$uns[["dataset_reference"]]

Type: atomic, data type: str, shape: 1

Bibtex reference of the paper in which the dataset was published.

`dataset_mod2.uns['dataset_summary']`

In R: dataset_mod2$uns[["dataset_summary"]]

Type: atomic, data type: str, shape: 1

Short description of the dataset.

`dataset_mod2.uns['dataset_url']`

In R: dataset_mod2$uns[["dataset_url"]]

Type: atomic, data type: str, shape: 1

Link to the original source of the dataset.

`dataset_mod2.uns['normalization_id']`

In R: dataset_mod2$uns[["normalization_id"]]

Type: atomic, data type: str, shape: 1

Which normalization was used

`dataset_mod2.var['feature_name']`

In R: dataset_mod2$var[["feature_name"]]

Type: vector, data type: object, shape: 134

A human-readable name for the feature, usually a gene symbol.

`dataset_mod2.var['feature_id']`

In R: dataset_mod2$var[["feature_id"]]

Type: vector, data type: object, shape: 134

Unique identifier for the feature, usually a ENSEMBL gene id.

`dataset_mod2.var['hvg']`

In R: dataset_mod2$var[["hvg"]]

Type: vector, data type: bool, shape: 134

Whether or not the feature is considered to be a ‘highly variable gene’

`dataset_mod2.var['hvg_score']`

In R: dataset_mod2$var[["hvg_score"]]

Type: vector, data type: float64, shape: 134

A ranking of the features by hvg.

References

Luecken, Malte, Daniel Burkhardt, Robrecht Cannoodt, Christopher Lance, Aditi Agrawal, Hananeh Aliee, Ann Chen, et al. 2021. “A Sandbox for Prediction and Integration of DNA, RNA, and Proteins in Single Cells.” In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, edited by J. Vanschoren and S. Yeung. Vol. 1. Curran. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/158f3069a435b314a80bdcb024f8e422-Paper-round2.pdf.