OpenProblems NeurIPS2021 Multiome – Open Problems in Single Cell Analysis

Info

openproblems_neurips2021/bmmc_multiome
Luecken et al. (2021)
7.78 GiB
14-02-2024
69249 cells × 13431 genes

Quick links

Used in

No related benchmarks found.

Description

Single-cell CITE-Seq data collected from bone marrow mononuclear cells of 12 healthy human donors using the 10X Multiome Gene Expression and Chromatin Accessibility kit. The dataset was generated to support Multimodal Single-Cell Data Integration Challenge at NeurIPS 2021. Samples were prepared using a standard protocol at four sites. The resulting data was then annotated to identify cell types and remove doublets. The dataset was designed with a nested batch layout such that some donor samples were measured at multiple sites with some donors measured at a single site.

Preview

dataset_mod1 is an AnnData object with n_obs × n_vars = 69249 × 13431 with slots:

obs: size_factors, cell_type, batch
var: feature_name, feature_id, hvg, hvg_score
obsm: X_svd
layers: counts, normalized
uns: dataset_description, dataset_id, dataset_name, dataset_organism, dataset_reference, dataset_summary, dataset_url, normalization_id

dataset_mod2 is an AnnData object with n_obs × n_vars = 69249 × 116490 with slots:

obs: cell_type, batch, size_factors
var: feature_name, feature_id, hvg, hvg_score
obsm: X_svd
layers: counts, normalized
uns: dataset_description, dataset_id, dataset_name, dataset_organism, dataset_reference, dataset_summary, dataset_url, normalization_id

Reference

Dataset mod1

Name	Description	Type	Data type	Size
obs
`batch`	A batch identifier. This label is very context-dependent and may be a combination of the tissue, assay, donor, etc.	`vector`	`category`	69249
`cell_type`	Classification of the cell type based on its characteristics and function within the tissue or organism.	`vector`	`category`	69249
`size_factors`	The size factors created by the normalisation method, if any.	`vector`	`float32`	69249
var
`feature_id`	Unique identifier for the feature, usually a ENSEMBL gene id.	`vector`	`object`	13431
`feature_name`	A human-readable name for the feature, usually a gene symbol.	`vector`	`object`	13431
`hvg`	Whether or not the feature is considered to be a ‘highly variable gene’	`vector`	`bool`	13431
`hvg_score`	A ranking of the features by hvg.	`vector`	`float64`	13431
obsm
`X_svd`	The resulting SVD embedding.	`densematrix`	`float32`	69249 × 100
layers
`counts`	Raw counts	`sparsematrix`	`float32`	69249 × 13431
`normalized`	Normalised expression values	`sparsematrix`	`float32`	69249 × 13431
uns
`dataset_description`	Long description of the dataset.	`atomic`	`str`	1
`dataset_id`	A unique identifier for the dataset. This is different from the `obs.dataset_id` field, which is the identifier for the dataset from which the cell data is derived.	`atomic`	`str`	1
`dataset_name`	A human-readable name for the dataset.	`atomic`	`str`	1
`dataset_organism`	The organism of the sample in the dataset.	`atomic`	`str`	1
`dataset_reference`	Bibtex reference of the paper in which the dataset was published.	`atomic`	`str`	1
`dataset_summary`	Short description of the dataset.	`atomic`	`str`	1
`dataset_url`	Link to the original source of the dataset.	`atomic`	`str`	1
`normalization_id`	Which normalization was used	`atomic`	`str`	1

Dataset mod2

Name	Description	Type	Data type	Size
obs
`batch`	A batch identifier. This label is very context-dependent and may be a combination of the tissue, assay, donor, etc.	`vector`	`category`	69249
`cell_type`	Classification of the cell type based on its characteristics and function within the tissue or organism.	`vector`	`category`	69249
`size_factors`	The size factors created by the normalisation method, if any.	`vector`	`float32`	69249
var
`feature_id`	Unique identifier for the feature, usually a ENSEMBL gene id.	`vector`	`object`	116490
`feature_name`	A human-readable name for the feature, usually a gene symbol.	`vector`	`object`	116490
`hvg`	Whether or not the feature is considered to be a ‘highly variable gene’	`vector`	`bool`	116490
`hvg_score`	A ranking of the features by hvg.	`vector`	`float64`	116490
obsm
`X_svd`	The resulting SVD embedding.	`densematrix`	`float32`	69249 × 100
layers
`counts`	Raw counts	`sparsematrix`	`float32`	69249 × 116490
`normalized`	Normalised expression values	`sparsematrix`	`float32`	69249 × 116490
uns
`dataset_description`	Long description of the dataset.	`atomic`	`str`	1
`dataset_id`	A unique identifier for the dataset. This is different from the `obs.dataset_id` field, which is the identifier for the dataset from which the cell data is derived.	`atomic`	`str`	1
`dataset_name`	A human-readable name for the dataset.	`atomic`	`str`	1
`dataset_organism`	The organism of the sample in the dataset.	`atomic`	`str`	1
`dataset_reference`	Bibtex reference of the paper in which the dataset was published.	`atomic`	`str`	1
`dataset_summary`	Short description of the dataset.	`atomic`	`str`	1
`dataset_url`	Link to the original source of the dataset.	`atomic`	`str`	1
`normalization_id`	Which normalization was used	`atomic`	`str`	1

Slot crossref data

`dataset_mod1.layers['counts']`

In R: dataset_mod1$layers[["counts"]]

Type: sparsematrix, data type: float32, shape: 69249 × 13431

Raw counts

`dataset_mod1.layers['normalized']`

In R: dataset_mod1$layers[["normalized"]]

Type: sparsematrix, data type: float32, shape: 69249 × 13431

Normalised expression values

`dataset_mod1.obs['size_factors']`

In R: dataset_mod1$obs[["size_factors"]]

Type: vector, data type: float32, shape: 69249

The size factors created by the normalisation method, if any.

`dataset_mod1.obs['cell_type']`

In R: dataset_mod1$obs[["cell_type"]]

Type: vector, data type: category, shape: 69249

Classification of the cell type based on its characteristics and function within the tissue or organism.

`dataset_mod1.obs['batch']`

In R: dataset_mod1$obs[["batch"]]

Type: vector, data type: category, shape: 69249

A batch identifier. This label is very context-dependent and may be a combination of the tissue, assay, donor, etc.

`dataset_mod1.obsm['X_svd']`

In R: dataset_mod1$obsm[["X_svd"]]

Type: densematrix, data type: float32, shape: 69249 × 100

The resulting SVD embedding.

`dataset_mod1.uns['dataset_description']`

In R: dataset_mod1$uns[["dataset_description"]]

Type: atomic, data type: str, shape: 1

Long description of the dataset.

`dataset_mod1.uns['dataset_id']`

In R: dataset_mod1$uns[["dataset_id"]]

Type: atomic, data type: str, shape: 1

A unique identifier for the dataset. This is different from the obs.dataset_id field, which is the identifier for the dataset from which the cell data is derived.

`dataset_mod1.uns['dataset_name']`

In R: dataset_mod1$uns[["dataset_name"]]

Type: atomic, data type: str, shape: 1

A human-readable name for the dataset.

`dataset_mod1.uns['dataset_organism']`

In R: dataset_mod1$uns[["dataset_organism"]]

Type: atomic, data type: str, shape: 1

The organism of the sample in the dataset.

`dataset_mod1.uns['dataset_reference']`

In R: dataset_mod1$uns[["dataset_reference"]]

Type: atomic, data type: str, shape: 1

Bibtex reference of the paper in which the dataset was published.

`dataset_mod1.uns['dataset_summary']`

In R: dataset_mod1$uns[["dataset_summary"]]

Type: atomic, data type: str, shape: 1

Short description of the dataset.

`dataset_mod1.uns['dataset_url']`

In R: dataset_mod1$uns[["dataset_url"]]

Type: atomic, data type: str, shape: 1

Link to the original source of the dataset.

`dataset_mod1.uns['normalization_id']`

In R: dataset_mod1$uns[["normalization_id"]]

Type: atomic, data type: str, shape: 1

Which normalization was used

`dataset_mod1.var['feature_name']`

In R: dataset_mod1$var[["feature_name"]]

Type: vector, data type: object, shape: 13431

A human-readable name for the feature, usually a gene symbol.

`dataset_mod1.var['feature_id']`

In R: dataset_mod1$var[["feature_id"]]

Type: vector, data type: object, shape: 13431

Unique identifier for the feature, usually a ENSEMBL gene id.

`dataset_mod1.var['hvg']`

In R: dataset_mod1$var[["hvg"]]

Type: vector, data type: bool, shape: 13431

Whether or not the feature is considered to be a ‘highly variable gene’

`dataset_mod1.var['hvg_score']`

In R: dataset_mod1$var[["hvg_score"]]

Type: vector, data type: float64, shape: 13431

A ranking of the features by hvg.

`dataset_mod2.layers['counts']`

In R: dataset_mod2$layers[["counts"]]

Type: sparsematrix, data type: float32, shape: 69249 × 116490

Raw counts

`dataset_mod2.layers['normalized']`

In R: dataset_mod2$layers[["normalized"]]

Type: sparsematrix, data type: float32, shape: 69249 × 116490

Normalised expression values

`dataset_mod2.obs['cell_type']`

In R: dataset_mod2$obs[["cell_type"]]

Type: vector, data type: category, shape: 69249

Classification of the cell type based on its characteristics and function within the tissue or organism.

`dataset_mod2.obs['batch']`

In R: dataset_mod2$obs[["batch"]]

Type: vector, data type: category, shape: 69249

A batch identifier. This label is very context-dependent and may be a combination of the tissue, assay, donor, etc.

`dataset_mod2.obs['size_factors']`

In R: dataset_mod2$obs[["size_factors"]]

Type: vector, data type: float32, shape: 69249

The size factors created by the normalisation method, if any.

`dataset_mod2.obsm['X_svd']`

In R: dataset_mod2$obsm[["X_svd"]]

Type: densematrix, data type: float32, shape: 69249 × 100

The resulting SVD embedding.

`dataset_mod2.uns['dataset_description']`

In R: dataset_mod2$uns[["dataset_description"]]

Type: atomic, data type: str, shape: 1

Long description of the dataset.

`dataset_mod2.uns['dataset_id']`

In R: dataset_mod2$uns[["dataset_id"]]

Type: atomic, data type: str, shape: 1

A unique identifier for the dataset. This is different from the obs.dataset_id field, which is the identifier for the dataset from which the cell data is derived.

`dataset_mod2.uns['dataset_name']`

In R: dataset_mod2$uns[["dataset_name"]]

Type: atomic, data type: str, shape: 1

A human-readable name for the dataset.

`dataset_mod2.uns['dataset_organism']`

In R: dataset_mod2$uns[["dataset_organism"]]

Type: atomic, data type: str, shape: 1

The organism of the sample in the dataset.

`dataset_mod2.uns['dataset_reference']`

In R: dataset_mod2$uns[["dataset_reference"]]

Type: atomic, data type: str, shape: 1

Bibtex reference of the paper in which the dataset was published.

`dataset_mod2.uns['dataset_summary']`

In R: dataset_mod2$uns[["dataset_summary"]]

Type: atomic, data type: str, shape: 1

Short description of the dataset.

`dataset_mod2.uns['dataset_url']`

In R: dataset_mod2$uns[["dataset_url"]]

Type: atomic, data type: str, shape: 1

Link to the original source of the dataset.

`dataset_mod2.uns['normalization_id']`

In R: dataset_mod2$uns[["normalization_id"]]

Type: atomic, data type: str, shape: 1

Which normalization was used

`dataset_mod2.var['feature_name']`

In R: dataset_mod2$var[["feature_name"]]

Type: vector, data type: object, shape: 116490

A human-readable name for the feature, usually a gene symbol.

`dataset_mod2.var['feature_id']`

In R: dataset_mod2$var[["feature_id"]]

Type: vector, data type: object, shape: 116490

Unique identifier for the feature, usually a ENSEMBL gene id.

`dataset_mod2.var['hvg']`

In R: dataset_mod2$var[["hvg"]]

Type: vector, data type: bool, shape: 116490

Whether or not the feature is considered to be a ‘highly variable gene’

`dataset_mod2.var['hvg_score']`

In R: dataset_mod2$var[["hvg_score"]]

Type: vector, data type: float64, shape: 116490

A ranking of the features by hvg.

References

Luecken, Malte, Daniel Burkhardt, Robrecht Cannoodt, Christopher Lance, Aditi Agrawal, Hananeh Aliee, Ann Chen, et al. 2021. “A Sandbox for Prediction and Integration of DNA, RNA, and Proteins in Single Cells.” In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, edited by J. Vanschoren and S. Yeung. Vol. 1. Curran. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/158f3069a435b314a80bdcb024f8e422-Paper-round2.pdf.