The CEM Dataset

Overview

The CEM dataset is an unlabeled collection of 2D cellular EM images designed for self-supervised learning algorithms. Gathered from over 2 PB of data, it is heterogeneous enough to capture a significant variety of organisms, tissues, and imaging methods.

Resources

CEM1.5M: The newest release of the dataset with 1.5 million images.
CEM500K: The first release of the dataset with 500 thousand images.
CEM1.5M Pre-trained Weights: PyTorch weights for a ResNet50 model pre-trained on CEM1.5M using the SwAV algorithm.
CEM500K Pre-trained Weights: PyTorch weights for a ResNet50 model pre-trained on CEM500K using the MoCoV2 algorithm.
CEM Patch Filtering Weights: PyTorch weights for a ResNet34 model trained on 12,000 EM images that were labeled as “informative” or “uninformative”. Used to curate patches in the CEM dataset.
cem-dataset: Source code to reproduce the results of our paper; scripts to preprocess, standardize, and curate 2D and 3D EM datasets; scripts to download and prepare the EMOrganelles benchmark datasets (including the All Mitochondria benchmark established in the CEM500K paper) and SnakeMake files to evaluate pre-trained models on the benchmarks. Plus, explanatory Jupyter Notebooks.

Citing this work

If you find any of these resources useful in your work, please cite:

@article {Conrad2021,
	author = {Conrad, Ryan and Narayan, Kedar},
	doi = {10.7554/eLife.65894},
	issn = {2050-084X},
	journal = {eLife},
	month = {apr},
	title = ,
	url = {https://elifesciences.org/articles/65894},
	volume = {10},
	year = {2021}
}