Study Guide to Virtual Cell
The idea of a Virtual Cell has been gaining momentum: build foundation models trained on massive single-cell atlases so we can simulate cellular behavior in silico. It’s ambitious, and the literature is growing fast. This post is my attempt to organize the key papers and concepts into a coherent study path.
RegulatoryGen Papers
Before diving into foundation models, it helps to understand the regulatory genomics landscape these models are trying to capture. These papers establish core concepts around how genetic variation shapes gene expression and disease.
- The role of regulatory variation in complex traits and disease — Albert & Kruglyak (2015), Nature Reviews Genetics. Comprehensive review linking regulatory variants to phenotypic variation and disease risk.
- Effects of cis and trans Genetic Ancestry on Gene Expression in African Americans — Price et al. (2008), PLOS Genetics. Demonstrates that ~12% of heritable variation in gene expression is due to cis variants, using admixture mapping in African Americans.
- Impact of regulatory variation from RNA to protein — Battle et al. (2015), Science. Shows that eQTL effects are attenuated at the protein level, revealing post-transcriptional buffering of genetic variation.
- The GTEx Consortium atlas of genetic regulatory effects across human tissues — GTEx Consortium (2020), Science. The definitive multi-tissue eQTL atlas from 49 tissues and 838 donors, characterizing tissue specificity of regulatory effects.
- RNA splicing is a primary link between genetic variation and disease — Li et al. (2016), Science. Identifies splicing QTLs as major contributors to complex traits, on par with expression QTLs. Introduces the LeafCutter method.
- Long-range enhancer–promoter contacts in gene expression control — Schoenfelder & Fraser (2019), Nature Reviews Genetics. Reviews how 3D genome architecture facilitates enhancer–promoter communication over large genomic distances.
Early Generative & Perturbation Attempts (2019 – 2021)
A key step toward virtual cells was learning to predict how cells respond to perturbations — before we had large foundation models. These early works showed that generative models, especially variational autoencoders, could capture meaningful biological variation in latent space.
- scGen predicts single-cell perturbation responses — Lotfollahi, Wolf & Theis (2019), Nature Methods. Pioneering work that uses a variational autoencoder (VAE) combined with latent space vector arithmetics to predict single-cell perturbation responses. By learning a shared latent representation of cells, scGen can extrapolate how unseen cell types would respond to a perturbation — without requiring matched perturbed/unperturbed data for every cell type. Demonstrated cross-species transfer (mouse → human) and out-of-sample cell type prediction.
Under construction…