An Intuitive Guide

The Virtual Cell

Everything you need to understand AI virtual cell simulation — from first principles to state of the art — with no biology background required.

Updated Feb 2026

Based on Arc Virtual Cell Challenge 2025

Research through NeurIPS 2025

01 — Foundation

What Is a Cell Actually Doing?

Your body has ~37 trillion cells. Every single one contains the same DNA — the same complete instruction manual for building a human. A liver cell and a neuron have identical DNA. What makes them different is which instructions are currently being read.

This reading process is called gene expression. At any moment, a cell is reading some genes loudly, some quietly, and most not at all. The result — the complete snapshot of which genes are active and how much — is called the cell's transcriptome.

Key Analogy

Think of DNA as a massive cookbook with 20,000 recipes. The transcriptome is the list of which recipes are being cooked right now, and how many portions of each. The cookbook never changes. The cooking does — constantly.

Measuring this snapshot is now possible with a technology called scRNA-seq (single-cell RNA sequencing). It reads the transcriptome of a single cell and gives you a vector of ~20,000 numbers — one per gene — representing activity levels.

The Fundamental Problem

To measure a cell's transcriptome, you have to destroy the cell. The measurement kills it. You can never watch the same cell before and after something happens — you can only compare populations of cells. This is a core source of noise in everything that follows.

The Central Concept: Genome vs. Transcriptome

graph LR A["🧬 Genome (DNA)\n~20,000 genes\nSame in every cell\nFixed"] --> B["📊 Transcriptome (RNA)\n~20,000 activity numbers\nDifferent per cell type\nChanges constantly"] B --> C["🔬 Cell Identity\nLiver / Neuron / Stem cell\nDefined by transcriptome pattern"] style A fill:#e4e0d8,stroke:#ccc9c0,color:#3a3631 style B fill:#e4e0d8,stroke:#16803c,color:#1a1815 style C fill:#e4e0d8,stroke:#0369a1,color:#3a3631

Three Distinct Concepts

Term	What It Is	Stability
Genome	Full DNA — all 20,000 genes. The complete instruction manual.	Fixed. Same in every cell of your body.
Transcriptome	Which genes are being read right now, at what level. A snapshot vector of ~20,000 numbers.	Dynamic. Changes with cell type, environment, signals, time.
Proteome	Which proteins are currently built from those transcripts.	Dynamic. Lags behind transcriptome by hours.

02 — Foundation

What Is a Perturbation?

A perturbation is simply: you interfere with the cell and see what happens.

The most common type used in this field is CRISPRi — a molecular tool that physically clamps onto a specific gene in the DNA, blocking the cell's reading machinery from transcribing it. You're saying "gene X, shut up" at the source.

Important Clarification

CRISPRi acts at the DNA level — it prevents the gene from ever being read. No RNA gets produced. No protein gets built. The effect is upstream of everything. (A different technique, RNAi, cuts RNA after it's produced — but CRISPRi is the dominant tool in this field.)

Why does silencing one gene matter? Because genes don't operate in isolation. They form networks. Each gene's protein product can regulate other genes. Silencing gene X might:

Compensation

Gene Y goes up

Another gene activates to compensate for the loss of X's function. The cell tries to maintain homeostasis.

Disinhibition

Gene Z explodes

Gene X was suppressing Z. Remove X, and Z goes unbraked — potentially overactivating an entire pathway.

Cascade

50 genes shift

Z's protein activates A and B, which suppress C, which releases D... the ripple propagates across the network.

The Observable

The transcriptome snapshot

After all the cascades settle, you measure the new transcriptome. That's the "perturbation response" — the end state of the whole network reaction.

The Core Insight

The perturbation response reveals the hidden wiring of the cell. By watching what cascades when you knock out gene X, you learn what X was connected to, what it controlled, and how the network compensates. This is one of the primary ways we understand gene function.

03 — Motivation

Why Predict It?

Right now, running a perturbation experiment means: silencing a gene, growing cells for days, killing them to measure the transcriptome, analyzing the data. Per experiment: days of work, thousands of dollars, cells destroyed.

The scale problem is staggering:

Experiment Type	Number of Possibilities	Status
Single gene knockouts	~20,000	Partially covered by Replogle dataset
Two-gene combinations	~200,000,000	Tiny fraction measured
Three-gene combinations	~1,300,000,000,000	Essentially unmeasured
Single gene × cell type × condition	Astronomical	Unmeasured

The Dream

Train a model on a fraction of these experiments. Have it accurately predict the rest. A virtual cell you can query like a database — "what happens if I knock out genes X and Y together in a stem cell under oxidative stress?" — answered in milliseconds, for free, without destroying a single real cell.

This is the complete value proposition. Not replacing experiments, but making each experiment do the work of thousands — multiplying scientific reach by orders of magnitude.

04 — The Full Loop

Gene → RNA → Protein → Drug

This is the chain that connects all the concepts. Understanding it is key to understanding why the transcriptome is the right thing to measure and predict.

The Central Dogma

Information in a cell flows in one direction:

The Full Cell Loop: From DNA to Drug Effect

flowchart TD subgraph EXT["🌐 External Cell (Neighbor / Organ)"] GA["Gene in Cell A"] -->|transcription| RNAA["RNA"] RNAA -->|translation| SIG["Secreted Protein\n(hormone, cytokine,\ngrowth factor)"] end subgraph CELL["🔬 Your Cell"] direction TB REC["Receptor\non cell surface"] -->|"signal received"| TF subgraph DOGMA["Central Dogma"] DNA["Gene (DNA)"] -->|"transcription\n(CRISPRi blocks here)"| RNA["RNA\n(measured in scRNA-seq)"] RNA -->|"translation"| PROT["Protein"] end TF["Transcription Factor\n(a special protein)"] -->|"activates / suppresses"| DNA PROT -->|"if it's a TF"| TF DRUG["💊 Drug"] -.->|"binds & blocks\nthe protein"| PROT end SIG -->|"travels to\nCell B"| REC TRANSCRIPTOME["📊 Transcriptome\n= all RNA levels right now\nThis is what models predict"] RNA --- TRANSCRIPTOME style EXT fill:#f1efe9,stroke:#ccc9c0,color:#3a3631 style CELL fill:#f1efe9,stroke:#ccc9c0,color:#3a3631 style DOGMA fill:#f8f7f4,stroke:#e4e0d8,color:#3a3631 style TRANSCRIPTOME fill:#dcfce7,stroke:#16803c,color:#14532d style DRUG fill:#fce7f3,stroke:#be185d,color:#9d174d style DNA fill:#dbeafe,stroke:#0369a1,color:#1e3a5f style RNA fill:#dcfce7,stroke:#16803c,color:#14532d style PROT fill:#ffedd5,stroke:#c2410c,color:#7c2d12 style TF fill:#ede9fe,stroke:#6d28d9,color:#4c1d95

How Drugs Connect to Gene Expression

Most drugs work by binding to a protein — not a gene. Ibuprofen blocks COX proteins. Statins block cholesterol-synthesis enzymes. Cancer drugs block proteins that drive cell division.

But proteins regulate genes. Many proteins are transcription factors — their job is to bind DNA and switch genes on or off. So when a drug blocks protein X:

The Drug → Transcriptome Chain

Drug blocks protein X → X can no longer activate genes A, B, C (those go quiet) → X can no longer suppress genes D, E, F (those become active) → D and E's proteins now regulate more genes → cascade propagates...

The transcriptome after the drug is the full footprint of everything that happened. Predicting it means predicting the drug's complete downstream effect — including intended effects and side effects.

Why Signals From Other Cells Matter

Your liver cells are constantly receiving molecular messages from neighboring cells, the bloodstream, and distant organs — hormones, cytokines, growth factors. Each of these lands on a receptor, triggers a transcription factor, and reshapes your transcriptome.

This means the transcriptome is not just internally determined — it's a continuous conversation between the cell and its entire environment. This makes prediction harder: you need to know the cell's internal state and the context it's sitting in.

05 — Deep Biology

Cell Identity, Attractors & Why Cells Don't Randomly Drift

If the transcriptome is constantly fluctuating, why doesn't your liver cell randomly become a neuron? This question leads to one of the most important concepts in cell biology.

The Epigenetic Landscape (Waddington, 1957)

Think of all possible cell states as a landscape with hills and valleys. A liver cell sits in the "liver valley." Fluctuations happen constantly, but the valley's shape pulls it back. Getting to the "neuron valley" would require climbing over a huge hill.

What Creates the Valleys — Gene Regulatory Networks

Liver cells stay liver cells because of self-reinforcing gene circuits. Master regulator genes (like FOXA2 for liver) activate each other and simultaneously suppress the master regulators of other cell types. It's a self-locking system.

Self-Reinforcing Identity Lock

graph TD FOXA2["FOXA2\n(liver master regulator)"] -->|activates| LIVER["Albumin, liver enzymes\n(liver identity genes)"] LIVER -->|feedback activates| FOXA2 FOXA2 -->|suppresses| NEURO["Neuron master regulators"] FOXA2 -->|suppresses| MUSCLE["Muscle master regulators"] NEURO -->|would suppress| FOXA2 MUSCLE -->|would suppress| FOXA2 style FOXA2 fill:#dcfce7,stroke:#16803c,color:#14532d style LIVER fill:#dbeafe,stroke:#0369a1,color:#1e3a5f style NEURO fill:#f1efe9,stroke:#ccc9c0,color:#8c8780 style MUSCLE fill:#f1efe9,stroke:#ccc9c0,color:#8c8780

Three Layers of Protection Against Drift

Layer 1

Gene Regulatory Networks

Master regulators activate each other and suppress competing identity programs. Small fluctuations get absorbed and corrected.

Layer 2

Epigenetics

Neuron genes in a liver cell are physically wrapped tight around protein spools (histones). The transcription machinery literally can't reach them. This constrains the transcriptome's range of fluctuation.

Layer 3

Neighbor Signals

Surrounding liver cells constantly send "stay liver" signals. Remove a cell from its tissue context and identity starts to drift — this is why cells in a dish behave differently than in an organ.

When It Breaks

Cancer

Cancer is not a cell becoming another normal cell type — it's a cell whose regulatory network is corrupted enough to lock into a pathological attractor. One with no brakes on division.

Nobel Prize Insight — Yamanaka 2012

Shinya Yamanaka proved cell identity is not permanent. By forcing expression of just 4 genes (Oct4, Sox2, Klf4, c-Myc) in a skin cell, he kicked it over the valley wall into a pluripotent stem cell state — capable of becoming almost any cell type. This was shocking: it proved cell identity is a stable attractor, not a locked destiny.

Why This Matters for the Virtual Cell

The attractor landscape picture is exactly what models are trying to learn. Not just "what gene expression level does gene X land at after a perturbation" — but the shape of the landscape itself. Where are the valleys? How deep? What perturbations tip a cell from one to another?

This is why perturbation prediction is not just a regression problem. It's learning a dynamical system refined by billions of years of evolution.

06 — Data

Key Datasets — The "ImageNet" of This Field

Just as computer vision built on ImageNet, cell biology perturbation prediction has a few canonical datasets that everyone benchmarks on. These datasets are the shared currency of the field.

Dataset	Technique	Scale	Significance
Adamson et al. 2016	CRISPRi knockdown of ~87 genes during ER stress response	~65K cells	Small but widely used early baseline. One of the first at this scale.
Norman et al. 2019	CRISPRa activation of 105 genes; 131 two-gene combinations	~100K cells	First large-scale combinatorial perturbation dataset. Opened the question of multi-gene interaction prediction.
Replogle et al. 2022	Genome-wide CRISPRi — nearly 10,000 genes, 2.5M cells	~2.5M cells	The current gold standard. Large enough to reveal that many "impressive" models were memorizing patterns. Changed what was possible to claim.

Why Replogle Changed Everything

Before Replogle, models trained on small datasets could appear to work well — but were actually fitting to statistical noise. At genome-wide scale, these artifacts disappear and only genuine signal survives. Replogle is why the field knows deep learning doesn't yet decisively beat simple baselines: there's enough data to see through overfitting.

07 — Competition

The Virtual Cell Challenge 2025

Organized by the Arc Institute (affiliated with Chan Zuckerberg Initiative and Stanford), announced at NeurIPS 2025. This is the field's flagship competition — the closest thing to a shared Turing test for virtual cells.

Participants

5,000+ people

Across 114 countries. 1,200+ teams submitted results. 300+ final submissions.

Prize

$250K total

$100K first place, $50K second, $25K third, $100K generalist prize.

Data

~300K cells

H1 human embryonic stem cells. 300 CRISPRi perturbations. Purpose-built for the challenge.

Link

Arc Institute

Read the full wrap-up →

The Task — Why It's Hard

The challenge used H1 human embryonic stem cells — a cell type no model had ever seen during training. Competitors received the list of 300 genes to perturb and had to predict the resulting transcriptome for each.

The Hard Part: Generalization

A model trained on cancer cell lines (K562, etc.) must transfer its "understanding of biology" to stem cells — where the same gene plays a completely different role, embedded in a completely different regulatory network. It's like asking someone who learned chess to play Go. The pieces move differently. Does your model understand biology, or has it memorized cancer cell patterns?

Scoring Metrics

The challenge used three metrics to capture different aspects of prediction quality:

Metric	What It Measures	Why It Matters
PDS — Perturbation Discrimination Score	Can you distinguish which gene was knocked out from the predicted expression profile? Are predictions specific enough to be differentiable?	Tests whether the model is actually capturing perturbation-specific biology vs. generic "cells look different when perturbed."
DES — Differential Expression Score	Do you identify the correct set of upregulated and downregulated genes?	Tests whether the model gets the direction of change right for the genes that matter.
MAE — Mean Absolute Error	Raw numeric accuracy across all 20,000 genes.	Sounds like the right metric — but is deeply misleading (see Metric Crisis section).

Why MAE Is Misleading

Of 20,000 genes, maybe 50–200 actually change after a perturbation. The other 19,800+ stay roughly the same. If your model predicts "nothing changes" for everything, your MAE is still very low — you're right about 19,800 genes. The model gets rewarded for ignoring the biology. This is why most models underperformed baselines on MAE, and why the Generalist Prize used 7 metrics instead.

08 — Results

Winners & What Worked

Team BM_xTVC — BioMap Research

$100,000 · Model: xTrimoSCPerturb

Hybrid deep learning + classical statistics. Combined an improved scFoundation transformer architecture with protein language model embeddings (so the model chemically "knows" what each protein does, not just its expression statistics), plus classical statistical features as explicit biological hints. Used pseudo-bulk aggregation to denoise signal before prediction.

Team XLearning Lab — Sichuan University

$50,000 · Model: X

Metric-driven strategy. Pseudo-bulk representation with protein embeddings. Heavily optimized for the specific scoring metrics rather than general biological accuracy.

Team Outlier — Multiple Institutions

$25,000 · Model: TransPert

Statistical framework using only summary-level data with similarity-aware aggregation. No deep learning at all — pure statistics.

★

Team Altos Labs — Generalist Prize

$100,000 · Model: go-with-the-flow

Flow matching generative model. Instead of predicting a single "answer," it predicts the full distribution of cell responses — capturing that different cells respond differently to the same perturbation. Pretrained on ~7M single cells from public datasets plus internal perturbation screens. Won the prize for most robust performance across all 7 evaluation metrics.

The Uncomfortable Takeaway

All top teams used pseudo-bulk aggregation — essentially averaging over cells before prediction as a denoising step. This is classical statistics saving neural networks from noise. The winning approach combined billion-parameter transformers with statistical cheat sheets. Pure end-to-end deep learning did not win.

09 — Technical

The Four Dominant Technical Approaches

Approach 1

Foundation Models (the GPT analogy)

Treat genes like words. A cell's expression profile is a "sentence." Pretrain a transformer on millions of cells, then fine-tune for perturbation prediction.

Approach 2

Hybrid: Statistics + Deep Learning

Neural networks augmented with biological knowledge — protein embeddings, pseudo-bulk denoising, explicit gene feature engineering. The winning approach.

Approach 3

Generative / Flow Matching

Predicts the full distribution of cell states, not just the mean. Models cellular heterogeneity — different cells respond differently to the same knock-out.

Approach 4

Mechanistic / White-Box

Encodes known biological networks (KEGG pathways, protein interactions) as structure. Uses LLM reasoning to trace cascades step by step. Interpretable.

Approach 1: Foundation Models

Models: scGPT (pretrained on 33M cells, masked gene prediction like BERT), Geneformer (ranks genes by expression level, uses rank order as input to remove technical noise from absolute counts), scFoundation (continuous expression encoding + read-depth recovery task).

The bet: If the model "understands" gene expression deeply enough through pretraining, it can generalize to unseen perturbations and cell types.
The problem: Often can't beat a linear model on perturbation prediction specifically. The "language" analogy breaks down — genes don't have grammar, and the statistical structure of scRNA-seq data is different from text.

Approach 2: Hybrid (the winner)

The key innovation is not replacing deep learning but augmenting it with structured biological knowledge:

Protein embeddings: Encode what each protein structurally and functionally is (from models like ESM), not just its expression statistics. The model knows gene X encodes a kinase before it even sees any data.
Pseudo-bulk: Average cells with the same perturbation into a single profile before prediction. Eliminates single-cell technical noise.
Explicit biological features: Tell the model which genes are typically differentially expressed, as a statistical prior.

Approach 3: Generative / Flow Matching

Traditional models predict: "after knocking out gene X, gene Y will be at level 4.2." But that ignores a biological reality: different cells respond differently. The population of perturbed cells has a distribution, not a single point.

Flow matching learns to transform one distribution into another. Input: the distribution of unperturbed cells. Output: the distribution of perturbed cells. This is related to diffusion models but faster and more stable to train.

Approach 4: Mechanistic (VCWorld)

Instead of learning correlations from data, VCWorld encodes known biology — signaling pathways, protein interaction networks — as explicit structure, then uses an LLM to reason through the cascade step by step. Each prediction comes with a traceable mechanistic explanation: "Gene X was silenced → TF Y lost its activator → Genes A, B, C went down → which caused..." — verifiable against known biology.

Trade-off: More interpretable and generalizes better to out-of-distribution cases, but limited by what's already known biologically. Can't discover truly novel mechanisms.

10 — The Problem

The Metric Crisis

This is the central intellectual drama of the field right now. The debate has evolved into a precise diagnosis — which is itself progress.

Act 1 — Nature Methods, 2025

"Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines" — Seven foundation models (scGPT, scFoundation, Geneformer, etc.) benchmarked against simple baselines. None beat linear regression or even just predicting the training mean. Widely circulated. Caused alarm.

Act 2 — bioRxiv, October 2025

Rebuttal: "Deep Learning-Based Genetic Perturbation Models Do Outperform Uninformative Baselines on Well-Calibrated Metrics" — Deep learning does outperform baselines, but only with better metrics. The commonly used metrics are broken:

Broken Metric	Why It's Broken	What It Rewards
MSE (Mean Squared Error)	Dominated by the ~19,800 genes that barely change. A model predicting "nothing changes" gets low MSE.	Predicting the mean. Ignoring perturbation signal.
Pearson(Δctrl) — correlation of change-from-control	When the control population has systematic bias, predicting "no change from control" still scores well.	Mode collapse — every perturbation predicted identically.

The rebuttal paper introduces the Dynamic Range Fraction — a calibration test that measures whether a metric can even distinguish a perfect model from a useless one. Conventional metrics often score near zero on this test.

Act 3 — Systema, Nature Biotechnology, 2025

Systema identifies the root cause: perturbation experiments mix two signals that current metrics can't separate:

Signal 1 — What We Want

Perturbation-specific signal

The actual biological effect of knocking out gene X. What we want models to learn and predict.

Signal 2 — The Problem

Systematic variation

Batch effects, selection bias, technical confounders that make all perturbed cells look different from controls — regardless of which gene was perturbed.

Most of what looks like model "performance" is actually learning Signal 2 — systematic variation. The model learned "perturbed cells look different from controls in this dataset," not "this specific gene knock-out causes this specific cascade."

Systema strips out systematic variation before scoring. Available at github.com/mlbio-epfl/systema.

Where the Field Is

Old narrative: "Deep learning is useless for perturbation prediction."
Current narrative: "Deep learning is probably useful — but we can't tell yet because our measuring tools are broken." The metric debate is turning a vague failure into a specific, fixable problem. That's real progress.

Three Competing Evaluation Frameworks

No consensus exists yet on which is canonical:

PerturBench — unified evaluation across multiple models and datasets
Systema — removes systematic variation bias before scoring
Arc Challenge PDS/DES system — rank-based, discrimination-focused; most influential practically due to prize money and prestige

11 — Landscape

SOTA Models Landscape 2025–2026

Model	Origin	Approach	Key Claim
VCWorld	Shanghai Jiao Tong + NeoLife AI, Nov 2025	Mechanistic white-box + LLM reasoning	SOTA on drug perturbation benchmarks with step-by-step interpretable predictions and verifiable hypotheses
C2S-Scale	Google, 2025	Cell2Sentence scaled to 27B parameters	Treats scRNA-seq as text. Outperforms GPT-4o on single-cell biology QA. Major step toward "virtual cells as model systems."
CellForge	2025	Agentic — AI agents design and iterate cell models	First attempt at autonomous virtual cell model construction. Very early stage but conceptually significant.
Noetik OCTO	Noetik, 2025	1.5B parameter multimodal spatial model	Trained on proprietary multimodal spatial dataset. Commercially deployed for precision oncology. Bridges spatial biology and gene expression.
Alpha Cell	SciLifeLab Sweden, early 2026	Foundation model for stem cell behavior	New program directly inspired by AlphaFold. Predicting stem cell behavior with ML. Just launched.

12 — Context

The AlphaFold Comparison

Everyone in this field invokes AlphaFold. Understanding why requires understanding what AlphaFold actually did — and why the virtual cell problem is harder.

AlphaFold (DeepMind, 2020) solved protein structure prediction: given an amino acid sequence, predict the 3D shape of the folded protein. This problem had been open for 50 years. AlphaFold solved it at near-experimental accuracy in 2020, and the impact was transformative — drug discovery, disease understanding, and basic science all accelerated overnight.

	AlphaFold	Virtual Cell
Core question	What 3D shape does this protein fold into?	What does a cell do when you perturb it?
Ground truth	Clean — a single deterministic 3D structure	Noisy — a stochastic distribution across a heterogeneous population
Variables	One molecule, hundreds of amino acids	20,000 interacting genes, embedded in tissue context
Training signal	Tens of thousands of clean protein structures from PDB	Noisy single-cell measurements that kill the cell to measure it
Generalization	Sequence → structure is consistent across organisms	Same gene does different things in different cell types
Status	Solved (2020)	Unsolved (2026)

The Difference in Kind

Protein folding has a single right answer per sequence. A cell's response to a perturbation depends on which cell type, what other genes are active, what signals it's receiving from neighbors, what time in the cell cycle it's at, and dozens of other contextual factors. The virtual cell problem is harder in almost every dimension — more variables, noisier measurements, less consistent ground truth, more context-dependence.

13 — Applications

Why Any of This Matters

Drug Discovery

Screen millions of targets cheaply

Most drugs fail in clinical trials because we didn't understand the biology. If you can simulate "what happens when you inhibit gene X in a cancer cell" without running the experiment, you can screen millions of drug targets computationally before synthesizing a single molecule.

Disease Understanding

Model what goes wrong

Most diseases involve changes in gene regulation — oncogenes overactive, tumor suppressors silenced, immune genes misfiring. A virtual cell lets you model the diseased state computationally and find where to intervene.

Personalized Medicine

Patient-specific digital twins

Take a biopsy from a patient, build a virtual model of their specific cells, simulate 50 different treatments, pick the one that works for their tumor's specific genetic profile. Not average patients — this specific person.

Replacing Animal Testing

Reduce preclinical experiments

If cellular responses can be simulated accurately, a large fraction of animal experiments during drug development become redundant. Faster, cheaper, more ethical.

14 — Honest Assessment

Where the Field Actually Is

The dream is clear. The tools are immature. The metrics are disputed. Simple baselines are embarrassingly competitive with models containing a billion parameters trained on millions of cells.

But: the data infrastructure (Perturb-seq at genome scale), the competition frameworks (Arc Challenge with $250K prizes and 5,000 participants), and the research output (multiple Nature papers in 2025 alone) are all signs of a field that is accelerating fast.

The Central Question

Can we build AlphaFold for cell biology?

The answer so far: "Not yet — but we're getting better at knowing exactly why not." The metric crisis revealed that we were measuring the wrong things. Systema revealed systematic confounds in the data. The challenge revealed that hybrid approaches beat pure neural networks.

In science, turning a vague failure into a specific, diagnosable problem is real progress. The field knows more precisely what it doesn't know — which is how breakthroughs happen.