Virtual Cell literature dump

[ single_cell  grn  stat_ml  ]

There is a lot of new work coming out relevant to predicting transcriptomic responses to perturbation. I’m way behind on reading. Since things are heating up, and I got a job doing something totally different, I will remain behind on reading for the foreseeable future. I’m sorry that I can’t go into my usual level of detail on all this newer work, but here’s my attempt to provide a curated and annotated bibliography. Some highlights include:

  • CellFlow has outstanding demos on the Saunders whole-embryo knockout zebrafish data.
  • GEM-1 from Synthesize Bio seems to have strong performance on held-out genetic and chemical perturbations.
  • New competition at Broad

New-to-me applications of perturbation prediction

  • There’s a new adipocyte fate challenge at Broad aiming to make a dent in obesity. It runs December 2025 through April 2026.
  • Unravel Bio, a Michael Levin spinout, outlines nasal swab RNA collection, network biology, and rapid transgenic frog creation for bespoke research in rare diseases. “We use the unique RNA signature of the patient to build a personalized in silico gene network – without requiring a diagnosis. These systems biology-informed networks leverage our proprietary model of human health which we developed by processing vast sets of real RNA data, unbiased by assumptions and interpretations…Based on the unique genetic signatures of disease within a patient, we can pinpoint the molecular target and match it to an existing drug.” I would not have expected this to work, but they have a pipeline, so we will see.

Commentary on what strategies will work

  • The Arc Virtual Cell challenge has wrapped up recap. “[H]ybrid models combining deep learning with statistical features outperformed pure neural networks, and strategic loss function design mattered as much as architecture. Multi-modal features, particularly protein embeddings, added value across top teams. The evaluation sparked valuable discussion about metric design. No single metric captures “model quality”, and we observed clear trade-offs where optimizing one metric sometimes came at the cost of others. We are studying this year’s submissions to inform future metric refinements, potentially including norm-matched PDS variants and new biological relevance criteria.”
  • Xinru Qiu has published an intro to perturbation prediction models
  • The folks at Relation (featuring Jake Taylor-King, substack and causal inference wizard Caroline Uhler) have put out some commentary: “[T]he dream of “foundation models” in regulatory biology, those capable of robust and generalisable predictions, will remain elusive unless grounded in a biologically informed, semi-mechanistic framework.” A key point is that many experiments involve multiple stimuli, and the order matters heavily: “For example, editing out genes that prevent cellular differentiation would have no effect if the target cells have been exposed to differentiation-inducing media prior to the perturbation.” They recommend adding specific negative controls to evaluation data, but on a shallow first pass, I don’t see how that recommendation differs from common practice.

    I appreciate the broader point that dynamic views of cellular perturbation uncover some baffling points. My fave example is that the Yamanaka factors don’t actually reprogram a large majority of cells, and iPSC reprogramming leans heavily on the ability of iPSC’s to self-renew and outcompete the garbage byproducts. Look at day 12-ish in Fig. 2d here or this review or Fig. 2 here.

  • Therence Bois from Valence Labs posts a comparison of STATE vs TxPert. In this post he also announces ambitions towards multimodal data (“multimodal measurements, phenomics, transcriptomics, proteomics, metabolomics”).
  • Claudia Chu and Harlan Stevens have a retrospective on the Arc Virtual Cell competition.
  • Giovanni Marco Dall’Olio has a retrospective on the Arc competition.
  • Giovanni Palla writes forward-looking commentary about virtual cell evals
  • SciComm giant Eric Topol has an interview with Charlotte Bunne and Stephen Quake focused on virtual cell ambitions.
  • Abhi Mahajan (Owlposting) has several new write-ups about making virtual cell models useful in his new role at Noetik. Here’s part 1, with links therein.
    • Part 1: they can stratify patients into responders and non-responders with vastly improved statistical efficiency, because the relevant axes pop out of unsupervised analysis. “[V]alue here doesn’t come from the usual virtual cell trick of simulating perturbations, but instead from the far simpler act of representation.”
    • Part 2: they can broaden a clinical trial inclusion criterion. “[A] particular ‘tumor microenvironment concept’ … seems highly enriched in [responder] Subgroup Z, but also extends outside of it. … we believe that it is unlikely to be noise given how biologically relevant it is to the therapy in question.”
    • Part 3: Bona fide novel target selection via in silico expression screening followed by in silico perturbation.
  • Rood, Hupalowksa, and Regev published a wide-ranging and thorough review in Aug 2024.
  • Han Chen et al. argue that observational data reveals “worker” genes and perturbation data reveals “supervisor” genes. This work analyzes 500 quantitative traits in yeast (h/t Anne Carpenter).
  • Ajay Nadig et al. run some experiments on training data composition for deep learning models. Combined with the GEM-1 results (see below), this reinforces the idea that extrapolating to novel cell types is not possible.
  • PRESAGE, from Genentech, finds that “Knowledge source selection is more critical for predictive performance than architectural complexity, with cross-system Perturb-seq data providing particularly strong predictive power. We also find that performance saturates quickly with training set size, suggesting that experimental design strategies might benefit from collecting sparse perturbation data across multiple biological systems rather than exhaustive profiling of individual systems.” The compressive sensing idea is a classic Aviv Regev beat and if I had to guess, Genentech must already be using it internally to 100x or 1000x their effective read count for mass-produced perturb-seq.

Paperzzzzz (on evals)

  • Qiyuan Liu et al. explain that retrieval scores behave very differently depending on the underlying distance metric used. preprint “How can the very same set of predictions appear almost random under L1/L2 distances yet become highly discriminative when evaluated with cosine-based measures? … PDS defined using L1/L2 distance is highly sensitive to the scale of predicted effects, whereas cosine-based versions are not, leading to interactions between metric choice and scale that are not always intuitive… [M]ultiplying all predicted effects by a constant can substantially change the resulting PDS values, even though scaling leaves all directional information unchanged.” The magnitude of the perturbation effect is a hugely important quantity in practice, though; see below.
  • In many cases, cell count has much of the information present in higher-dimensional readouts. LinkedIn, preprint.
    • “[B]iased benchmark datasets have a high proportion of cell viability assays; a fairly easy-to-predict assay endpoint that should not be overemphasized in benchmarking studies”
    • “[M]any assays in benchmark datasets are specific to targets but the active chemicals in the dataset also impact cell viability and the actual readout is confounded by effects like cytotoxicity burst”
    • “[T]here is an absence of baseline models trained only on cell counts that can demonstrate the comparative advantage of a model trained on more complex phenotypic profiles.”
  • Jeffry Zhong et al. have a Nov 2025 update to their work on embeddings. The initial version of this was the first place AFAIK to show the value of now oft-used ESM2 gene embeddings for prediction perturbation outcomes or genetic interactions.
  • Systema evaluated virtual cell models using the gene expression log fold change over a training-data-mean baseline predictor, rather than their ability to predict log fold change over controls or just plain expression. I mentioned it already and I think it’s a valuable approach with justification even beyond what they explicitly describe. They find better performance of scGPT relative to baselines when using this new metric. I mentioned already that I think there could be a separate problem that could reverse their main conclusions – but also, it might not, and their conclusions might stand up just fine.
  • Diversity by Design and Deep Learning-Based Genetic Perturbation Models Do Outperform Uninformative Baselines on Well-Calibrated Metrics claim that, well, see title. A major contribution here is that they check many eval metrics based on the metrics’ ability to associate replicates and/or functionally related genes. This is a useful check. I mentioned already that I think these have a problem that could reverse their main conclusions – but also, it might not, and their conclusions might stand up just fine.

Paperzzzzz (static methods)

  • Anne Carpenter has been doing perturbation prediction with high-dimensional readouts and high-throughput screens for a lawwnnggg time. She lists relevant work
  • scLDM is a new generative model for scRNA data that allows latent-space arithmetic. It outperforms STATE, CPA, and scGPT in a cell-type-transfer task (training data includes the test-set perturbations and the test-set cell types, but separately).
  • scAgents and CellForge, which seem to be two different names for the same project (??) use an LLM to run a whole-ass virtual cell competition under the hood, then spit out the best performing model. PEREGGRN and projects like it seemed important at the time, but they are not creative and they do not provide any sort of “moat” (protection from competition). Now it appears they have been fully automated. This study does not present the usual dumb baselines and it does not condition on targeted genes, so I wouldn’t take the results at face value, but those seem like minor fixes. This type of automation seems extremely useful as we try to scale up general-purpose virtual cell models and automatically customize them for deployment in new contexts.
  • Synthesize Bio (a Jeff Leek spinout) has a deep latent variable model “GEM-1” trained on Recount3-ish data. See their blawg post and preprint. Across three test sets containing 86 unseen genetic perturbation, 47 unseen chemicals, and 142 unseen cell lines, their model predicts outcomes with high within-gene correlation. (Jeff Leek is a statistician, so instead of using within-sample correlations, which are a weird genomics thing that behaves deceptively, he uses the usual thing that has been used to evaluate predictions for over a century and behaves the way you would want.) There’s one big caveat here: the test data contain all relevant experiments with data uploaded to SRA from July 1, 2024 through September 30, 2024, so there should be massive technical effects. GEM-1 seems to be an excellent model for technical effects, so how much of its strength on the held-out data comes from modeling perturbations well, and how much comes from modeling technical effects well? I would love to see either an ablation study omitting the perturbation latents or an eval showing within-heldout-study $r_{gene}$.
  • C2S-Scale, from the Cell2Sentence and Gemma families (blog, preprint) discovers an interaction effect using interferon + silmitasertib to increase antigen presentation far more than either stimulus alone. They also beat out several other foundation models when predicting new (cell type + perturbation combos) in CMAP and in a cytokine response dictionary (the Mingze Dong et al CINEMA-OT data).

Resources

  • PertPy provides a standardized way to access data.

Paperzzzzz (dynamic models)

The following tools join CellOracle, PRESCIENT, RNAForecaster, Dictys, and other dynamic models.

  • Cflows uses neural ODE’s to map trajectories and predict perturbations.
  • CellFlow uses neural ODE’s to map trajectories and predict perturbations. Modeling is done in a latent space learned by PCA or a VAE. Model fitting uses optimal transport to match up the control and perturbed samples (which cell type is which). The “before” and “after” perturbation states are connected with straight lines, and the neural ODE’s are trained to spit out RNA velocity compatible with these straight lines (this is called “flow matching”). Mappings from RNA state to RNA velocity can include side information such as ESM2 embeddings for cytokines or perturbed TF’s (these contain a lot of useful info). This model works on whatever species. Results on the Saunders whole-embryo genetic perturbation data (which is in zebrafish) and several other demos seem outstanding. This method is intended to be trained on datasets with many cell types and many genetic perturbations observed. Their remarks on scaling:
    • “We were able to observe a clear scaling relationship between CellFlow’s performance and the number of seen conditions… [P]rediction accuracy for IFN-omega strongly depends on the presence of IFN-beta in the training data, which is consistent with the similar cellular responses elicited by these two cytokines. Similarly, the majority of identified train cytokines which positively influence CellFlow’s performance could be explained by a high similarity to the test cytokine.”

Stuff I have not gotten to yet

I am so sorry. It’s not even triaged.

  • Scape https://www.biorxiv.org/content/10.1101/2025.09.08.674873v1
  • Prophet from Theis
  • Noutahi et al. https://arxiv.org/abs/2505.14613
  • Salt & Peper from uhler group https://arxiv.org/abs/2404.16907
  • the other PerturbNet, Welch not McCarter
  • GSNN https://pubmed.ncbi.nlm.nih.gov/38464019/
  • MintFlow https://x.com/mo_lotfollahi/status/1950571770529903081
  • Disentanglement https://www.nature.com/articles/s41467-025-62008-1
  • Zitnik group “Combinatorial prediction of therapeutic perturbations using causally inspired neural networks” https://www.nature.com/articles/s41551-025-01481-x https://www.noetik.blog/p/how-do-you-use-a-virtual-cell-to
Written on January 1, 2026