Methods for expression forecasting under novel genetic perturbations

[ grn ]

Perspective: bioinformatics boom and bust

Forecasting expression in response to genetic perturbations is potentially extremely valuable for drug target discovery, stem cell protocol optimization, or developmental genetics. More and more methods papers are showing promising results. Let’s take a look.

If you get bored in the middle of the list, you can skip to the end for a spicy take on epistemic standards in bioinformatics. 🌶️

UPDATE 2025 July: I am retconning this post into a series on virtual cell modeling.

Episode 1: methods circa mid-2024 (this post)
Episode 2: benchmarks circa early 2025
Episode 3: a broader look at genomic foundation models
Episode 4: new developments circa June 2025

First I’ll recap methods that require some perturbations in the training data.

DCD-FG estimates a causal structural model where each variable is a possibly-nonlinear function of its parents. The twist is the “factor graph” structure: there’s a bottleneck where each regulator feeds into a limited number of latent master regulators, which in turn propagate effects downstream. Smooth penalties are applied to make the structure sparse and acyclic (as in NO-TEARS). Mathematical guarantees accompanying the paper show that less data is needed: even in situations where many causal DAGs fit the data, often only one of them is a factor graph. The main purpose of DCD-FG is network structure recovery, but predicting counterfactuals is another core function of causal inference, and the original paper has intriguing demos predicting unseen genetic perturbations. A lot of love and care went into this method, and I’m a big fan of it.
NODAGS-Flow uses normalizing flows, which are nonlinear dynamical models guaranteed to converge to a fixed steady state. This allows them to estimate a cyclic causal network without fear of explosive long-term implications. They show strong results predicting held-out interventions. These results are on a 61-gene subset, which is far fewer genes than the DCD-FG or CellOracle demos and indeed fewer genes than are perturbed in the melanoma data. I’m not so familiar with this method yet, and given the small example, it’s not clear whether it will scale to the full transcriptome.
Bicycle uses a cyclic causal graph powered by a linear stochastic differential equation model. A key empirical result is state of the art performance on two of the three melanoma datasets used for demos in the DCD-FG and NODAGS-Flow papers. Like NODAGS-Flow, these results are on a 61-gene subset.
GEARS is a method for predicting genetic interactions from individual genetic effects, but it also shows highly promising results for predicting responses to novel genetic perturbations. GEARS dispenses entirely with the concept of learning an explicit causal structure. Instead, it starts from per-gene embeddings, combining them and decoding them into post-perturbation gene expression. Genes embeddings are combined along a graph structure derived from the Gene Ontology, with the core insight being that functionally similar genes should have similar perturbation outcomes. GEARS showed excellent results on some unusually thorough comparisons, and it has clearly inspired a large amount of related work.
CODEX is inspired by GEARS and also predicts unseen perturbations by combining, then decoding, latent representations of functionally similar genes. They also have a nice big list of comparisons, including GEARS and including a version of CODEX with only linear effects. I’m not so familiar with this method yet but I’m really impressed with how much is packed into the paper – maybe that’s the benefit of a fairly simple network architecture.
scFoundation and related xTrimoGene are general-purpose foundation models. scFoundation is pretrained on transcriptomes from over 50 million human cells, including tumor samples. Their pretraining data draws widely from GEO, but appears to not contain any perturb-seq data. Their perturbation prediction demo repurposes GEARS, using cell-specific embeddings for a step near the end involving a gene coexpression graph.
AIDO-cell is a general-purpose foundation model pretrained on transcriptomes from 50 million cells. It follows scFoundation and GEARS with one perturbation predicion demo on the Norman genetic interaction data.
GeneFormer is a general-purpose foundation model pretrained on transcriptomes from 30 million primary human cells. (The pretraining data does not include any cell lines or perturb-seq data. There is also a recent update with a bigger model and more pretraining data, but I am only familiar with the 2023 version.) Among many other demos, GeneFormer identifies candidate drug targets for cardiomyopathy treatment. This is a descendent of coexpression analysis methods like GENIE3: after mining correlations in the training data, the model is expected to encode knowledge of context-specific network dynamics. Users can take single cell transcriptomes and increase or decrease some candidate genes’ expression. Then, latent cell embeddings from GeneFormer will shift in a way that is informative about downstream effects on the transcriptome.
scGPT is a general-purpose foundation model pretrained on transcriptomes from over 33 million human cells. Among many other demos, scGPT predicts responses to genetic perturbation. The approach is distinct from GEARS, CODEX, or GeneFormer: unlike GEARS or CODEX, there is no use of the Gene Ontology, and unlike GeneFormer, perturbation is not enacted by literally cranking the dosage of the affected gene up or down. Rather, there is a “condition embedding” added to every single position in the input embeddings, all the time, no matter what. This condition embedding can be set to one value for “knocked down” and another for “wild type”. It functions like a latent embedding for the word NOT in a language model: somehow, the model learns that “NOT” modifies nearby words (nearby gene embeddings), and “NOT” does NOT always mean the same thing.
EDIT 2024 Oct 15: People also obtain informative gene embeddings by describing the gene to a language model. These language-based embeddings can be decoded to predict gene expression effects. Here is an example that beats GEARS and a mean baseline on a couple of the recent large-scale perturb-seqs in K562 and RPE-1.
EDIT 2024 Dec 14: more models using natural-language gene embeddings: scLAMBDA, GenePT, and scGenePT
A recent benchmark study includes a peculiar linear baseline. It predicts perturbation-induced log fold change after perturbing gene p as t(G)[p,:]WG, where G are gene embeddings from 10-dimensional PCA on the training data and W is learned via ridge regression. To be clear about the dimensions: t(G)[p,:] is 1x10, W is 10x10, and G is 10 by the number of genes. In their experiments, this baseline beats GEARS, scGPT, and scFoundation almost uniformly when predicting genetic interactions on the Norman data or predicting novel perturbation outcomes on other perturb-seq datasets. This is a very troubling finding and it deserves plenty of attention.

There are also methods that do not require perturbations in the training data. These usually are trained on some type of time-series data.

CellOracle seems to have inspired a lot of people. No individual component of it is new. The causal structure is from pairing transcription factor binding motifs with target genes (GimmeMotifs and Cicero). The regression models are just linear. The plots are from Velocyto. But it’s a sensible pipeline to try, and when they nominate hypotheses, the fraction confirmed by their literature review is astounding (especially Fig 1i). If this is predictive of future results, then CellOracle really will live up to the “Oracle” in the name, and it will be an extremely important tool for small labs to prioritize expensive, time-consuming experiments such as knockout mice. CellOracle or similar linear structural equation models are frequently included as comparisons in other studies.
Dictys is like what would happen to CellOracle if the backend were upgraded by a team of Wall Street quants. Starting even from the unusual figurative use of “dissect” both titles, the projects are very similar: find motifs in ATAC data to get the causal structure. Fit linear models to estimate causal effects. Feed in new features, e.g. setting a gene to 0 to knock it out. But in Dictys, the details are hardcore: they reprocess the ATAC reads on a GPU so that the motifs land in footprints. They fit latent-space stochastic differential equations coupled with a multinomial observation model, making an explicit distinction between technical and biological randomness. They normalize predicted expression to match the compositional effects that happen when you use total-count normalization on real data. I was blown away by this level of ambition, detail, and technical knowledge in stochastic processes. Compared to CellOracle, the empirical focus is much more on network structure recovery, with only one demo on the total effect of genetic perturbations. What would happen if we deployed Dictys to predict changes in stem cell differentiation or embryo studies?
scKINETICS is presented as an RNA velocity estimation method, but they’re not fooling me: it’s a network derived from, yup, motif analysis of chromatin accessiblity data, used to determine the nonzero coefficients in, yup, a linear model for gene expression. Unlike CellOracle and Dictys, there is no steady state assumption: the model predicts future expression of each cell, not current expression. Since future expression is unknown, it is imputed via a probabilistic compromise between each cell’s predicted expression and nearby cells. The scKINETICS empirical demos use pseudotime as a ground truth, which seems circular because both scKINETICS and the “ground truth” estimate velocity via nearest-neighbor analysis. But scKINETICS is basically a reasonable combination of ingredients, highly similar to Dictys and CellOracle, and it would be interesting to see how it stacks up against them in a neutral comparison.
PRESCIENT also combines dynamic models of gene expression with in-silico perturbation, but it is a substantial departure from CellOracle, Dictys, and scKINETICS. PRESCIENT figures out how to “transport” each time-point onto its successors using a beautiful blend of forward simulation and optimal transport lineage tracing. All modeling is done in 30- to 50-dimensional PCA space, with no motif analysis at all. Claims about simulating complex genetic perturbations are central to the paper, but I don’t know how it is supposed to get the causal structure right, and frankly I am not impressed with the empirical demos of in silico perturbation, which seem to mostly present simulation results without comparing to ground truth. (The lineage tracing demos, on the other hand, are some of the best in the business, because they take advantage of clonal barcoding data.) This combination of a beautiful, distinct approach and very few checks against known perturbation outcomes makes me intensely curious: how would PRESCIENT stack up against CellOracle, Dictys, and scKINETICS in a direct comparison?
RNAForecaster is another dynamic model advertising simulated genetic perturbations. It requires RNA velocity as input, and the predictive model is a fully connected neural network. No motif analysis is used. Empirically, the perturbation predictions have only been tested on simulated data. I’m also curious to see how this would fare on real data.
OneSC was developed at the desk across from mine by a close friend and colleague, Dr. Dan Peng. (Hire Dan (LinkedIn); he’s brilliant with computers, clever with data, highly productive, and great to work with.) I can’t be neutral about OneSC. But OneSC’s distinctive advantage is boiling down network dynamics into extremely parsimonious Boolean models that nevertheless capture qualitative system behavior such as the number and profile of achievable steady states. OneSC shows promising generalization on held-out perturbations, and there is some theory stating that sparse Boolean networks in particular can be fully reconstructed from very little data. So while the restriction to Boolean models makes it hard to closely fit the data the way that something like PRESCIENT would, it has a better shot at recovering real mechanisms and getting the counterfactuals right.
It’s worth mentioning that GeneFormer can be deployed on training data with no interventions, meaning it could be directly compared to these timeseries-based models. Indeed both GeneFormer (original paper) and CellOracle (follow-up paper) were independently used to nominate candidate targets whose modulation might promote cardiac recovery.

EDIT 2024 Oct 16: there are also methods that use RNA velocity to predict perturbation responses. I have a limited understanding of this work but it’s certainly worth mentioning.

Dynamo uses kernel ridge regression to predict velocity from expression state. They predict knockout differentiation trajectories by “least action”: a path that minimizes the difference between the velocity implied by the path and the velocity predicted by the model.
The biophysical model of Chari et al. uses a bursting model to estimate post-perturbation rates of transcription, splicing, and decay rates directly from perturb-seq data. In addition to assessing which specific processes mediate differential expression, this work includes some prediction of kinetic parameters under new perturbations.
RegVelo predicts transcription rates as $a = h(Ws + b)$ where $s$ are spliced RNA counts, $W$ is a GRN weight matrix, and $h$ is a nonlinear response function. The structure of $W$ is hard-constrained, or L1-penalized towards, prior GRNs from motif analysis and other sources.

Perspective: bioinformatics boom and bust

To me, we are in the very early stages of this type of modeling. There are many interesting and diverse approaches, and depending on who does the evaluation, the results always seem to favor something different. This motivated my work on the PEREGGRN benchmarks. We’ve done a lot of work in the ~16 months since the preprint came out, and we updated our preprint on Sept 30 2024. We have a lot to learn, and I hope to discuss this topic more soon.

My fear is that new methods will emerge faster than the community is able to assess them, and many of these methods will be biologically unreliable, painful to use, and even more painful to install, ultimately benefitting their authors more than their users. I have a ton of respect for certain pioneers in this space, and I hope my enthusiasm is clear from the summaries above. But pioneers are inevitably followed by imitators. This has certainly happened with pseudotime inference and GRN structure inference, and [heroic] [attempts] at neutral evaluation seemed to sandbag the flooding only momentarily. This time, the rising tide of nonstop bioinformatics methods development is poised to spill over into not just basic-science questions like GRN structure and lineage tracing, but also translational questions like how to optimize directed differentiation of stem cells or how to select drug targets for cardiac recovery.

If you are worried about this, then I encourage you to think about how we can maintain high epistemic standards as a field. Certain other fields have practices we may be able to adopt: for example, adjustment for experimenter [degrees] [of] [freedom]; funnel plots; preregistration; blinding; and screening instead of candidate studies. Even if you are currently following a traditional format of methods development, it’s not too late to preregister an experiment or to engage with a group running systematic benchmarks.

Written on October 1, 2024