Expression forecasting benchmarks

[ grn  ]

I wrote about the boom of new perturbation prediction methods. The natural predator of the methods developer is the benchmark developer, and a population boom of methods is naturally followed by a boom of benchmarks. (The usual prediction about what happens after the booms is left as an exercise to the reader.) Here are some benchmark studies that evaluate perturbation prediction methods. For each one, I will prioritize three quetions:

  • What’s the task?
  • What is the overall message of this benchmark?
  • What are this benchmark’s distinctive advantages, and if I have a new method, should I test it with this benchmark?

  • Ahlmann-Eltze et al focuses on comparing foundation models to simple baselines.
    • Task: predict gene expression for new perturbations not seen during training, or new combinations of perturbations.
    • Message: This study includes a peculiar linear baseline that beats GEARS, scGPT, and scFoundation almost uniformly when predicting genetic interactions on the Norman data or predicting novel perturbation outcomes on other perturb-seq datasets. This is an important finding that casts doubt on the value of pretrained foundation models or Gene Ontology for these tasks.
    • Advantages/should you use the code: The clear advantage of this study is their clever, unique linear baseline. I don’t think this work is intended to be re-used and extended by other teams aside from the authors, so my advice would be: read it; heed it; don’t need to repeat it. If I hear otherwise I’ll update this post.
  • PerturBench targets slightly different tasks but uses a lot of the same ingredients.
    • Task: They focus on transfer learning across cell types or experimental conditions: if you know how a perturbation affects gene expression in one cell type, can you predict what would happen in a different cell type? They also have a separate genetic interaction prediction task.
    • Message: They find leading performance using simple, but nonlinear, latent-space arithmetic.
    • Advantages/should you use the code: Their work is very clearly meant to be reused, so if you’re a methods developer looking for a quick way to access a lot of results, you should take a look. Their framework seems to be highly flexible, especially the data splitting: you can manually specify what goes in the test set. Their way of incoporating new methods seems to be python-only, but if your method is written in R, Julia, or something else, maybe you can rig it up to call your method using a subprocess. I am not sure how to add new datasets – I think that’s a work in progress. This work took thousands of GPU hours and is distinguished by the breadth of coverage over deep-learning methods.
  • CausalBench is mostly geared towards fancy causal DAG structure inference methods and towards network structure recovery, but they do sometimes test on held-out interventions like the rest of the projects listed here.
    • Task: I don’t understand the data split or the evaluation methods well enough to comment.
    • Message: The authors’ own takeaway is that causal inference methods do not necessarily make better predictions than alternative methods with no underlying causal theory or no way of handling interventions in the training data.
    • Advantages/should you use the code: This work is from an open challenge by GSK that is now over. It is not clear to me whether it is intended to still be used, but I’m optimistic about this: the interface looks very convenient, and since it was for a competition, you can be sure the interface has been tested by several independent teams. Their way of incoporating new methods seems to be python-only, but if your method is written in R, Julia, or something else, maybe you can rig it up to call your method using a subprocess. This framework offers two datasets and I am not sure how to add new datasets.
  • Edit 2024 Oct 14: PertEval-scFM compares perturbation prediction performance by using a variety of foundation models.
    • Task: This set of benchmarks is based on the Norman 2019 CRISPRa data. It focuses on the information content of latent embeddings. All models are used in a way that is similar to GeneFormer’s setup: they obtain a perturbed embedding by zeroing out the targeted gene (regardless of KO vs OE), and they learn to predict training-set post-perturbation expression from the perturbed embedding. Only methods producing embeddings are included, and the same decoder architecture is used across all models.
    • Message: They state it super clearly, so I’ll quote. “Our results show that [single-cell RNA-seq foundation model] embeddings do not provide consistent improvements over baseline models… Additionally, all models struggle with predicting strong or atypical perturbation effects.”
    • Advantages/should you use the code: This work includes an impressive variety of foundation models. I am not sure whether they intend to make this work extensible by outside developers.
  • Edit 2024 Oct 16: This brief benchmark
    • Task: … uses the same data splits as the initial scGPT demos on the Adamson 2016 perturb-seq data, with new metrics and new baselines.
    • Message: I’ll quote. “[W]e found that even the simplest baseline model - taking the mean of training examples - outperformed scGPT.” They interpret: “In the Adamson dataset, we observed high similarity between the perturbation profiles, with a median Pearson correlation of 0.662. This result is not entirely unexpected, given that the Adamson study focused on perturbations specifically targeting endoplasmic reticulum homeostasis, where similar transcriptional responses might be expected. Only a few genes exhibited anti-correlated expression profiles. Similarly, in the Norman dataset, there was a high degree of similarity between perturbation profiles, with a median Pearson correlation of 0.273.”
    • Advantages/should you use the code: The infrastructure doesn’t look like this is meant to be extended to new methods and datasets by third parties. The advantage of this study is a careful, skeptical look at the scGPT evals and the info content of the data.
  • PEREGGRN (code, paper) constitutes the bulk of my PhD work, and therefore I cannot discuss it objectively. If you want my take anyway:
    • Task: predict gene expression for new perturbations not seen during training, or new combinations of perturbations.
    • Message: Judging by most evaluation metrics, on most datasets, the mean and median perform better than most methods. Examples where published methods beat simple baselines occur more often for the specific combination of method, dataset, and eval metric that was used in the original publication.
    • Advantages/should you use the code: A distinctive advantage is that we include many different cell types and many ways of inducing GoF/LoF. Overexpressing developmentally relevant transcription factors in pluripotent stem cells is very different from knocking out heat-shock proteins in a cancer cell line, and this biological diversity is a big plus. The code is designed for extensibility: new methods can be added in Docker containers, so you can use R, Python, Julia, or anything else. We have instructions to let users to add their own datasets, add their own draft causal network structures, compute new evaluation metrics, or choose among several types of data split that emphasize different tasks.
Written on October 1, 2024