Gene regulatory network inference is a venus fly trap for quants

[ grn  ]

An image of Audrey II captioned 'Why dont you try .... GENE REGULATORY NETWORK INFERENCE?'

Choosing a scientific project is hard, and people don’t spend enough time on it (see Alon (pdf) and Fischbach on this). If you are thinking of choosing a research project related to gene regulatory networks, computer models of transcription, perturbation responses, or virtual cells, then you might benefit from knowing what others have set out to do and how it has gone.

Gene regulatory network inference is a venus fly trap for quants

We learn about LASSO and regression trees and dynamic mode decomposition and “Bayesian mechanics” and Kalman filters and transformers, and we get excited to solve real problems with these models. Maybe we start thinking about assumptions – e.g. my predictors are supposed to have not too much correlation for LASSO model selection consistency, and oh yeah I only have mRNA data, so I hope the protein activity in my cells relies on uninterrupted mRNA production. Oh and some of the variation is measurement error, not biological variation, so let’s build in a Gaussian or Binomial error model. And we have some guesses about the network structure, so let’s use those as a prior for Bayesian model averaging. Wow, this is turning into a ton of work! Throw in a demo on a real dataset. OK, great, we’re done.

… I guess maybe we should check the results. Let’s (simulate) (some) (data) and maybe also compare to portions of known networks in well-studied organisms (BEELINE, DREAM5, Djordjevic, McCalla). Wow, this is even more work! And the precision-recall tradeoffs look pretty bad. Huh …

… ok, maybe we can’t get the whole network. Fine. If someone can ChIP and KO a single transcription factor and publish a paper on that – which, to be clear, is common and worthwhile – then maybe I can still have a positive impact by recovering a subset of a gene regulatory network with high confidence. So let’s get some p-values and run some multiple testing correction and just skim off the top of the list. Throw in a real-data demo. OK, great, we’re done.

… I guess maybe we should check the results again. Oh but these tests make all sorts of assumptions about linear/Gaussian data or extremely unstructured null models. Let’s roll our own test with more flexible assumptions and figure out a way to calibrate the error rates and OH EEUUUGGHH IT’S THAT BAD??

Fuck. Well. I guess we need cleaner data somehow. Let’s try to just regress out some batch effects and oh the AUPR is worse now and the calibration is also probably not fixed but we can’t tell because we lost 95% of our power? Huh. Ummm.

Ok fine, maybe the data are just hopelessly confounded by unmodeled post-transcriptional regulation and/or batch effects. Let’s try this again some other time with cleaner data.

… this is where I left off and where I hope to pick up again some day. If you want to work on this, I am happy to chat about it.

Virtual cells

So we tried GRN inference and the result is extremely messy. Can we still get some value out of a messy network with lots of wrong answers but also enrichment of right answers? People claim to make useful cell state predictions with a messy first-draft motif-based network (CellOracle, scKINETICS) or a curated functional similarity network (GEARS, CODEX) or no network at all (PRESCIENT, GeneFormer).

There is tons of modeling diversity here and there are lots of open questions. In fact, here’s a list of guiding questions that I maintained and extended for several years during my Ph.D. I chased after too many of these questions during the super-customizable PEREGGRN project, leaving myself a little overextended. But the main result was that simple baselines usually worked best, preventing us from obtaining meaningful answers to many of our questions. Many others are finding similar results. So this didn’t work out either.

Lessons

Longstanding ambitions to build GRN’s and virtual cells still await richer data. For a problem this hard, modeling details will not matter until we have adequate data. I encourage analysts to redirect their enthusiasm: instead of crafting theory, investigate the information content of your data, and devise new ways to discover what other information you need. Alternatively, if you do want to focus on theory and model-building, there are other types of data and other ideas in transcriptional control that could be a better fit for you: for example reaction rate parameter inference or kinetic control or noise propagation.

Written on May 5, 2025