Some references on genomic foundation models

[ misc  ]

I recently offered to swap bibliographies with the folks at Tabula Bio, so I put together a haphazard list of recent work on foundation models in genomics. I always want to know what are the limits of the latest data and how far we can generalize, so this discussion is a great opportunity to explore where Tabula’s interests and mine overlap. For three model classes defined by the general type of training data, here are some pointers to a sampling of existing work, plus a brief comment on where these models seem to hit a wall. Read it quick before the SOTA gets up and walks away!

Seq2function

Increasingly large models have successfully predicted chromatin state or gene expression from DNA sequence.

They are good at designing sequences that will be transcribed in specific cell types. But, they struggle with novel alleles, person-to-person variation, and distal enhancers.

In a recent Very Large Benchmark of enhancer-to-gene pairing, Enformer is outperformed by task-specific models.

seq2seq

I don’t know as much about models of DNA sequence alone – mostly just what the Owl posts.

Even self-supervised, these things can clearly find interesting real biology, like RNA secondary structure.

But, some work claims that for a lot of demo tasks, the pretraining isn’t helping and non-FM baselines or task-specific models often perform better than the pretrained FM’s.

Transcriptome only

People also train foundation models on transcriptome data without using any DNA sequence information.

So far, simple baselines, or task-specific models, or completely different strategies such as learning gene embeddings from text and scientific literature, almost always perform on par with or better than these models. Some examples of studies with this type of finding:

I’m a huge Debbie Downer most of the time, so let me clarify something here. I am optimistic about transcriptome foundation models in the long run, but I think we will need richer training data with orders of magnitude more perturbations, like Tahoe-100M, CMAP, or data internal to New Limit or big biotechs. I am also optimistic about pairing perturbation transcriptomics with DNA sequence models.

Written on March 27, 2025