Network modeling and the Connectivity Map

[ grn  ]

Here’s a puzzle for you: why is nobody using CMAP for network modeling in stem cell biology?

For context, this post is part of a series about network modeling, which describes a rough overall scheme as well as data we need, data we have, and what our models should do (forthcoming) in order to systematically formulate and test computational models of stem cells. We are going to need a lot of data, especially if the processes that govern activity of each gene are complex. Profiling transcriptomes in every human cell type – say, 10,000 different cell types – may still not provide enough data to infer the right “rule” for each gene.

Furthermore, even if that worked, we wouldn’t trust it. Some type of perturbation is necessary: prediction of an out-of-sample cell state, preferably even a non-physiolgical one, to see whether the model has bona fide predictive power.

Enter the Connectivity Map. CMAP contains transcriptomic profiles of immortalized human cell lines under tens of thousands of perturbations. It includes both exposure to small molecule drugs (tens of thousands) and forced overexpression or knockdown of genes (thousands). Each readout describes around 1,000 genes that were directly measured using the LINCS L1000 assay, so not the full transcriptome, but some conditions also have full-transcriptome data, and there has been some effort to impute the missing values.

Here are some basic questions about this dataset. If I can’t find them in the literature, I want to try to answer them myself. I apologize because below I assume whoever’s reading has some knowledge of methods that motivate my thinking (CellOracle, SCRIBE, scGen), and I haven’t written any intro posts on those topics yet.

  • One application of CMAP in osteoblast differentiation showed that CMAP could be used to find new drugs for directed differentiation. It worked great – but that was 5 years ago and to my knowledge, nobody has done anything similar since. So was this a fluke, or could this tactic be more broadly useful? To make this concrete: if you feed CMAP some transcriptome data from a well-established protocol for directed differentiation of cardiac muscle or pancreatic beta cells, can it recover molecules that are actually used in the protocol? If you feed it embryo data, can it recover signals that are known to be involved in embryogenesis?
  • One of the core issues in stem cell biology is that different cell types respond to the same signal in different ways. When querying CMAP for useful small molecules, each cell line often gives a different answer compared to the average, and it is not known which best reflects the expected behavior of the query cell type. Can this question be settled in a way that is similar to collaborative filtering, i.e. based on a profile of the query cell type? More specifically, suppose you have RNA levels for cell types A, B, and C under a control condition, but only A and B under treatment with, say, retinoic acid. How well can you predict the response of cell type C to retinoic acid, and what method works best? I want to see a simple baseline prediction – maybe the median fold change from A and B – compared to a modern deep learning approach such as scGen, a network-based approach like CellOracle, and some typical collaborative filtering methods like you might see competing on the Netfix problem.
    • If network-based approaches are competitive or at least promising: can network data from other sources improve performance on CMAP out-of-sample predictions? For example, could CellOracle be modified to “prefer” additional connections taken from network inference on the ARCHS4 collection (or recount2, or CellNet, etc)? If you had different pre-existing networks, could you benchmark them by asking which ones lead to the greatest improvement in the CMAP collaborative filtering problem?
  • Stem cell simulation methods like SCRIBE and CellOracle would benefit greatly from an explicit time-scale. Could CMAP, whose measurements are 6h or 24h after perturbation, be used to calibrate these methods’ time-scales? For example:
    • CellOracle has a “propagation” parameter that is arbitrarily set to 3, but CellOracle is not meant to simulate prolonged intervals, and if you were to adapt it for that purpose, you would need to establish a clear relationship between that propagation parameter and actual time.
    • SCRIBE can be trained using RNA velocity data. RNA velocity seems to predict 1-2 hours out (fig 1h, 2g of the paper), but that won’t necessarily be consistent from gene to gene or cell to cell, so it is unclear how many steps to run the model in order to simulate e.g. a 5-day directed differentiation from pluripotent stem cells to endoderm.
Written on August 16, 2020