Intro to gene regulatory network post series

[ grn ]

The mystery of life
A vague modeling strategy
Table of Contents

The mystery of life

Human cells have a remarkable ability to differentiate and respond to their environment. A single fertilized egg can give rise to placenta, germ cells, muscle, liver, skin, neurons, and all the tissues of the body. Individual cells alter their shape, stiffness, organelle content, DNA packaging, and surface receptor composition. They migrate, contract, metabolize, secrete, multiply, kill, and die, according to the sequence of stimuli they “detect” beginning at conception. Despite their common origins in a pluripotent stem cell state, these cells are dramatically different in every aspect except one: they all contain more or less the same genome. It’s a long-standing challenge to build predictive models showing how this genome and its related proteins and RNA can control cellular “behaviors” and the developmental sequences by which each cell type emerges.

The biochemical systems controlling cells have a long parts list: tens of thousands of genes and RNA transcripts, hundreds of thousands of enhancers, and also proteins and signaling molecules. (If this is starting to sound foreign to readers low on biology background, check out the bio intro post.) It is useful to simplify these biochemical systems into networks: if transcript A is translated into protein A, which binds at locus B and inhibits transcription of transcript B, then the network would have a connection from A to B. The hope is that the network won’t be fully connected: each molecule type will have a manageable number of others that it interacts with.

Typical approaches have used purpose-built biotech tools (e.g. genetic knockout mice, fluorescent “reporter” cell lines) to identify pieces of the network that are important for function or development. Often, there is enough information to detail entire “pathways” within the network; this has been done, for examples, with programmed cell death, nucleic acid sensing, and Wnt signaling (resource site, article). But with modern technologies, we are fast gaining the ability to observe phenotypes, assess molecular states, and perturb genes on a massive scale. We may not need to reconstruct the entire human control network one element at a time using labor-intensive methods. I am interested in automated, large-scale network modeling based on screens and observational data.

Unfortunately, there are many obstacles we must overcome in order to build convincing and useful models of cellular control from modern datasets. This post series will lay out a broad strategy and some of the current problems.

A vague modeling strategy

General simulators of cell state will be built in discrete steps.

Protein content can be naïvely predicted from RNA content through the central dogma (DNA -> RNA -> protein), but increasingly sophisticated methods will improve predictions allowing for factors other than transcription.
Many specific molecules have already been linked to cellular phenotypes such as apoptosis and proliferation. This work will continue as before and with the aid of new technologies such as genome-wide CRISPR screens.
RNA content is difficult to predict, but we will see continuous improvement of mathematical models that incorporate DNA binding proteins, chromatin shape/state, external stimuli, and splicing.

I am interested in all of these tasks (studying phenotypes, predicting protein content, and predicting RNA content), but of the three, modeling RNA content is my current focus. From a statistical perspective, modeling of transcription raises fascinating questions about what data to generate, what inferences are possible, and how to represent the remaining uncertainty. From a stem cell biology perspective, I believe that control of transcription is necessary for control of cell identity and possibly even sufficient. From a technological perspective, it seems the time is ripe for additional development of this subfield: diverse relevant datasets are available, and models derived from them are beginning to show practical utility. Here are my thoughts and my plans moving forward.

(Note: this series is a work in progress; some linked posts below are actually not up yet.)

Quick & dirty biology for people from other fields: show me the basics
What types of data do we have to work with?
- The single-cell revolution in genomics
  - A list of relevant datasets
- Discovering and modeling enhancers
Is this just going to crash and burn?
- How much data will we need to “solve” human transcriptional regulation? What kind of data is best?
- What type of inference is possible, and what obstacles arise?
- How do we handle the constant threat of missing information?
What are other people doing in this field and how well does it work? (all forthcoming)
- CellNet, Mogrify, and CellOracle
- DREAM5 and various modeling techniques
What is Eric fantasizing about doing and why? (all forthcoming)
- Summary and connections among topics
- Large-scale batch-corrected GRN inference
- GRN inference with RNA velocity
- Out-of-sample estimation of CMAP perturbations
- Pathway response inference in stem cell and embryo RNA data
- Isoform estimation in single-cell data
- Reverse inference of binding motifs
- Protein content prediction from RNA measurements

Written on March 10, 2020

Intro to gene regulatory network post series

The mystery of life

A vague modeling strategy

Table of Contents