What FDR control doesn't do

Working on transcriptome data, I use FDR control methods constantly, and I’ve run into a couple of unexpected types of situations where it’s easy to assume they will behave well… but they don’t.

Read More

Neural Network Checklist

Training neural networks is hard. Plan to explore many options. Take systematic notes. Here are some things to try when it doesn’t work at first.

Read More

Five disturbing impossibility theorems

I remember reading, maybe in a decades-ago issue of Scientific American, that geometry and physics professors sometimes hear from cranks. These are arrogant people, working in isolation, who earnestly believe they have discovered something important. Typical examples might be trisecting an angle with a compass and straightedge in a finite number of steps or unifying physics. I have not understood the details of those examples personally, but there is a basic, recurring disconnect between the position of the crank – “The establishment has tried and failed; they do not understand” – and the position of the establishment – “We understand thoroughly; this task is proven to be impossible.”

Read More

A third miracle of modern frequentist statistics

Here’s post #3 of 4 on astonishingly general tools from modern frequentist statistics. This series highlights methods that can accomplish an incredible amount on the basis of very limited or seemingly wrong building blocks. It’s not just about the prior anymore: each of these methods works when you don’t even know the likelihood. I hope this makes you curious about what’s out there and excited to learn more stats!

Read More

A second miracle of modern frequentist statistics

Here’s post #2 of 4 on the theme of “Do 👏 not 👏 ignore 👏 the 👏 magnificent 👏 bounty 👏 of 👏 techniques 👏 from 👏 modern 👏 statistics 👏 just 👏 because 👏 someone 👏 taught 👏 you 👏 an 👏 ideology”. This series highlights tools from modern frequentist statistics, each with a distinctive advantage that nearly defies belief in terms of how much can be accomplished on the basis of very limited or seemingly wrong building blocks. It’s not just about the prior anymore: each of these methods works when you don’t even know the likelihood. I hope this makes you curious about what’s out there and excited to learn more stats!

Read More

A miracle of modern frequentist statistics

Over the past N years I’ve heard lots of smart people that I respect say things like “inside every frequentist there’s a Bayesian waiting to come out” and “Everybody who actually analyzes data uses Bayesian methods.” These people are not seeing what I’m seeing, and that’s a shame, because what I’m seeing is frankly astonishing.

Read More

A fourth miracle of modern frequentist statistics

Here’s post #4 of 4 on tools from modern frequentist statistics that a guaranteed to work despite their basis in very limited or seemingly wrong building blocks. Unlike the other miracles, you will need to specify the full likelihood for this one, but if you’ve only ever been exposed to Bayesian takes on variational inference, it may open your mind to interesting new possibilities, or at least alert you to an issue you might not have thought about yet.

Read More

freezr keeps results with the code that produced them

Doing single-cell RNA analysis for the Maehr Lab, it was a constant struggle to keep everything organized. The nightmare scenario would be to get a useful or interesting result during an hour of intensive experimentation and re-writing of code, then later discover that the code to produce that result no longer exists. I wanted to interact with the data, not be interrupted constantly to commit microscopic changes, so I wrote freezr, which saves code, plots, and console output to a designated folder just as the same code runs. Check it out at https://github.com/ekernf01/freezr.

Read More

Kalman Filter Cheat Sheet

To help me understand Kalman filtering while studying for quals, this cheat sheet condenses and complements the explanation of the Kalman filter in Bishop PRML (pdf) section 13.3. I wrote this as if I were about to implement it. (I didn’t: there’s already an open-source implementation, pykalman, including all the functionality discussed herein.) I’m having trouble with math typesetting on the web, so here’s the markdown and pdf of this post.

Read More

Experiments with one control and $k$ treatments have highest power when the control arm is $\sqrt k$ times bigger than each individual treatment arm

When we screened transcription factors influencing endoderm differentiation (post, paper), we ran into an interesting design problem. The experiment had 49 different treatment conditions and one control. Treatments were not allowed to overlap. We were able to measure outcomes in a fixed number of cells – as it turned out, about 16,000. What is the optimal proportion of cells to use as controls?

Read More

Inferring developmental signals from RNA-seq data

Back in the Maehr lab, any given experiment that my coworkers ran was expensive and hard. Our most common measurements – flow cytometry, imaging, and qPCR – could take hours or a day or two. Making a batch of virus containing CRISPR guide RNAs could take days or a week or two. Running a directed differentiation could take weeks or a month or two, especially if the cells died. Making a cell line could take months. So if they are going to design an screen, they really want a good chance to find something in there that works for whatever result they are trying to obtain.

Read More

Problems in causal modeling of transcriptional regulation

Earlier, I discussed opportunities and ambitions in modeling the human transcriptome and its protein and DNA counterparts. I also promised a discussion of the many statistical issues that arise. To begin that discussion, I am trying to identify statistical issues and position them into some type of structure.

Read More

Coping with missing data in biological network modeling

This is not a standalone post. Check out the intro to this series. This particular post is about the number one limitation in causal network inference: missing layers. In the most common type of experiment, we measure gene activity by sampling RNA transcripts, reading them out, and counting them. We can only guess at:

Read More

Enhancer integration for network modeling

In the intro, I described a three-part scheme to unravel the mystery of multicellular life. As part of that, I talked about how I mostly am trying to predict RNA levels these days. But, we know that important parts of the human regulatory network are contained in other types of molecules: for example, of all mutations related to autoimmunity, 90% are not in a coding region. This post discusses a class of gene-like entities called enhancers that have recently emerged as an interesting and potentially useful counterpart to genes. Teams are beginning to catalog enhancers and figure out how they help control cell state. This post will survey how that’s being done and will convey one important current question: how do we best connect each enhancer with the gene(s) it helps control?

Read More

I don't want to be afraid of neural networks anymore

Neural networks are exploding through AI at the same pace as single-cell technology in genomics and GWAS consortia in genetics – that is to say, very quickly. For one example, ImageNet error rates decreased tenfold from 2010 to 2017 (source), but people are also now using neural networks to play StarCraft at a superhuman level, synthesize and manipulate photorealistic images of faces, and yes, also to analyze single-cell genomics data. This seems like a class of models I should consider learning to work with.

Read More

Definitive endoderm screen

The Maehr Lab recently published another paper! Hooray! I want to recap it with a quick summary and some technical things I learned on the project.

Read More

We want R functions for gradients and hessians

The programming language R typically offers functions prefixed by d, p, q, and r for pdf, cdf, inverse cdf, and sampling. These are useful for fitting statistical models – especially MCMC since you’re usually sampling and computing densities. You can even get the log density from the d function with an extra argument.

Read More

What I wish I knew two years ago, or, best practices for command-line tools in bioinformatics

I work in a small lab. The number of bioinformaticians hovers around 1 to 1.5. We prioritize interaction with the data, so we do not spend the effort to implement things from scratch unless we absolutely need to. We start with what’s out there and adapt it as necessary. That means I have installed, used, adapted, or repurposed many shapes or sizes of bioinformatics tools. In terms of usefulness, they run the gamut from “I deeply regret installing this” to “Can I have your autograph?”. Some patterns emerge distinguishing those that are most pleasant to work with, and that’s the topic of today’s post.

Read More

The curious case of the missing T cell receptor transcripts

This is part 1 of a three-part post on T cell receptors & RNA data. Here are parts 2 (technical details) and 3 (bonus material). Also check out WAT3R, a custom-built analysis pipeline for TCR + scRNA data. I wasn’t involved in developing WAT3R, but it looks like a really nice tool, and much more modern than this post you’re reading.

Read More

About the Maehr lab

We’re a stem cell lab in a diabetes center. Stem cell biology is the hammer; Type I diabetes is the nail. The connection is rather complicated.

Read More

Configuring pipelining tools for LSF

Many (all?) bioinformatics groups use cloud or cluster computing to handle grunt work such as sequence alignment. They use scheduling systems such as Sun Grid Engine and LSF to submit jobs to the cluster. But, it’s becoming more common to use one of many modern pipelining tools. These pipelining tools abstract away the details of job submission, getting rid of boilerplate that would otherwise appear every time you build a pipeline.

Read More