What FDR control doesn't do
Working on transcriptome data, I use FDR control methods constantly, and I’ve run into a couple of unexpected types of situations where it’s easy to assume they will behave well… but they don’t.
Ph.D. student at JHU BME; formerly Bioinformatician at UMass Medical School
Working on transcriptome data, I use FDR control methods constantly, and I’ve run into a couple of unexpected types of situations where it’s easy to assume they will behave well… but they don’t.
Training neural networks is hard. Plan to explore many options. Take systematic notes. Here are some things to try when it doesn’t work at first.
This post might be for you if:
This post will highlight Jean-Marie Dufour’s absolutely impenetrable tour de force “Some Impossibility Theorems in Econometrics With Applications to Structural and Dynamic Models”, also known by its English title, “Weak Instruments, Get Rekt.” Dufour talk about confidence intervals for a parameter $\phi$ in a situation where:
Suppose the joint distribution of $X, Y, Z$ is continuous and the max of $X$, $Y$, and $Z$ is $M$. Suppose you want to test whether $X$ and $Y$ are independent conditional on $Z$. You want to do this with a type 1 error rate (false positive rate) controlled at 5%.
Maybe this was naive of me. I thought that if you were willing to assume that the mean exists, you could probably … estimate it?
So you want to cluster your data. And you want to use a “good” algorithm for clustering. What makes a clustering algorithm “good”?
It’s natural to want and expect a one-stop shop. The best fitting model ought to give both the best predictions and the best inferences.
I remember reading, maybe in a decades-ago issue of Scientific American, that geometry and physics professors sometimes hear from cranks. These are arrogant people, working in isolation, who earnestly believe they have discovered something important. Typical examples might be trisecting an angle with a compass and straightedge in a finite number of steps or unifying physics. I have not understood the details of those examples personally, but there is a basic, recurring disconnect between the position of the crank – “The establishment has tried and failed; they do not understand” – and the position of the establishment – “We understand thoroughly; this task is proven to be impossible.”
Here’s post #3 of 4 on astonishingly general tools from modern frequentist statistics. This series highlights methods that can accomplish an incredible amount on the basis of very limited or seemingly wrong building blocks. It’s not just about the prior anymore: each of these methods works when you don’t even know the likelihood. I hope this makes you curious about what’s out there and excited to learn more stats!
Here’s post #2 of 4 on the theme of “Do 👏 not 👏 ignore 👏 the 👏 magnificent 👏 bounty 👏 of 👏 techniques 👏 from 👏 modern 👏 statistics 👏 just 👏 because 👏 someone 👏 taught 👏 you 👏 an 👏 ideology”. This series highlights tools from modern frequentist statistics, each with a distinctive advantage that nearly defies belief in terms of how much can be accomplished on the basis of very limited or seemingly wrong building blocks. It’s not just about the prior anymore: each of these methods works when you don’t even know the likelihood. I hope this makes you curious about what’s out there and excited to learn more stats!
Over the past N years I’ve heard lots of smart people that I respect say things like “inside every frequentist there’s a Bayesian waiting to come out” and “Everybody who actually analyzes data uses Bayesian methods.” These people are not seeing what I’m seeing, and that’s a shame, because what I’m seeing is frankly astonishing.
Here’s post #4 of 4 on tools from modern frequentist statistics that a guaranteed to work despite their basis in very limited or seemingly wrong building blocks. Unlike the other miracles, you will need to specify the full likelihood for this one, but if you’ve only ever been exposed to Bayesian takes on variational inference, it may open your mind to interesting new possibilities, or at least alert you to an issue you might not have thought about yet.
Warning: near-delirious rant initiating. Consider two of the major unmet needs for predicting gene function in stem cell biology:
I just finished teaching a one-credit engineering class (part of JHU’s HEART). It was an opportunity for students and me to indulge our curiosity about a really neat topic: namely, epigenetics.
This is an example of a completed project from my class on epigenetics.
This is a description of the research project from my class on epigenetics.
MatrixLazyEval
is an R package for “lazy evaluation” of matrices, which can help save time and memory by being smarter about common tasks like:
For a while, I experimented with writing all of my R packages in R markdown. In order to facilitate this, I built kniterati
, which knits all the Rmd
files into R
files prior to the usual devtools
workflow. Check it out at https://github.com/ekernf01/kniterati.
Doing single-cell RNA analysis for the Maehr Lab, it was a constant struggle to keep everything organized. The nightmare scenario would be to get a useful or interesting result during an hour of intensive experimentation and re-writing of code, then later discover that the code to produce that result no longer exists. I wanted to interact with the data, not be interrupted constantly to commit microscopic changes, so I wrote freezr, which saves code, plots, and console output to a designated folder just as the same code runs. Check it out at https://github.com/ekernf01/freezr.
To help me understand Kalman filtering while studying for quals, this cheat sheet condenses and complements the explanation of the Kalman filter in Bishop PRML (pdf) section 13.3. I wrote this as if I were about to implement it. (I didn’t: there’s already an open-source implementation, pykalman, including all the functionality discussed herein.) I’m having trouble with math typesetting on the web, so here’s the markdown and pdf of this post.
When we screened transcription factors influencing endoderm differentiation (post, paper), we ran into an interesting design problem. The experiment had 49 different treatment conditions and one control. Treatments were not allowed to overlap. We were able to measure outcomes in a fixed number of cells – as it turned out, about 16,000. What is the optimal proportion of cells to use as controls?
Here’s a puzzle for you: why is nobody using CMAP for network modeling in stem cell biology?
Sorry, this one’s not ready yet!
Back in the Maehr lab, any given experiment that my coworkers ran was expensive and hard. Our most common measurements – flow cytometry, imaging, and qPCR – could take hours or a day or two. Making a batch of virus containing CRISPR guide RNAs could take days or a week or two. Running a directed differentiation could take weeks or a month or two, especially if the cells died. Making a cell line could take months. So if they are going to design an screen, they really want a good chance to find something in there that works for whatever result they are trying to obtain.
Earlier, I discussed opportunities and ambitions in modeling the human transcriptome and its protein and DNA counterparts. I also promised a discussion of the many statistical issues that arise. To begin that discussion, I am trying to identify statistical issues and position them into some type of structure.
This is not a standalone post. Check out the intro to this series. This particular post is about the number one limitation in causal network inference: missing layers. In the most common type of experiment, we measure gene activity by sampling RNA transcripts, reading them out, and counting them. We can only guess at:
In the intro, I described a three-part scheme to unravel the mystery of multicellular life. As part of that, I talked about how I mostly am trying to predict RNA levels these days. But, we know that important parts of the human regulatory network are contained in other types of molecules: for example, of all mutations related to autoimmunity, 90% are not in a coding region. This post discusses a class of gene-like entities called enhancers that have recently emerged as an interesting and potentially useful counterpart to genes. Teams are beginning to catalog enhancers and figure out how they help control cell state. This post will survey how that’s being done and will convey one important current question: how do we best connect each enhancer with the gene(s) it helps control?
This is not a standalone post – it’s just a rough list of datasets that could be useful for regulatory network models. Check out the intro to this series for more context.
This post is part of the GRN series. Check out the intro.
In the intro, I discussed the puzzle of multicellular life: many cell types, one genome. I also followed up by discussing the many statistical issues that arise. There’s a very cool paper from 1999 that brings a lot of clarity to this situation, and here I have space to dig into it a bit.
I’ve been reading recently about debugging methods, starting with this post by Julia Evans. All the links I followed from that point have something in common: the scientific method. Evans quotes @act_gardner’s summary:
In the Maehr lab’s regular journal club, we recently discussed a Cell paper from the research groups of Howard Chang, Will Greenleaf, and Paul Khavari (henceforth “Rubin et al.”).
Neural networks are exploding through AI at the same pace as single-cell technology in genomics and GWAS consortia in genetics – that is to say, very quickly. For one example, ImageNet error rates decreased tenfold from 2010 to 2017 (source), but people are also now using neural networks to play StarCraft at a superhuman level, synthesize and manipulate photorealistic images of faces, and yes, also to analyze single-cell genomics data. This seems like a class of models I should consider learning to work with.
The Maehr Lab recently published another paper! Hooray! I want to recap it with a quick summary and some technical things I learned on the project.
This is not a stand-alone post. It is a technical appendix for an upcoming post, which is (EDIT) now published here on the blog of the peer-reviewed journal Medical Care!
The programming language R
typically offers functions prefixed by d
, p
, q
, and r
for pdf, cdf, inverse cdf, and sampling. These are useful for fitting statistical models – especially MCMC since you’re usually sampling and computing densities. You can even get the log density from the d
function with an extra argument.
I work in a small lab. The number of bioinformaticians hovers around 1 to 1.5. We prioritize interaction with the data, so we do not spend the effort to implement things from scratch unless we absolutely need to. We start with what’s out there and adapt it as necessary. That means I have installed, used, adapted, or repurposed many shapes or sizes of bioinformatics tools. In terms of usefulness, they run the gamut from “I deeply regret installing this” to “Can I have your autograph?”. Some patterns emerge distinguishing those that are most pleasant to work with, and that’s the topic of today’s post.
Why study the thymus? The eponymous “T cells” of the immune system:
This is part 3 of a three-part post on T-cell receptors & RNA data. Here are parts 1 (intro / summary) and 2 (technical details).
This is part 2 of a three-part post on T-cell receptors & RNA data. Here are parts 1 (intro / summary) and 3 (bonus material).
This is part 1 of a three-part post on T cell receptors & RNA data. Here are parts 2 (technical details) and 3 (bonus material). Also check out WAT3R, a custom-built analysis pipeline for TCR + scRNA data. I wasn’t involved in developing WAT3R, but it looks like a really nice tool, and much more modern than this post you’re reading.
We’re a stem cell lab in a diabetes center. Stem cell biology is the hammer; Type I diabetes is the nail. The connection is rather complicated.
Many (all?) bioinformatics groups use cloud or cluster computing to handle grunt work such as sequence alignment. They use scheduling systems such as Sun Grid Engine and LSF to submit jobs to the cluster. But, it’s becoming more common to use one of many modern pipelining tools. These pipelining tools abstract away the details of job submission, getting rid of boilerplate that would otherwise appear every time you build a pipeline.
This is the fully detailed version of this post about TracerSeq. If you find it daunting, boring, or overly technical, check out the short version first.
This is a short version of the full post about TracerSeq.
Hello world. Stay tuned for initial posts, coming up!