Lazy matrix evaluation saves RAM in common analyses

[ software  single_cell  stat_ml  ]

MatrixLazyEval is an R package for “lazy evaluation” of matrices, which can help save time and memory by being smarter about common tasks like:

  • scaling rows/colums
  • shifting rows/columns
  • extracting residuals after linear regression
  • multiplying matrices
  • composing multiple such operations
  • performing approximate PCA and SVD’s

See https://github.com/ekernf01/MatrixLazyEval to try it out.

How does it work?

Can you spot the difference between \((Xv) - \mathbf 1(m^Tv)\) and \((X - \mathbf 1 m^T)v\)? In certain realistic situations, the result is more or less the same, but the memory requirements can differ by gigabytes.

Suppose you have a 60,000 by 30,000 sparse matrix X. You want to center each column of X to form a new matrix Z. Then you want to compute Zv for some vector v. This closely mimics what happens when you run PCA. The first step can be written $Z = X - \mathbf 1 m^T$, where $m$ are the means and $\mathbf 1$ is a column vector of $1’s$. If you center the columns naively, almost none of the zeroes will be preserved. Your matrix will be dense, and it will occupy >14GB of memory (1.8e9 doubles; 8 bytes per double). To avoid this, you can distribute $v$ to avoid ever storing $Z$: compute $(Xv) - \mathbf 1(m^Tv)$, where m is a row vector containing the column means of X. This consumes little memory beyond what is already used to store the data.

Written on November 28, 2020