Typically, explains Welch, for projects like this one, each single-cell data set that is submitted must be re-analyzed with the previous data sets in the order they arrive. Their new approach allows new datasets to the be added to existing ones, without reprocessing the older datasets. It also enables researchers to break up datasets into so-called mini-batches to reduce the amount of memory needed to process them.
"This is crucial for the sets increasingly generated with millions of cells," Welch says. "This year, there have been five to six papers with two million cells or more and the amount of memory you need just to store the raw data is significantly more than anyone has on their computer."
Welch likens the online technique to the continuous data processing done by social media platforms like Facebook and Twitter, which must process continuously-generated data from users and serve up relevant posts to people's feeds. "Here, instead of people writing tweets, we have labs around the world performing experiments and releasing their data."
The finding has the potential to greatly improve efficiency for other ambitious projects like the Human Body Map and Human Cell Atlas. Says Welch, "Understanding the normal complement of cells in the body is the first step towards understanding how they go wrong in disease."