Apache Spark™ as the New Engine of Genomics

A handful of talks at the recent Spark Summit in San Francisco neatly underscore three major trends in genomics:

  1. Faster processing of raw genomic data.
  2. New schema and libraries for genomic analysis.
  3. Genomics as a guide to real-world treatment.

All three trends are moving the center of innovation away from the wetware sequencing process into the realm of computation — where Apache Spark™ is playing an increasingly key role.

The first trend involves reducing the time and expense required to boil raw genomic data down to something researchers can use. Doing so means exposing how the particular genome under investigation varies from the human reference genome, a process that researchers call this process variant calling. Variant calling turns out to be hugely intensive in terms of computation: well over 40 hours for a single raw genome sequence even on robust systems.

In fact, as Zaid Al-Ars of Delft University of Technology pointed out in his Spark Summit talk, the cost of the computation for this process is now higher than the cost of the wetware process for acquiring the raw genomic data in the first place. (The cost of that wetware process was $100 million in 2001 and it’s been falling faster than Moore’s law. Current estimates put it at $1,000 or less.) With wetware costs so low and falling, this first new trend began to emerge: a shift in focus from the wetware process to the post-wetware computation.

Al-Ars and his team at Delft took up the challenge of reducing those computation costs by bringing Spark to bear on the traditional data process pipeline. In fact, Spark is uniquely suited to the challenge: a single full-genome sequence can run to hundreds of gigabytes of raw data — and the processing of that data is parallelizable by chromosome (or even by sub-segments of chromosome).

But Al-Ars didn’t just take advantage of Spark’s ability to process lots of data in parallel; he implemented dynamic load balancing as well. In the end, by running 163 hardware threads in parallel on each of 20 nodes (for a total of over 3200 parallel threads), he was able to get the compute time for a single raw genome file down from over 40 hours to a single hour — a huge reduction in time and cost. (Check out the abstract, slides, and video of his talk.)

A second trend in the field also takes on the challenge of reducing those initial computation costs — but takes a different approach. Rather than making the traditional pipeline more efficient, researchers like Frank Austin Nothaft at UC Berkeley’s AmpLab are focusing on the legacy BAM files that indicate how a particular genome aligns with the human reference genome. It turns out that the flat format of BAM files aren’t just computationally expensive — they also severely constrain how the data can be optimized and analyzed.

Notthaft and others are tossing out the BAM files and starting from scratch. At the center of their work is ADAM, a Spark-based, open-source library for doing genomic analysis. ADAM defines explicit schema for individual datatypes and stores the data on disk using the Parquet format. It also lets researchers use common schemas and Spark-based primitives that vastly improve performance, especially around joins.

Nothaft didn't stop there. He also helped develop an index RDD that extends a point-optimized RDD to enable range lookups (which are especially relevant to genomic data). He even integrated Toil, a pipeline manager for massive workflows. In the end, he and his team have been able to achieve 30x - 50x performance improvements at scale — while enabling a new diversity of queries. (See the abstract, slides, and video of Nothaft's Spark Summit talk.)

As these researchers streamline the initial analytics process and reduce costs, masses of genetic data are coming online — both from patients themselves and, increasingly, from the biopsied tumors whose peculiar genetic signatures offer vital clues to treatment. That brings us to the third trend: Putting genomics data to work means fitting that data into the clinical process in a way that doctors can understand, manage, and leverage.

Oncologists and other doctors simply aren’t trained in the complex work of sifting through genomic analyses in order to guide interventions. Daniel Quest of the Mayo Clinic’s Center for Individualized Medicine predicts that in time we’ll have genomics data specialists who support doctors much the way radiologists do: by interpreting the results of sophisticated diagnostics to offer a set of predictions and suggestions for treatment.

But first the data has to be gathered and filtered. And as Quest points out in his Spark Summit talk, doing that work means more than crunching numbers on a single patient’s own raw genomic data. It means integrating genomic data from across the population (and across demographics within the population) in order to establish some version of ground truth about normal human variation. Doctors need that ground truth to distinguish expected genomic variation from actual pathology.

But even that isn't enough. Getting a truly complete picture means going a step further by combining baseline genomic data with gene expression data from RNA sequencing — and then layering in phenotype data from patient records: gender, age, ailment, behavior, and more.

It’s a dizzying set of challenges that includes annotation, filtering, cohort analysis, sparse data management, patient privacy, and ultimately the need to unify on tools that enable knowledge to be shared between institutions. And for each of those challenges, Quest sees a key role for Apache Spark. (Check out the abstract, slides, and video of his talk.)

We're seeing that advances in genomics — especially around computation and analysis — are the gateway to a revolution in medicine. By all accounts, Apache Spark is the new engine of that revolution.