Spark 2.0

Apache Spark™ 2.0: Machine Learning. Under the Hood and Over the Rainbow.

Now that the dust has settled on Apache Spark™ 2.0, the community has a chance to catch its collective breath and reflect a little on what was achieved for the largest and most complex release in the project's history.

One of the main goals of the machine learning team here at the Spark Technology Center is to continue to evolve Apache Spark as the foundation for end-to-end, continuous, intelligent enterprise applications. With that in mind, we'll briefly mention some of the major new features in the 2.0 release in Spark's machine-learning library, MLlib, as well as a few important changes beneath the surface. Finally, we'll cast our minds forward to what may lie ahead for version 2.1 and beyond.

For MLlib, there were a few major highlights in Spark 2.0:

  • The older RDD-based API in the mllib package is now in maintenance mode, and the newer DataFrame-based API (in the ml package), with its support for DataFrames and machine learning pipelines, has become the focus of future development for machine learning in Spark.
  • Full support for saving and loading pipelines in Spark's native format, across languages (with the exception of cross-validators in Python).
  • Additional algorithm support for Python and R.

While these have already been well covered elsewhere, the STC team has worked hard to help make these initiatives a reality — congratulations!

Another key focus of the team has been feature parity — both between mllib and ml, and between the Python and Scala APIs. In the 2.0 release, we're proud to have contributed significantly to both areas, in particular reaching close to full parity for PySpark in ml.

Under the Hood

Despite the understandable attention paid to major features in such a large release, what happens under the hood in terms of bug fixes and performance improvements can be equally important (if not more so!).

While the team has again been involved across the board in this area, here we'd like to highlight just one example of a small (but subtle) issue that has dramatic implications for performance.

We Need to Work on our Communication...

Linear models, such as logistic regression, are the work-horses of machine learning. They're especially useful for very large datasets, such as those found in online advertising and other web-scale predictive tasks, because they are relatively less complex than, say, deep learning, and so are easier to train and more scalable. As such, they are among the most-used algorithms around, and were among the earliest algorithms added to Spark ml.

In distributed machine learning, the bottleneck for scaling large models (that is, where there are a large number of unique variables in the model) is often not computing power, as one might think, but communication across the network. This is because these algorithms are iterative in nature, and tend to send a lot of data back and forth between nodes in a cluster in each iteration. Therefore, it pays to be as communication-efficient as possible when constructing such an algorithm.

While working on adding multi-class logistic regression to Spark ML (part of the ongoing push towards parity between ml and mllib), STC team member Seth Hendrickson realized that, due to the way that Spark automatically serializes data when inter-node communication is required (e.g. during a reduce or aggregation operation), the aggregation step of the logistic regression training algorithm resulted in 3x more data being communicated than necessary.

This is illustrated in the chart below, where we compare the amount of shuffle data per iteration as the feature dimension increases.

Shuffle size

Once fixed, this resulted in a decrease in per-iteration time of over 11% (shown in the chart below), as well as a decrease in overall execution time of over 20%, mostly due to lower shuffle read time and less data being broadcast at each iteration. We would expect the performance difference to be even larger as data and cluster size increases1.

Iteration time

Subsequently, various Spark community members rapidly addressed the same issue in linear regression and AFT survival regression (these patches will be released as part of version 2.1).

So there you have it - Spark 2.0 even improves your communication skills!

Over the Rainbow

What does it mean when we refer to Apache Spark as the "foundation for end-to-end, continuous, intelligent enterprise applications"? In the context of Spark's machine learning pipelines, we believe this means usability, scalability, streaming support, and closing the loop between data, training and deployment to enable automated, intelligent workflows - in short the "pot of gold" at the end of the rainbow!

In line with this vision, the focus areas for the team for Spark 2.1 and beyond include:

  • Achieving full feature parity between mllib and ml
  • Integrating Spark ML pipelines with the new structured streaming API to support continuous machine-learning applications
  • Exploring additional model export capabilities including standardized approaches such as PMML
  • Improving the usability and scalability of the pipeline APIs, for example in areas such as cross-validation and efficiency for datasets with many columns

We'd love to hear your feedback on these areas of interest — email me at, and we look forward to working with the Spark community to help drive these initiatives forward.

  1. Tests were run on a relatively small cluster with 4 worker nodes (each with 48 cores, 100GB memory). Input data ranged from 6GB to 200GB, with 48 partitions, and was sized to fit in cluster memory at the maximum feature size. The quoted performance improvement figures are for the maximum feature size.