Apache Spark™ 2.0: Impressive Improvements to Spark SQL

What a difference a version number makes! With the release of Spark 2.0 (with Spark SQL) on July 26th, Spark SQL capability and performance has improved impressively:

  • Now we can run all 99 TPC-DS queries serially (single-user) at 1TB, and 98 of 99 queries at 10TB, using Spark SQL 2.0 combined with deep tuning.

  • For the subset of the queries that could be run across Spark 1.5, 1.6 and 2.0, Spark SQL 2.0 performance was more than 2.5x better than Spark 1.6 and 4x better than Spark 1.5. at both 1TB and 10TB.

  • For the challenging multi-user concurrency test the percentage of queries able to complete has shot up dramatically with Spark SQL 2.0, again both for 1TB and 10TB data volumes.

Back in the spring of 2016, we wrote about the rapid evolution of Spark SQL performance and hinted that we would have more to say on an even more comprehensive test derived from TPC-DS using the upcoming Spark 2.0 release. Well, it took a little longer than expected but the results show that the wait was worthwhile.

Let’s dig into the each of these 3 highlights in more detail.

First of all, raw SQL functionality, as exercised by the TPC-DS specification has been enhanced in Spark SQL 2.0, particularly in the area of subqueries. The spec allows minor syntactic sugar variations (called Minor Query Modifcations or MQM’s) but that’s it. No manual rewriting of the query is permitted; especially not with knowledge of the content of the data set. Spark SQL 2.0 is able to parse, execute and produce the correct answer for all 99 queries against the 1GB TPC-DS qualification database. IBM’s BigInsights Big SQL is the only other SQL on Hadoop engine that has demonstrated that it can run all 99 queries in specification compliant way. (See a technical preview of IBM Open Platform with Apache Hadoop 4.3.)

TPC-DS defines a total of 5 official data set sizes, ranging from 1TB to 100TB. In our tests we select the 1TB and 10TB data set sizes. We used a 5-node configuration for the 1TB testing and a 20-node cluster for the 10TB tests. Each node in the cluster was a 2-socket x64 server with 128GB of RAM and a set of x SATA drives – a typical Hadoop cluster node.1

In a single user test using 1TB of data, comparing Spark 1.5, 1.6 and 2.0, for a subset of 55 queries that were able to run across all 3 releases, Spark SQL 2.0 was 2.8x faster than Spark 1.6 and a whopping 4.1x faster than Spark 1.5. Lest we think that we are done, I’ll mention that in the game of SQL performance "whack-a-mole" things are never simple. There is always a query that runs worse than in the previous release and Spark SQL is no different. For example (unlucky) Query 13, ran nearly 2x longer with Spark 2.0 than Spark 1.6, using the same set of tuning parameters.

Graph of 1TB Total Time

In fact, Spark SQL 2.0 was able to run all 99 TPC-DS queries in just a little more than the time it took Spark 1.5 to run the 55 queries it was capable of running.

When we expanded the data volume from 1TB to 10TB and simultaneously increased the cluster size from 5 to 20 nodes, we observed very similar ratios. The elapsed time for Spark SQL 2.0 to run all 49 comparable queries was 2.7x shorter than Spark SQL 1.6 and 4x shorter than Spark SQL 1.5.

Graph of 10TB Total Time

So what enabled this leap forward? There are three main factors:

  1. Enhanced SQL support. As has been noted in blogs by IBM and Databricks, there has been a significant infusion of SQL capability in Spark SQL 2.0, most notably the addition of enough subquery support to enable all the TPC-DS queries to run.
  2. Continued focus on an improved optimizer and an improved, optimized run-time.
  3. Our team’s growing experience in understanding, analyzing and tuning Spark with its large pool of tuning parameters. You can read more details about the tuning parameters we are using in this Slideshare. In the bar charts above the only difference between Spark 2.0 and Spark 2.0 with deep tuning results is a set of tuning parameter changes related to memory and to various timeouts. The Spark binary itself is identical.

In real customer environments, as well as what is required by the TPC-DS specification, multi-user throughput is essential. In TPC-DS there is a minimum of 4 concurrent streams of execution. In our previous testing we found it extremely challenging to be able to even run most of the queries in concurrent multi-user test environment at 1TB or above. With Spark 2.0 and our deeper understanding of all the Spark tunables we have made tremendous progress in getting most of the queries to run most of the time at the 1TB level. The following graph shows how the mix of queries succeeding and failing has evolved from Spark 1.5 to Spark 2.0 at 1TB.

1TB concurrent user throughput test

As we increase the data volume to 10TB we see a fall-off in the success rate in running the queries with 4 concurrent users, but still a significant improvement using Spark 2.0 over Spark 1.6.

10TB concurrent user throughput test

The Spark Technology Center’s Spark SQL development team, working closely with the performance team, is continuing to analyze, tune and work with JIRAs to further improve these results, since even with all the progress that has been made with Spark SQL 2.0, there is still further innovation required to be able to run all the queries, and run them well. For a typical enterprise wanting to use Spark SQL in production, the possible set of tuning parameters — especially the timeout parameters — may prove to be daunting, and it would be far better if these could be successfully automated and dynamically adjusted based on query workload requirements. Finally, for a significant set of queries there remain tremendous opportunities to further improve performance, especially when there are fact to fact table joins and when data volumes, especially of intermediate results, exceed available real memory. Together, the 10 longest queries account for over half the total elapsed time in the single-user performance test. We expect that the rapid evolution of Spark SQL will continue.

  1. Hardware specifications: 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. Software specifications: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0