Apache Spark™ 2.0: Keeping Count

Apache Spark™ 2.0 was released in July with an impressive list of new features, improvements and bug fixes. If you are using Apache Spark™ you probably studied the release notes and you may have looked at the long list of more than 2,500 issues that were addressed in this latest major release — the highest number for any Apache Spark™ release yet — resulting in more than 200,000 lines of new code. And of course no less impressive is the list of over 300 contributors that made this release possible.

Rightfully, there has been a lot of buzz about all the major accomplishments in this release and Google will gladly serve up many blog posts and detailed presentations. However, if you like spreadsheets, data tables, comparing numbers, and are curious about how this latest Spark release compares to previous releases in terms of issue numbers and lines of code then this blog post is for you.

Since Apache Spark™ is an open source project all of its source code as well as its issue tracker is available online on GitHub and Apache JIRA respectively. GitHub makes some statistics readily available like commit frequency or lines of code added in a certain period and JIRA provides its own query language to search and filter issues the results of which can be presented on dashboards. If you want to know how many developers contributed to a certain component or how long it took on average to fix an issue and how many people are involved in the average code review, then you need to dig deeper into what the GitHub Developer API and JIRA Rest API have to offer.

Let's start with a look at how the 2,522 issues break down by component and issue type to underscore the focus points of this release. Since there are about 27 components, let's limit ourselves to Spark Core, GraphX, ML/MLlib, PySpark, SparkR, Spark SQL and Spark Streaming. Overall, we find that more than 1,600 of the issues resolved in Apache Spark™ 2.0 were new Features and Improvements, broken down into Tasks and Sub-Tasks, and about 900 were bug fixes. Looking at the component break-down we see that about half of the resolved issues are in the Spark SQL component with 1,254, followed by Machine Learning with 407 and Spark Core with 258.

Issues Resolved by Component and Issue Type
Component All Feat' Impr' Bug Task Sub-T. Doc Umbr. Epic Bugs New Func
All 2522 103 745 870 29 661 63 12 3 35% 62%
Spark Core 258 6 70 116 3 58 1 3 0 45% 54%
GraphX 20 0 4 3 1 10 0 1 0 15% 80%
ML/MLlib 407 37 155 80 4 100 23 6 0 20% 74%
PySpark 211 15 69 65 0 53 7 1 0 31% 65%
SparkR 116 15 23 37 0 31 6 3 0 32% 62%
SQL 1254 40 335 433 9 409 10 1 3 35% 63%
Streaming 97 1 24 28 1 38 3 0 0 29% 66%
Other 421 10 138 168 12 56 22 1 0 40% 52%

Each issue is also categorized by severity with the highest severity being Blocker and the lowest severity being Trivial. When breaking down all of the issues resolved in Apache Spark 2.0 by component and severity, one can see that more than 60% of all resolved issues were categorized as Major, Critical, or Blocker. The majority of issues addressed in the PySpark and Machine Learning components were categorized as Minor.

Issues Resolved by Component and Severity
Components All Blocker Critical Major Minor Trivial
All 2522 81 123 1525 630 163
Spark Core 258 7 12 136 80 23
GraphX 20 3 4 7 2 4
ML/MLlib 407 15 18 149 178 47
PySpark 211 8 8 78 84 33
SparkR 116 9 8 74 24 1
SQL 1254 44 74 920 186 30
Streaming 97 1 6 51 28 11
Other 421 8 12 215 143 43

When we take a look at how this release compares to previous releases we can see that Spark 2.0 clearly stands out with 2,522 issues addressed over 1,195 in Spark 1.6 and 1,563 in Spark 1.5. Another observation is that for the past several releases the strongest focus of the Spark community has been on Spark SQL and Machine Learning followed by Spark Core and PySpark.

Resolved Issues by FixVersion Since 1.0 (Excluding Fix-Packs)
Components All 1.0 1.1 1.2 1.3 1.4 1.5 1.6 2.0
All 9122 454 710 837 795 1046 1563 1195 2522
Spark Core 1305 116 169 196 141 160 111 154 258
GraphX 94 6 8 24 14 12 5 5 20
ML/MLlib 1533 46 92 93 126 221 313 235 407
PySpark 787 38 62 83 59 100 116 118 211
SparkR 291 0 0 0 0 29 71 75 116
SQL 3497 38 198 222 242 316 783 444 1254
Streaming 462 17 17 37 52 86 85 71 97
Other 1860 204 186 227 207 218 198 199 421

Thanks to the wealth of attributes that can be recorded with each issue in the JIRA issue tracking system, and thanks to the Apache Spark community and its committers who diligently fill them in, there are a lot more metrics that we can drill into. The table below contains a few of those metrics along with some derived and aggregated metrics to draw out the relationship between the number of reporters, watchers and code contributors, the average number of days it takes to resolve an issue that was reported by someone other than the implementer of the code change ("turn-around time"). Since each code contribution (via "Pull Request") is thoroughly reviewed by other members of the Spark community we added a column for the average number of code reviewers per issue and the duration of the code review process.

Community and Productivity Metrics
Components
Issues
Contributors
Reporters
Issues/Contr.
Self-Reported
Contr.-reported
Watchers/Issue
Days/Issue
Code Reviewers
Days/Review
All 2522 301 409 8.4 66% 89% 3.9 66.1 2.6 9.3 All
Spark Core 258 90 101 2.9 69% 94% 4.1 113.6 2.7 6.7 Spark Core
GraphX 20 11 11 1.8 70% 95% 4.3 129.7 2.2 21.0 GraphX
ML/MLlib 407 80 85 5.1 57% 94% 3.9 72.4 2.5 16.1 ML/MLlib
PySpark 211 59 64 3.6 57% 89% 3.7 55.9 2.6 11.9 PySpark
SparkR 116 29 30 4.0 53% 92% 4.2 48.2 3.0 13.8 SparkR
SQL 1254 115 169 10.9 69% 88% 4.0 44.8 2.6 6.0 SQL
Streaming 97 35 44 2.8 70% 88% 3.5 86.3 2.7 13.4 Streaming
Other 421 129 149 3.3 67% 89% 3.3 81 2.7 9.6 Other

There are a couple of interesting things to notice in the table above. Overall 2,522 issues that got raised by 409 issue reporters have been addressed by 301 code contributors in Apache Spark™ 2.0. This translates to a workload of about 8.4 issues per code contributor. Two thirds of the issues actually got reported by the same developer that also contributed the code change. Overall 89 percent of all 2.0 issues were raised by a Spark developer, only 11 percent by Spark users who did not also contribute code. This 8-1 ratio could be an indication that the Spark developers find and fix issues before they arise in the field or that issues get raised via other channels like mailing lists or discussions on online forums like StackOverflow which get monitored by Spark developers who then enter those issues into JIRA. On average each issue is followed by 3.9 community members and it takes about 66 days from the time an issue is raised to the time it gets resolved (excluding self-reported issues where all too often the reporting developer has a code change ready at the time the issue is created). Each proposed code change ("Pull Request") is reviewed by 2 to 3 community members and the code review process takes between 9 to 10 days on average.

In the component breakdown Spark SQL stands out with the most productive developers, each of whom addressed almost 11 issues. Spark SQL also shows the quickest turn-around time of less than 45 days to resolve issues and Spark SQL has the swiftest code reviews taking only 6 days on average.

A somewhat biased take on the GraphX component would be that it is the least widely used since 19 of the 20 addressed issues were raised by Spark developers themselves. This apparent lack of usage and interest seems to be reflected in the highest number of days (almost 130) that it took to resolve GraphX issues. However, with an average 4.3 watchers, the issues in the GraphX component also garner the highest interest by other members of the community.

Coming soon: A follow-up post that details the lines of code added per component with Spark 2.0.

Spark Technology Center

Newsletter

Subscribe to the Spark Technology Center newsletter for the latest thought leadership in Apache Spark™, machine learning and open source.

Subscribe

Newsletter

You Might Also Enjoy