Apache Spark™ 2.0: Line by Line

Two weeks ago we took count of the latest release of Apache Spark™ by correlating the number of issues addressed to the number of people involved. We found that there were more contributions to Apache Spark™ 2.0 by more contributors than to any of its prior releases. In this blog post, we take a look at the impact all of those contributions had to the Apache Spark code base.

To get a better sense of the size of the code changes in Spark 2.0 let's start with a quick assessment of the Spark code base overall. As of version 2.0, the Spark project is made up of nearly 13,500 files containing source code, test data, documentation, configurations, third-party code, scripts and various other files required for infrastructure and development. The majority of files (7,818) contain test and example data (JSON, TXT, CSV, etc.), followed by 5,251 files of source code (Scala, Java, Python, R, etc.). All of the source code adds up to more than 814,000 lines (about 73% of all lines) and the data for tests and examples adds up to more than 242,000 lines. From the initial commit to the Spark project in March 2010 until the finalization of the Spark 2.0 release, there have been close to 17,000 commits resulting in more than 99,000 file changes with an average of 11.2 lines per change.

Number of Files and Number of Lines by Type of File
Category Files Lines Lines/
Total
Changes Lines/
Change
Source Code 5,251 814,112 73.1% 69,903 11.6
Data (Test, Examples) 7,818 242,481 21.8% 19,898 12.2
Documentation 73 24,623 2.2% 2,201 11.2
Configuration 75 9,337 0.8% 2,958 3.2
3rd-party 25 4,887 0.4% 982 5.0
Scripts 84 4,741 0.4% 1,225 3.9
Other 212 13,768 1.2% 2,536 5.4
All 13,498 1,113,949 100.0% 99,703 11.2


Most of the source code is written in Scala with 527,557 lines of code (LOC) followed by Java with 140,547 lines and Python with 60,545 lines. The table below lists the most prominent languages by descending number of lines as reported by the open-source tool cloc. Test data, documentation and 3rd-party libraries are mostly excluded.

While it's interesting to distinguish lines that contain code from blank lines and lines of comments as in the table below, we'll use the term "Lines of Code" or "LOC" to include blank lines and comments.

Lines of Code by Language
Language Files Lines Blank Comment Code % All Lines
Scala 2,428 527,557 12% 25% 63% 62.24%
Java 742 140,547 13% 17% 70% 16.58%
Python 234 60,545 16% 38% 46% 7.14%
Hive Query 1,560 52,823 22% 7% 72% 6.23%
R 52 21,991 10% 44% 46% 2.59%
CSS 19 9,192 13% 3% 85% 1.08%
JSON 33 7,017 0% 0% 100% 0.83%
Maven 34 7,359 3% 11% 86% 0.87%
SQL 208 5,363 1% 4% 95% 0.63%
Javascript 32 6,107 16% 19% 65% 0.72%
Shell Script 68 4,055 15% 37% 48% 0.48%
Thrift 2 1,234 16% 39% 45% 0.15%
XML 10 1,037 13% 25% 62% 0.12%
ANTLR4 Grammar 1 954 12% 2% 86% 0.11%
HTML 6 460 10% 6% 85% 0.05%
DOS Batch 16 687 14% 38% 48% 0.08%
Ruby 3 283 19% 21% 60% 0.03%
make 2 206 16% 11% 73% 0.02%
YAML 3 112 3% 4% 94% 0.01%
C 1 49 20% 41% 39% 0.01%
TSV 1 4 0% 0% 100% 0.00%
All 5,455 847,582 13% 23% 64% 100.00%


The Spark project is made up of various components, in fact Spark's issue tracker on Apache JIRA counts 27, but the most prominent ones are Spark Core, GraphX, MLlib, SQL, and Streaming, as well as the popular language interfaces for Python (PySpark) and R (SparkR). These components are reflected in the directory structure of the Spark project which we can leverage to analyze lines of code by component while taking into account that components can overlap. For example there are close to 32,000 lines of Machine Learning code in PySpark (i.e. python/pyspark/mllib/classification.py) and more than 12,000 lines of structured streaming code counted under both Spark SQL and Streaming. Most of the components have application code, test code, example code as well as data for tests and examples. Spark SQL clearly stands out with more than 365,000 lines of code (including test and examples) and over 219,000 lines of test data. Across all components about 35% of all source code is dedicated to tests. That focus on test-driven development is vital for the quality and stability of the code base, especially so for an open-source project as dynamic as Apache Spark which during the past year had an average of more than 13 code contributions (commits) per day.

Lines of Code by Component
Component Lines Main Code Test Code Examples Data Other
Spark Core 142,606 88,621 50,286 2,525 545 629
GraphX 9,381 6,333 2,153 755 31 109
ML/MLlib 160,102 83,283 42,108 23,096 11,439 176
PySpark 63,981 46,419 9,206 7,045 12 1,299
Spark SQL 586,396 200,059 163,700 1,833 219,460 1,344
SparkR 25,429 19,209 5,261 507 0 452
Streaming 65,732 32,406 26,553 4,144 2,477 152
Web UI 29,928 15,983 3,924 0 7,975 2,046
Other 97,927 35,006 12,530 156 531 49,704
All 1,113,949 481,835 302,924 33,347 242,458 53,385


Now that we have a sense of the size of the Apache Spark project let's take a closer look at the changes contributed during this latest release. Overall more than 203,000 lines of code where added between version 1.6.0 and version 2.0.0, the vast majority of which was contributed to Spark SQL with over 148,000 lines of code. While the implementation of new features often has a net positive impact on the number of LOC, replacing old code with a more effective implementation may decrease the number of LOC as can be seen for GraphX (-55) and Spark Core Examples (-541).

Lines of Code Added in Spark 2.0 (since 1.6.0)
Component Lines Main Code Test Code Examples Data Other
Spark Core 6,004 2,050 4,515 -541 0 -20
GraphX 401 -55 35 422 0 -1
ML/MLlib 30,681 14,809 8,055 5,463 2,353 1
PySpark 13,758 9,372 1,903 2,067 5 411
Spark SQL 148,748 106,312 39,375 1,495 1,545 21
SparkR 6,231 4,639 1,197 292 0 103
Streaming 15,699 6,661 6,496 56 2,477 9
Web UI 5,442 2,046 530 0 2,435 431
Other 2,076 905 1,934 -150 0 -613
All 203,318 131,064 56,364 6,198 8,809 883

The impact of the code additions in Spark 2.0 become even more apparent when expressing the net additions of LOC by component as percentages of the entire code base. About a quarter of all lines of code in Spark SQL, SparkR and Streaming were added during the Spark 2.0 release.

Lines of Code Added in Spark 2.0 as Percent of Total
Component Lines Main Code Test Code Examples Data Other
Spark Core 4% 2% 9% -21% 0% -3%
GraphX 4% -1% 2% 56% 0% -1%
ML/MLlib 19% 18% 19% 24% 21% 1%
PySpark 22% 20% 21% 29% 42% 32%
Spark SQL 25% 53% 24% 82% 1% 2%
SparkR 25% 24% 23% 58% 0% 23%
Streaming 24% 21% 24% 1% 100% 6%
Web UI 18% 13% 14% 0% 0% 21%
Other 2% 3% 15% -96% 0% -1%
All 18% 27% 19% 19% 4% 2%


The Spark source code is managed in a Git repository which records file changes by lines being added and lines being removed. We can query the Git log using the --numstat feature to aggregate all the lines that were added and removed by individual code contributions. The inclusion of the Hive Thrift Service 1.2 into Spark SQL was by far the biggest single code contribution with close to 70,000 lines of new code. More than 7,800 lines of code were removed to replace the ANTLR3 based SQL parser with a new SQL parser based on ANTLR4 (+4,061) and more than 7,700 lines were removed with the exclusion of the streaming connectors for Flume, Akka, MQTT, Twitter and ZeroMQ. The Flume streaming connector was later added back (+3,745) and the streaming connectors for Akka, MQTT, Twitter and ZeroMQ found a new home in the Apache Bahir project. One of the most notable enhancements in Spark SQL is the additional sub-query support to enable all of the TPC-DS 1.4 benchmark queries with more than 3,600 lines of new code. Read all about the performance improvements in Berni Schiefer's blog post Apache Spark™ 2.0: Impressive Improvements to Spark SQL.

Top 10 Code Contributions to Spark 2.0 by Lines of Code
Hash LOC Summary
7feeb82 +69,895 [SPARK-14987][SQL] inline hive-service (cli) into sql/hive-thriftserver
a9b93e0 -7,872 [SPARK-14211][SQL] Remove ANTLR3 based parser
06dec37 -7,721 [SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages
970635a +5,370 [SPARK-12362][SQL][WIP] Inline Hive Parser
b28fe44 -5,352 [SPARK-14770][SQL] Remove unused queries in hive module test resources
96534aa +4,306 [SPARK-14549][ML] Copy the Vector and Matrix classes from mllib to ml in mllib-local
600c0b6 +4,061 [SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4
24587ce +3,745 [SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to Spark
d7bf318 +3,632 [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark queries for SparkSQL
3134f11 +3,353 [SPARK-12177][STREAMING][KAFKA] Update KafkaDStreams to new Kafka 0.10 Consumer API


The Spark Technology Center (STC) has also experienced strong growth during the Spark 2.0 release. We now have more than 50 team members world-wide and more than 30 of our developers have contributed code to the Spark 2.0 release with a strong focus on Spark SQL, Machine Learning, and PySpark as Vijay Bommireddipalli described in his bog post Apache Spark™ 2.0: New and Noteworthy in the Latest Release. You can find all of our contributions to Apache Spark on our JIRA dashboard
 jiras.spark.tc
.


Contributors to Spark 2.0
Component World STC
Spark Core 90 8 9%
GraphX 11 1 9%
ML/MLlib 80 13 16%
PySpark 59 14 24%
Spark SQL 115 16 14%
SparkR 29 4 14%
Streaming 35 6 17%
Web UI 29 3 10%
Other 121 13 11%
All 301 32 11%
 
LOC Added to Spark 2.0
Component World STC
Spark Core 6,004 320 5%
GraphX 401 18 4%
ML/MLlib 30,681 8,300 27%
PySpark 13,758 5,973 43%
Spark SQL 148,748 10,978 7%
SparkR 6,231 757 12%
Streaming 15,699 -150 -1%
Web UI 5,442 311 6%
Other 2,076 544 26%
All 203,318 20,988 10%

Spark Technology Center

Newsletter

Subscribe to the Spark Technology Center newsletter for the latest thought leadership in Apache Spark™, machine learning and open source.

Subscribe

Newsletter

You Might Also Enjoy