open source

STC Contributions to Apache Spark™

In a relatively short period of time, the IBM Spark Technology Center (STC) has made notable contributions to the greater Apache Spark eco-system, and is very energetic and passionate in moving Spark forward.

Since June 2015, when IBM announced the Spark Technology Center (STC), engineers at the STC have actively contributed to Spark releases: v1.4.x, v1.5.x, v1.6.0, v1.6.1, as well as to release v.2.0 (in progress.)

As of the writing of this blog, IBM STC has contributed to over 279 JIRAs and counting. About 50% are answers to major JIRAs reported in Apache Spark.

Let’s take a quick peek into these 279 contributions…..

  • 127out of 279 (46%) are deliverables in Spark SQL area
  • 71 of them (25%) are in MLlib module
  • 46 (16%) are in PySpark module

These top 3 areas of focus from IBM STC made up 87% of the total contributions as of today. The rest are in the documentation, Spark Core and Streaming modules, and so on.

You can track the STC progress on this live dashboard on github, which shows the cumulative progress:

Specific to Spark 1.6.x, IBM team members have over 95 commits, primarily from the STC. A total of 29 team members contributed to the release (26 of them from the STC), and each contributing engineer is a credited contributor in the release notes of Spark 1.6.x


For SparkSQL, we contributed enhancements and fixes in the new DataSet API, DataFrame API, Data type, UDF and SQL standard compliance, such as adding EXPLAIN and PrintSchema capability, and support for coalesce and re-partition.

In addition, STC engineers:

  • Added support for column datatype of CHAR.
  • Fixed the type extractor failures for complex data types.
  • Fixed DataFrames bugs in saving long column partitioned parquet file, and handling of various nullability bugs and optimization issues.
  • Fixed the limitation in Order by clause to comply with standard.
  • Fixed a number of UDF code issues in completion of Stddev support.

Machine Learning

In the area of machine learning, the STC team met with key influencers and stake holders in the Spark community to jointly work on items on the roadmap for Machine Learning in Spark. Most of the roadmap items discussed went into 1.6, except for implementation of LU Decomposition algorithm which is slated for the upcoming release.

In addition to helping implement the roadmap, here are some notable contributions:

  • Improved the Pyspark distributed matrix algebra by enriching the matrix operations and fixing bugs.
  • Enhanced the Word2Vec algorithm.
  • Added optimized 1st- through 4th-order summary statistics for DataFrames (technically in SparkSQL, but related to machine learning).
  • Greatly enhanced Pyspark API by adding interfaces to Scala Machine learning tools.
  • Made a performance enhancement to the Linear Data Generator, which is critical for unit testing in Spark ML.

And more…

The team also addressed major regressions on DataFrame API, enhanced support for Scala 2.11, made enhancements to the Spark History Server, and added JDBC Dialect for Apache Derby.

In addition to the JIRA activities, the IBM STC added the JDBC dialect support for DB2 and made Spark Connector for Netezza v0.1.1 available to public through Spark Packages and a developer blog on IBM external site. Check it out here:

We’d love for you to take a look at the work we’re doing and help us with this ongoing effort to contribute in a big way to the open-source community.

Spark Technology Center


Subscribe to the Spark Technology Center newsletter for the latest thought leadership in Apache Spark™, machine learning and open source.



You Might Also Enjoy