Stocator – Fast Lane for Connecting Object Stores to Spark

Spark users need a way to interface with object stores – something fast, direct, and optimized for use with stored objects.

Now you have it.


Apache Spark can work with multiple data sources that include object stores like Amazon S3, OpenStack Swift (such as on IBM SoftLayer), Azure Blob Storage, and more. To access an object store, Spark uses Hadoop modules that contain drivers to the various object stores. In most of the flows, Spark sends a request with the data source URI to the Hadoop Mapreduce Client Core. In the response, Spark receives the File System driver that is capable of handling the given URI. In some specific flows, Spark may bypass the Hadoop Map Reduce Client and obtain the File System driver directly, via the Hadoop Path resolver.

current-state-fullSpark needs only a small set of object store functions. More specifically, Spark requires object listing, object creation, object reading, and getting data partitions. Hadoop drivers, however, must be compliant with the Hadoop ecosystem. This means they support many more operations, such as shell operations on directories, including move, copy, and rename, which are not native object store operations. Moreover, Hadoop Mapreduce Client Core uses temp files and folders for every write operation. Those temp files and folders are renamed, copied, and deleted. This leads to dozens of useless requests when targeted at an object store. It’s clear that Hadoop is designed to work with file systems and not object stores.

Stocator – The New Object Store Cloud Connector

We took a new approach and developed Stocator, a unique object store connector for Apache Spark. The good news is that we just contributed Stocator to the open source community, under Apache License 2.0. It comes packaged with the OpenStack Swift driver, but contains a pluggable mechanism capable of supporting additional object store drivers.

Stocator is written in Java, and implements an HDFS interface to simplify its integration with Spark. There is no need to modify Spark’s code to use the new connector: it can be compiled with Spark via a Maven build, or provided as an input package without the need to re-compile Spark.

Stocator works smoothly with the Hadoop Mapreduce Client Core module, but uses an object store approach. Apache Spark continues working with Hadoop Mapreduce Client Core and Spark can use various existing Hadoop output committers. Stocator intercepts internal requests and adapts them to the object store before they access the object store.

For example, when Stocator receives a request to create a temporary object or directory, it creates the final object instead of a temporary one. Stocator also uses streaming to upload objects. Uploading an object via streaming, without knowing its content length in advance, eliminates the need to create the object locally, calculate its length, and only then upload it to the object store. Swift is unique in its ability to support streaming, as compared to other object store solutions available on the market.

Because Stocator is explicitly designed for object stores, it has very a different architecture than the existing Hadoop driver. It doesn’t depend on robust Hadoop modules, but rather interacts directly with object stores via the JOSS package. Below is a diagram of the flow.

new-state-fullWe demonstrate Stocator by example. Consider an array of 10 elements that we want to persist as a text object. Assume our array is equally distributed across 10 tasks in Spark.

val data = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val distData = sc.parallelize(data)

The following table summarizes the overall number of HTTP requests to Swift, and compares the number of tasks involved for the existing Hadoop driver and Stocator.

results-fullThe speed advantage of Stocator is primarily because it doesn’t use temp files or directory structures when it persists data in Swift.

How can you use Stocator with Object Store?

To use Stocator, you need access to an object store. An object store allows you to store any type of object (text data, binary data, Parquet objects, etc.). After that, it’s just a matter of using the powerful Spark to analyze the content of the objects.

For example, OpenStack Swift can be obtained via the community OpenStack code, e.g., with Swift-All-in-One, or via an object store service such as IBM SoftLayer. The detailed instructions for how to set up Spark with Swift can be found here.

What’s next?

We believe strongly in the power of the Spark-Object Store integration and will continue to support the new driver in the open source community. Our future plans include exploring more improvements to the way Spark interacts with object stores—and keeping up the open source spirit.

It’s all open source. Don’t waste time – go out and try it!


You Might Also Enjoy

Gidon Gershinsky
Gidon Gershinsky
19 days ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More

James Spyker
James Spyker
3 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More