Revving Up Performance for the Tachyon File System

Tachyon is an open source, memory-centric distributed storage system. Since being released to open source in April 2013, it has flourished into a fast growing project with more than 150 contributors from about 50 organizations.

Being a fault tolerance storage system, Tachyon knows how to reliably preserve data. In contrast to existing solutions, Tachyon doesn’t use replication to achieve fault tolerance. Instead, it relies purely on computations. While data replication remains a common approach today, the drawback is that replications are generally limited by their networks or disks. Tachyon eliminates the need for replication by using data lineage, a well-known technique that tracks the lineage of data operations into the storage layer.

When it comes to Apache Spark™, Tachyon offers many advantages. For example, Tachyon can keep in-memory data safe, even when the Spark job crashes. It also allows data to be shared at memory speed between different Spark jobs. Without Tachyon, each job would need to load data from disk to the main memory, greatly slowing down performance. It also goes beyond Spark and can be used with Hadoop MapReduce, Apache HBase, Apache Flink, and others.

Basically, Tachyon consists of two major layers: the lineage layer and persistent layer. In this blog, I focus on the persistent layer, which is responsible for preserving Tachyon’s checkpoint data to the underlying storage (which may be Amazon S3, HDFS, or GlusterFS).

Tachyon internally implements HDFS interfaces to interact with the underlying storage. As a result, any storage system that exposes the HDFS interface can easily be plugged into Tachyon. At IBM Research, we recently extended Tachyon’s persistent layer to work with OpenStack Swift and the SoftLayer public object store. We based this integration on the Swift driver from the Hadoop OpenStack module.

Screen Shot 2015-10-28 at 11.30.02 AM

While testing the Tachyon-Swift integration, it became clear that using Hadoop modules for Tachyon is far from optimal. This occurs primarily because the default Hadoop code is not optimized to work with object stores. To explain this problem, let’s look at the example of FileOutputCommiter, which comes with Hadoop and is built to work in a file system as opposed to an object store. This means, we must maintain the file system structure of directories and sub-directories any time we want to work with a file. For example, if we want to work with the file container/a/b/c/data.txt, the Swift driver will have to create empty objects for container, a, a/b, and a/b/c, in order to maintain the nested structure demanded by Hadoop. In contrast, working with an architecture that is optimized for object storage would allow Swift to simply create a container with an object called a/b/c/data.txt . Because Swift supports object names with delimiters and supports listing based on prefix—it wouldn’t have to generate all the structures in Swift.  In short, by having Tachyon work directly with Swift, and using a different architecture, we can make things work much more efficiently.

To overcome these drawbacks, we developed a new architecture that doesn’t depend on the existing Hadoop Swift driver—and helps Tachyon work more efficiently with Swift. Our approach uses direct access to Swift via the JOSS library. In fact, our early tests show a significant improvement in the performance and user experience as compared to the previous solution. We are about to contribute our new architecture to the Tachyon community.

IBM Research sees Tachyon as a promising new technology and we will continue to evaluate the new architecture and advance integration between Spark and Tachyon, while looking into innovative combinations for Spark, Tachyon, and Swift. Stay tuned—Spark + Tachyon evaluation post coming soon.

Spark Technology Center


Subscribe to the Spark Technology Center newsletter for the latest thought leadership in Apache Spark™, machine learning and open source.



You Might Also Enjoy