Spark SQL

CACHE Table in Apache Spark™ SQL

For users wanting to improve performance by caching table data into memory, we offer some considerations…

You can either do sqlContext.cacheTable(“tableName”), dataFram.cache() in an application or “CACHE table tableName” in the Spark-SQL shell. The new query against the cached table will use InMemoryColumnerTableScan for scanning and retrieving only the required column(s).

For example:

scala> sqlContext.cacheTable("t4")

scala> val df = sqlContext.sql("select col1 from t4") df: org.apache.spark.sql.DataFrame = [col1: int]

scala> df.explain(true) == Parsed Logical Plan == 'Project ['col1] +- 'UnresolvedRelation `t4`, None

== Analyzed Logical Plan == col1: int Project [col1#103] +- MetastoreRelation default, t4, None

== Optimized Logical Plan == Project [col1#103] +- InMemoryRelation [col1#103,col2#104,col3#105], true, 10000, StorageLevel(true, true, false, true, 1), HiveTableScan [col1#73,col2#74,col3#75], MetastoreRelation default, t4, None, Some(t4)

== Physical Plan == InMemoryColumnarTableScan [col1#103], InMemoryRelation [col1#103,col2#104,col3#105], true, 10000, StorageLevel(true, true, false, true, 1), HiveTableScan [col1#73,col2#74,col3#75], MetastoreRelation default, t4, None, Some(t4)

It’s worth noting that prior to Apache Spark™ 1.5.2, caching a parquet table had an issue. Specifically, the query selecting the cached parquet table did not actually scan from the InMemoryColumnartableScan. Instead, it scanned from ParquetRelation in the physical plan — which had the potential to downgrade performance.

The problem was that the LogicalRelation that wraps the ParquetRelation has an expectedOutpuAttributes that stores a list of resolved fields with expIds, yet these expIds are not expected to be the same at different times. When caching table, the LogicalRelation that wraps the ParquetRelation becomes the key in the cache and the resulting InMemoryRelation is the value. Then, when a new query comes in, the newly resolved LogicalRelation that wraps the same ParquetRelation has expectedOutpuAttributes with different expIds than the cached key. As a result, the look up of the cached relation is not found and the plan fails to choose the physical ParquetRelation for scanning.

Instead of comparing wrapping LogicalRelations for looking up the key from the cache, the code should directly compare the underlying ParquetRelation. This issue is fixed in 1.6.0 and 1.5.2.

Bios: Xin Wu is an active contributor for Apache Spark with IBM Spark Technology Center(STC).. Xin’s main focus is on Spark SQL component. Prior to joining STC, he was a developer of Big SQL, which is a SQL-on-Hadoop engine by IBM.

Newsletter

You Might Also Enjoy

Gidon Gershinsky
Gidon Gershinsky
19 days ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More

James Spyker
James Spyker
3 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More