research

Configuring the Apache Spark™ SQL Context

The Apache Spark website documents the properties you can configure, including settings that control the Spark application and the Spark SQL Context. Let’s look at some of the Spark SQL Context parameters, and how to enable them with a nice feature in Spark SQL 1.6.

Spark SQL Context (SQLContext) serves as the entry point for users to create dataframes, datasets, and to run SQL-related queries. SQLContext can take Spark SQL configuration properties through setConf method or you can specify the SQLContext properties in the /conf directory within the spark-defaults.conf file.  For example, you can add:

spark.sql.PARQUET_SCHEMA_MERGING_ENABLED=true

Or you can set the parameter using the SET key=value command in SQL.

The Spark SQL, DataFrames and Datasets Guide on the Apache Spark website documents most Spark SQL Context properties, which all start with the naming convention “spark.sql.”. Look for those properties within these sections of the guide:

  • Configuration
  • Caching Data In Memory
  • Other Configuration Options
  • Migration Guide

After you set the SQLContext configuration property value, the property name and the key value are stored as a string pair in the map structure, and the property key value can then be used during the SQL execution time. The value of the key can be boolean, integer, or long.

Internally, Spark converts the string value to the corresponding boolean, integer, or long when that property’s value is being used. For some of the properties, the values are memory bytes, for example, SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE, which is the targeted size of a post-shuffle partition’s input data size.

In previous Spark releases, if the user wanted to specify 1m byte size, he or she had to do the calculation (1024×1024 = 1048576), because the string ‘1m’ couldn’t be converted to long using the string’s toLong method. It would throw a NumberFormatException. However, with Spark 1.6, the user can specify string ‘1m’ , ‘1g’ etc as unit (m,k,g), and Spark will use an existing utility to try to interpret the string and convert it to corresponding byte size. That should save Spark users a lot of time.

That’s just one example of how Spark is becoming easier to use and easier to configure with each new release.

Spark Technology Center

Newsletter

Subscribe to the Spark Technology Center newsletter for the latest thought leadership in Apache Spark™, machine learning and open source.

Subscribe

Newsletter

You Might Also Enjoy