Performance

A Survey of Books about Apache Spark™

Hundreds of authors have spent thousands of hours to make the native Apache Spark™ documentation as clear and complete as possible — including a quick start guide and programming guide. They've done an incredible job.

Even so, if you're puzzling through Spark's many complexities and capabilities, you may want to turn to books that offer a true guided tour of the material.

We've assembled a survey of the best of the books currently on the market — from introductions for novices to deep-dive explorations for veterans:

  • Learning Spark: Lightning-Fast Big Data Analysis — by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia. This definitive guide comes from a core team of Spark insiders and is designed to get you up and running fast. Learn to quickly express parallel jobs and set up everything from simple batch jobs to streaming processing and machine learning.

  • Getting Started With Apache Spark — Jim Scott. A friendly, free, online introduction for new-comers. Scott offers step-by-step instructions to take users from installation to core capabilities (RDDs, Data Frames, Spark SQL, Spark Streaming, and the Machine Learning library). He ends with real-world production use cases.

  • Mastering Apache Spark — Mike Frampton. Frampton lays out advanced techniques and examples for processing and storing data, including integration with key third-party applications. Other topics include clustering and classification using MLlib; Spark stream processing via Flume and HDFS; creating and populating Spark schemas; and graph processing using Spark GraphX.

  • High Performance Spark: Best practices for scaling and optimizing Apache Spark — Holden Karau, Rachel Warren. A book for those who've used Apache Spark to solve medium sized-problems, but are ready to take advantage of Spark at scale. Learn to make jobs run faster, productionize exploratory data science , handle larger data sets, and reduce pipeline running times for faster insights. Additional info here.

  • Advanced Analytics with Spark: Patterns for Learning from Data at Scale — Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills. Four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The book brings statistical methods (classification, collaborative filtering, anomaly detection, and more) together with real-world data sets from genomics, security, finance, and neuroimaging.

  • Machine Learning with Spark — Nick Pentreath. Pentreath walks through loading, processing, and preparing data as input to Spark’s machine learning models. Detailed examples and real-world use cases cover common models including recommender systems, classification, regression, clustering, and dimensionality reduction. Also covered: working with large-scale text data, plus methods for online machine learning and model evaluation using Spark Streaming.

  • Apache Spark Machine Learning Blueprint — Alex Liu. Liu explores connecting Spark with R to handle huge datasets at high speed. The book serves up project "blueprints" that demonstrate notebooks and machine learning capabilities for detecting fraud, analyzing financial risks, building predictive models, and setting up recommendation systems.

Newsletter

You Might Also Enjoy

Gidon Gershinsky
Gidon Gershinsky
2 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More

James Spyker
James Spyker
4 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More