James Spyker
James Spyker
2 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More

Gita Koblents
Gita Koblents
3 months ago

Bringing Apache Spark™ Closer to SIMD and GPU

by Gita Koblents, Kazuaki Ishizaki, Hiroshi Inoue Accelerating the Apache Spark™ execution engine has always been a focus of the Spark development community. As a result, significant performance improvements were delivered in Spark 2.0 compared with Spark 1.6. Most of the improvements were implemented as part of Project Tungsten. The goal of Project Tungsten is to push Spark performance closer to... Read More

Vijay Sundaresan
Vijay Sundaresan
5 months ago

Improvements to the SizeEstimator class in Apache™ Spark

by Vijay Sundaresan, Adam Roberts, and Andrew Craik What's the SizeEstimator class and why should I care? org.apache.spark.util.SizeEstimator is a core class used by Apache™ Spark that walks the object graph rooted at a given object and uses knowledge of the JVM object model to arrive at an estimate for the amount of Java heap that is transitively held live by the object. Note that this means size... Read More

Sunitha Kambhampati
Sunitha Kambhampati
5 months ago

Exploring the Apache Spark™ DataSource API

Apache Spark™ provides a pluggable mechanism to integrate with external data sources using the DataSource APIs. These APIs allow Spark to read data from external data sources and also for data that is analyzed in Spark to be written back out to the external data sources. The DataSource APIs also support filter pushdowns and column pruning that can significantly improve the performance of queries.... Read More

Steve Moore
Steve Moore
5 months ago

Data Science Hub & the Data Science Community: Philippe Van Impe

At the recent sold-out Spark & Machine Learning Meetup in Brussels, Philippe Van Impe of the European Data Science Academy delivered a lightning talk called Data Science Hub & the Data Science Community. Philippe's talk gave an introduction to the activities of the Data Science Community, whose mission is to educate, inspire and empower scholars and professionals to apply data sciences... Read More

Steve Moore
Steve Moore
5 months ago

Apache Spark™ Applications the Easy Way: Pierre Borckmans

At the recent sold-out Spark & Machine Learning Meetup in Brussels, Pierre Borckmans of Real Impact Analytics delivered a lightning talk called Writing Spark applications, the easy way. As Pierre explained, even though Apache Spark™ offers intuitive and high-level APIs, writing production-ready Spark data pipelines involves non-trivial challenges for data scientists without expert backgroun... Read More

Steve Moore
Steve Moore
5 months ago

Hyperparameter Optimization: Sven Hafeneger

At the recent sold-out Spark & Machine Learning Meetup in Brussels, Sven Hafeneger of IBM delivered a lightning talk called Hyperparameter Optimization - when scikit-learn meets PySpark. As Sven explained, Apache Spark™ is not only useful when you have big data problems. If you have a relatively small data set you might still have a big computational problem. One problem is the search for o... Read More

Steve Moore
Steve Moore
5 months ago

Intro to Extending Spark ML for Custom Models: Holden Karau

At the recent sold-out Spark & Machine Learning Meetup in Brussels, Holden Karau of the Spark Technology Center delivered a lightning talk called A very brief introduction to extending Spark ML for custom models: Talk + Demo. Holden took a look at Apache SparkML™ pipelines. Inspired by sci-kit learn, they have the potential to make machine learning tasks much easier. This talk looked at how... Read More

Steve Moore
Steve Moore
5 months ago

Data Science and Beer: Kris Peeters

At the recent sold-out Spark & Machine Learning Meetup in Brussels, Kris Peeters of Data Minded delivered a lightning talk called Data Science and Beer. As Kris explained, because of its general-purpose nature, Apache Spark™ is being used by a wide variety of data professionals, each with their own backgrounds. The data warehouse / data lake of a large organization is a spot where those 3 worl... Read More

Steve Moore
Steve Moore
5 months ago

Data Streams in Telecom: Koen Dejonghe

Data Streams in Telecom: Koen Dejonghe

At the recent sold-out Spark & Machine Learning Meetup in Brussels, Koen Dejonghe of Eurocontrol delivered a lightning talk titled Simulation and processing of data streams in Telecom. Specific... Read More

Jeremy Anderson
Jeremy Anderson
5 months ago

Open Source Design and Apache SystemML™

Open Source Design and Apache SystemML™

What is Open Source Design? When we hear “open source”, most of us think code, “source” referring to source code. So what does open source design mean? We can point to plenty of examples of open desig... Read More

Kun Liu
Kun Liu
5 months ago

Using Stocator to Connect Object Stores to Apache Spark™ 2.0

In a previous blog post, Gil Vernik gave a high-level overview of Stocator, a driver to access data in an object store. After the release of Apache Spark™ 2.0, we tested the process of connecting to object storage and Spark using the Stocator driver — and found that it works seamlessly with Spark 2.0. This blog will cover the three basic steps to use the Stocator driver in Spark 2.0 to access obje... Read More

Jon Alter
Jon Alter
6 months ago

RedRock is Open Source

Putting big data analysis in your hands RedRock on GitHub Backend: https://github.com/SparkTC/redrock Frontend: https://github.com/SparkTC/redrock-ios We are excited to announce that the RedRock backend is now open source! That’s cool, but what is RedRock you say? RedRock is an example application to demonstrate the power of Spark integrated with ElasticSearch and processing Twitter data. RedRoc... Read More

Shelly Garion
Shelly Garion
6 months ago

Can Apache™ Spark reveal how people really use cloud storage?

By Shelly Garion and Hillel Kolodner For about a year, our team has been using Apache™ Spark analytics to investigate IBM Cloud logs and understand how people really use the cloud. Spark allows us to get the answers in a relatively simple way retroactively going over historical data collected over long periods of time (for example, years of operational data). This benefit goes beyond other existin... Read More

Madison J Myers
Madison J Myers
8 months ago

0 to Life-Changing App: We Found Data!

0 to Life-Changing App: We Found Data!

Could it be? Yes! My team and I have finally found delightful data or, rather, the Goldilocks of data. Whatever you prefer. The important part is that the data is public. The data is big. The data is... Read More

Steve Moore
Steve Moore
8 months ago

Apache Spark™ Makers Build Hackathon

Build an Apache Spark™ application and win $50,000 and one-on-one sessions with judges from Tesla, Netflix, IBM, and more. Compete to address a real business problem or core business concern related to customer care, marketing, risk management, or operations. We live in a data-driven world. As a developer, data engineer, or data scientist, you need the right tools to access, and analyze data to ge... Read More

Steve Moore
Steve Moore
9 months ago

An Introduction to Notebooks

An Introduction to Notebooks

When you hear the word 'notebook' maybe you think of a notepad or a laptop. Increasingly, the word brings to mind a web application that contains all of your code, text, and visualizations for a parti... Read More

Xiao (Sean) Li
Xiao (Sean) Li
9 months ago

Deciding about De/Serialization in PySpark Storage Levels

Serialization can save substantial space at the cost of some extra CPU time — and by default, PySpark uses the cPickle serializer. (The following link explains the general internal design of PySpark: PySpark_Internals.) Prior to PySpark 2.0, the stored objects were always serialized regardless of whether you chose a serialized level. That means, the flag “deserialized” had no effect (as documented... Read More

Bo Meng
Bo Meng
9 months ago

Enabling Apache Spark™ on HBase

Apache HBase is a distributed key-value store of data on HDFS. It’s modeled on Google’s Big Table, and provides APIs to query the data. The data is organized, partitioned, and distributed by its “row keys”. Per partition, the data is further physically partitioned by “column families” that specify collections of “columns” of data. The data model is well-suited for wide tables where columns are dyn... Read More

Xin Wu
Xin Wu
10 months ago

CACHE Table in Apache Spark™ SQL

CACHE Table in Apache Spark™ SQL

For users wanting to improve performance by caching table data into memory, we offer some considerations… You can either do sqlContext.cacheTable(“tableName”), dataFram.cache() in an application or “C... Read More

Steve Moore
Steve Moore
a year ago

Data By the Bay

Data By the Bay

We’re helping to promote Data By the Bay, a large-scale, data-focused gathering with 150 talks, comprising seven conferences over the course of five days. From the organizers: 50+ founders/CEOs/CTOs,... Read More

Luciano Resende
Luciano Resende
a year ago

Mentoring and Open Source

Mentoring and Open Source

Hear Luciano discuss mentoring at ApacheCon North America in Vancouver on Thursday May 12. Enterprise adoption of open source software is at its peak. But while we see corporations in general consumi... Read More

David Fallside
David Fallside
a year ago

Introducing EclairJS

Introducing EclairJS

In this post, we describe the motivation behind the EclairJS project and provide a glimpse into its capabilities. Node.js is fast becoming one of the more popular frameworks for quickly developing... Read More

Fred Reiss
Fred Reiss
a year ago

Inside Apache SystemML

Inside Apache SystemML

Fred Reiss presented this deep dive at Spark Summit East in NYC in February 2016. See the slides on SlideShare Inside Apache SystemML – Spark Summit East 2016 from Fred Reiss Fred Reiss talks ab... Read More

Holden Karau
Holden Karau
a year ago

Beyond Parallelize and Collect

Beyond Parallelize and Collect

Holden Karau presented this important work at Spark Summit East in NYC in February 2016. See the slides on SlideShare Beyond parallelize and collect – Spark Summit East 2016 from Holden Karau Ef... Read More

Gino Bustelo
Gino Bustelo
a year ago

Announcing Apache Toree

Announcing Apache Toree

I’m pleased to announce that in late 2015, Apache Spark Kernel was accepted by Apache as an incubator project. As part of this transition, the Apache Spark Kernel project was renamed Toree. As an Apac... Read More

Katharine Kearnan
Katharine Kearnan
a year ago

Reinvent Yourself. And the World.

Reinvent Yourself. And the World.

A colleague of mine said something great yesterday: “IBM is a place where you can re-invent yourself.” He’s right. Many of us were hired to do one job, and were given the opportunity by IBM to define... Read More

Spark Admin
Spark Admin
a year ago

SystemML Webinar from Fred Reiss

SystemML Webinar from Fred Reiss

To commemorate our recent release of SystemML 0.8.0, we asked SystemML guru Fred Reiss to do a webinar. Check out the replay here. Hear from one of the leading minds in machine learning, Fred Reiss,... Read More

Joel Horwitz
Joel Horwitz
a year ago

Datapalooza! Real World Data Products

Datapalooza! Real World Data Products

Before I get started, I highly recommend you go now to http://www.spark.tc/datapalooza and buy a ticket. It may sell out before you finish reading this post. Now let's begin. Back in 2008, Apache Hado... Read More

Paula Ta-Shma
Paula Ta-Shma
a year ago

Channeling Oceans of IoT Data

Channeling Oceans of IoT Data

IBM researchers in Haifa, together with partners from the COSMOS EU-funded project, are using Apache Spark™ to analyze the new wave of IoT data and solve problems in a way that is generic, integrated,... Read More

Steve Moore
Steve Moore
2 years ago

Datapalooza Comes to San Francisco

Datapalooza Comes to San Francisco

From November 10th to 12th, the Spark Technology Center in San Francisco hosts the first-ever Datapalooza — a deep-dive with industry leaders from data science, data engineering, and app development.... Read More

Team TrAfrica
Team TrAfrica
2 years ago

Africa, Leading: TrAfrica

Africa, Leading: TrAfrica

If you live in Silicon Valley, and commute to downtown San Francisco, your trip typically takes 1.5 hours, regardless of whether you try a combination of car + train + CalTrain, car + BART, or just dr... Read More

Chip Senkbeil
Chip Senkbeil
2 years ago

Apache Spark™ Kernel Architecture

Apache Spark™ Kernel Architecture

In the first part of the Apache Spark™ Kernel series, we stepped through the problem with enabling interactive applications against Apache Spark and how the Spark Kernel solved this problem. This week... Read More

Benjamin Herta
Benjamin Herta
2 years ago

Using Spark's cache for correctness, not just performance

RDDs are immutable. Right? This is one of the first things we learn when we read about Apache Spark™. Here’s a little program which appears to contradict this. This Scala program creates a small RDD, performs a few simple transformations on it, and then calls RDD.count() on the same RDD twice. The values of the two calls to count are compared with an assert, and at first glance, we would think tha... Read More

Katharine Kearnan
Katharine Kearnan
2 years ago

Project RedRock: Design + Data

Project RedRock: Design + Data

*“There’s a huge market opening up for data analytics. Whoever turns the technology into products that are simple, beautiful, and easy for anyone to use, wins.” David Townsend, IBM Designer * “Anyone... Read More

Benjamin Herta
Benjamin Herta
2 years ago

From the Driver to the Executors

From the Driver to the Executors

I have worked with a number of people who are new to Apache Spark™, and have an existing program that they want to port to it.  Spark supports a number of programming languages, is relatively easy to... Read More

Fred Reiss
Fred Reiss
2 years ago

Welcome to our blog!

Welcome to our blog!

My name is Fred Reiss, and I work at IBM’s Spark Technology Center. The STC is a new part of IBM, located in downtown San Francisco. Our mission is to serve as an interface between IBM and the Apache... Read More

Rob Thomas
Rob Thomas
2 years ago

Apache Spark™

Apache Spark™

“In 1997, IBM asked James Barry to make sense of the company’s struggling web server business. Barry found that IBM had lots of pieces of the puzzle in different parts of the company, but not an integ... Read More