Developing Data Products with Jupyter Notebooks and Apache Spark

Note: This post draws materials from a talk titled From Doodles to Dashboards: Notebooks as a Cloud Platform presented at Defrag 2015 on November 11, 2015. All of the notebooks shown in this post are included in the associated GitHub repo.

One of the major value propositions of Apache Spark is its ability to span the many phases of the data mining process. From exploration to deployment, Spark offers a consistent API, multi-language support, and scalability with both the volume and velocity of data. With Spark, data science teams can focus on the development of their data products, and avoid the accidental complexity of switching data processing technologies along the way.

To truly capitalize on this potential, data practitioners also need a user experience that caters to their many activities and their many uses of Spark (e.g., data munging, machine learning, stream processing). We see one answer to this need in interactive notebooks—living documents of text, code, visualizations, and widgets, backed by cloud compute and data. Combining the open ecosystem of Project Jupyter notebooks with Spark, for example, can help data science teams grow their data products from back-of-the-napkin ideas to reproducible interactive reports to just-good-enough dynamic dashboards.

As a demonstration of how Spark and Jupyter Notebooks complement each other to ease the creation of data products, consider the following scenario:

My team wants to help increase attendance at IBM meetups. We know from prior research that meetup attendees are more likely to subscribe to IBM cloud services, and so we think the effort is justified. But we don’t yet know how to empower our evangelists to attract new attendees.

While the problem above is hypothetical, the approach summarized below mirrors our path in a number of customer engagements.

Doodling to understand

Starting out, my teammates and I need to learn about the relevant data available to us. Notebooks provide a natural place for us to dabble with data sources and take notes along the way. We might, for instance, poke at the API in notebooks to fetch meetup lists by topic and to learn how to process the real-time RSVP stream with Spark. We might also try joining other public data sources with the meetup data to see if we can get richer detail about users and venues.


Documenting to collaborate

After exploring for a while, we start to hit upon viable models for identifying and visualizing meetup candidates. Notebooks provide a nice canvas for comparing and contrasting these approaches with reproducible results. For instance, we might evaluate the performance of our various candidate models in a notebook that we can easily extend and re-run when need be. We might also start to look at different ways of visualizing candidates from our Spark stream and reaching out to them.


Deploying to take action

Ultimately, we want to put our work into the hands of our evangelists so they can start to take action. Here, too, notebooks shine thanks to their ability to include front- and backend code in an open document format that we can transform. We might, for instance, roll-up our work into a notebook that uses declarative widgets to show meetups and candidates in real-time, to provide a one-click way to contact a Meetup user, and to track how many candidates RSVP after we reach out. We might then lay out this notebook as a dashboard and deploy it as a web frontend to be used by our evangelists.


Discussing new insights

After deploying the app, we start to collect both objective and subjective feedback from our evangelist users. Together, we think of improvements to not only our first-cut dashboard UI, but also to how we identify and rank candidates. These new ideas are ripe for experimentation in notebooks and redeployment as improved dashboards. We can continue to iterate in this fashion until our data product is just-good-enough for our evangelists, warrants implementation as production-level Spark data processing pipeline and web application, deserves no further investment, and so on.

And more than likely along the way, we’ll discover new ways to bring insights to our team. For instance, we might bridge our work in notebooks to Slack to make information about meetups readily available in our on-going team conversation.


Bottom line

Notebooks help data science teams realize the value of Spark throughout the evolution of their data products. Together, they form a powerful platform for performing analytics at scale and developing data products with speed.

Spark Technology Center


Subscribe to the Spark Technology Center newsletter for the latest thought leadership in Apache Spark™, machine learning and open source.



You Might Also Enjoy