0 to Life-Changing Application with Apache SystemML

A “life-changing app”? You may be asking yourself who is this person and how are they so sure they are going to change lives?

Well, let me introduce myself.

Before joining the Spark Technology Center as an intern working with SystemML, I was a student and a researcher and a restaurant manager and an undergraduate admissions ambassador and a barista and...the list goes on, but my passion has always been the social sciences and social good. I studied global politics and philosophy in my undergrad at NYU, I then went on to study foreign policy, focusing on South Asia in my masters degree at King’s College London. Jumping forward a few years, several countries and several jobs, I spontaneously moved out to San Francisco to see what all the buzz was about. I worked as a journalist and as a health researcher, but I wanted something to really dig my teeth into. That’s when I discovered data science. Though I have no computer science background and am driven only by my thirst for knowledge, I have jumped head first into the world of data, programming and machine learning as a UC Berkeley data science grad student.

That brings us back to now where IBM’s STC has given me the assignment of my dreams: learn SystemML from scratch, brainstorm a real-world problem, help build an application using SystemML, then sit back and see lives being changed. Well, that’s the plan anyway.

As you can guess, this experience of learning SystemML from scratch and then building an application with it will be interesting at the least. That’s why I am going to blog about every step along the way. This way, we can simultaneously build our SystemML applications together, and I can alleviate some troubleshooting along the way.

Why SystemML?

At UC Berkeley, we're taught R and Python. SystemML runs with R and Python. Being new to computer science and wanting to jump straight into the data doesn't allow me much time to hack into Spark and figure out how to write high-level math with big data. On SystemML you can write the math no matter how big the data is! Because I can access algorithms from files, it's easier to go from formulas and R code to big data problems.

Now let's get to my first dive into SystemML where I’ll focus on: overcoming assumptions.

While I may still be very new to the tech world and all of its wonderful tutorials, an issue that I have consistently noticed thus far, is the long list of assumptions made in any step by step guide, particularly in setting up your environment. Many developers, data scientists and researchers are so advanced, they have forgotten what it’s like to be new! When writing tutorials, they assume that everything is set up and ready to go, but that’s not always the case. No need to worry with SystemML: I am here to help. Below is my very own step by step guide to running SystemML on Jupyter notebook (with little to no assumptions).

SystemML Jupyter Tutorial

*If you are just starting out please read the following “setting up your environment” step. If you aren’t just starting out please skip to “run SystemML”, but make sure to install SystemML first!

Setting up your environment.

If you’re on a mac, you’ll want to install homebrew (http://brew.sh) if you haven’t already.

Copy and paste the following into your terminal.

# OS X:
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)"  

Now install Java (need Java 8).

brew tap caskroom/cask  
brew install Caskroom/cask/java  

In order to install something on homebrew all you need to do is type "brew install" followed by what you want to install. See below.

Follow-up by installing everything else you need.

Install Spark.

brew tap homebrew/versions  
brew install apache-spark16  

Install python 2 or 3.

#Install Python 2 with Jupyter, Matplotlib and Numpy
brew install python  
pip install jupyter matplotlib numpy  
#Install Python 3 with Jupyter, Matplotlib and Numpy
brew install python3  
pip3 install jupyter matplotlib numpy  

Download SystemML.

Go to the Apache SystemML downloads page and download the zip file (second file).

This next step is optional, but it will make your life a lot easier.

Set SYSTEMML_ HOME on your bash profile.

First, use vim to create/edit your bash profile. Not sure what vim is? Check https://www.linux.com/learn/vim-101-beginners-guide-vim.

We are going to insert our file where Spark and SystemML is stored into our bash profile. This will make it easier to access. First type:

vim .bash_profile  

Now you are in your vim. First, type “i” for insert.


Now insert SystemML. Note: /Documents is where I saved my SystemML. Be sure that your file path is accurate.

export SYSTEMML_HOME=/Users/stc/Documents/systemml-0.10.0-incubating  

Now type :wq to write the file and quit


Make sure to open a new tab in terminal so that you make sure the changes have been made.

Congrats! You’ve made it to the step where we run SystemML!

Run SystemML flawlessly.

In your browser, if you go to http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization you will see a long line of code under “Nonnegative Matrix Factorization”.

Take a look at this page if you want to understand the code more, but we only need to use part of it. In your terminal, type:

PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path $SYSTEMML_HOME/target/SystemML.jar --jars $SYSTEMML_HOME/target/SystemML.jar --conf "spark.driver.memory=12g" --conf spark.driver.maxResultSize=0 --conf spark.akka.frameSize=128 --conf spark.default.parallelism=100  

Jupyter should have launched and you should now be running the jupyter notebook with Spark and SystemML!

Now set up the notebook and download the data:

%load_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np  
import matplotlib.pyplot as plt  
plt.rcParams['figure.figsize'] = (10, 6)



curl -O http://snap.stanford.edu/data/amazon0601.txt.gz  
gunzip amazon0601.txt.gz  

Use PySpark to load the data into the Spark Data Frame

import pyspark.sql.functions as F  
dataPath = "amazon0601.txt"

X_train = (sc.textFile(dataPath)  
    .filter(lambda l: not l.startswith("#"))
    .map(lambda l: l.split("\t"))
    .map(lambda prods: (int(prods[0]), int(prods[1]), 1.0))
    .toDF(("prod_i", "prod_j", "x_ij"))
    .filter("prod_i < 500 AND prod_j < 500")

max_prod_i = X_train.select(F.max("prod_i")).first()[0]  
max_prod_j = X_train.select(F.max("prod_j")).first()[0]  
numProducts = max(max_prod_i, max_prod_j) + 1  
print("Total number of products: {}".format(numProducts))  

Create a SystemML Context Object

from SystemML import MLContext  
ml = MLContext(sc)  

Define a kernel for Poisson nonnegative matrix factorization (PNMF) in DML

pnmf = """  
X = read($X)  
X = X+1  
V = table(X[,1], X[,2])  
size = ifdef($size, -1)  
if(size > -1) {  
    V = V[1:size,1:size]
max_iteration = as.integer($maxiter)  
rank = as.integer($rank)

n = nrow(V)  
m = ncol(V)  
range = 0.01  
W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform")  
H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform")  
losses = matrix(0, rows=max_iteration, cols=1)  


while(i <= max_iteration) {

  H = (H * (t(W) %*% (V/(W%*%H))))/t(colSums(W)) 
  W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H))

  losses[i,] = -1 * (sum(V*log(W%*%H)) - as.scalar(colSums(W)%*%rowSums(H)))
  i = i + 1;

write(losses, $lossout)  
write(W, $Wout)  
write(H, $Hout)  

Execute the Algorithm

outputs = ml.executeScript(pnmf, {"X": X_train, "maxiter": 100, "rank": 10}, ["W", "H", "losses"])  

Retrieve the Losses and Plot Them

losses = outputs.getDF(sqlContext, "losses")  
xy = losses.sort(losses.ID).map(lambda r: (r[0], r[1])).collect()  
x, y = zip(*xy)  
plt.plot(x, y)  
plt.title('PNMF Training Loss')  

Congratulations! You just ran SystemML!

Thanks for reading! Stay tuned for updates on my life-changing app!


You Might Also Enjoy

James Spyker
James Spyker
2 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More

Seth Dobrin
Seth Dobrin
2 months ago

Non-Obvious Application of Spark™ as a Cloud-Sync Tool

When most people think about Apache Spark™, they think about analytics and machine learning. In my upcoming talk at Spark Summit East, I'll talk about leveraging Spark in conjunction with Kafka, in a hybrid cloud environment, to apply the batch and micro-batch analytic capabilities to transactional data in place of performing traditional ETL. This application of these two open source tools is a no... Read More