Scoring Heart Disease with Apache Spark™ (Part 1/3)

This is part 1 of a practical guide to scoring health data with Apache Spark™. We'll post the subsequent parts of the guide in the coming weeks.

Inspired from a R tutorial here, the focus is not on the mathematical aspects of the construction of a score, but rather on the use of Spark and on exploiting the results. The guide is divided into four parts:

  1. Objective description of the study and data
  2. Data preparation and initial analysis
  3. Construction and validation of the score
  4. Interpretation of results

For the code I reference, visit the full github repository here.

Objective of the study and description of the data


Scoring is a technique for prioritizing data for assessing a rating or determining the probability that an individual meets a solicitation or belongs to an intended target.

The score is usually obtained from the quantitative and qualitative data available on the individual (socio-demo data, purchasing behavior, previous answers ...) to which we can apply a scoring model.

In general, we use logistic regression as the modeling technique. It's a supervised learning technique that, for example, helps to determine whether an item belongs to a category based on descriptors collected on a population sample in order to generalize learning.

Some examples of applications:

  • Determining the viability of a client seeking credit based on characteristics such as age, type of job, income level, other outstanding loans, and so on.
  • For a company, determining the best type of planting area based on neighborhood characteristics (SPC, number of inhabitants, life cycle, and so on).

Case Study

Four our case study, we have a database with data for about 462 patients for whom we want to predict exposure to a heart attack.

The data is available here under the tab Data > South African Heart Disease.

Data Description

In a retrospective sample of males in a heart-disease high-risk region of the Western Cape of South Africa, there are roughly two controls per case of coronary heart disease (CHD). Many of the CHD-positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after a CHD event. In some cases, the measurements were made after these treatments. (Note that these data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal.)

Dataset from South African Medical Journal

For our study case, we will be interested in that last variable chd.

Model Definition

Our aim is to determine the probability of a given heart disease observation among 462 patients — for example, chd = f (obesity, age, family history, etc ...)

Yes, but that's not big data. Why do we need Spark? Can't we use R or Pandas?

The intention is to define this analytical gait on Spark and Zeppelin which can then also be carried over to use in a big data context.

The example illustrates how to use logistic regression on health data, but the the problem formulation is generally applicable. For example, it could apply to an insurance company wanting to determine risk factors, or for provisioning and pricing profiles of target customers for a new commercial offer. At the level of macroeconomics, the approach could even be used to quantify national risk.

Modeling approach

Like any good modeling approach, building a good scoring model is an iterative approach that evaluates these questions and criteria:

  • Exploratory Analysis: What is the data set? Are there any missing values?
  • Check the correlation between the descriptors and the variable.
  • Identify important and redundant predictors to create a parsimonious model (an especially important step when making forecasts).
  • Estimate the model on a training sample.
  • Validate the model on a test sample and build the model based on quality indicators.
  • Compare different models and retain the most suitable model according to the purpose of the study.

We will follow these steps to evaluate the case study:


1. Load Raw Data

Let's first load the required libraries. We are working with Zeppelin Notebook (v.0.6.0) so we'll need the spark-csv package to read the downloaded data into a Spark DataFrame.

In Zeppelin, we edit the first cell to add the following lines to load the dependency:



In a separate cell, we'll need to read the data raw first using spark-csv as follows:

// Update the path to point 
// to your downloaded data.
val dataPath = "./heart-disease-study/data/"  
val rawData ="com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load(dataPath).drop("row.names")  

We can now check what our data looks like:

or with

2. Exploratory Analysis

Since the data is already loaded, we can now check the type of each column by printing the schema:

Note that the chd target variable was treated as a numeric variable. We'll deal with that later on.

And finally, let's run some summary statistics on the data (for example,

Categorical feature encoder

Notice that we also have a categorical feature famhist for family history. We'll need to encode it for further usage. We'll also need to convert the chd feature into a DoubleType before converting it into a categorical feature:

// famhist UDF encoder
val encodeFamHist = udf[Double, String]{  
  _ match { case "Absent" => 0.0 case "Present" => 1.0}

// Apply UDF and cast on data
val data = rawData  
import org.apache.spark.mllib.linalg.{Vector, Vectors}  

val toVec = udf[Vector, Double] { (a) =>  Vectors.dense(a) }

val encodeFamHist = udf[Double, String]( _ match { case "Absent" => 0.0 case "Present" => 1.0} )  
val data = base.withColumn("famhist",encodeFamHist('famhist)).withColumn("chd",'chd.cast("Double"))

val chdEncoder = new OneHotEncoder().setInputCol("chd").setOutputCol("chd_categorical")  
val famhistEncoder = new OneHotEncoder().setInputCol("famhist").setOutputCol("famhist_categorical")

val pipeline = new Pipeline().setStages(Array(chdEncoder, famhistEncoder))

val encoded =  

Let's pause there for now. In the next part of this post, we'll talk about searching for meaningful explanatory variables, discuss outliers and missing values, and more.


You Might Also Enjoy

Gidon Gershinsky
Gidon Gershinsky
19 days ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More

James Spyker
James Spyker
3 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More