Can Apache™ Spark reveal how people really use cloud storage?

By Shelly Garion and Hillel Kolodner

For about a year, our team has been using Apache™ Spark analytics to investigate IBM Cloud logs and understand how people really use the cloud. Spark allows us to get the answers in a relatively simple way retroactively going over historical data collected over long periods of time (for example, years of operational data). This benefit goes beyond other existing tools that track use of the cloud and need to be programmed in advance.

In this post, we present the algorithmic techniques we use to analyze the huge amounts of log data produced by an operational cloud object store with Spark, and show how we use them for several interesting use-cases. The techniques include sampling, smart grouping and aggregation, and the use of Spark-MLlib. Logs contain semi-structured data, and since they come from different sources they may also contain noise, errors, broken lines, and more. Spark deals with parsing and cleaning them very efficiently.

Our work focuses on analyzing logs coming from cloud object storage, which stores objects such as text, images, and videos. Normally, it would be almost impossible to determine how many objects are new and how many are reused, rewritten, or erased — because we’re talking about billions or trillions of objects.
Logs are automatically generated by computer systems, in particular, cloud computing systems. These systems are multi-tenant and enable the resources and costs to be shared across a large pool of users. By analyzing the logs, we are able to see what individual users, groups of users, and the whole user community are doing. This kind of log analysis is useful for critical business insights, including:

  • customer characterization
  • quality of service
  • capacity planning
  • performance
  • billing and pricing
  • predictive failure analysis
  • system design
  • security and anomaly detection.

For example, we analyze logs to estimate potential for archiving: How many objects can be archived? What is the expected archive size? What criteria are optimal for archiving data? Similarly, we can analyze the potential for caching, and predict the best cache size to optimize the chance for cache hits.

To analyze the archiving potential, we estimate the probability that an object that was put in the cloud will not be touched again (at least for a very long time). A naive approach might take all the log lines during one year and perform “groupByKey” where the key is the object name. Since there are billions or trillions of distinct objects, this approach might cause the Spark cluster to run out of memory. Instead, we work on a sample of the objects, for example, all objects whose hashed name starts with “000”, which yields a random sample of 1/4096 of all objects. For more accurate prediction, we choose additional random samples.

Next, we would like to sort all the operations on an object according to their time. But this is not really feasible for each object. Our solution is to keep a list of days in which the object was active, including the last daily operation on this object, and its last daily size. This significantly reduces the amount of memory needed. Ultimately, we are able to answer the following question: What is the probability that an object will be touched again if it has not been touched for T days? We answer this for a variety of object sizes, users/accounts, and numbers of days.

Another problem is to identify time frames in which the performance of the object storage decreased, causing a subsequent increase in operation latencies. Since the latency can vary greatly according to the operation type and object size, we decided to focus on HEAD operations. A HEAD operation on an object returns the object metadata and not its content; hence the latency is independent of object size. Next, we would like to calculate the percentiles of the latencies for each time period (for example: median, 90%, 99%). Since there could be millions of operations per second, it would be impractical to collect all the latencies, sort them, and calculate the exact percentiles. We decided to estimate the percentiles by dividing all the latencies into a histogram with a fixed number of cells (for example, 1000 cells), where each cell represents a range of latencies. Then we use the "Map/Reduce" method to map each log line to its appropriate cell and to sum the number of lines in each cell. Here is a simplified version of our code. The figure below shows the results of our analysis.

// Main:
val logFiles = sc.textFile(“hdfs:///logdata/logdatafile*.gz”)

val LatencyHEADobject =“ “)).map(ProcessLogLine).filter(line => line._1 == “HEAD object”)

val LatencyHistogram =,b)=>a+b)

// Functions:
def ProcessLogLine(line: Array[String]) = {  
    val operation = .. // string, contains the fields indicating the operation type
    val time = .. // string, contains the fields indicating the time of the request (either week, day, hour, minute, second)
    val latency = .. // double, the field indicating the latency of the request
    (operation, time, latency)

def LatencytoBuckets (line: (String, String, Double) ) = {  
       val time = line._2
       val latency = line._3
       val loglatency = math.log(latency*1000) 
       val bucket = (if (loglatency > 0) loglatency.toInt else 0)
       ((time, bucket),1L)
// the graph below shows buckets 1 to 6. 

We see that toward the end of the period the latencies increase in an anomalous way; in particular, a HEAD latency of 20 msec is normally in the 80th percentile, but starting near time 140 it is often in the 40th percentile. This could indicate a problem that needs to be addressed.

Cumulative Distribution Function of Latency of HEAD Object

Our work also exploits the logs to detect security threats and anomalies by finding outliers; these are observation points that are far from other data points. We build a model of “normal” user behavior based on the logs: for each account (or container) we count how many different operations of each type and how many distinct objects are accessed each hour. After training the model, we are able to generate alerts for anomalous behavior if the numbers are significantly higher than normal. Generally, we analyze the logs for the entire cluster, and also separately per customer (that is, per account). This proves useful not only for security alerts, but also for performance considerations.

In conclusion, Spark, when used judiciously, is very efficient at analyzing huge amounts of semi-structured data retroactively — without the need to have the analytics programmed beforehand.


You Might Also Enjoy

Gidon Gershinsky
Gidon Gershinsky
19 days ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More

James Spyker
James Spyker
3 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More