Apache incubator

A Universal Translator for Big Data and Machine Learning


Anybody who travels to a foreign country or reads a book or newspaper written in a language they don’t speak understands the value of a good translation. Yet, in the realm of big data, application developers face huge challenges when combining information from different sources and when deploying data-heavy applications to different types of computers. What they need is a good translator.

That’s why IBM donated SystemML to the open source community. SystemML is a universal translator for big data, and the machine learning algorithms now essential to processing it. With SystemML, developers who don’t have expertise in machine learning can embed it in their applications once, and then use it in industry-specific scenarios on a wide variety of computing platforms, from mainframes to smartphones.

SystemML was born in 2007 in a research lab in Almaden, CA, when IBM researcher Shiv Vaithanathan and his summer intern looked at each other at the end of a long summer writing algorithms on Hadoop, and realized they could never conquer big data one algorithm at a time.

"SystemML lets the data scientist be super-creative – not encumbered by how to optimize the algorithms." “SystemML lets the data scientist be super-creative – not encumbered by how to optimize the algorithms.” – Shiv Vaityanathan, SystemML author

“We needed to think along the same lines as SQL, by separating the “what” from the “how”. In other words, we had to separate the specification of the algorithm – and make the specification easy – from the way we optimized the algorithm.”

Fast forward to 2015: The Apache Foundation, one of the leading open source organizations in the world, accepted SystemML as an official Apache Incubator project—giving it the name ApacheSystemML.
Acceptance by the Apache Foundation—known for its rigor in vetting every project¬—is an honor, and recognition of the impact SystemML will have on data analytics.

We open sourced SystemML in June when we threw our weight behind the Apache Spark project—a fast growing open source software project that enables developers and data scientists to more easily integrate Big Data analytics into applications.

We believe that Apache Spark is the most important new open source project in a decade. We’re embedding Spark into our Analytics and Commerce platforms, offering Spark as a service on IBM Cloud, and putting more than 3,500 IBM researchers and developers to work on Spark-related projects.

Apache SystemML is an essential element of the Spark ecosystem of technologies. Think of Spark as the analytics operating system for any application that taps into huge volumes of streaming data. MLLib, the machine learning library for Spark, provides developers with a rich set of machine learning algorithms. And SystemML enables developers to translate those algorithms so they can easily digest different kinds of data and to run on different kinds of computers.

SystemML allows a developer to write a single machine learning algorithm and automatically scale it up using Spark or Hadoop, saving significant time on the part of highly skilled developers. While other tech companies have open sourced machine learning technologies as well, most of those are specialized tools to train neural networks. They are important, but niche, and the ability to ease the use of machine learning within Spark or Hadoop will be critical for machine learning to really become ubiquitous in the long run.

Fred Reiss, one of SystemM’s authors, explains:

"SystemML represents a new design point" - Fred Reiss, STC Researcher “SystemML represents a new design point” – Fred Reiss, STC Researcher

“In the big data space, there really hasn’t been any high-level language, just a collection of point solutions and some frameworks for tying them together. SystemML represents a new design point: a flexible, high-level language coupled with an optimizer and runtime that can handle big data problems.”

In the coming years, all businesses and, indeed, society in general, will come to rely on computing systems that learn—what we call cognitive systems. This kind of computer learning is critical because the flood of Big Data makes it impossible for organizations to manually train and program computers to handle complex situations and problems—especially as they morph over time. Computing systems must learn from their interactions with data.

The Apache SystemML project has achieved a number of early milestones to date, including:

Over 320 patches including APIs, Data Ingestion, Optimizations, Language and Runtime Operators, Additional Algorithms, Testing, and Documentation.

90+ contributions to the Apache Spark project from more than 25 engineers at the IBM Spark Technology Center in San Francisco to make Machine Learning accessible to the fastest growing community of data science professionals and to various other components of Apache Spark.

More than 15 contributors from a number of organizations to enhance the capabilities to the core SystemML engine.

Apache SystemML committer D.B.Tsai had this to say about it:

D.B.Tsai, SystemML contributor “It is a great extensible complement framework of Spark MLlib.” – D.B.Tsai, SystemML committer

“SystemML not only scales for big data analytics with high performance optimizer technology, but empowers users to write customized machine learning algorithms using simple domain specific language without learning complicated distributed programming. It is a great extensible complement framework of Spark MLlib. I’m looking forward to seeing this become part of the Apache Spark ecosystem.”

We are excited too. We believe that open source software will be an essential element of big data analytics and cognitive computing, just at it has been vital to the advances that have come in the Internet and cloud computing. The more tech companies and developers share resources and combine our efforts, the faster information technology will transform business and society.


You Might Also Enjoy

Kevin Bates
Kevin Bates
10 months ago

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Gidon Gershinsky
Gidon Gershinsky
a year ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More