SLAM with Kafka and Spark Streaming: Q & A with J White Bear

SLAM with Kafka and Spark Streaming

J White Bear, Spark Technology Center Engineer, fell in love with machine learning as a computational biologist using predictive analytics on human proteins to find vaccines. She’s now teaching robots to see where they are in space relative to other objects—something we’ll thank her for when robots are giving each other the high sign on the road while driving us around—and building a framework for the Apache Spark community to support both robotics and computational biologists.

Here, she talks about SLAM—Simultaneous Localization and Mapping—as applied to robots and human proteins, and the secret to nerd happiness.

J’s SLAM work was shown recently at Spark Summit. She also spoke on June 29th at Hadoop Summit. See below for a video of her talk.

Q. How did you get from, say, 5th grade math, to SLAM algorithms at the Spark Technology Center?

It was a zig-zaggy path! But the hook for me was computational biology. You’re actually looking at a lot of big data problems in biology: with the recent DNA sequencing, and protein-protein interactions, we have the capabilities that we didn’t have thirty years ago, working with cells, and now we have the data that comes along with that and the processing of that data. Traditional biologists don't really study math. They don't really study statistics. They definitely don't study computer science. So their approaches are very limited. They have to come and work with scientists on the other side of things who can say, okay, this is how we can interpret the data that we have. Because you're looking at, you know, you're looking at one cell all day long in a dish. It's a very myopic view. We look at the big picture: what are these systems in the body? what are these pathways? how can we characterize this?—and those are all machine learning problems.

Q. How is that a machine learning problem?

My research in particular is protein-protein interactions, with a disease called schistosoma mansoni and human proteins. The number of proteins and the number of configurations that they can take in your body are the size of the universe, literally, meaning we don't have enough computational power to actually do it manually. We can't use brute force. You have to find a smart way to predict how proteins are going to interact in real life.

Then you can focus on finding vaccine candidates. This disease is from a parasite found in developing countries that burrows into the skin when people go into the water. The parasite uses an elastase to degrade the extracellular matrix, and then gets into their bloodstream, and then goes and infects them, and they keep getting re-infected. There's a treatment, but it's chemo, and most of the people can't afford the drugs, and even if they can, they need to keep taking them repeatedly, unless they just never go back in the water.

So that's obviously not practical. You want a vaccine. So we look for the the proteins that the parasites are using to get through the skin and in human proteins, we see which ones are they specifically degrading, and then can we find something to block that interaction.

Q. What was the transition to machine learning for robotics?

The algorithms that you use in robotics are very similar to the algorithms that you use in computational biology. When robots explore a live space, they're doing the same thing you do when you try to figure out how a protein is folding: using particle spaces and particles to explore the space.

Q. Why did you choose SLAM?

SLAM is a cool problem, robots are cute, and it's the future. Roomba, the drones Amazon is working on, Google's self-driving car—all those use SLAM algorithms—or they will. They mount LIDAR on the vehicles, and those real time navigation. That’s what SLAM is all about. You want to know where you are in a space, how to get to the next space, and what’s going on in the space around you. As humans, we do that all the time, and it’s cool—we just know! But a machine can’t do that without our help, and even with our help it’s hard, because things we don’t think of as difficult are actually difficult for robots.

Q. What’s hard about navigating, in a robot's mind?

Differentiating a road from a patch of grass. To a robot, it’s the same color, the dirt is kind of brown, the road is kind of brown, which is which—and while the robots are thinking you have cars driving off the road. That’s a computer vision problem—all machine learning based. How do we smartly predict what we're seeing?

Drones, cars—they’re all implementing SLAM variants and there are problems inherent to the space: understanding where the robot is in relation to its space (localization), mapping that new space correctly (mapping), interpreting the visual or radar data accurately (machine learning) and doing it in a fast and reliable way (simultaneous).

Q. So are you working with localization, mapping, interpretation … and how and why are you using Spark?

To start with, I’m working on the performance aspect with Spark by examining Extended Kalman Filter-SLAM (EKF-SLAM) variant performance in real time on the cloud. EKF-SLAM represents the vehicle’s internal map and pose estimate in a high-dimensional Gaussian matrix. Maintaining a multi-variate Gaussian requires computational time quadratic in the number of features of the map. Then there is a constant stream of new data that grows larger in relation to the number of moves or observations the robot makes. We need to be able to perform fast computations, while making solid predictions about the robot’s state and its environment. So it’s a suitable problem for the Spark platform.

The challenge was, of course, that this had not been done before by integrating Kafka, Spark Streaming, and Spark ML—that’s the approach I’m taking. I’m measuring and doing some benchmarks.

Q. Why Kafka and Spark Streaming?

Apache Kafka acts as a producer and consumer—it feeds control data to Spark Streaming which feeds the data into the SLAM algorithm for processing and then transmits the results and actions back to the robot. They all work together to perform this task as fast as possible and make accurate predictions, predictions that translate to reliable moves and accurate maps as the robot explores the space.

It’s still a very active area of research. We’re not there yet. It’s still a hard problem. What do you see? Where are you? Where are you going?

Q. Can we look at Turtlebot?

Yes! For Turtlebot here, I write the algorithm. I base it on Spark and run it on a Spark cluster. And I have him move around in a space and he avoids things, like this ball. And he also creates a very accurate map of what he's seeing.

We obviously can see this is a wall. But to a robot, he doesn't necessarily know. So he has to define it: okay, this is the wall.

Turtlebot exploring

Q. How do you see what’s going on in Turtlebot’s head?

This is a very raw, 2-D vision. These dots are called landmarks and it'll put them down in front of things that it sees with its laser. And then eventually if you get enough of these dots, the robot gets a sense of what this room is like. If you get a series of dots that go together, you can say, okay, this is probably a wall and you have to navigate your robot around walls. You have to think about where you can't go.

Landmark map with SLAM

And if a person walks by, that's a whole other problem, because a person isn't going to generate a dot because they're moving. So how do you keep them in this space? How do you account for that? How do you get robots to follow voice commands, or react to gestures?

Q. Where do you see this going?

I’d like to expand this. Spark isn’t supporting these features right now for the robotics community. We’ve been looking at large machine learning sets for businesses and commercial data because those are the main customers, but in the open source world, you also want to support research and science.

The drones are going to need software to run them and there are going to be startups entering into the drone world, not just the big guys like Amazon. Where are we—the Spark community—when that happens?

Everybody has machines running around. Everybody knows about the self-driving cars. So where are the frameworks? They're just not there. They're all proprietary right now: there's going to be Google proprietary; there's going to be Amazon proprietary. I’d like to see a ground floor framework.

Q. Available to the open source community, for anyone to build off of?

Yes, and then we can support it for the commercial industries.

Q. What do you want to work on next?

Point cloud operations and artificial neural networks — with Mike Dusenberry at the Spark Technology Center. I want to figure out, can we mesh the two and come up with something new?

Also more benchmarking, defining where we need to head in terms of improvement.

And IoT—real time applications hooked up to the Spark Cloud. Like your Roomba and the robot cooking in your house all hooked up to the Spark cloud.

Q. Can I have my cooking robot now?

Soon! Right now, my focus is SLAM and building a framework for the Spark community. And a framework for computational biologists as well; the algorithms are so close. We can do amazing things and help people in amazing ways. That makes me happy. Nerd happy.


You Might Also Enjoy

Gidon Gershinsky
Gidon Gershinsky
19 days ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More

James Spyker
James Spyker
3 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More