RedRock Version 2 is Now Open Source

alt

You asked and we listened!

RedRock-v2 is now available on GitHub.

RedRock was received with such great enthusiasm, we wanted to give you more — so here it is. We are happy to introduce “RedRock Version 2”.

In RRv2 we took what we learned from our first version of RR and made it better. In this example application, we include Akka Actors, Redis, and Gephi. RRv2 allows you to delve deeper into Twitter communities, see how they align, and what they think.

Like the first RedRock, the application is split into a back and front end. Both are open source today.

The Nuts and Bolts

First, an Akka Actor pulls a 10-minute chunk of Twitter Decahose data from IBM Bluemix and writes it to HDFS in a folder monitored by Apache Spark™ Streaming. Spark preprocesses the tweet, selecting English tweets, extracting word tokens and tweet sentiment. After preprocessing the tweets, Spark writes the results to HDFS and redis.

The front end is an iPad application that interacts with the back end via a REST API.

User Interaction

The objective of the app is to discover communities of similar-minded Twitter users who are discussing a particular topic. The topic of discussion is defined by twenty related terms obtained from a Spark Word2Vec model that is trained on English tweets received over a period of seven days. Once a topic is selected, we filter out retweets that include any of the twenty terms. From these filtered retweets we generate a network of users called a retweet graph. In this graph the users are placed on the nodes and links between the nodes are created if the users retweet each other.

alt

The user enters a search term, for example something current and polarizing: "#trump". On the back end, Spark Word2Vec is used to give us terms that are closely related to our search term, the distance from the center is how closely each term is related to the original search term, and the size of the bubble is related to the frequency of that term.

alt

At this point, the users can tune their searches by clicking on one of the closely related terms. This will cause the chart to recenter around that term, displaying the most closely related terms to the new search term.

Once the user is satisfied with the related term, they can click on the "Communities" button at the top left to have a look at the communities tweeting about those terms.

alt

The user is presented with retweet graph where the layout is generated using ForceAtlas2 algorithm implemented in Gephi. Each dot is a Twitter user and the colors represent the different communities. These communities are determined using Parallel Louvain algorithm.

To see what the communities are the user clicks on one of the tweets to see more information about that community.

alt

Community Details displays the most commonly used term as well as the overall sentiment being expressed by that community.

As you can see, the app allows users to quickly find communities on Twitter and discover what brings them together. For more information, checkout the project on github.

We hope you enjoy this app a much as we enjoyed building it.

Newsletter

You Might Also Enjoy

Gidon Gershinsky
Gidon Gershinsky
2 months ago

How Alluxio is Accelerating Apache Spark Workloads

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More

James Spyker
James Spyker
4 months ago

Streaming Transformations as Alternatives to ETL

The strategy of extracting, transforming and then loading data (ETL) to create a version of your data optimized for analytics has been around since the 1970s and its challenges are well understood. The time it takes to run an ETL job is dependent on the total data volume so that the time and resource costs rise as an enterprise’s data volume grows. The requirement for analytics databases to be mo... Read More