Building a Word2Vec Model with Twitter Data

It is always amazing when someone is able to take a very hard, present day problem, and translate it to one that has been studied for centuries. This is the case with Word2Vec, which transforms words into vectors. Text is unstructured data and has been explored mathematically far less than vectors—both historically, and today. Newton (1642-1726) may have been the first one to study vectors in the context of forces in physics, so vectors is a concept with at least 289 years of mathematical maturity. Mathematical exploration of text data is a concept with only a few decades of maturity.  Similarly, I have worked with vectors for half of my life, but only explored text data for less than a year.

The application of mathematical thinking to text data is especially important now, at a time when the value of data is understood, but not actualized. The majority of business-relevant information originates in unstructured form, primarily text. This data is invisible to, and unusable by, business, health care, education, and government, until it can be “read”. Mathematical exploration of text data can yield insights that translate into better decisions made by doctors, marketers, entrepreneurs, and teachers.

As part of my endeavor to make text data “readable”, I applied Word2Vec to generate vectors that capture word meaning, and enable arithmetic operations associated with words. For example, the vector(‘king’) + vector(‘woman’) – vector(‘man’) will result in a vector that is close to the vector(‘queen’). Isn’t this incredible?
The Word2Vec method was proposed by Mikolos et al. in 2013. This algorithm is based on neural networks and maps a corpus of text to a matrix where each row is associated to a word in the input text data (in this case, tweets). The resultant vector space can be utilized in a variety of ways, such as measuring distance between words. Therefore, given a word of interest, the aforementioned vector space can be used to compute the top N closest words.

For example, a model that I built using 30 days of Twitter data gives the 5 closest words to #deeplearning. They are: 1. #machinelearning, 2. #ml, 3. #smartdata, 4. #predictiveanalytics, 5. #datascience. The Word2Vec implementation used for this session is the one from MLlib, the machine learning package that’s part of Apache Spark.

If you’re interested in learning more about building Word2Vec models, I’ll be at Datapalooza San Francisco, November 10-12. In my session, I will share my experience on how to choose the key parameters for building an accurate Word2Vec model. In addition, I will show examples of how this technique is used in data products. One of these examples is the RedRock iPad application, which uses Word2Vec to generate 2 of its visualizations. My code for this tutorial is available in the following github repo:

Anyone who comes to DataPalooza will be able to build the skeleton data products in 3 days. Join me at my session to understand how the Word2Vec model can be one of the key elements of your data product. See you there!

Spark Technology Center


Subscribe to the Spark Technology Center newsletter for the latest thought leadership in Apache Spark™, machine learning and open source.



You Might Also Enjoy