Blog — Virtually Abstract

Alright Y'all. This one is gonna be long and juicy. I plan on writing a TL;DR version of this post for those who want the shorthand version with less in-depth data science included. Leslie Baker (my girlfriend, also a SWE at Google) and I worked on an awesome data mining project together. The gist of it was this: We want to be able to estimate a persons income level based on what they write in their tweets. In english you can understand, we want to know what the richest and poorest words are. An ambiguous and difficult problem to solve, sure. However, we came up with some pretty sweet data and research. All in all, I think we can do a pretty dang good job of estimating someones income based on their tweets now. Sit back, grab some popcorn, and enjoy. (Or skip to the end for a list of words and the expected income of their authors.)

Relating Tweet Vocabulary to Location and Socioeconomic Status

Introduction

Twitter is an interesting source of data because it provides an unfiltered view into the thoughts and words shared by all kinds of people from all over the world. We wanted to take this data and learn more about how people from different areas and backgrounds communicate differently. To that end, we decided to collect data from Twitter as well as the US Census and, with the help of data mining algorithms we have learned over the course of the semester, discover location-based trends over those datasets. Our end goal for this project was to discover if there existed specific sets of words or phrases that originated from locations in the US that had particularly high or low socioeconomic status.

Twitter then makes tweet location data easily available by use of public APIs. There are a few projects that take advantage of this geographic data, along with other tweet data, to show information about specific users. We wanted to take the unique approach of grouping tweets together based on location and performing analysis on general location trends, as well as incorporating even more data by using the US Census.

In order to reach our goal, we employed two different data mining techniques—clustering and frequent items analysis—and came up with many clever solutions to collect, associate, and analyze our two data sets. We enjoyed the process of solving challenging problems and came out with some interesting results.

Data Collection

In order to perform our desired analysis, we had to collect data from several different sources. First, we needed a large set of tweets from Twitter. To do this, we used the Twitter v1.1 sample stream API to collect a sample set of tweets as they were being posted in real time. We filtered these tweets based on their language (English) and location (United States), trimmed the tweet objects to only contain the information we needed (text, latitude, longitude), and stored them as JSON objects in files for later use. Over the course of a few weeks, we collected almost 40 million tweets.

The second major dataset we collected was the US Census data for median household income based on block group. This set came from the American Community Survey 2013 5-year estimates and aggregated by the US Census Factfinder. We received tables of the median household incomes for every block group in the United States. Then, in our implementation, we stored this data as a hashmap with key being the block group ID and value being the median household income.

The final set of data that we needed to connect the two above was a relation between latitude/longitude coordinates and corresponding block group ID. The best source that we found for this was the FCC’s Census Block Conversion API, which could accept a latitude and longitude pair and return the ID for the corresponding block ID. We could then use the prefix of the block ID to get the block group ID. However, we could not realistically send a network request to the FCC for each of our 40 million tweets, so we had to implement a clustering system to cluster tweets to their estimated closest block group. We will cover this in more detail in the next section.

Experimentation

Our experiment required far more work than I think either of us planned for. When designing this experiment, we also had a lot of unknown factors and hurdles that we were not sure how to overcome at first. These hurdles included things like: not knowing how to translate latitude/longitude coordinate pairs into Census block groups, computer memory constraints, and having the necessary computational power to complete the experiment before our original report was due. We really had to work hard to optimize the computationally intense code in our project just to be able to physically finish it.

In an intermediate sample report, we were able to solve a lot of our concerns by producing a simpler proof of concept using very rudimentary data-mining techniques. The biggest hurdle we overcame was finding an open un-metered API that could convert our latitude/longitude coordinate pairs to a Census block group. This API was made available by the FCC and the source information can be found below. The steps we took to produce this first prototype are as follows:

For our intermediate experiment we used a subset of 4 million tweets (about 10% of our final data set)
To overcome the hurdle of calling the FCC latitude/longitude conversion API 4 million times (for many reasons this was unfeasible), we rounded the latitude/longitude pairs to 3 digits and kept a local cache of the calls we had already made. Using this simple technique, however, cost us great losses in precision.
With the block group IDs obtained, we looked up the expected income for each of those block groups using our map of Census data.
From there, we were able to map each word of each tweet to its block group and then to its expected income, keeping a table of the counts of how many times the word appeared in each block group.
After constructing the map of word to income areas it was used in, we then simply calculated an average.

Even though we lost a large amount of precision when clustering by rounding and using a small subset of our data, we obtained very distinct and convincing results showing some of the exact correlations we were hoping to find. In doing this small test, we were also able to plan ahead for some of the memory and computational constraints we were going to have to deal with in the final experiment. This led to redeveloping the entire process in a sliced fashion to allow us to focus our resources on one slice of the data at a time. After slicing the data into Cartesian chunks and processing one at a time, we were able to finish our final computations in about 3 days. Our final experiment went as follows:

Our final experiment utilized the full 40 million tweet breadth of our data. We chose to slice it based on latitude and longitude values rounded down to the nearest integer. Because our data was limited to the continental United States, this gave us 62 longitudes and 20 latitudes for 62 x 20 = 1240 slices.
For each slice, we had to cluster the data to center points in order to make an FCC API call for each center rather than each individual tweet. Based on our intermediate report, we estimated we would be able to efficiently make around 500,000 calls to the API, so this was the total number of centers we aimed to cluster to.
For each slice, using the k-means++ algorithm for finding Euclidean centers in 2 dimensions, we calculated somewhere in between 0 and 500 centers for each slice. The number of centers we chose to calculate for each slice was dependent upon the number of tweets that landed in that slice. In some cases where major metropolitan areas landed in the slice, 500 centers was simply not enough to accurately portray the data contained in the slice. For these special cases we chose to split the slice into an additional 10 sub-slices.
With the centers in hand, we were able to begin calling the FCC API to translate the latitude and longitude of each center into a block group. Using a system of network monitoring and threading, we were able to make calls as fast as possible on a series of threads. This part of the process took around 24 hours to complete using a single client computer.
As with our intermediate experiment, we used the map of Census data to determine the income level for each block group and stored that value with the corresponding center.
Once each center was associated with an income level, we began moving through our tweet data. For each tweet we had stored, we placed it in its appropriate Cartesian slice by rounding its latitude/longitude down to the nearest integer and then mapped it to its closest center based on its Euclidean distance.
Keeping track of all the words from all the tweets that mapped to each center and hence expected income, we used the same methods of calculating the average income that we did in the initial experiment.
The final product that we obtained from this experiment was a list of every word used over the course of the 40 million tweets and an average expected household income from the location where each tweet was originally generated.

Analysis

There are many ways the data we collected in our first experiment could be analyzed. Some of the interesting things we found are included here.

First, the vast majority of words are only used once in our data set. The four graphs below help to illustrate this. The first two graphs are working with our total data set sorted from 1 to about 10 million based on how many times the term was used. Words like “I”, ”the”, and “is” (common english language terms) are used around 10 million times each in our data. These words would appear at the top of the spike on the right hand side of the graph. Meanwhile in Figure 1, we see that a vast majority of our terms (about 3.1 million) are used only 1 time. Figure 3 helps to illustrate this point on a logarithmic scale showing the number of terms that were used at least X times.

This is an interesting observation. Why were so many terms only used one time? Upon looking into our data we discovered that this phenomenon is mainly due to 2 factors: many words are misspelled and hashtags are often a unique grouping of words (e.g. Heyyyyy #ThisResearchRocks).

This brings us to Figures 2 and 4. Because most of our terms are only used once in the data set, it is hard for us to be very confident in the income values we obtained for those terms. Words that were used at least 20 times and hence have 20 income values associated with them are far more trustworthy and interesting. We chose to do all of our further analysis on words that were used 20 or more times or a data set that contains about 150,000 unique terms.

So now let’s talk about incomes. Once again, in order to construct these graphs, we take all of our data points (<word, expected income>) and sort them by their expected income. The graphs then display an index into that list for the X-axis and the expected income at that index for the Y-Axis. When we include all of our ~4.5 million terms in the graph on the left, we see that the median expected income of all these words is about the same as the overall United States ($51,900) and ranges between $0 and $250,000 (the highest income bucket collected by the US Census).

Figure 5 displays all terms we found, Figure 6 is displaying the trimmed list of terms that were used 20 times or more. We can see here that Figure 5 has a curve that is far more linear than Figure 6. This is due to the fact that terms used many times are more likely to tend towards the median. Thus we see a majority of the graph hangs very close to the middle with a relatively small amount of terms on the tail ends.

The words at the tail ends described in the previous paragraph are the words we are most interested in. What we know about these words is that they were used a significant amount of times and still did not tend towards the median income value. This tells us these groupings of words are specifically used more in a location that has a low or high reported income in the Census. Having thousands of terms to choose from, it is hard to list all of our findings. Here are some of the observations we made about words close to the end of each tail of the curve.

Low end of the curve findings (terms the richest on twitter would use)

College campus terms
Drug references
Profanity
Common misspellings
#medicareforall
#vegasstrip
#farmville

High end of the curve findings (terms the poorest on twitter would use)

Metropolitan terms
Financial terms
High tech terms
Arts terms
Gender equality terms
#downtonabbeypbs
#metopera

Another observation that was clear in the data was that metropolitan areas are naturally weighted with higher incomes, thus metropolitan terms come in higher. The example below shows how a collection of terms commonly associated with Las Vegas (gambling, landmarks, etc.) shape up next to a collection of terms commonly associated with New York City (landmarks, neighborhoods, etc.). The X-axis is showing expected income where the Y-axis is showing number of matching terms at each particular income level. This graph can be used to show that many more New York City terms are used at a high income level than Las Vegas terms. In fact we see that spikes of Las Vegas terms are used well below the $50,000 income mark while many spikes in New York terms are used well above the $100,000 income mark.

Conclusion

I don't think its any surprise that the richest and poorest among us use vastly different vocabulary. I do however find it to be terrifically interesting to see just what that vocabulary is. For some final fun materials, I leave you with some snippets from the presentation poster we made for this project (note: the poster also includes a little info on our research on hashtag physical locality). Also, as a huge bonus I have included a link here to where you can see the list of a set of popular words on twitter and their expected incomes. The file is huge with around 250k lines. So you probably want to just download it and use your favorite text editor.

What words are used by rich people?by Taylor Stapleton

Relating Tweet Vocabulary to Location and Socioeconomic Status

Introduction

Data Collection

Experimentation

Analysis

Conclusion

Link to Word Data!

Thurrott's commenters are amazing...by Taylor Stapleton

Should one advertise a blog?by Taylor Stapleton