Taylor Stapleton

What words are used by rich people?by Taylor Stapleton

Alright Y'all. This one is gonna be long and juicy. I plan on writing a TL;DR version of this post for those who want the shorthand version with less in-depth data science included. Leslie Baker (my girlfriend, also a SWE at Google) and I worked on an awesome data mining project together. The gist of it was this: We want to be able to estimate a persons income level based on what they write in their tweets. In english you can understand, we want to know what the richest and poorest words are. An ambiguous and difficult problem to solve, sure. However, we came up with some pretty sweet data and research. All in all, I think we can do a pretty dang good job of estimating someones income based on their tweets now. Sit back, grab some popcorn, and enjoy. (Or skip to the end for a list of words and the expected income of their authors.)

Relating Tweet Vocabulary to Location and Socioeconomic Status

Introduction

Twitter is an interesting source of data because it provides an unfiltered view into the thoughts and words shared by all kinds of people from all over the world. We wanted to take this data and learn more about how people from different areas and backgrounds communicate differently. To that end, we decided to collect data from Twitter as well as the US Census and, with the help of data mining algorithms we have learned over the course of the semester, discover location-based trends over those datasets. Our end goal for this project was to discover if there existed specific sets of words or phrases that originated from locations in the US that had particularly high or low socioeconomic status.

Twitter then makes tweet location data easily available by use of public APIs. There are a few projects that take advantage of this geographic data, along with other tweet data, to show information about specific users. We wanted to take the unique approach of grouping tweets together based on location and performing analysis on general location trends, as well as incorporating even more data by using the US Census.

In order to reach our goal, we employed two different data mining techniques—clustering and frequent items analysis—and came up with many clever solutions to collect, associate, and analyze our two data sets. We enjoyed the process of solving challenging problems and came out with some interesting results.

Data Collection

In order to perform our desired analysis, we had to collect data from several different sources. First, we needed a large set of tweets from Twitter. To do this, we used the Twitter v1.1 sample stream API to collect a sample set of tweets as they were being posted in real time. We filtered these tweets based on their language (English) and location (United States), trimmed the tweet objects to only contain the information we needed (text, latitude, longitude),  and stored them as JSON objects in files for later use. Over the course of a few weeks, we collected almost 40 million tweets.

The second major dataset we collected was the US Census data for median household income based on block group. This set came from the American Community Survey 2013 5-year estimates and aggregated by the US Census Factfinder. We received tables of the median household incomes for every block group in the United States. Then, in our implementation, we stored this data as a hashmap with key being the block group ID and value being the median household income.

The final set of data that we needed to connect the two above was a relation between latitude/longitude coordinates and corresponding block group ID. The best source that we found for this was the FCC’s Census Block Conversion API, which could accept a latitude and longitude pair and return the ID for the corresponding block ID. We could then use the prefix of the block ID to get the block group ID. However, we could not realistically send a network request to the FCC for each of our 40 million tweets, so we had to implement a clustering system to cluster tweets to their estimated closest block group. We will cover this in more detail in the next section.

Experimentation

Our experiment required far more work than I think either of us planned for. When designing this experiment, we also had a lot of unknown factors and hurdles that we were not sure how to overcome at first. These hurdles included things like: not knowing how to translate latitude/longitude coordinate pairs into Census block groups, computer memory constraints, and having the necessary computational power to complete the experiment before our original report was due. We really had to work hard to optimize the computationally intense code in our project just to be able to physically finish it.

In an intermediate sample report, we were able to solve a lot of our concerns by producing a simpler proof of concept using very rudimentary data-mining techniques. The biggest hurdle we overcame was finding an open un-metered API that could convert our latitude/longitude coordinate pairs to a Census block group. This API was made available by the FCC and the source information can be found below. The steps we took to produce this first prototype are as follows:

  1. For our intermediate experiment we used a subset of 4 million tweets (about 10% of our final data set)

  2. To overcome the hurdle of calling the FCC latitude/longitude conversion API 4 million times (for many reasons this was unfeasible), we rounded the latitude/longitude pairs to 3 digits and kept a local cache of the calls we had already made. Using this simple technique, however, cost us great losses in precision.

  3. With the block group IDs obtained, we looked up the expected income for each of those block groups using our map of Census data.

  4. From there, we were able to map each word of each tweet to its block group and then to its expected income, keeping a table of the counts of how many times the word appeared in each block group.

  5. After constructing the map of word to income areas it was used in, we then simply calculated an average.

Even though we lost a large amount of precision when clustering by rounding and using a small subset of our data, we obtained very distinct and convincing results showing some of the exact correlations we were hoping to find. In doing this small test, we were also able to plan ahead for some of the memory and computational constraints we were going to have to deal with in the final experiment. This led to redeveloping the entire process in a sliced fashion to allow us to focus our resources on one slice of the data at a time. After slicing the data into Cartesian chunks and processing one at a time, we were able to finish our final computations in about 3 days. Our final experiment went as follows:

  1. Our final experiment utilized the full 40 million tweet breadth of our data. We chose to slice it based on latitude and longitude values rounded down to the nearest integer. Because our data was limited to the continental United States, this gave us 62 longitudes and 20 latitudes for 62 x 20 = 1240 slices.

  2. For each slice, we had to cluster the data to center points in order to make an FCC API call for each center rather than each individual tweet. Based on our intermediate report, we estimated we would be able to efficiently make around 500,000 calls to the API, so this was the total number of centers we aimed to cluster to.

  3. For each slice, using the k-means++ algorithm for finding Euclidean centers in 2 dimensions, we calculated somewhere in between 0 and 500 centers for each slice. The number of centers we chose to calculate for each slice was dependent upon the number of tweets that landed in that slice. In some cases where major metropolitan areas landed in the slice, 500 centers was simply not enough to accurately portray the data contained in the slice. For these special cases we chose to split the slice into an additional 10 sub-slices.

  4. With the centers in hand, we were able to begin calling the FCC API to translate the latitude and longitude of each center into a block group. Using a system of network monitoring and threading, we were able to make calls as fast as possible on a series of threads. This part of the process took around 24 hours to complete using a single client computer.

  5. As with our intermediate experiment, we used the map of Census data to determine the income level for each block group and stored that value with the corresponding center.

  6. Once each center was associated with an income level, we began moving through our tweet data. For each tweet we had stored, we placed it in its appropriate Cartesian slice by rounding its latitude/longitude down to the nearest integer and then mapped it to its closest center based on its Euclidean distance.

  7. Keeping track of all the words from all the tweets that mapped to each center and hence expected income, we used the same methods of calculating the average income that we did in the initial experiment.

  8. The final product that we obtained from this experiment was a list of every word used over the course of the 40 million tweets and an average expected household income from the location where each tweet was originally generated.

Analysis

There are many ways the data we collected in our first experiment could be analyzed. Some of the interesting things we found are included here.

First, the vast majority of words are only used once in our data set. The four graphs below help to illustrate this. The first two graphs are working with our total data set sorted from 1 to about 10 million based on how many times the term was used. Words like “I”, ”the”, and “is” (common english language terms) are used around 10 million times each in our data. These words would appear at the top of the spike on the right hand side of the graph. Meanwhile in Figure 1, we see that a vast majority of our terms (about 3.1 million) are used only 1 time. Figure 3 helps to illustrate this point on a logarithmic scale showing the number of terms that were used at least X times.

This is an interesting observation. Why were so many terms only used one time? Upon looking into our data we discovered that this phenomenon is mainly due to 2 factors: many words are misspelled and hashtags are often a unique grouping of words (e.g. Heyyyyy #ThisResearchRocks).

This brings us to Figures 2 and 4. Because most of our terms are only used once in the data set, it is hard for us to be very confident in the income values we obtained for those terms. Words that were used at least 20 times and hence have 20 income values associated with them are far more trustworthy and interesting. We chose to do all of our further analysis on words that were used 20 or more times or a data set that contains about 150,000 unique terms.

So now let’s talk about incomes. Once again, in order to construct these graphs, we take all of our data points (<word, expected income>) and sort them by their expected income. The graphs then display an index into that list for the X-axis and the expected income at that index for the Y-Axis. When we include all of our ~4.5 million terms in the graph on the left, we see that the median expected income of all these words is about the same as the overall United States ($51,900) and ranges between $0 and $250,000 (the highest income bucket collected by the US Census).

Figure 5 displays all terms we found, Figure 6 is displaying the trimmed list of terms that were used 20 times or more. We can see here that Figure 5 has a curve that is far more linear than Figure 6. This is due to the fact that terms used many times are more likely to tend towards the median. Thus we see a majority of the graph hangs very close to the middle with a relatively small amount of terms on the tail ends.

The words at the tail ends described in the previous paragraph are the words we are most interested in. What we know about these words is that they were used a significant amount of times and still did not tend towards the median income value. This tells us these groupings of words are specifically used more in a location that has a low or high reported income in the Census. Having thousands of terms to choose from, it is hard to list all of our findings. Here are some of the observations we made about words close to the end of each tail of the curve.

Low end of the curve findings (terms the richest on twitter would use)

  • College campus terms

  • Drug references

  • Profanity

  • Common misspellings

  • #medicareforall

  • #vegasstrip

  • #farmville

High end of the curve findings (terms the poorest on twitter would use)

  • Metropolitan terms

  • Financial terms

  • High tech terms

  • Arts terms

  • Gender equality terms

  • #downtonabbeypbs

  • #metopera

Another observation that was clear in the data was that metropolitan areas are naturally weighted with higher incomes, thus metropolitan terms come in higher. The example below shows how a collection of terms commonly associated with Las Vegas (gambling, landmarks, etc.) shape up next to a collection of terms commonly associated with New York City (landmarks, neighborhoods, etc.). The X-axis is showing expected income where the Y-axis is showing number of matching terms at each particular income level. This graph can be used to show that many more New York City terms are used at a high income level than Las Vegas terms. In fact we see that spikes of Las Vegas terms are used well below the $50,000 income mark while many spikes in New York terms are used well above the $100,000 income mark.

Conclusion

I don't think its any surprise that the richest and poorest among us use vastly different vocabulary. I do however find it to be terrifically interesting to see just what that vocabulary is. For some final fun materials, I leave you with some snippets from the presentation poster we made for this project (note: the poster also includes a little info on our research on hashtag physical locality). Also, as a huge bonus I have included a link here to where you can see the list of a set of popular words on twitter and their expected incomes. The file is huge with around 250k lines. So you probably want to just download it and use your favorite text editor.

Link to Word Data!

Thurrott's commenters are amazing...by Taylor Stapleton

I'm not sure if this is worthy of a blog post or not. To be on the safe side though, I better get this down in ink. I was browsing around the internet today and stumbled upon www.thurrott.com. I don't know much about the site except that I assume the Thurrott in question is Paul Thurrott. Paul is someone I only know to be an intelligent and respected technology journalist. In my mind he is commonly associated with Microsoft and the TWIT crowd as well (not sure why, too lazy to research). Ok, enough blabber and on to the point.

So I'm reading a post on this site and then move on to the comments section for some heated juice. But heated juice is not what I found! The comments were..... constructive. They were valuable. People who wrote the comments took time to think about what they were writing. They included proper spelling, punctuation, well-formed ideas, and even facts to back up arguments! I was just so astounded by this paradise of intellectual comments that I feel like I have to tell someone. In the image are some examples of one of the first well-founded, respectable comment debates I have ever seen on the internet. This will definitely be a site I pay attention to in the future.

 

Should one advertise a blog?by Taylor Stapleton

It seems like most blogs start out pretty slow. Such is naturally the case with this blog. Do I expect it will get famous anytime soon? Not in the slightest. Do I expect that someday we could have a modest regular audience? Maybe. Even though I work for Google and specifically in an organization that specializes in the popularity of content on the internet, I really don't have a lot of clues as to how I could make my own content popular. I suppose I can hit the usual suspects like improving my SEO, and self promoting the content such that people might be tempted to share the content with someone they think would be interested. I have done a little bit of this so far, and I think that my hosting provider also has a bit of this built in.

You seem to get a natural amount of SEO with a Squarespace site and that is cool. If you were to directly search for the title of the site, you would probably find it easily on Google. This is great if you already have a following of people that are already looking for you. However, if you are like me and there isn't a reason in this entire world that you should be interested in what I have to say as of this moment, you are going to have a very hard time discovering me. It's not like anyone links into my content. And the generic terms that I write my posts about are certainly not going to be searcheable over content that is already popular on the internet. This has posed an interesting problem to me. Almost a bit of a chicken/egg problem.

One option I have considered is that I could seed the blogs numbers and viewership with a bit of cash. I'm not talking about something sketchy (or am I?), but I'm considering running an ad campaign for the site. I don't really care about making money on my content, but I would also care not to lose a ton of money on this site either, So I think it would be a pretty cheap and low-key ad campaign. Maybe some search advertising targetting some keywords like {blog|tech|writing|nonsense}. I worry a little bit about alienating my own potential readers though. Say you have the average reader of a tech blog and they see an ad for my blog. Are they going to think it's shit because I'm advertising it? Also consider that if this person saw the ad and visited my page that they might not yet like the page because there isn't a lot of content yet. Versus the scenario where if they found the blog organically sometime later when it's more flushed out, they might enjoy it more and would be more likely to be a repeat reader.

Well this is an exciting post. Side note: if you came to this post through an ad, then you are from the future and already know the outcome of my musings here. Good lord do I lust for your superior knowledge.

Sorting binary outcomes using std::deque. by Taylor Stapleton

Well at least I thought it was clever. Ok, so imagine you have a collection of things, and each of these things has a binary property. You need to take this list and order it such that all the items with their property set to 'true' should be first in the list. I'm sure there are a number of clever tricks to doing this, however, my new favorite is using std::deque. The only reason std::deque works well is because not only can it add elements to the end of its container in constant time, but it can also add elements to the front of its container in constant time (using push_back and push_front respectively). As an example, on the left below, I have a rudimentary inefficient method of doing this binary outcome sorting using std::list, and on the right, I showcase the std::deque method.

Get paid to earn your masters degree?by Taylor Stapleton

A huge campus with many buildings. A terrific amount of very smart people. Vague guidelines on project details. Learning new things everyday. Having been thinking about these symptoms lately, I have decided that working at Google is as close as I can get to getting paid to go to school. I loved being in college for the most part. Learning was awesome and being part of the college community was awesome. I loved my opportunities to teach other people too. For me, there were really only two drawbacks. First, general education courses are fucking useless in their current form (I'm a bit salty about this point because I still hold SO much resentment). Second, and perhaps more obvious, is that you have to pay a fantastic sum of money to attend most colleges. 

So basically I'm trying to have a little bit different outlook on my job. This is not to say my outlook was very poor in the past. I have just found a way to improve it. I always wanted to get paid to go to school and now I basically have the chance. And actually, if you think about it, this is a pretty world class education considering some of the technologies and peers that one gets to work with here. And instead of paying to attend a University, I'm instead getting paid handsomely for doing very similar work to the work I would be doing in school. So I guess I have a piece of advice to all those considering getting a masters degree in computer science.

Don't get a masters degree in computer science.

I think a lot of people would rightly freak out at anyone discouraging someone from higher-education. But let's talk about just a couple of specific points to consider. Grad school can be really expensive. You could end up paying a lot of money for that couple extra years of school. In some cases it might cost you 6 figures. Meanwhile if you choose to go to industry, you are guaranteed to instead earn six figures in that same time frame. In some cases you could be earning 6 figures/year right out of college. In terms of educational value, it depends on where you work. If you have the opportunity to work at one of the big companies like Google or Microsoft, you could be working with tools and technology that the educational realm wont get to play with for some time. But that isn't always true and should be carefully considered if important to you. A masters degree looks really good on a resume, but so does a couple years at a job and a possible title change. By my estimation, someone with the title of software engineer II has the equivalent if not more earning potential than someone with a masters. 

I don't have any eloquent way to end this post.

Lets talk about robots... by Taylor Stapleton

Ok. So, I can't tell you how many times I have walked into a Walmart or a McDonalds and I just know that the caliber of employee that is going to help me is probably going to be terrible. In fact I often wish that a lot of these service jobs would just be replaced by robots. Sleek, friendly, reliable robots. Well maybe there are some of you out there that would love to just have a robot for a helping hand around the office or for testing as well. Turns out, this dream can actually be a reality (for an enormous sum of money I'm sure). I got the to see a presentation this week about one such company that makes some very specialized kinds of robots just for software testing.

OptoFidelity is a company that makes robots for the purpose of testing mobile phone applications. Essentially, besides an actual human, an OptoFidelity robot is the best you can get for running end to end manual tests on a device. Actually, in the presentation I saw several ways that this robot totally exceeds a humans capability. They seem to have many different form factors of robots with different capabilities. But among the offerings their robots have are:

  • Mimic any kind of gestures including different pressure levels and multiple fingers for pinching and zooming.
  • Provide very accurate finger hovering distance from screens.
  • Do perfect stylus testing including getting angles of stylus and pressure exactly correct.
  • Provide an array of sensors by which to measure the devices output (bluetooth, NFC, Wifi) with terrific accuracy.
  • Capture the screen with high accuracy cameras.
  • Unlike almost any other form of test, the robot could reboot your device manually

One of the main things I was really impressed with is that they had found things to test on a phone that were way outside the scope of what I had imagined. They have a high speed camera watching the phone screen that is accurate enough to perfectly measure the frame rate of the screen at all times. So for instance, the robot finger can touch the screen and the high speed camera will measure exactly how fast and smooth the triggered animation is. So as part of your continuous integration build, you could have a step where you ship the build off to the robot controlled phone. Once on the phone, the robot would do a quick check to see if your application is still as visually fast as it used to be and if you are too slow, your code submission fails. WOW! I WANT THAT!

Apparently as one of their features, they also have the ability to do screen diff-ing and optical character recognition through the camera as well. Although, screen diff-ing through a camera sounds like an insanely hard computer science problem to overcome. So the questions I developed whilst watching this presentation are these:

  • How does one define a test suite for a robot? Can it be abstracted to the point where I can write a script for a robot that just gets called by methods like "SwipeUpFromBottom" or "PressTheGreenButton"?
  • Can you provide me an interface for which to import the video from the camera into my own software space and use my own image recognition software on the stream?

 

Making work an inviting place to be by Taylor Stapleton

I recently decided that I wanted to give myself more reasons to be at work. Google has been pretty interesting so far in the way that nobody really cares when you are at work. If I wanted, I could totally get away with coming in at ten and leaving by four everyday. They call this "being results oriented", which is a fancy way of saying "as long as you get your work done, we don't care when or how." This proves to be a little difficult sometimes. I could come into work whenever I wanted and most the time I will be able to bust some stuff out in the hours I am there. However, if something comes up and I'm not very productive in the time I am there, I could see myself falling behind. Luckily I live with Leslie, my girlfriend with way more self control than me, who also works at Google and ill hold me accountable for work.

So to continue on the first line of this post: There is more I can do to make my workplace somewhere I enjoy being. I did three things to work towards this goal. First, I got my shit together and requested a new office chair that isn't squeaky. Second, I ordered a really nice pair of headphones for work. Last, and certainly not the most normal, I bought some turf. It is just a small piece of artificial grass turf. It's not the stuff you would find at a mini-golf range, it's pretty nice. The length of the individual "grass" fibers is pretty long which makes placing your feet on it pleasant. It adds a bit of color to my space, and like previously mentioned, it's great for bare feet.

A picture of the turf in question.

A picture of the turf in question.

In case you are strangely wondering which turf I purchased can be found here.

A computer scientist playing a city builderby Taylor Stapleton

So, I have recently been playing a game entitled "Cities in Motion 2". The entire premise of the game is to take a simulated city (a very well simulated city) and build a mass transportation system for it. I have always been fascinated by transport and I love learning how things like trains work. So this game is perfect for me! I get to try insane mass transit ideas on an awesome engine and with no consequences. Inevitably, in any city you must have a central location where all the different forms of transportation come together to exchange passengers. I never realized how hard it can be to build a transit hub for millions of people to pass through. It's an optimization problem of the highest degree. Almost every design I try ends up having some kind of deadlock when there are too many vehicles passing through. The common city system will usually utilize buses, trolleys, trams, monorails , subways, and ferries. My latest design includes buses, trams, and monorails and after many iterations has yet to deadlock or even slow down. I'm very proud of its design so I thought I would share it here.

So if you are some kind of engineer and you are looking for fun puzzles and problems to solve on the weekends, I would highly recommend this game. I picked it up when it was full price at $30, but it commonly goes on sale on Steam as well. Even at full price I would say that it's really quite delicious.

p.s. -- This game does not include an auto-save feature. So unless you want to spend 6 hours building your first amazing transit system only to have your computer crash and leave you cursing the god who created the people that created this game without an auto-save feature, go ahead and set reminders for yourself to save your game now and then. :-)

Why is compiling still a valid excuse for messing around?by Taylor Stapleton

Everyone knows it's OK to have a YouTube video up and playing on one monitor as long as you can visibly see something, anything, compiling on the other monitor right? Because it seems to be a widely accepted notion that you are going to spend a certain percentage of your day just compiling your code waiting to get all the juicy details on where you accidentally passed objects instead of pointers so you can fix them and compile again. It seems only appropriate that I kick this blog off in its early days talking about something that sincerely boggles my brain. Would someone like to explain to me why it's 2015 and there are still large companies and millions of developers who are waiting ages for code to compile to get compiler-style feedback about their code? 

Picture the scenario where I develop a piece of code. I'm not entirely sure if what I did is going to work correctly or not so I have to change context and hop over to my terminal, figure out whatever build command is going to run the compiler and linker on my code, and then wait some indeterminate amount of time for it to tell me in the most cryptic language possible what's wrong. Though it seems impossible, there are still huge companies working in this work flow (at Google there are still tens of thousands of developers writing c++ in editors that barely do syntax highlighting).  I mainly have observed this problem for languages like c, c++, and python but I'm sure there are many more scenarios where this is occurring. 

A snippet from Visual Studio 2013 giving code feedback.

A snippet from Visual Studio 2013 giving code feedback.


In the past I have had the pleasure of working in Visual Studio or to a lesser degree, Eclipse. In VS I mainly worked in its primary language, c#, and Java for Eclipse. Both of these tools did such an amazing job at giving instant code feedback. I can't even quantify how much more productive I was when working in those environments. Not only was I saving the maybe 10% of my day that I would otherwise spend compiling, but the ability to get instant feedback about the code's compilability (I'm going to pretend I just coined that term) helped me to stay perfectly in context with what I was working on without ever going away and losing focus.

So think about this: software engineers create mind-blowingly cool things every day. Most people on this planet couldn't possibly fathom the work they do. So can we all take a moment to think about how nice it would be to stop wasting even a small portion of our time on something as silly as compilation? It could take years, but would someone please develop some more excellent tools for instant code feedback? And for all the vim and emacs users out there, take this moment to possibly consider that working in a modern IDE with code feedback tools is a very valid way of life. God, what a ranty first post.