In Part I of this article, I showed my analysis for the data I pulled and tagged from Ask HN: Who is Hiring? threads for the last 7 years. This second part is a walkthrough of how this data was fetched, cleaned up and analyzed using different pieces of Python code.
My very first challenge was to scrape the data from Hacker News. All Python programmers know that there is nothing better than Beautiful Soup for scraping html pages. Although I did discover Hacker News’ public API, I figured that it does not give me any additional information from what my Beautiful Soup scraper was already providing me so I stuck with raw html and Beautiful Soup. Here is the code to fetch and cache the HN page and then extract job title information (written as a list of hyphen/pipe separated sentences) and descriptions from the text for each posting:
Once I had my data in a clean list, the next challenge was to extract information from it in a tabular form. Luckily for the data we are mainly interested in, especially after 2014, the comments have the first line separated into columns like this:
Replicated | QA Automation Engineer | $100k – $130k + equity | Los Angeles | https://www.replicated.com
There is a slight issue though: the order of these columns is not guaranteed to be the same. The name of the company is almost always the first entity but the rest of the entities can be in any order. Here is another title line to explain what I mean:
Thinknum | New York | Multiple Positions | On-site – Full-time | $90k-$140k + equity
As you can see, the location appears in the 4th column for the former posting, but in the second column for the latter. Not only is the order of entities not guaranteed, there is no minimum number of columns either. Some companies mention a location, others don’t. Some mention compensation information and others skip it.
To get around the problem above, I decided to create a strategy for the classification of each of the column types. Here, for example, is a simple match for finding whether the column contains a URL:
url = re.findall(‘https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+’, token)
If a match is found, we can simply store the job URL thus found in its own column. Similarly, the code for classifying some of the other columns looks like this:
We now have a semblance of a spreadsheet emerging from our code, however it lacks two of the most interesting columns: location and technology.
For technology, I made a list of all of the technologies that I see appearing in the posts, with as many of their variations as possible, and then simply did a iterative word match in the title of the post. Here is the list of technologies I used:
If any of the technologies matches, its first occurrence is used as a consistent name and added to the technology column. It is helpful to store the data in a denormalized format for easier analysis later on, so if a job post mentions multiple technologies, multiple rows are created for it.
Possibly the most interesting problem I ended up solving was to correctly tag the location of the job post. Since virtually infinite number of locations and their variations can appear throughout these threads, I decided I needed a good NLP package for location tagging. I was tempted to use NLTK which is the de-facto natural language parsing library used by Python programmers. However, I stumbled upon spaCy which looked marginally more modern than NLTK. My first task was to see if a sentence or part of a sentence contains a location and spaCy’s named entity recognizer (ner) provided a neat solution:
For a given sentence or part of sentence, the code above is able to identify the portion that represents a location. I noticed that sometimes when a sentence or token contains multiple locations, spaCy’s ner is not successful in correctly identifying the extent of that location. For example, it can tag ‘San Francisco & Pleasanton, CA’ as a single location when in fact that points to two different locations. Luckily, you can train spaCy to recognize such entities better. Here is how you train it:
After this training, spaCy starts recognizing locations within text quite accurately. However, only half of our problem is solved at this point.
spaCy can recognize that the strings ‘Cambridge, UK’, ‘Cambridge, MA’ and ‘Cambridge, USA’ all contain a location named Cambridge but it can not determine that two of those three locations are actually identical and the third one is in a different continent altogether. This meant that in addition to a named entity recognizer, what I also needed was a geocoder which would allow me to consistently geocode ‘Cambridge, MA’ and ‘Cambridge, USA’ to the same city. A quick Google search revealed the really awesome geopy library which encapsulates pretty much all of the popular geocoding services including Google Maps, Bing Maps, and Mapquest etc. The one I ended up selecting was the free one named Photon which is based on OpenStreetMap. The reason I chose Photon was that a) its free and b) you can literally download its 55GB search index DB on your machine. This allows for blazing fast searches that I needed to do for tens of thousands of location strings identified by spaCy.
This was the last piece of the puzzle I needed to solve:
The code above correctly maps a string containing information about Cambridge, Massachusetts to the same New England city.
There it is!
All the pieces needed to correctly fetch, parse, tokenize, recognize, categorize and geocode data from the Ask HN: Who’s hiring? threads. If you haven’t read part I of this series, it talks about the results of this analysis. Do check it out!
This blog was originally published on Medium.