This post is part of a series on the Providence project at Stack Exchange, the first post can be found here.
Having already built a classifier for “developer-kinds”, we next set out to determine what technologies people are using. What we considered to be distinct technologies are things like “Microsoft’s web stack” or “Oracle’s database servers,” which are coarser groupings than say “ASP.NET” or “Oracle 11g.”
It was tempting to consider these technologies to be more specific developer-kinds, but we found that many people are knowledgeable about several technologies even if they specialize in only one or two kinds of development. For example, many developers are at least passingly familiar with PHP web development even if they may be exclusively writing iOS apps for the time being.
Our starting point for this problem was to determine what technologies are both actually out there and used by large segments of the developer population. We got the list of technologies by looking at the different job openings listed on Stack Overflow Careers for which companies are trying to hire developers. This was a decent enough proxy for our purposes. After excluding a few labels for being too ambiguous, we settled on the following list:
- Ruby On Rails
- Microsoft’s Web Stack (ASP.NET, IIS, etc.)
- Python’s Web Stacks (Django, Flask, etc.)
- Java’s Web Stacks (JSP, Struts, etc.)
- Cloud Platforms (Azure, EC2, etc.)
- OS X
- Oracle RDBMS
- Microsoft SQL Server
Something that stands out in these labels is the prevalence of Microsoft technologies. We suspect this is a consequence of two factors: Microsoft’s reach is very broad due to its historical dominance of the industry, and Stack Overflow itself (and thus the jobs listed on Careers Stack Overflow) has a slight Microsoft bias due to early users being predominantly Coding Horror and Joel On Software readers.
Furthermore, mobile technologies are unrepresented in the list, likely because they are already adequately covered by the developer-kind predictions. The reality on the ground is that developing for a particular mobile platform very strongly implies the use of particular technologies. While some people do use things like Xamarin or Ruby Motion to develop for iOS and Android, the vast majority do not.
We knew that many tags on Stack Overflow correlate strongly with a particular technology, but others, such as [sql], were split between several different technologies. To determine these relationships, we looked at tag co-occurrence on questions. For example, given that [jsp] is a Java Web Stacks tag which co-occurs quite a bit more with [java] than [windows] tells us that viewing a question with the [java] should count more towards the “Java Web Stacks” technologies than the “Windows” technologies. Repeatedly applying this observation allowed us to grow clouds of related tags, each with a weight proportional to their co-occurrence with tags of a known quality. One small tweak over this simple system was to reduce the weights of tags that are commonly found in several technologies, such as [sql] which co-occurs with many different database, application, and web technologies.
This approach required that we hand-select seed tags for each technology. The seeds used in the first version of Providence for each technology are:
- PHP – php, php-*, zend-framework
- Ruby On Rails – ruby-on-rails, ruby-on-rails-*, active-record
- Microsoft’s Web Stack – asp.net, asp.net-*, iis, iis-*
- Node.js – node.js, node.js-*, express, npm, meteor
- Python’s Web Stacks – django, django-*, google-app-engine, flask, flask-*
- Java’s Web Stacks – jsp, jsp-*, jsf, jsf-*, facelets, primefaces, servlets, java-ee, struts*, spring-mvc, tomcat
- WordPress – wordpress, wordpress-*
- Cloud Platforms – cloud, azure, amazon-ec2, amazon-web-services, amazon-s3
- Salesforce – salesforce, salesforce-*
- SharePoint – sharepoint, sharepoint-*, wss, wss-*
- Windows – windows, windows-*, winapi, winforms
- OS X – osx, osx-*, cocoa, applescript
- Oracle RDBMS – oracle, oracle*, plsql
- Microsoft SQL Server – sql-server, sql-server-*, tsql
- MySQL – mysql, phpmyadmin
By this point, the technology classifier was functionally complete, but in order to reduce some noise in our predictions, we decided to constrain the number of tags we consider to 500 per technology and ignore any tags that are used on less than 0.05% of Stack Overflow questions.
At Stack Exchange, we’ve historically been pretty loose with our data analysis. You can see this in the “answered questions” definition (has an accepted answer or an answer with score > 0), “question quality” (measured by ad hoc heuristics based on votes, length, and character classes), “interesting tab” homepage algorithm (backed by a series of experimentally determined weights), and rather naïve question search function.
This approach has worked for a long time (turns out your brain is a great tool for data analysis), but as our community grows and we tackle more difficult problems we’ve needed to become more sophisticated. For example, we didn’t have to worry about matching users to questions when we only had 30 questions a day but 3,500 a day is a completely different story. Some of our efforts to address these problems have already shipped, such as a more sophisticated homepage algorithm, while others are still ongoing, such as improvements to our search and quality scoring.
One of our other efforts is to better understand our users, which led to the Providence project. Providence analyzes our traffic logs to predict some simple labels (like “is a web developer” or “uses the Java technology stack”) for each person who visits our site. In its early incarnation, we only have a few labels but we’re planning to continue adding new labels in order to build new features and improve old ones.
While we can’t release the Stack Overflow traffic logs for privacy reasons, we believe it’s in the best interest of the community for us to document the ways we’re using it. Accordingly, this is the first post in a series on the Providence project. We’re going to cover each of the individual predictions made, as well as architecture, testing, and all the little (and not-so-little) problems we had shipping version 1.0.
We have also added a way for any user to download their current Providence prediction data because it’s theirs and they should be able to see and use it as they like. Users can also prevent other systems (Careers, the Stack Overflow homepage, etc.) from querying their Providence data if they want to.
First up: What kind of developer are you?
One of the first questions we wanted Providence to answer was ‘What “kind” of developer are you?’. This larger question also encompassed sub-questions:
- What are the different “developer-kinds”?
- How much, if at all, do people specialize in a single “kind” of development?
- Among these different kinds of developers, do they use Stack Overflow differently?
We answered the first sub-question by looking at a lots and lots of résumés and job postings. While there is definitely a fair amount of fuzziness in job titles, there’s a loose consensus on the sorts of developers out there. After filtering out some labels for which we just didn’t have much data (more on that later), we came up with this list of developer-kinds:
- Full Stack Web Developers
- Front End Web Developers
- Back End Web Developers
- Android Developers
- iOS Developers
- Windows Phone Developers
- Database Administrators
- System Administrators
- Desktop Developers
- Math/Statistics Focused Developers
- Graphics Developers
The second sub-question we answered by looking at typical users of Stack Overflow. Our conclusion was that although many jobs are fairly specialized, few developers focus on a single role to the exclusion of all else. This matched our intuition, because it’s pretty hard to avoid exposure to at least some web technologies, not to mention developers love to tinker with new things for the heck of it.
Answering the final sub-question was nothing short of a leap of faith. We assumed that different kinds of developers viewed different sets of questions; and, as all we had to use were traffic logs, we couldn’t really test any other assumptions anyway. Having moved forward regardless, we now know that we were correct, but at the time we were taking a gamble.
A prerequisite for any useful analysis is data, and for our developer-kind predictions we needed labeled data. Seeing that Providence did not yet exist, this data had not been gathered. This is a chicken and egg problem that frequently popped up during the Providence project.
Our solution was an activity we’ve taken to calling “labeling parties.” Every developer at Stack Exchange was asked to go and categorize several randomly chosen users based on their Stack Overflow Careers profile, and we used this to build a data set. For the developer-kinds problem, our labeling party hand classified 1,237 people.
In our experience, naïvely rubbing standard machine learning algorithms against our data rarely works. The same goes for developer-kinds. We attacked this problem in three different steps: structure, features, algorithms.
Looking over the different developer-kinds, it’s readily apparent that there’s an implicit hierarchy. Many kinds are some flavor of “web developer,” while others are “mobile developer,” and the remainder are fairly niche; we’ve taken to calling “web,” “mobile,” and “other” major developer-kinds. This observation led us to first classify the major developer-kind, and then proceed to the final labels.
Since we only really have question tag view data to use in the initial version of Providence, all of our features are naturally tag focused. The breakdowns of the groups of tags used in each classifier are:
- Major Developer-Kinds
- Mobile programming languages (java, objective-c, etc.)
- Non-web, non-mobile programming languages
- Web technologies (html, css, etc.)
- Mobile Developer-Kinds
- iDevice related (ios, objective-c, etc.)
- Android related (android, listview, etc.)
- Windows Phone related (window-phone, etc.)
- Other Developer-Kinds
- Each of the top 100 used tags on Stack Overflow
- Pairs of each of the top 100 used tags on Stack Overflow
- SQL related (sql, tsql, etc.)
- Database related (mysql, postgressql, etc.)
- Linux/Unix related (shell, bash, etc.)
- Math related (matlab, numpy, etc.)
For many features, rather than use the total tag views, we calculate an average and then use the deviation from that. With some features, we calculate this deviation for each developer-kind in the training set; for example, we calculate deviation from average web programming language tag views for each of the web, mobile, and other developer-kinds in the Major Developer Predictor.
Turning these features into final predictions requires an actual machine learning algorithm, but in my opinion, this is the least interesting bit of Providence. For these predictors we found that support vector machines, with a variety of kernels, produce acceptably accurate predictions; however, the choice of algorithm mattered little, various flavors of neural networks performed reasonably well, and the largest gains always came from introducing new features.
So how well did this classifier perform? Performance was determined with a split test of job listing ads with the control group being served with our existing algorithm which only considered geography, we’ll be covering our testing methodology in more depth in a future post. In the end we saw an improvement for 10-30% over the control algorithm, with the largest gains being seen in the Mobile Developer-Kinds and the smallest in the Web Developer-Kinds.
Next up: What technologies do you know?