Providence: What technologies do you know?

This post is part of a series on the Providence project at Stack Exchange, the first post can be found here.

Having already built a classifier for “developer-kinds”, we next set out to determine what technologies people are using.  What we considered to be distinct technologies are things like “Microsoft’s web stack” or “Oracle’s database servers,” which are coarser groupings than say “ASP.NET” or “Oracle 11g.”

It was tempting to consider these technologies to be more specific developer-kinds, but we found that many people are knowledgeable about several technologies even if they specialize in only one or two kinds of development.  For example, many developers are at least passingly familiar with PHP web development even if they may be exclusively writing iOS apps for the time being.

The DataStack types not considered: jenga, pancake, fat.

Our starting point for this problem was to determine what technologies are both actually out there and used by large segments of the developer population.  We got the list of technologies by looking at the different job openings listed on Stack Overflow Careers for which companies are trying to hire developers. This was a decent enough proxy for our purposes.  After excluding a few labels for being too ambiguous, we settled on the following list:

  • PHP
  • Ruby On Rails
  • Microsoft’s Web Stack (ASP.NET, IIS, etc.)
  • Node.js
  • Python’s Web Stacks (Django, Flask, etc.)
  • Java’s Web Stacks (JSP, Struts, etc.)
  • WordPress
  • Cloud Platforms (Azure, EC2, etc.)
  • Salesforce
  • SharePoint
  • Windows
  • OS X
  • Oracle RDBMS
  • Microsoft SQL Server
  • MySQL

Something that stands out in these labels is the prevalence of Microsoft technologies.  We suspect this is a consequence of two factors: Microsoft’s reach is very broad due to its historical dominance of the industry, and Stack Overflow itself (and thus the jobs listed on Careers Stack Overflow) has a slight Microsoft bias due to early users being predominantly Coding Horror and Joel On Software readers.

Furthermore, mobile technologies are unrepresented in the list, likely because they are already adequately covered by the developer-kind predictions.  The reality on the ground is that developing for a particular mobile platform very strongly implies the use of particular technologies.  While some people do use things like Xamarin or Ruby Motion to develop for iOS and Android, the vast majority do not.

The Classifier

We knew that many tags on Stack Overflow correlate strongly with a particular technology, but others, such as [sql], were split between several different technologies.  To determine these relationships, we looked at tag co-occurrence on questions.  For example, given that [jsp] is a Java Web Stacks tag which co-occurs quite a bit more with [java] than [windows] tells us that viewing a question with the [java] should count more towards the “Java Web Stacks” technologies than the “Windows” technologies.  Repeatedly applying this observation allowed us to grow clouds of related tags, each with a weight proportional to their co-occurrence with tags of a known quality.  One small tweak over this simple system was to reduce the weights of tags that are commonly found in several technologies, such as [sql] which co-occurs with many different database, application, and web technologies.

This approach required that we hand-select seed tags for each technology.  The seeds used in the first version of Providence for each technology are:

  • PHP – php, php-*, zend-framework
  • Ruby On Rails – ruby-on-rails, ruby-on-rails-*, active-record
  • Microsoft’s Web Stack –,*, iis, iis-*
  • Node.js – node.js, node.js-*, express, npm, meteor
  • Python’s Web Stacks – django, django-*, google-app-engine, flask, flask-*
  • Java’s Web Stacks – jsp, jsp-*, jsf, jsf-*, facelets, primefaces, servlets, java-ee, struts*, spring-mvc, tomcat
  • WordPress – wordpress, wordpress-*
  • Cloud Platforms – cloud, azure, amazon-ec2, amazon-web-services, amazon-s3
  • Salesforce – salesforce, salesforce-*
  • SharePoint – sharepoint, sharepoint-*, wss, wss-*
  • Windows – windows, windows-*, winapi, winforms
  • OS X – osx, osx-*, cocoa, applescript
  • Oracle RDBMS – oracle, oracle*, plsql
  • Microsoft SQL Server – sql-server, sql-server-*, tsql
  • MySQL – mysql, phpmyadmin

By this point, the technology classifier was functionally complete, but in order to reduce some noise in our predictions, we decided to constrain the number of tags we consider to 500 per technology and ignore any tags that are used on less than 0.05% of Stack Overflow questions.

Next up: Actually using Providence for something useful