Your Future On Stack OverflowPosted: 2013/05/22
I recently spent a while working on a pretty fun problem over at Stack Exchange: predicting what tags you’re going to be active answering in.
Confirmed some suspicions, learned some lessons, got about a 10% improvement on answer posting from the homepage (which I’m choosing to interpret as better surfacing of unanswered questions).
Why do we care?
Stack Overflow has had the curious problem of being way too popular for a while now. So many new questions are asked, new answers posted, and old posts updated that the old “what’s active” homepage would cover maybe the last 10 minutes. We addressed this years ago by replacing the homepage with the interesting tab, which gives everyone a customized view of stuff to answer.
The interesting algorithm (while kind of magic) has worked pretty well, but the bit where we take your top tags has always seemed a bit sub-par. Intuitively we know that not all tags are equal in volume or scoring potential, and we also know that activity in one tag isn’t really indicative just in that tag.
What we’d really like in there is your future, what you’re going to want to answer rather than what you already have. They’re related, certainly, but not identical.
Stated more formally: what we wanted was an algorithm that when given a user and their activity on Stack Overflow to date, predicted for each tag what percentage of their future answers would be on questions in that tag. “Percentage” is a tad mislead since each question on Stack Overflow can have up to five tags, so the percentages don’t sum to anything meaningful.
The immediate use of such an algorithm would be improving the homepage, making the questions shown to you more tailored to your interests and expertise. With any luck the insights in such an algorithm would let us do similar tailoring elsewhere.
To TL;DR, you can check out what my system thinks it knows about you by going to /users/tag-future/current on any of the older Stack Exchange sites. The rest of this post is about how I built it, and what I learned doing it.
What Do We Know?
A big part of any modeling process is going to be choosing what data to look at. Cast too wide a net and your iteration time explodes, too narrow and you risk missing some easy gains. Practicality is also a factor, as data you technically have but never intended to query en masse may lead you to build something you can’t deploy.
What I ended up using is simply the answers on a site (their text, creation dates, and so on), along with the tags the associated questions had when the answer was posted. This data set has the advantage of being eminently available, after all Stack Exchange has literally been built for the purpose of serving answers, and public knowledge.
At various times I did try using data from the questions themselves and an answerers history of asking, but to no avail. I’m sure there’s more data we could pull in, and probably will over time; though I intend to focus on our public data. In part this is because it’s easier to explain and consume the public data but also because intuitively answerers are making decisions based on what they can see, so it makes sense to focus there first.
A Model Of A Stack Exchange
The actual process of deriving a model was throwing a lot of assumptions about how Stack Overflow (and other Stack Exchanges) work against the wall, and seeing what actually matched reality. Painstaking, time consuming, iteration. The resulting model does work (confirmed by split testing against the then current homepage), and backs up with data a lot of things we only knew intuitively.
Some Tags Don’t Matter
It stands to reason that a tag that only occurs once on Stack Overflow is meaningless, and twice is probably just as meaningless. Which begs the question, when, exactly does a tag start to matter? Turns out, before about forty uses a tag on Stack Overflow has no predictive ability; so all these tags aren’t really worth looking at in isolation.
Similarly a single answer isn’t likely to tell us much about a user, what I’d expect is a habit of answering within a tag to be significant. How many answers before it matters? Looks like about three. My two answers in “windows-desktop-gadgets” say about as much about me as my astrological sign (Pisces if you’re curious).
Most People Are Average (That’s Why It’s An Average)
What’s being asked on Stack Overflow is a pretty good indicator of what’s being used in the greater programming world, so it stands to reason that a lot of people’s future answering behavior is going to look like the “average user’s” answering behavior. In fact, I found that the best naive algorithm for predicting a user’s future was taking the site average and then overlaying their personal activity.
Surprisingly, despite the seemingly breakneck speed of change in software, looking at recent history when calculating the site average is a much worse predictor than considering all-time. Likewise when looking at user history, even for very highly active users, recent activity is a worse predictor than all time.
One interpretation of those results, which I have no additional evidence for, is that you don’t really get worse at things over time you mostly just learn new things. That would gel with recent observations about older developers being more skilled than younger ones.
You Transition Into A Tag
As I mentioned above, our best baseline algorithm was predicting the average tags of the site and then plugging in a user’s actual observed history. An obvious problem with that is that posting a single answer in say “java.util.date” could get us predicting 10% of your future answers will be in “java.util.date” even though you’ll probably never want to touch that again.
So again I expected there to be some number of uses of a tag after which your history in it is a better predictor than “site average”. On Stack Overflow, it takes about nine answers before you’re properly “in” the tag. Of course there needs to be a transition between “site average” and “your average” between three and nine answers, and I found a linear one works pretty well.
We All Kind Of Look The Same
Intuitively we know there are certain “classes” of users on Stack Overflow, but exactly what those classes are is debatable. Tech stack, FOSS vs MS vs Apple vs Google? Skill level, Junior vs Senior? Hobbyist vs Professional? Front-end vs Back-end vs DB? And on and on.
Instead of trying to guess those lines in the sand, I went with a different intuition which was “users who start off similarly will end up similarly”. So I clustered users based on some N initial answers, then use what I knew about existing users to make predictions for new users who fall into the cluster.
Turns out you can cut Stack Overflow users into about 440 groups based on about 60 initial tags (or about 30 answers equivalently) using some really naive assumptions about minimum distances in euclidean space. Eyeballing the clusters, it’s (very approximately) Tech stack + front/back-end that divides users most cleanly.
One Tag Implies Another
Testing that assumption I find that it does, in fact, match reality. The best approach I found was predicting activity in a tag given activity in commonly co-occurring tags (via a variation on principal component analysis) and making small up or down tweaks to the baseline prediction accordingly. This approach depends on there being enough data for co-occurrence to be meaningful, which I found to be true for about 12,000 tags on Stack Overflow.
Trust Your Instincts
Using the Force is optional.
One pretty painful lesson I learned doing all this is: don’t put your faith in standard machine learning. It’s very easy to get the impression online (or in survey courses) that rubbing a neural net or a decision forest against your data is guaranteed to produce improvements. Perhaps this is true if you’ve done nothing “by hand” to attack the problem or if your problem is perfectly suited to off the shelf algorithms, but what I found over and over again is that the truthiness of my gut (and that of my co-workers) beats the one-size-fits-all solutions. You know rather a lot about your domain, it makes sense to exploit that expertise.
However you also have to realize your instincts aren’t perfect, and be willing to have the data invalidate your gut. As an example, I spent about a week trying to find a way to roll title words into the predictor to no avail. TF-IDF, naive co-occurrence, some neural network approaches, and even our home grown tag suggester never quite did well enough; titles were just too noisy with the tools at my disposal.
Get to testing live as fast as you possibly can, you can’t have any real confidence in your model until it’s actually running against live data. By necessity much evaluation has to be done offline, especially if you’ve got a whole bunch of gut checks to make, but once you think you’ve got a winner start testing. The biggest gotcha revealed when my predictor went live is that the way I selected training data made for a really bad predictor for low activity users, effectively shifting everything to the right. I solved this by training two separate predictors (one for low activity, and one for high).
Finally, as always solving the hard part is 90% of the work, solving the easy part is also 90% of the work. If you’re coming at a problem indirectly like we were, looking to increase answer rates by improving tag predictions, don’t have a ton of faith in your assumptions about the ease of integration. It turned out that simply replacing observed history with a better prediction in our homepage algorithm broke some of the magic, and it took about twenty attempts to realize gains in spite of the predictor doing what we’d intended. The winning approach was considering how unusual a user is when compared to their peers, rather than considering them in isolation.
Again, want to see what we think you’ll be active in? Hit /users/tag-future/current on your Stack Exchange of choice.