Your Future On Stack OverflowPosted: 2013/05/22 Filed under: pontification 18 Comments
I recently spent a while working on a pretty fun problem over at Stack Exchange: predicting what tags you’re going to be active answering in.
Confirmed some suspicions, learned some lessons, got about a 10% improvement on answer posting from the homepage (which I’m choosing to interpret as better surfacing of unanswered questions).
Why do we care?
Stack Overflow has had the curious problem of being way too popular for a while now. So many new questions are asked, new answers posted, and old posts updated that the old “what’s active” homepage would cover maybe the last 10 minutes. We addressed this years ago by replacing the homepage with the interesting tab, which gives everyone a customized view of stuff to answer.
The interesting algorithm (while kind of magic) has worked pretty well, but the bit where we take your top tags has always seemed a bit sub-par. Intuitively we know that not all tags are equal in volume or scoring potential, and we also know that activity in one tag isn’t really indicative just in that tag.
What we’d really like in there is your future, what you’re going to want to answer rather than what you already have. They’re related, certainly, but not identical.
Stated more formally: what we wanted was an algorithm that when given a user and their activity on Stack Overflow to date, predicted for each tag what percentage of their future answers would be on questions in that tag. “Percentage” is a tad mislead since each question on Stack Overflow can have up to five tags, so the percentages don’t sum to anything meaningful.
The immediate use of such an algorithm would be improving the homepage, making the questions shown to you more tailored to your interests and expertise. With any luck the insights in such an algorithm would let us do similar tailoring elsewhere.
To TL;DR, you can check out what my system thinks it knows about you by going to /users/tag-future/current on any of the older Stack Exchange sites. The rest of this post is about how I built it, and what I learned doing it.
What Do We Know?
A big part of any modeling process is going to be choosing what data to look at. Cast too wide a net and your iteration time explodes, too narrow and you risk missing some easy gains. Practicality is also a factor, as data you technically have but never intended to query en masse may lead you to build something you can’t deploy.
What I ended up using is simply the answers on a site (their text, creation dates, and so on), along with the tags the associated questions had when the answer was posted. This data set has the advantage of being eminently available, after all Stack Exchange has literally been built for the purpose of serving answers, and public knowledge.
At various times I did try using data from the questions themselves and an answerers history of asking, but to no avail. I’m sure there’s more data we could pull in, and probably will over time; though I intend to focus on our public data. In part this is because it’s easier to explain and consume the public data but also because intuitively answerers are making decisions based on what they can see, so it makes sense to focus there first.
A Model Of A Stack Exchange
The actual process of deriving a model was throwing a lot of assumptions about how Stack Overflow (and other Stack Exchanges) work against the wall, and seeing what actually matched reality. Painstaking, time consuming, iteration. The resulting model does work (confirmed by split testing against the then current homepage), and backs up with data a lot of things we only knew intuitively.
Some Tags Don’t Matter
It stands to reason that a tag that only occurs once on Stack Overflow is meaningless, and twice is probably just as meaningless. Which begs the question, when, exactly does a tag start to matter? Turns out, before about forty uses a tag on Stack Overflow has no predictive ability; so all these tags aren’t really worth looking at in isolation.
Similarly a single answer isn’t likely to tell us much about a user, what I’d expect is a habit of answering within a tag to be significant. How many answers before it matters? Looks like about three. My two answers in “windows-desktop-gadgets” say about as much about me as my astrological sign (Pisces if you’re curious).
Most People Are Average (That’s Why It’s An Average)
What’s being asked on Stack Overflow is a pretty good indicator of what’s being used in the greater programming world, so it stands to reason that a lot of people’s future answering behavior is going to look like the “average user’s” answering behavior. In fact, I found that the best naive algorithm for predicting a user’s future was taking the site average and then overlaying their personal activity.
Surprisingly, despite the seemingly breakneck speed of change in software, looking at recent history when calculating the site average is a much worse predictor than considering all-time. Likewise when looking at user history, even for very highly active users, recent activity is a worse predictor than all time.
One interpretation of those results, which I have no additional evidence for, is that you don’t really get worse at things over time you mostly just learn new things. That would gel with recent observations about older developers being more skilled than younger ones.
You Transition Into A Tag
As I mentioned above, our best baseline algorithm was predicting the average tags of the site and then plugging in a user’s actual observed history. An obvious problem with that is that posting a single answer in say “java.util.date” could get us predicting 10% of your future answers will be in “java.util.date” even though you’ll probably never want to touch that again.
So again I expected there to be some number of uses of a tag after which your history in it is a better predictor than “site average”. On Stack Overflow, it takes about nine answers before you’re properly “in” the tag. Of course there needs to be a transition between “site average” and “your average” between three and nine answers, and I found a linear one works pretty well.
We All Kind Of Look The Same
Intuitively we know there are certain “classes” of users on Stack Overflow, but exactly what those classes are is debatable. Tech stack, FOSS vs MS vs Apple vs Google? Skill level, Junior vs Senior? Hobbyist vs Professional? Front-end vs Back-end vs DB? And on and on.
Instead of trying to guess those lines in the sand, I went with a different intuition which was “users who start off similarly will end up similarly”. So I clustered users based on some N initial answers, then use what I knew about existing users to make predictions for new users who fall into the cluster.
Turns out you can cut Stack Overflow users into about 440 groups based on about 60 initial tags (or about 30 answers equivalently) using some really naive assumptions about minimum distances in euclidean space. Eyeballing the clusters, it’s (very approximately) Tech stack + front/back-end that divides users most cleanly.
One Tag Implies Another
Testing that assumption I find that it does, in fact, match reality. The best approach I found was predicting activity in a tag given activity in commonly co-occurring tags (via a variation on principal component analysis) and making small up or down tweaks to the baseline prediction accordingly. This approach depends on there being enough data for co-occurrence to be meaningful, which I found to be true for about 12,000 tags on Stack Overflow.
Trust Your Instincts
Using the Force is optional.
One pretty painful lesson I learned doing all this is: don’t put your faith in standard machine learning. It’s very easy to get the impression online (or in survey courses) that rubbing a neural net or a decision forest against your data is guaranteed to produce improvements. Perhaps this is true if you’ve done nothing “by hand” to attack the problem or if your problem is perfectly suited to off the shelf algorithms, but what I found over and over again is that the truthiness of my gut (and that of my co-workers) beats the one-size-fits-all solutions. You know rather a lot about your domain, it makes sense to exploit that expertise.
However you also have to realize your instincts aren’t perfect, and be willing to have the data invalidate your gut. As an example, I spent about a week trying to find a way to roll title words into the predictor to no avail. TF-IDF, naive co-occurrence, some neural network approaches, and even our home grown tag suggester never quite did well enough; titles were just too noisy with the tools at my disposal.
Get to testing live as fast as you possibly can, you can’t have any real confidence in your model until it’s actually running against live data. By necessity much evaluation has to be done offline, especially if you’ve got a whole bunch of gut checks to make, but once you think you’ve got a winner start testing. The biggest gotcha revealed when my predictor went live is that the way I selected training data made for a really bad predictor for low activity users, effectively shifting everything to the right. I solved this by training two separate predictors (one for low activity, and one for high).
Finally, as always solving the hard part is 90% of the work, solving the easy part is also 90% of the work. If you’re coming at a problem indirectly like we were, looking to increase answer rates by improving tag predictions, don’t have a ton of faith in your assumptions about the ease of integration. It turned out that simply replacing observed history with a better prediction in our homepage algorithm broke some of the magic, and it took about twenty attempts to realize gains in spite of the predictor doing what we’d intended. The winning approach was considering how unusual a user is when compared to their peers, rather than considering them in isolation.
Again, want to see what we think you’ll be active in? Hit /users/tag-future/current on your Stack Exchange of choice.
I wonder how well the “sort to groups, see how you vary from others in your group” approach would work for something like the Amazon or Netflix recommendation engines, and conversely, if thy do anything you could learn from?
I suspect Netflix is doing something very similar already, Amazon appears to get by just watching what you’ve looked at. Both already seem to do well enough, at least for me.
One thing I’m wary of incorporating (that both Netflix and Amazon seem to use) is click data. I don’t think clicking something quite indicates intent to answer on Stack Exchanges, since a lot of users switch gears between consuming and producing answers. There were a few attempts at a homepage tweak that increased click rate but decreased answer rate that I considered failures, and I suspect we’d find that to be a common failure case with traffic-y data.
The problem I see with this is that it could be self fulfilling. By steering them towards questions you expect they are more likely to answer those but also more likely to miss unpredicted questions that they would have answered. If that overly acts to narrow a user’s field then it seems that there is a real danger that fringe questions that might otherwise have been answered will be missed.
This is a concern we had back with the original interesting tab changes. And really it’s a meta concern around any recommendation system.
Our fix is two fold.
First we aren’t only showing you what we think you’re likely to answer, we’re using it as input to an algorithm that cares a lot about “answered-ness”, activity, relative scores, and so on. The very nature of the algorithm makes it so even if we think a lot of people prefer [tag-a], lots of them will get [tag-b] as due to the ensuing dearth of unanswered [tag-a] questions.
Second, we throw a fair bit of random questions in there. Conceptually we think everyone can learn everything if they want, so just cause you’ve never answered a [c#] question or one in a related tag that doesn’t mean you won’t ever want to.
We watched for weird changes in behavior or answering after the interesting tab was first deployed and found nothing concerning. While this change hasn’t been live as long I’d be surprised if it fails in a new way.
OTOH, when you are showing users irrelevant questions, you are wasting their time. By wasting their time, you might encourage them to never use the site again, in which case the world will have lost out on their knowledge.
I’ll check this out but in all honesty since day 1 I have *specifically* avoided the homepage and *intentionally* bookmarked the Questions page.
I’ve only ever wanted to come in and see the most “recent” questions… and because I have my “favorite tags” I can instantly identify exactly which questions I might be interested in.
I think you’ll find most programmers “hate” having a system “guess” what they want. Its why we hate using MS Word… cause it never seems to guess correctly what we want. 😉
For the people who care to dig into a subject area specifically we don’t want any of this preference guessing stuff, that’d be silly. After all, if you’re willing to tell us you want [.net] questions we’re more than happy to give them to you.
The interesting tab, though, is aimed at people who don’t want to indicate a preference explicitly. Stack Overflow had more than a hundred questions updated in the last 10 minutes, shoving all that blinding at someone is overwhelming so we have to do *something*. We make the firehouse available if you want it, but it really is a demonstrably poor choice for the default user experience.
You’ll notice that although we have this predictor for a lot of them, we don’t use the interesting tab on the other Stack Exchange sites. For them, it’s still feasible to show someone everything that’s changed recently and we think better than a guessing algorithm.
Very well written Kevin, as always!
Now, as someone that had worked on machine learning in the past, I wonder.. how do you really test your model/algorithm combination?
I completely agree with you when you say “you can’t have any real confidence in your model until it’s actually running against live data”, but how do you know your model is working, how do you test it?
With training data you have techniques (divide train/test sets, assess accuracy, rinse, repeat); but with real, live data? How do you discovered, for example, that you needed two separate predictors?
We did a bunch of split tests to confirm the algorithm worked.
Once I was sure the thing worked against offline data, I rolled it into the interesting tab. Whenever a test failed (and a lot of them did), I went looking at what was posted and by whom and try to guess what went wrong. Then I’d tweak the algorithm and roll out a new split test.
For the two predictor problem in particular, I ran a test that came up a wash (no better and no worse). When looking at the answers posted, I noticed that users with about 10 answers or so under their belt tended to agree with the predictor. The next test was then making predictions only for users with >= 10 answers, which worked. I eventually came back around to users with < 10 answers, and through another couple of split tests found that a separately trained predictor for them worked well too.
Sounds like a lot of work..
Here is an idea: why not offload some of the “test” onto the users?
For example: in the page you linked, put up/down votes (+/-, or whatever) on “likely” and “unlikely” tags that your predictor extracted?
Another couple of questions/comments: when did this algorithm was deployed?
Is the “interesting” page compiled considering *only* this algorithm, or there are other criteria to mix in other questions?
I ask because I have noticed some strange questions in my interesting tab lately… and I wondered why! Now I know something has changed, but … something still looks strange 🙂
Depending on your activity class (which depends on # of answers you’ve posted, currently dividing at 10), somewhere between a month and half and two weeks ago.
The first few people would have started seeing the high activity predictor around the 12th of April.
That was around the time I started noticing some “older” questions in my interesting tab. That means, questions that have a recent edit in the answer, but are > 1 year old and have an accepted answer. Since this does not fall into the category of “questions I am likely to answer”, I wondered if there are other criteria for going into the “interesting” tab, like “this could be interesting for you to read” (which would be also nice)
The “boost” (it’s really more of a ceiling) for recent activity works the same between the old and new interesting tabs. I suspect you’re just seeing a bunch of old questions getting edited, it happens.
I have spent some time trying to find improvements on the parts interesting tab algorithm that don’t have to do with tags, but haven’t found anything that works yet. Seems to be a trickier problem, though I wouldn’t have guessed that.
On a purely curious note… what is the threshold to be a “High Activity” user? I am one… I’m just curious how much I can slack off before being denied membership to the “HA” club 😉
I also checked my tags out… and yeah they are pretty accurate. The only thing it didn’t catch… is what it doesn’t know. Which is that if I have a lot of time and there are a bunch of XSL(T) questions that popup… I’ll likely answer those.
High activity is lower than you’d expect, given the long tail of users. Currently it’s 10 answers.
A couple of friends and me did something similar using stackoverflow data as a project for one of our courses in the MSc we’re all doing. We used LDA to get topics of questions and then combined that with previous user scores to, given a question, rank users in order of how likely they were to give a good answer to the question. We got decent results, if you’re interested we can send you a copy of the report. This is indeed a fun problem to work on.
Sure, sounds like an interesting read.
If you send it to firstname.lastname@example.org it’ll get to me and the other interested folks on the team.