History Of The Stack Exchange API, Version 1.0

Posted: 2011/08/07 | Author: kevinmontrose | Filed under: pontification |1 Comment

When I was hired by Stack Exchange, it was openly acknowledged that it was in large part because of my enthusiasm for building an API. We’ve since gone on to produce an initial 1.0 release, a minor 1.1 update (documented here), and are planning for a 2.0 this calendar year.

If you haven't read this book (or the blog it's from), why are you reading this?

Raymond Chen's blog is a great source for Windows history. Equivalents for most other topics are sorely lacking.

What I’d like to talk about isn’t 2.0 (though we are planning to publish a spec for comment in the not-to-distant future), but about the thought process behind the 1.0 release. I always find the history behind such projects fascinating, so I’d like to get some of Stack Exchange API’s out there.

We weren’t shooting for the moon, we constrained ourselves.

A big one was that 1.0 had to be read-only.

Pragmatically, we didn’t have the resources to devote to the mounds of refactoring that would be required to get our ask and edit code paths up to snuff. There are also all sorts of rejection cases to handle (at the time we had bans, too many posts in a certain timeframe, and “are you human” captcha checks), which we’d have to expose, and the mechanism would have to be sufficiently flexible to handle new rejection cases gracefully (and we’ve added some in the 1.0 -> 2.0 interim, validating this concern). There’s also the difficulty in rendering Markdown (with our Stack Exchange specific extensions, plus Prettify, MathJax, jTab, and who knows what else in the future), which needs to be solved if applications built on the Stack Exchange API are to be able to mimic our preview pane.

Philosophically, write is incredibly dangerous. Not just in the buggy-authentication, logged in as Jeff Atwood, mass content deleting sense; though that will keep me up at night. More significantly (and insidiously) in the lowered friction, less guidance, more likely to post garbage sense.

Then there are quality checks, duplicate checks, history checks...

Similar titles, similar questions, live preview, tag tips, and a markdown helper. This is just the guidance we give a poster *before* they submit.

We do an awful lot to keep the quality of content on the Stack Exchange network very high (to the point where we shut down whole sites that don’t meet our standards). A poorly thought out write API is a great way to screw it all up, so we pushed it out of the 1.0 time-frame. It looks like we’ll be revisiting it in 3.0, for the record.

We also wanted to eliminate the need to scrape our sites.

This may seem a bit odd as a constraint, but there’s was only so much development time available and a lot of it needed to be dedicated to this one goal. The influence of this is really quite easy to see, there’s an equivalent API method for nearly every top level route on a Stack Exchange site (/users, /badges, /questions, /tags, and so on).

Historically we had tolerated a certain amount of scraping in recognition that there were valid reasons to get up-to-date data out of a Stack Exchange site, and providing it is in the spirit of the cc-wiki license that covers all of our user contributed content. However scraping is hideously inefficient both from a consuming and producing side, with time wasted rendering HTML, serving scripts, including unnecessary data, and then stripping all that garbage back out. It’s also very hard to optimize a site for both programs and users, the access patterns are all different. By moving scraping off of the main sites and onto an API, we were able to get a lot more aggressive about protecting the user experience by blocking bots that negatively affect it.

Of course, we were willing to try out some neat ideas.

So named for the "vector processors" popularized in the early days of the industry (as in the CM-1 pictured above). More commonly called SIMD today.

Vectorized requests are probably the most distinctive part of our API. In a nutshell, almost everywhere we accept an id we’ll accept up to 100 of them.

/users/80572;22656;1/

Fetches user records for myself, Jon Skeet, and Jeff Atwood all in one go.

This makes polling for changes nice and easy within a set of questions, users, users’ questions, users’ answers, and so on. It also makes it faster to fetch lots of data, since you’re only paying for a round trip for every 100 resources.

I’m not contending that this is a novel feature, Twitter’s API does something similar for user lookup. We do go quite a bit further, making it a fundamental part of our API.

Representing compression visually is difficult.

We also forced all responses to be GZIP’d. The rational for this has been discussed a bit before, but I’ll re-iterate.

Not GZIP’ing responses is a huge waste for all parties. We waste bandwidth sending responses, and the consumer wastes time waiting for the pointlessly larger responses (especially painful on mobile devices). And it’s not like GZIP is some exotic new technology, no matter what stack someone is on, they have access to a GZIP library.

This is one of those things in the world that I’d fix if I had a time machine. There was very little reason to not require all content be GZIP’d under HTTP, even way back in the 90’s. Bandwidth has almost always been much more expensive than CPU time.

Initially we tried just rejecting all requests without the appropriate Accept-Encoding header but eventually resorted to just always responding with GZIP’d requests, regardless of what the client nominally accepts. This has to do with some proxies stripping out the Accept-Encoding header, for a variety of (generally terrible) reasons.

I’m unaware of any other API that goes whole hog and requires clients accept compressed responses. Salesforce.com’s API at least encourages it.

Not nearly as complex as SQL can get, but hopefully complex enough for real work.

Finally, we emphasize sorting and filtering to make complex queries. Most endpoints accept sort, min, max, fromdate, and todate parameters to craft these queries with.

For example, getting a quick count of how many of my comments have ever been upvoted on Stack Overflow (38, at time of writing):

/users/80572/comments?sort=votes&min=1&pagesize=0

or all the positively voted Meta Stack Overflow answers the Stack Exchange dev team made in July 2011 (all 5995 of them):

/users/130213;...;91687/answers?sort=votes&min=1&fromdate=1309478400&todate=1312156799

We eventually settled on one configurable sort, that varies by method, and an always present “creation filter” as adequately expressive. Basically, it’s sufficiently constrained that we don’t have to worry (well… not too much anyway) about crippling our databases with absurdly expensive queries, while still being conveniently powerful in a lot of cases.

This isn’t to suggest that our API is perfect.

I’ve got a whole series of articles in me about all the mistakes that were made. Plus there’s 1.1 and the upcoming 2.0 to discuss, both of which aim to address to short-comings in our initial offering. I plan to address these in the future, as time allows.

One Comment on “History Of The Stack Exchange API, Version 1.0”

systempuntoout (@systempuntoout) says:

2011/08/08 at 20:55

Thanks for sharing your thoughts Kevin, I hope to see other API related posts like this one. Personally I would like to hear the human part of your experience, the feelings to develop with an horde of whining guys (cough, like me) ready to pull the trigger, asking for improbable ad hoc features and so on. It was a though work and you did a great job 🙂 .

Kevin Montrose