History Of The Stack Exchange API, Mistakes

Posted: 2011/08/14 | Author: kevinmontrose | Filed under: pontification | 12 Comments

In an earlier post, I wrote about some of the philosophy and “cool bits” in the 1.0 release of the Stack Exchange API. That’s all well and good, but of course I’m going to tout the good parts of our API; I wrote a lot of it after all. More interesting are the things that have turned out to be mistakes, we learn more from failure than success after all.

Returning Total By Default

Practically every method in the API returns a count of the elements the query would return if not constrained by paging.

For instance, all questions on Stack Overflow:

{
  "total": 1936398,
  "page": 1,
  "pagesize": 30,
  "questions": [...]
}

Total is useful for rendering paging controls, and count(*) queries (how many of my comments have been up-voted, and so on); so it’s not that the total field itself was a mistake. But returning it by default definitely was.

The trick is that while total can be useful, it’s not always useful. Quite frequently queries take the form of “give me the most recent N questions/answers/users who X”, or “give me the top N questions/answers owned by U ordered by S”. Neither of these common queries care about total, but they’re paying the cost of fetching it each time.

For simple queries (/1.0/questions call above), at least as much time is spent fetching total as is spent fetching data.

“Implicit” Types

Each method in the Stack Exchange API returns a homogenous set of results, wrapped in a meta data object. You get collections of questions, answers, comments, users, badges, and so on back.

The mistake is that although the form of the response is conceptually consistent, the key under which the actual data is returned is based on the type. Examples help illustrate this.

/1.0/questions returns:

{
 "total": 1947127,
 ...
 "questions": [...]
}

/1.0/users returns:

{
 "total": 507795,
 ...
 "users": [...]
}

This makes it something of a pain to write wrappers around our API in statically typed languages. A much better design would have been a consistent `items` field with an additional `type` field.

How /1.0/questions should have looked:

{
 "total": 1947127,
 "type": "question",
 ...
  "items": [...]
}

This mistake became apparent as more API wrappers were written. Stacky, for example, has a number of otherwise pointless classes (the “Responses” classes) just to deal with this.

It should be obvious what's dangerous, and most things shouldn't be.

Inconsistent HTML “Safety”

This one only affects web apps using our API, but it can be a real doozy when it does. Essentially, not all text returns from our API is safe to embed directly into HTML.

This is complicated a bit by many of our fields having legitimate HTML in them, making it so consumers can’t just html encode everything. Question bodies, for example, almost always have a great deal of HTML in them.

This led to the situation where question bodies are safe to embed directly, but question titles are not; user about mes, but not display names; and so on. Ideally, everything would be safe to embed directly except in certain rare circumstances.

This mistake is a consequence of how we store the underlying data. It just so happens that we encode question titles and user display names “just in time”, while question bodies and user about mes are stored pre-rendered.

A Focus On Registered Users

There are two distinct mistakes here. First, we have no way of returning non-existent users. This question, for instance, has no owner. In the API, we return no user object even though we clearly know at least the display name of the user. This comes from 1.0 assuming that every user will have an id, which is a flawed assumption.

Second, the /1.0/users route only returns registered users. Unregistered users can be found via their ids, or via some other resource (their questions, comments, etc.). This is basically a bug that no one noticed until it was too late, and got frozen into 1.0.

I suppose the lesson to take from these two mistakes is that your beta audience (in our case, registered users) and popular queries (which for us are all around questions and answers) have a very large impact on the “polish” pieces of an API get. A corollary to Linus’ Law to be aware of, as the eyeballs are not uniformly distributed.

Things not copied from Twitter: API uptime.

Wasteful Request Quotas

Our request quota system is a lift from Twitter’s API for the most part, since we figured it was better to ~~steal~~ borrow from an existing widely used API than risk inventing a worse system.

To quickly summarize, we issue every IP using the API a quota (that can be raised by using an app key) and return the remaining and total quotas in the X-RateLimit-Current and X-RateLimit-Max headers. These quotas reset 24 hours after they are initially set.

This turns out to be pretty wasteful in terms of bandwidth as, unlike Twitter, our quotas are quite generous (10,000 requests a day) and not dynamic. As with the total field, many applications don’t really care about the quota (until they exceed it, which is rare) but they pay to fetch it on every request.

Quotas are also the only bit of meta data we place in response headers, making them very easy for developers to miss (since no one reads documentation, they just start poking at APIs). They also aren’t compressed due to the nature of headers, which goes against our “always compress responses” design decision.

The Good News

Is that all of these, along with some other less interesting mistakes, are slated to be fixed in 2.0. We couldn’t address them in 1.1, as we were committed to not breaking backwards compatibility in a point-release (there were also serious time constraints).

History Of The Stack Exchange API, Version 1.0

Posted: 2011/08/07 | Author: kevinmontrose | Filed under: pontification | 1 Comment

When I was hired by Stack Exchange, it was openly acknowledged that it was in large part because of my enthusiasm for building an API. We’ve since gone on to produce an initial 1.0 release, a minor 1.1 update (documented here), and are planning for a 2.0 this calendar year.

If you haven't read this book (or the blog it's from), why are you reading this?

Raymond Chen's blog is a great source for Windows history. Equivalents for most other topics are sorely lacking.

What I’d like to talk about isn’t 2.0 (though we are planning to publish a spec for comment in the not-to-distant future), but about the thought process behind the 1.0 release. I always find the history behind such projects fascinating, so I’d like to get some of Stack Exchange API’s out there.

We weren’t shooting for the moon, we constrained ourselves.

A big one was that 1.0 had to be read-only.

Pragmatically, we didn’t have the resources to devote to the mounds of refactoring that would be required to get our ask and edit code paths up to snuff. There are also all sorts of rejection cases to handle (at the time we had bans, too many posts in a certain timeframe, and “are you human” captcha checks), which we’d have to expose, and the mechanism would have to be sufficiently flexible to handle new rejection cases gracefully (and we’ve added some in the 1.0 -> 2.0 interim, validating this concern). There’s also the difficulty in rendering Markdown (with our Stack Exchange specific extensions, plus Prettify, MathJax, jTab, and who knows what else in the future), which needs to be solved if applications built on the Stack Exchange API are to be able to mimic our preview pane.

Philosophically, write is incredibly dangerous. Not just in the buggy-authentication, logged in as Jeff Atwood, mass content deleting sense; though that will keep me up at night. More significantly (and insidiously) in the lowered friction, less guidance, more likely to post garbage sense.

Then there are quality checks, duplicate checks, history checks...

Similar titles, similar questions, live preview, tag tips, and a markdown helper. This is just the guidance we give a poster *before* they submit.

We do an awful lot to keep the quality of content on the Stack Exchange network very high (to the point where we shut down whole sites that don’t meet our standards). A poorly thought out write API is a great way to screw it all up, so we pushed it out of the 1.0 time-frame. It looks like we’ll be revisiting it in 3.0, for the record.

We also wanted to eliminate the need to scrape our sites.

This may seem a bit odd as a constraint, but there’s was only so much development time available and a lot of it needed to be dedicated to this one goal. The influence of this is really quite easy to see, there’s an equivalent API method for nearly every top level route on a Stack Exchange site (/users, /badges, /questions, /tags, and so on).

Historically we had tolerated a certain amount of scraping in recognition that there were valid reasons to get up-to-date data out of a Stack Exchange site, and providing it is in the spirit of the cc-wiki license that covers all of our user contributed content. However scraping is hideously inefficient both from a consuming and producing side, with time wasted rendering HTML, serving scripts, including unnecessary data, and then stripping all that garbage back out. It’s also very hard to optimize a site for both programs and users, the access patterns are all different. By moving scraping off of the main sites and onto an API, we were able to get a lot more aggressive about protecting the user experience by blocking bots that negatively affect it.

Of course, we were willing to try out some neat ideas.

So named for the "vector processors" popularized in the early days of the industry (as in the CM-1 pictured above). More commonly called SIMD today.

Vectorized requests are probably the most distinctive part of our API. In a nutshell, almost everywhere we accept an id we’ll accept up to 100 of them.

/users/80572;22656;1/

Fetches user records for myself, Jon Skeet, and Jeff Atwood all in one go.

This makes polling for changes nice and easy within a set of questions, users, users’ questions, users’ answers, and so on. It also makes it faster to fetch lots of data, since you’re only paying for a round trip for every 100 resources.

I’m not contending that this is a novel feature, Twitter’s API does something similar for user lookup. We do go quite a bit further, making it a fundamental part of our API.

Representing compression visually is difficult.

We also forced all responses to be GZIP’d. The rational for this has been discussed a bit before, but I’ll re-iterate.

Not GZIP’ing responses is a huge waste for all parties. We waste bandwidth sending responses, and the consumer wastes time waiting for the pointlessly larger responses (especially painful on mobile devices). And it’s not like GZIP is some exotic new technology, no matter what stack someone is on, they have access to a GZIP library.

This is one of those things in the world that I’d fix if I had a time machine. There was very little reason to not require all content be GZIP’d under HTTP, even way back in the 90’s. Bandwidth has almost always been much more expensive than CPU time.

Initially we tried just rejecting all requests without the appropriate Accept-Encoding header but eventually resorted to just always responding with GZIP’d requests, regardless of what the client nominally accepts. This has to do with some proxies stripping out the Accept-Encoding header, for a variety of (generally terrible) reasons.

I’m unaware of any other API that goes whole hog and requires clients accept compressed responses. Salesforce.com’s API at least encourages it.

Not nearly as complex as SQL can get, but hopefully complex enough for real work.

Finally, we emphasize sorting and filtering to make complex queries. Most endpoints accept sort, min, max, fromdate, and todate parameters to craft these queries with.

For example, getting a quick count of how many of my comments have ever been upvoted on Stack Overflow (38, at time of writing):

/users/80572/comments?sort=votes&min=1&pagesize=0

or all the positively voted Meta Stack Overflow answers the Stack Exchange dev team made in July 2011 (all 5995 of them):

/users/130213;...;91687/answers?sort=votes&min=1&fromdate=1309478400&todate=1312156799

We eventually settled on one configurable sort, that varies by method, and an always present “creation filter” as adequately expressive. Basically, it’s sufficiently constrained that we don’t have to worry (well… not too much anyway) about crippling our databases with absurdly expensive queries, while still being conveniently powerful in a lot of cases.

This isn’t to suggest that our API is perfect.

I’ve got a whole series of articles in me about all the mistakes that were made. Plus there’s 1.1 and the upcoming 2.0 to discuss, both of which aim to address to short-comings in our initial offering. I plan to address these in the future, as time allows.

Kevin Montrose