History Of The Stack Exchange API, Version 1.0

Posted: 2011/08/07 | Author: kevinmontrose | Filed under: pontification | 1 Comment

When I was hired by Stack Exchange, it was openly acknowledged that it was in large part because of my enthusiasm for building an API. We’ve since gone on to produce an initial 1.0 release, a minor 1.1 update (documented here), and are planning for a 2.0 this calendar year.

If you haven't read this book (or the blog it's from), why are you reading this?

Raymond Chen's blog is a great source for Windows history. Equivalents for most other topics are sorely lacking.

What I’d like to talk about isn’t 2.0 (though we are planning to publish a spec for comment in the not-to-distant future), but about the thought process behind the 1.0 release. I always find the history behind such projects fascinating, so I’d like to get some of Stack Exchange API’s out there.

We weren’t shooting for the moon, we constrained ourselves.

A big one was that 1.0 had to be read-only.

Pragmatically, we didn’t have the resources to devote to the mounds of refactoring that would be required to get our ask and edit code paths up to snuff. There are also all sorts of rejection cases to handle (at the time we had bans, too many posts in a certain timeframe, and “are you human” captcha checks), which we’d have to expose, and the mechanism would have to be sufficiently flexible to handle new rejection cases gracefully (and we’ve added some in the 1.0 -> 2.0 interim, validating this concern). There’s also the difficulty in rendering Markdown (with our Stack Exchange specific extensions, plus Prettify, MathJax, jTab, and who knows what else in the future), which needs to be solved if applications built on the Stack Exchange API are to be able to mimic our preview pane.

Philosophically, write is incredibly dangerous. Not just in the buggy-authentication, logged in as Jeff Atwood, mass content deleting sense; though that will keep me up at night. More significantly (and insidiously) in the lowered friction, less guidance, more likely to post garbage sense.

Then there are quality checks, duplicate checks, history checks...

Similar titles, similar questions, live preview, tag tips, and a markdown helper. This is just the guidance we give a poster *before* they submit.

We do an awful lot to keep the quality of content on the Stack Exchange network very high (to the point where we shut down whole sites that don’t meet our standards). A poorly thought out write API is a great way to screw it all up, so we pushed it out of the 1.0 time-frame. It looks like we’ll be revisiting it in 3.0, for the record.

We also wanted to eliminate the need to scrape our sites.

This may seem a bit odd as a constraint, but there’s was only so much development time available and a lot of it needed to be dedicated to this one goal. The influence of this is really quite easy to see, there’s an equivalent API method for nearly every top level route on a Stack Exchange site (/users, /badges, /questions, /tags, and so on).

Historically we had tolerated a certain amount of scraping in recognition that there were valid reasons to get up-to-date data out of a Stack Exchange site, and providing it is in the spirit of the cc-wiki license that covers all of our user contributed content. However scraping is hideously inefficient both from a consuming and producing side, with time wasted rendering HTML, serving scripts, including unnecessary data, and then stripping all that garbage back out. It’s also very hard to optimize a site for both programs and users, the access patterns are all different. By moving scraping off of the main sites and onto an API, we were able to get a lot more aggressive about protecting the user experience by blocking bots that negatively affect it.

Of course, we were willing to try out some neat ideas.

So named for the "vector processors" popularized in the early days of the industry (as in the CM-1 pictured above). More commonly called SIMD today.

Vectorized requests are probably the most distinctive part of our API. In a nutshell, almost everywhere we accept an id we’ll accept up to 100 of them.

/users/80572;22656;1/

Fetches user records for myself, Jon Skeet, and Jeff Atwood all in one go.

This makes polling for changes nice and easy within a set of questions, users, users’ questions, users’ answers, and so on. It also makes it faster to fetch lots of data, since you’re only paying for a round trip for every 100 resources.

I’m not contending that this is a novel feature, Twitter’s API does something similar for user lookup. We do go quite a bit further, making it a fundamental part of our API.

Representing compression visually is difficult.

We also forced all responses to be GZIP’d. The rational for this has been discussed a bit before, but I’ll re-iterate.

Not GZIP’ing responses is a huge waste for all parties. We waste bandwidth sending responses, and the consumer wastes time waiting for the pointlessly larger responses (especially painful on mobile devices). And it’s not like GZIP is some exotic new technology, no matter what stack someone is on, they have access to a GZIP library.

This is one of those things in the world that I’d fix if I had a time machine. There was very little reason to not require all content be GZIP’d under HTTP, even way back in the 90’s. Bandwidth has almost always been much more expensive than CPU time.

Initially we tried just rejecting all requests without the appropriate Accept-Encoding header but eventually resorted to just always responding with GZIP’d requests, regardless of what the client nominally accepts. This has to do with some proxies stripping out the Accept-Encoding header, for a variety of (generally terrible) reasons.

I’m unaware of any other API that goes whole hog and requires clients accept compressed responses. Salesforce.com’s API at least encourages it.

Not nearly as complex as SQL can get, but hopefully complex enough for real work.

Finally, we emphasize sorting and filtering to make complex queries. Most endpoints accept sort, min, max, fromdate, and todate parameters to craft these queries with.

For example, getting a quick count of how many of my comments have ever been upvoted on Stack Overflow (38, at time of writing):

/users/80572/comments?sort=votes&min=1&pagesize=0

or all the positively voted Meta Stack Overflow answers the Stack Exchange dev team made in July 2011 (all 5995 of them):

/users/130213;...;91687/answers?sort=votes&min=1&fromdate=1309478400&todate=1312156799

We eventually settled on one configurable sort, that varies by method, and an always present “creation filter” as adequately expressive. Basically, it’s sufficiently constrained that we don’t have to worry (well… not too much anyway) about crippling our databases with absurdly expensive queries, while still being conveniently powerful in a lot of cases.

This isn’t to suggest that our API is perfect.

I’ve got a whole series of articles in me about all the mistakes that were made. Plus there’s 1.1 and the upcoming 2.0 to discuss, both of which aim to address to short-comings in our initial offering. I plan to address these in the future, as time allows.

Your Email Is (Practically) Your Identity

Posted: 2011/07/31 | Author: kevinmontrose | Filed under: pontification | 7 Comments

I am exactly 7964 awesomes on the internet.

I'm approaching this from a technical or practical perspective, more-so than a personal one. Karma, reciprocity, reputation, or similar are not pertinent to this discussion.

There’s a lot of confusion about what identity, on the internet, is. I contend that, for all practical purposes, your online identity is your email address.

Let’s look at some other (supposed) identification methods:

Username – whatever the user feels like typing in
OpenID – A guaranteed unique URL
OAuth – some guaranteed unique token in the context of a service provider

What sets an email address apart from these other methods is that it’s a method of contacting an individual. In fact, it’s a practically universal method of contacting someone on the internet.

Consider, regardless of the mechanism you use to authenticate users, one returns to your site and wants to login… but can’t remember their credentials. This is not a trick question, obviously you have them enter their email address and then send them something they can use to recover their login information (a password reset link, their OpenID, their OAuth service provider, etc.). Regardless of the login mechanism, the lack of an associated email address will result in the loss of the account.

"butter^" is not a valid password for StackID.

~7% of users on of StackID have forgotten their passwords at some point. The same ratio holds (OpenID instead of password) on Stack Overflow.

I find myself considering OpenID, OAuth, username and password combinations, and so on as “credentials” rather than “identities” conceptually.

Pontificating is all well and good, but how has this actually affected anything?

One of the first things I worked on at Stack Exchange (so long ago that the company was still Stack Overflow Internet Services, and the Stack Exchange product had a 1.0 in front of its name that it didn’t know about), was pulling in user email’s as part of authenticating an OpenID. There were two problems this solved, one was that user’s would accidentally create accounts using different credentials; a common trusted email let us avoid creating these accounts (this recently came up on Meta.StackOverflow). The second was that associations between site’s couldn’t be automated since Google generates a unique OpenID string for each domain a user authenticates to; finding related accounts based on email neatly worked around this wrinkle in Google’s OpenID implementation.

Adding those last two columns. Given a time machine we'd require them, but they're optional at time of writing.

Some of this predicament is peculiar to the OpenID ecosystem, but the same basic problem in both scenarios is possible with even a bog standard username/password system. If you have some disjoint user tables (as Stack Exchange’s are for historical reasons) you can’t just do a correlation between username (or even username & password hash), you need to verify that the same person controls both accounts; and really all you can do is contact both accounts and see if they point to the same person, the mechanism for that being (once again) email.

In a nutshell, if you’ve got more than one kind of credential in your system, say username/password and Facebook Connect, then the only way you’re going to figure out whether the same user has multiple credentials is via correlating email addresses. That Stack Exchange needs this internally is a historical accident, but given the popularity of “Login with Facebook” buttons I have to imagine it comes up elsewhere (perhaps others have consigned themselves to duplicate accounts, or a single external point of failure for user login).

These observations about email are why StackID, Stack Exchange’s own OpenID provider, requires (and confirms) email addresses as part of account creation. We also always share that email address, provided that the relying party asks for it via Simple Registration or Attribute Exchange.

Such names started with the gentry, and spread slowly to the rest of society.

In the English speaking world, names distinct enough for identification outside of a small area really got started with the Domesday Book. Compiled in 1086 CE.

One counter argument I’ve encountered to this position, is that changing your email shouldn’t effectively change your identity. The real life equivalent of changing your email address (changing your street address, phone number, legal name, and so on) is pretty disruptive, why would the internet version be trivial? If nothing else, almost all of your accounts are already relying on your email address for recovery anyway.

I suspect what makes Method of Contact = Email = Identity non-obvious is the tendency of people to assume identity is much simpler that it really is, coupled with the relative youth (and accompanying instability) of the internet. Anecdotally, while I certainly have changed my email address in the past, I’ve been using my current email address for almost as long as I’ve carried a driver’s license (which is good enough ID for most purposes in the United States).

Kevin Montrose