Stack Exchange API V2.0: Safety

Every method in version 2.0 of the Stack Exchange API (no longer in beta, but an app contest continues) can return either “safe” or “unsafe” results.  This is a rather odd concept that deserves some explanation.

Semantics

Safe results are those for which every field on every object returned can be inlined into HTML without concern for script injections or other “malicious” content.  Unsafe results are those for which this is not the case, however that is not to imply that all fields are capable of containing malicious content in an unsafe return.

An application indicates where it wants safe or unsafe results via the filter it passes along with the request.  Since most requests should have some filter passed, it seemed reasonable to bundle safety into them.

To the best of my knowledge, this is novel in an API.  It’s not exactly something that we set out to add however, there’s a strong historical component to its inclusion.

Rationale

In version 1.0 of the Stack Exchange API every result is what we would call unsafe in version 2.0.  We were simply returning the data stored in our database, without much concern for encoding or escaping.  This lead to cases where, for example, question bodies were safe to inline (and in fact, escaping them would be an error) but question titles required encoding lest an application open itself to script injections.  This behavior caused difficulties for a number of third-party developers, and bit us internally a few times as well; which is why I resolved to do something to address it in version 2.0.

I feel that a great strength in any API is consistency, and thus the problem was not that you had to encode data, it was that you didn’t always have to.  We had two options, we could either make all fields “high fidelity” by returning the most original data we had (ie. exactly what user’s had entered, before tag balancing, entity encoding, and so on) or we could make all fields “safe” by make sure the same sanitization code ran against them regardless of how they were stored in the database.

Unfortunately, it was not to be.

Strictly speaking, I would have preferred to go with the “high fidelity” option but another consideration forced the “safe” one.  The vast majority of the content in the Stack Exchange network is submitted in markdown, which isn’t really standardized (and we’ve added a number of our own extensions).  Returning the highest fidelity data would be making the assumption that all of our consumers could render our particular markdown variant.  While we have open sourced most of our markdown implementation, even with that we’d be restricting potential consumers to just those built on .NET.  Thus the “high fidelity” option wasn’t just difficult to pursue, it was effectively impossible.

Given this train of thought, I was originally going to just make all returns “safe” period.  I realize in hindsight that this would have been a pretty grave mistake (thankfully, some of the developers at Stack Exchange, and some from the community, talked me out of it).  I think the parallel with C#’s unsafe keyword is a good one, sometimes you need dangerous things and you put a big “I know what I’m doing” signal right there when you do.  This parallel ultimately granted the options their names; properly escaped and inline-able returns are “safe”, and those that are pulled straight out of the database are “unsafe”.

Adding a notion of “safety” allows us to be very consistent in how data we return should be treated.  It also allows developers to ignore encoding issues if they so desire, at least in the web app case.  Of course, if they’re willing to handle encoding themselves they also have the option.


Stack Exchange API V2.0: No Write Access

Version 2.0 of the Stack Exchange API (still in beta, with an app contest) introduced authentication, but unlike most other APIs did not introduced write access at the same time.  Write access is one of the most common feature requests we get for the API, so I feel it’s important to explain that this was very much a conscious decision, not one driven solely by the scarcity of development resources.

Why?

The “secret sauce” of the Stack Exchange network is the quality of the content found on the sites.  When compared to competing platforms you’re much more likely to find quality existing answers to your questions, to find interesting and coherent questions to answer, and to quickly get good answers to new questions you post.  Every change we make or feature we add is considered in terms of either preserving or improving quality, and the Stack Exchange API has been no different.  When viewed under with lens of quality, a few observations can be made about the API.

Screw up authentication, and you screw up write access

Write access presupposes authentication, and any flaw in authentication is going to negatively impact the use of write access accordingly.  Authentication, being a new 2.0 feature, is essentially untested, unverified, and above all untrusted in the current Stack Exchange API.  By pushing authentication out in its own release we’re starting to build some confidence in our implementation, allowing us to be less nervous about write access in the future.

Screw up write access, and you screw up quality

The worst possible outcome of adding write access to the Stack Exchange API is a flood of terrible questions and answers on our sites.  As such the design of our API will need to actively discourage such an outcome, we can’t get away with a simple “POST /questions/create”.  However, a number of other APIs can get away with very simple write methods and it’s important to understand how they differ from the Stack Exchange model and as such need not be as concerned (beyond not having nearly our quality in the first place).

The biggest observation is that every Stack Exchange site is, in a sense, a single “well” in danger of being poisoned.  Every Stack Exchange site user sees the same Newest Questions page, the same User page, and (with the exception of Stack Overflow) the same Home page.  Compare with social APIs (ie. Facebook and Twitter, where everything is sharded around the user), or service APIs (like Twilio, which doesn’t really have common content to show users); there are lots of “wells”, none of which are crucial to protect.

Write access requires more than just writing

It’s easy to forget just how much a Stack Exchange site does to encourage proper question asking behavior in users.

I've circled the parts of the Ask page that don't offer the user guidance

A write API for Stack Exchange will need to make all of this guidance available for developers to integrate into their applications, as well as find ways to incentivize that integration.  We also have several automated quality checks against new posts, and a plethora of other rejection causes; all of which need to be conscientiously reported by the API (without creation oracles for bypassing those checks).

Ultimately the combination of: wanting authentication to be independently battle tested, the need to really focus on getting write access correct, and the scarcity of development resources caused by other work also slated for V2.0 caused write access to be deferred to a subsequent release.


Stack Exchange API V2.0: JS Auth Library

In a previous article I discussed why we went with OAuth 2.0 for authentication in V2.0 of the Stack Exchange API (beta and contest currently underway), and very shortly after we released a simple javascript library to automate the whole affair (currently also considered “in beta”, report any issues on Stack Apps).  The motivations for creating this are, I feel, non-obvious as is why it’s built the way it is.

Motivations

I’m a strong believer in simple APIs.  Every time a developer using your API has to struggle with a concept or move outside their comfort zone, your design has failed in some small way. When you look at the Stack Exchange API V2.0, the standout “weird” thing is authentication.  Every other function in the system is a simple GET (well, there is one POST with /filters/create), has no notion of state, returns JSON, and so on.  OAuth 2.0 requires user redirects, obviously has some notion of state, has different flows, and is passing data around on query strings or in hashes. It follows that, in pursuit of overall simplicity, it’s worthwhile to focus on simplifying consumers using our authentication flows.  The question then becomes “what can we do to simplify authentication?”, with an eye towards doing as much good as possible with our limited resources. The rationale for a javascript library is that:

  • web applications are prevalent, popular, and all use javascript
  • we lack expertise in the other (smaller) comparable platforms (Android and iOS, basically)
  • web development makes it very easy to push bug fixes to all consumers (high future bang for buck)
  • other APIs offer hosted javascript libraries (Facebook, Twitter, Twilio, etc.)

Considerations

The first thing that had to be decided was the scope of the library, as although the primary driver for the library was the complexity of authentication that did not necessarily mean that’s all the library should offer. Ultimately, all it did cover is authentication, for reasons of both time and avoidance of a chilling affect.  Essentially, scoping the library to just authentication gave us the biggest bang for our buck while alleviating most fears that we’d discourage the development of competing javascript libraries for our API.  It is, after all, in Stack Exchange’s best interest for their to be a healthy development community around our API. I also decided that it was absolutely crucial that our library be as small as possible, and quickly served up.  Negatively affecting page load is unacceptable in a javascript library, basically.  In fact, concerns about page load times are why the Stack Exchange sites themselves do not use Facebook or Twitter provided javascript for their share buttons (and also why there is, at time of writing, no Google Plus share option).  It would be hypocritical to expect other developers to not have the same concerns we do about third-party includes.

Implementation

Warning, lots of this follows.

Since it’s been a while since there’s been any code in this discussion, I’m going to go over the current version (which reports as 453) and explain the interesting bits.  The source is here, though I caution that a great many things in it are implementation details that should not be depended upon.  In particular, consumers should always link to our hosted version of the library (at https://api.stackexchange.com/js/2.0/all.js).

The first three lines sort of set the stage for “small as we can make it”.

window.SE = (function (navigator, document,window,encodeURIComponent,Math, undefined) {
"use strict";
var seUrl, clientId, loginUrl, proxyUrl, fetchUserUrl, requestKey, buildNumber = '@@~~BuildNumber~~@@';

I’m passing globals as parameters to the closure defining the interface in those cases where we can expect minification to save space (there’s still some work to be done here, where I will literally be counting bytes for every reference).  We don’t actually pass an undefined to this function, which both saves space and assures nobody’s done anything goofy like giving undefined a value.  I intend to spend some time seeing if similar proofing for all passed terms is possible (document and window are already un-assignable, at least in some browsers). Note that we also declare all of our variables in batches throughout this script, to save bytes from repeating “var” keywords.

Implementation Detail: “@@~~BuildNumber~~@@” is replaced as part of our deploy.  Note that we pass it as a string everywhere, allow us to change the format of the version string in the future.  Version is provided only for bug reporting purposes, consumers should not depend on its format nor use it in any control flow.

function rand() { return Math.floor(Math.random() * 1000000); }

Probably the most boring part of the entire implementation, gives us a random number.  Smaller than inlining it everywhere where we need one, but not by a lot even after minifying references to Math.  Since we only ever use this to avoid collisions, I’ll probably end up removing it altogether in a future version to save some bytes.

function oldIE() {
if (navigator.appName === 'Microsoft Internet Explorer') {
var x = /MSIE ([0-9]{1,}[\.0-9]{0,})/.exec(navigator.userAgent);
if (x) {
    return x[1] <= 8.0;
}
}
return false;
}

Naturally, there’s some Internet Explorer edge case we have to deal with.  For this version of the library, it’s that IE8 has all the appearances of supporting postMessage but does not actually have a compliant implementation.  This is a fairly terse check for Internet Explorer versions <= 8.0, inspired by the Microsoft recommended version.  I suspect a smaller one could be built, and it’d be nice to remove the usage of navigator if possible.

Implementation Detail:  There is no guarantee that this library will always treat IE 8 or lower differently than other browsers, nor is there a guarantee that it will always use postMessage for communication when able.

Now we get into the SE.init function, the first method that every consumer will need to call.  You’ll notice that we accept parameters as properties on an options object; this is a future proofing consideration, as we’ll be able to add new parameters to the method without worrying (much) about breaking consumers.

You’ll also notice that I’m doing some parameter validation here:

if (!cid) { throw "`clientId` must be passed in options to init"; }
if (!proxy) { throw "`channelUrl` must be passed in options to init"; }
if (!complete) { throw "a `complete` function must be passed in options to init"; }
if (!requestKey) { throw "`key` must be passed in options to init"; }

This is something of a religious position, but I personally find it incredibly frustrating when a minified javascript library blows up because it expected a parameter that wasn’t passed.  This is inordinately difficult to diagnose given how trivial the error is (often being nothing more than a typo), so I’m checking for it in our library and thereby hopefully saving developers some time.

Implementation Detail:  The exact format of these error messages isn’t specified, in fact I suspect we’ll rework them to reduce some repetition and thereby save some bytes.  It is also not guaranteed that we will always check for required parameters (though I doubt we’ll remove it, it’s still not part of the spec) so don’t go using try-catch blocks for control flow.

This odd bit:

if (options.dev) {
seUrl = 'https://dev.stackexchange.com';
fetchUserUrl = 'https://dev.api.stackexchange.com/2.0/me/associated';
} else {
seUrl = 'https://stackexchange.com';
fetchUserUrl = 'https://api.stackexchange.com/2.0/me/associated';
}

Is for testing on our dev tier.  At some point I’ll get our build setup to strip this out from the production version, there’s a lot of wasted bytes right there.

Implementation Detail: If the above wasn’t enough, don’t even think about relying on passing dev to SE.init(); it’s going away for sure.

The last bit of note in SE.init, is the very last line:

setTimeout(function () { complete({ version: buildNumber }); }, 1);

This is a bit of future proofing as well.  Currently, we don’t actually have any heaving lifting to do in SE.init() but there very well could be some in the future.  Since we’ll never accept blocking behavior, we know that any significant additions to SE.init() will be asynchronous; and a complete function would be the obvious way to signal that SE.init() is done.

Implementation Detail:  Currently, you can get away with calling SE.authenticate() immediately, without waiting for the complete function passed to SE.init() to execute.  Don’t do this, as you may find that your code will break quite badly if our provided library starts doing more work in SE.init().

Next up is fetchUsers(), an internal method that handles fetching network_users after an authentication session should the consumer request them.  We make a JSONP request to /me/associated, since we cannot rely on the browser understanding CORS headers (which are themselves a fairly late addition to the Stack Exchange API).

Going a little out of order, here’s how we attach the script tag.

while (window[callbackName] || document.getElementById(callbackName)) {
callbackName = 'sec' + rand();
}
window[callbackName] = callbackFunction;
src += '?access_token=' + encodeURIComponent(token);
src += '&pagesize=100';
src += '&key=' + encodeURIComponent(requestKey);
src += '&callback=' + encodeURIComponent(callbackName);
src += '&filter=!6RfQBFKB58ckl';
script = document.createElement('script');
script.type = 'text/javascript';
script.src = src;
script.id = callbackName;
document.getElementsByTagName('head')[0].appendChild(script);

The only interesting bit here is the while loop making sure we don’t pick a callback name that is already in use.  Such a collision would be catastrophically bad, and since we can’t guarantee anything about the hosting page we don’t have a choice but to check.

Implementation Detail:  JSONP is the lowest common denominator, since many browsers still in use do not support CORS.  It’s entirely possible we’ll stop using JSONP in the future, if CORS supporting browsers become practically universal.

Our callbackFunction is defined earlier as:

callbackFunction =
function (data) {
try {
delete window[callbackName];
} catch (e) {
window[callbackName] = undefined;
}
script.parentNode.removeChild(script);
if (data.error_id) {
error({ errorName: data.error_name, errorMessage: data.error_message });
return;
}
success({ accessToken: token, expirationDate: expires, networkUsers: data.items });
};

Again, this is fairly pedestrian.  One important thing that is often overlooked when making these sorts of libraries is the cleanup of script tags and callback functions that are no longer needed.  Leaving those lingering around does nothing but negatively affect browser performance.

Implementation Detail:  The try-catch block is a workaround for older IE behaviors.  Some investigation into whether setting the callback to undefined performs acceptably for all browsers may let us shave some bytes there, and remove the block.

Finally, we get to the whole point of this library: the SE.authenticate() method.

We do the same parameter validation we do in SE.init, though there’s a special case for scope.

if (scopeOpt && Object.prototype.toString.call(scopeOpt) !== '[object Array]') { throw "`scope` must be an Array in options to authenticate"; }

Because we can’t rely on the presence of Array.isArray in all browsers, we have to fall back on this silly toString() check.

The meat of SE.authenticate() is in this block:

if (window.postMessage && !oldIE()) {
if (window.attachEvent) {
window.attachEvent("onmessage", handler);
} else {
window.addEventListener("message", handler, false);
}
} else {
poll =
function () {
if (!opened) { return; }
if (opened.closed) {
clearInterval(pollHandle);
return;
}
var msgFrame = opened.frames['se-api-frame'];
if (msgFrame) {
clearInterval(pollHandle);
handler({ origin: seUrl, source: opened, data: msgFrame.location.hash });
}
};
pollHandle = setInterval(poll, 50);
}
opened = window.open(url, "_blank", "width=660, height=480");

In a nutshell, if a browser supports (and properly implements, unlike IE8) postMessage we use that for cross-domain communication other we use the old iframe trick.  The iframe approach here isn’t the most elegant (polling isn’t strictly required) but it’s simpler.

Notice that if we end up using the iframe approach, I’m wrapping the results up in an object that quacks enough like a postMessage event to make use of the same handler function.  This is easier to maintain, and saves some space through code reuse.

Implementation Detail:  Hoy boy, where to start.  First, the usage of postMessage or iframes shouldn’t be relied upon.  Nor should the format of those messages sent.  The observant will notice that stackexchange.com detects that this library is in use, and only create an iframe named “se-api-frame” when it is; this behavior shouldn’t be relied upon.  There’s quite a lot in this method that should be treated as a black box; note that the communication calisthenics this library is doing isn’t necessary if you’re hosting your javascript under your own domain (as is expected of other, more fully featured, libraries like those found on Stack Apps).

Here’s the handler function:

handler =
function (e) {
if (e.origin !== seUrl || e.source !== opened) { return; }
var i,
pieces,
parts = e.data.substring(1).split('&'),
map = {};
for (i = 0; i < parts.length; i++) {
pieces = parts[i].split('=');
map[pieces[0]] = pieces[1];
}
if (+map.state !== state) {
return;
}
if (window.detachEvent) {
window.detachEvent("onmessage", handler);
} else {
window.removeEventListener("message", handler, false);
}
opened.close();
if (map.access_token) {
mapSuccess(map.access_token, map.expires);
return;
}
error({ errorName: map.error, errorMessage: map.error_description });
};

You’ll notice that we’re religious about checking the message for authenticity (origin, source, and state checks).  This is very important as it helps prevent malicious scripts from using our script as a vector into a consumer; security is worth throwing bytes at.

Again we’re also conscientious about cleaning up, making sure to unregister our event listener, for the same performance reasons.

I’m using a mapSuccess function to handle the conversion of the response and invokation of success (and optionally calling fetchUsers()).  This is probably wasting some space and will get refactored sometime in the future.

I’m passing expirationDate to success as a Date because of a mismatch between the Stack Exchange API (which talks in “seconds since the unix epoch”) and javascript (which while it has a dedicated Date type, thinks in “milliseconds since unix epoch”).  They’re just similar enough to be confusing, so I figured it was best to pass the data in an unambiguous type.

Implementation Detail:  The manner in which we’re currently calculating expirationDate can over-estimate how long the access token is good for.  This is legal, because the expiration date of an access token technically just specifies a date by which the access token is guaranteed to be unusable (consider what happens to an access token for an application a user removes immediately after authenticating to).

Currently we’ve managed to squeeze this whole affair down into a little less than 3K worth of minified code, which gets down under 2K after compression.  Considering caching (and our CDN) I’m pretty happy with the state of the library, though I have some hope that I can get us down close to 1K after compression.

[ed. As of version 568, the Stack Exchange Javascript SDK is down to 1.77K compressed, 2.43K uncompressed.]


Stack Exchange API V2.0: Authentication

The most obvious addition to the 2.0 version of the Stack Exchange API (beta and contest currently under way) is authentication (authorization technically but the distinction isn’t pertinent to this discussion) in the form of OAuth 2.0.  With this new feature, it’s now possible for a user to demonstrate to a third party who they are on any site in the Stack Exchange network.

Why OAuth 2.0?

OAuth 2.0 was a pretty easy choice.  For one, there aren’t that many well known authentication protocols out there.  OAuth 1.0a, OAuth 2.0, OpenID (sort of), … and that’s about it.

Though we’re quite familiar with OpenID, all it does is demonstrate who you are to a consumer; there’s no token for subsequent privileged requests.  Furthermore, OpenID is… tricky to consume.  At Stack Exchange we make use of the excellent dotNetOpenAuth, but when you’re providing an API you can’t assume all clients have an “easy out” in the form of a library; simplicity is king.

OAuth 1.0a is something of a disaster, in my professional opinion.  It certainly works, but it’s very complicated with numerous flows that frankly most applications don’t need.  It also forces developers to deal with the nitty gritty details of implementing signatures, and all sorts of encoding headaches (remember, an API cannot assume that developers can hand that work off to a library).  If we had no other options then we’d go with OAuth 1.0a, but thankfully that’s not the case.

OAuth 2.0’s main strength is consumer simplicity.  A simple redirect, POST (or just a redirect, depending on flow), and then out pops a token for privileged requests.  It does impose a little bit of additional complexity on our end, as HTTPS is mandated by the standard but extra complexity on our end is fine (as opposed to on the consumer’s end).  OAuth 2.0 is fairly new, and as with OAuth 1.0a a “conforming implementation” is a matter of debate so it’s not all roses; but it’s the best of the bunch.

Implementation Details

I explicitly modeled our implementation of OAuth 2.0 on Facebook’s, under the assumption that in the face of any ambiguities in the spec it’d be best if we went along with a larger provider’s interpretation.  Not to imply that Facebook is automatically correct, but that following it is best option for those developing against our API; anyone wildly Googling for OAuth 2.0 is likely to find details on Facebook’s implementation, and it’d be best if they were also true for Stack Exchange’s.

For example the OAuth 2.0 spec calls for scopes to be space delimited while Facebook requires commas, at Stack Exchange we accept both.  The spec also (for some bizarre reason) leaves the success and error responses when exchanging auth codes for access tokens in the explicit flow up to the implementation, in both cases we mimic Facebook’s implementation.

The Stack Exchange user model introduces some complications as well.  On Stack Exchange you conceptually have 1 account (with any number of credentials) and many users (1 on each different site, potentially), you can be logged in as any number of users (with a cookie on each site) but aren’t really logged in at an account level.  Since we didn’t want users to have to authenticate to each site they’re active on (with 70+ sites in the network, this would be incredibly unfriendly for power users) we needed to chose a single site that would serve as a “master site” and mediate logins required during OAuth 2.0 flows; we ended up choosing stackexchange.com to fill this role.

The Elephant In The Room

Pictured: The Stack Exchange conference room

By now, some of you are thinking “what’s the point, it’s trivial to compromise credentials in an OAuth flow”.  Simple phishing for username/password in an app, more complicated script injection schemes, pulling cookies out of hosted browser instances, etc.  There are some arguments against it from a UX standpoint as well.

Honestly, I agree with a good deal of these arguments.  In a lot of cases OAuth really is a lot weaker than it’s been portrayed, though I would argue that it helps protect honest developers from themselves.  You can’t make any silly mistakes around password storage if you never have an opportunity to store a password, for example.

However, these arguments aren’t really pertinent in Stack Exchange’s case because we’ve already settled on OpenID for login.  This means that even if we wanted to support something akin to xAuth we couldn’t, a user’s username/password combo is useless to us.  So we’re stuck with something that depends on a browser short of pulling off of OpenID altogether (which is almost certainly never going to happen, for reasons I hope are obvious).


Stack Exchange API V2.0: Implementing Filters

As part of this series of articles, I’ve already discussed why we added filters to the Stack Exchange API in version 2.0 (go check it out, you could win a prize).  Now I’m going to discuss how they were implemented and what drove the design.

Considerations

Stability

It is absolutely paramount that filters not break, ever.  A lot of the benefits of filters go away if applications are constantly generating them (that is, if they aren’t “baked into” executables), and “frustrated” would be a gross understatement of how developers would feel if we kept forcing them to redistribute their applications with new filters.

From stability, it follows that filters need to be immutable.  Consider the denial of service attack that would be possible from modifying a malicious party extracting and modifying a filter baked into a popular application.

Speed

One of the big motivations behind filters was improving performance, so it follows that the actual implementation of filters shouldn’t have any measurable overhead.  In practice this means that no extra database queries (and preferably no network traffic at all) can occur as consequence of passing a filter.

Ease of Use

While it’s probably impossible to make using a filter more convenient than not using one, it’s important that using filters not be a big hassle for developers.  Minimizing the number of filters that need to be created, and providing tools to aid in their creation are thus worthwhile.

Implementation

Format

Filters, at their core, ended up being a simple bitfield of which fields to include in a response.  Bitfields are fast to decode, and naturally stable.

Also, every filter includes every type is encoded in this bitfield.  This is important for the ease of use consideration, as it makes it possible to use a single filter for all your requests.

Encoding

A naive bitfield for a filter would have, at time of writing, 282 bits.  This is a bit hefty, a base64 encoded naive filter would be approximately 47 characters long for example, so it behooves us to compress it somewhat.

An obvious and simple compression technique is to run-length encode the bitfield.  We make this even more likely to bear fruit by grouping the bits first by “included in the default filter” and then by “type containing the field”.  This grouping exploits the expectation that filters will commonly either diverge from the default filter or focus on particular types.

We also tweak the character’s we’ll use to encode a filter a bit, so we’re technically encoding in a base higher than 64; though we’re losing a character to indicate safe/unsafe (which is a discussion for another time).

All told, this gets the size of filters we’re seeing the wild down to a manageable 3 to 29 characters.

Bit Shuffling

This ones a bit odd, but in the middle of the encoding step we do some seemingly pointless bit shuffling.  What we’re trying to do here is enforce opaqueness, why we’d want to do that deserves some explanation.

A common problem when versioning APIs is discovering that a number of consumers (oftentimes an uncomfortably large number) are getting away with doing technically illegal things.  An example is SetWindowsHook in Win16 (short version, developers could exploit knowledge of the implementation to avoid calling UnhookWindowsHook), one from V1.0 of the Stack Exchange API is /questions/{id}/comments also accepting answer ids (this exploits /posts/{ids}/comments, /questions/{ids}/comments, and /answers/{ids}/comments all being aliases in V1.x).  When you find such behavior you’re left choosing between breaking consumers or maintaining “bugs” in your implementation indefinitely, neither of which are great options.

The point of bit shuffling is to make it both harder to figure out the implementation (though naturally not  impossible, the average developer is more than bright enough to figure our scheme out given enough time) so such “too clever for your own good” behavior is harder to pull off, and to really drive the point home that you shouldn’t be creating filters without calling /filter/create.

Backwards Compatibility

Maintaining compatibility between API versions with filters is actually pretty simple if you add one additional constraint, you never remove a bit in the field.  This lets you use the length of the underlying bitfield as a version number.

Our implementation maintains a list of fields paired with the length of the bitfield they were introduced on.  This lets us figure out which fields were available when a given filter was created, and exclude any newer fields when we encounter an old filter.

Composing A Query

One prerequisite for filters is the ability to easily compose queries against your datastore.  After all, it’s useless to know that certain fields shouldn’t be fetched if you can’t actually avoid querying for them.

In the past we would have used LINQ-to-SQL, but performance concerns have long since lead us to develop and switch to Dapper, and SqlBuilder in Dapper.Contrib.

Here’s an rough outline of building part of an answer object query.

// While technically optional, we always need this so *always* fetch it
builder.Select("Id AS answer_id");
builder.Select("ParentId AS question_id");
// links and title are backed by the same columns
if (Filter.Answer.Link || Filter.Answer.Title)
{
builder.LeftJoin("dbo.Posts Q ON Q.Id = ParentId");
builder.Select("Q.Title as title");
}
if(Filter.Answer.LastEditDate)
{
builder.Select("LastEditDate AS last_edit_date");
}

The actual code is a bit heavier on extension methods and reuse.

Note that sometimes we’ll grab more data than we intend to return, such as when fetching badge_count objects we always fetch all three counts even if we only intend to return, say, gold.  We rely on some IL magic just before we serialize our response to handle those cases.

Caches

The Stack Exchange network sites would fall over without aggressive caching, and our API has been no different.  However, introducing filters complicates our caching approach a bit.

In V1.x, we just maintained query -> response and type+id -> object caches.  In V2.0, we need to account for the fields actually fetched or we risk responding with too many or too few fields set when we have a cache hit.

The way we deal with this is to tag each object in the cache with a mini-filter which contains only those types that could have been returned by the method called.  For example, the /comment mini-filter would contain all the fields on the comment and shallow_user types.  When we pull something out of the cache, we can check to see if it matches by seeing if the cached mini-filter covers the relevant fields in the current request’s filter; and if so, use the cached data to avoid a database query.

One clever hack on top of this approach lets us service requests for filters that we’ve never actually seen before.  When we have a cache hit for a given type+id pair but the mini-filter doesn’t cover the current request, we run the full request (database hit and all) and then merge the cached object with the returned one and place it back in the cache.  I’ve taken to calling this “merge and return to cache” process as widening an object in cache.

Imagine: request A comes in asking for 1/2 the question fields, request B now comes in asking for the other 1/2, then request C comes in asking for all the fields on question.  When A is processed there’s nothing in the cache, we run the query and place 1/2 of a question object in cache.  When B is processed, we find the cached result of A but it doesn’t have the fields needed to satisfy B; so we run the query, widen the cached A with the new B.  When C is processed, we find the cached union of A and B and voilà, we can satisfy C without hitting the database.

One subtlety is that you have to make sure a widened object doesn’t remain in cache forever.  It’s all too easy for an object to gain a handful of fields on many subsequent queries resetting it’s expiration each time, causing you to serve exceptionally stale data.  The exact solution depends on your caching infrastructure, we just add another tag to the object with it’s maximum expiration time; anything we pull out of the cache that’s past due to be expired is ignored.

Tooling

We attacked the problem of simplifying filter usage in two ways: providing a tool to generate filters, and enabling rapid prototyping with a human-friendly way to bypass filter generation.

Stack Exchange's Emmett Nicholas did a lot of the UI work here.

We spent a lot of time getting GUI for filter editing up to snuff in our API console (pictured, the /questions console).  With just that console you can relatively easily generate a new filter, or use an existing one as a starting point.  For our internal development practically all filters have ended up being created via this UI (which is backed by calls to /filter/create), dogfooding has lead me to be pretty satisfied with the result.

For those developers who aren’t using the API console when prototyping, we allow filters to be specified with the “include”, “exclude”, and “base” parameters (the format being the same as calls to /filter/create).  The idea here is if you just want a quick query for, say, total questions you probably don’t want to go through the trouble of generating a filter; instead, just call /questions?include=.total&base=none&site=stackoverflow.  However, we don’t want such queries to make their way into released application (they’re absurdly wasteful of bandwidth for one) so we need a way to disincentivize them outside of adhoc queries.  We do this by making them available only when a query doesn’t pass an application key, and since increased quotas are linked to passing application keys we expect the majority of applications to use filters correctly.


Stack Exchange API V2.0: Filters

We’re underway with the 2.0 version of the Stack Exchange API, so there’s no time like the present to get my thoughts on it written down.  This is the first in a series of nine posts about the additions, changes, and ideas in and around our latest API revision.  I’m sharing details because I think they’re interesting, but not with the expectation that everything I talk about will be generally applicable to other API designers.

First up, the addition of filters.

Mechanics

Pictured: the state of our documentation in the V1.0 beta.

Filters take the form of an opaque string passed to a method which specifies which fields you want returned.  For example, passing “!A7x.GE1T” to /sites returns only the names of the sites in the Stack Exchange network (this is a simplification, more details when we get to implementation).  This is similar to, but considerably terser than, partial returns via the “fields” parameter as implemented by Facebook and Google (note that we do allow something similar for key-less requests via the “include” and “exclude” parameters).

You can think of filters as redacting returned fields.  Every method has some set of fields that can be returned, and a filter specifies which of those fields shouldn’t be.  If you’re more comfortable thinking in SQL, filters specify the selected columns (the reality is a bit more complicated).

Filters are created by passing the fields to include, those to exclude, and a base filter to the /filter/create method.  They’re immutable and never expire, making it possible (and recommended) to generate them once and then bake them into applications for distribution.

Motivations

There are two big motivations, and a couple of minor ones, for introducing filters.

Performance

We also log a monitor per-request SQL and CPU time for profiling purposes.

The biggest one was improved performance in general, and allowing developers to tweak API performance in particular.  In the previous versions of the Stack Exchange API you generally fetched everything about an object even if you only cared about a few properties.  There were ways to exclude comments and answers (which were egregiouly expensive in some cases) but that was it.

For example, imagine all you cared about were the users most recently active in a given tag (let’s say C#).  In both V1.1 and V2.0 the easiest way to query is would be to use the /questions route with the tagged parameter.  In V1.1 you can exclude body, answers, and comments but you’re still paying for close checks, vote totals, view counts, etc.  In V2.0 you can get just the users, letting you avoid several joins and a few queries.  The adage “the fastest query is one you never execute” holds, as always.

Bandwidth

Related to performance, some of our returns can be surprisingly large.  Consider /users, which doesn’t return bodies (as /questions, /answers, and so on do) but does return an about_me field.  These fields can be very large (at time of writing, the largest about_me fields are around 4k) and when multiplied by the max page size we’re talking about wasting 100s of kilobytes.

Even in the worst cases this is pretty small potatoes for a PC, but for mobile devices both the wasted data (which can be shockingly expensive) and the latency of fetching those wasted bytes can be a big problem.  In V1.1 the only options we had were per-field true/false parameters (the aforementioned answers and comments) which quickly becomes unwieldy.  Filters in V2.0 let us handle this byte shaving in a generic way.

Saner Defaults

In V1.1 any data we didn’t return by default either introduced a new method or a new parameter to get at that data, which made us error on the side of “return it by default”.  Filters let us be much more conservative in what our default returns are.

A glance at the user type reveals a full six fields we’d have paid the cost of returning under the V1.1 regime.  Filters also provide a convenient place to hang properties that are app wide (at least most of the time), such as “safety” (which is a discussion for another time).

Interest Indications

Filters give us a great insight into what fields are actually of interest to API consumers. Looking into usage gives us some indicators on where to focus our efforts, both in terms of optimization and new methods to add.  Historically my intuition about how people use our API has been pretty poor, so having more signals to feed back into future development is a definite nice to have.

While not the sexiest feature (that’s probably authentication), filters are probably my favorite new feature.  They’re a fairly simple idea that solve a lot of common, general, problems.  My next post (ed: available here) will deal with some of the implementation details of our filter system.


Disabling Third-Party Cookies Doesn’t (Meaningfully) Improve Privacy

Cookies aren't just for the dark side.

I noticed in some discussion on Hacker News about Google Chrome an argument that disabling third-party cookies somehow improved privacy.  I don’t intend to comment on the rest of the debate, but this particular assertion is troubling.

At time of writing, only two browsers interfere with third-party cookies in any meaningful way.  Internet Explorer denies setting third-party cookies unless a P3P header is sent.  This is basically an evil bit, and just as pointless.  No other browser even pretends to care about this standard.

The other is Apple’s Safari browser, which denies setting third-party cookies unless a user has “interacted” with the framed content.  The definition of “interacted” is a bit fuzzy, but clicking seems to do it.  No other browser does this, or anything like it.  There are some laughably simple hacks around this, like floating an iframe under the user’s cursor (and, for some reason, submitting a form with a POST method).  Even if those hacks didn’t exist, the idea is still pointless.

The reason I know about these rules is that we had to work around them when implementing auto-logins at Stack Exchange (there was an earlier version that straight up did not work for Safari due to reliance on third-party cookies).  This also came up when implementing the Stack Exchange OpenID Provider, as we frame log in and account creation forms on our login page.

For auto-logins, I ended up using a combination of localStorage and postMessage that works on all modern browsers (since it’s not core functionality we were willing to throw IE7 under a bus at the time, and now that IE9 is out we don’t support IE7 at all).  StackID tries some workarounds for Safari, and upon failure displays an error message providing some guidance.

These methods are somewhat less nefarious than this, but just slightly.

The joke is that there are alternatives that work just fine

ETags have gotten a lot of press, the gist being that you re-purpose a caching mechanism for tracking (similar tricks are possible with the Last-Modified header).  This is a fundamental problem with any cache expiration scheme that isn’t strictly time based, as a user will always have to present some (potentially identifying) token to a server to see if their cache is still valid.

Panopticlick attacks the problem statistically, using the fact that any given browser is pretty distinctive in terms of headers, plugins, and so on independent of any cookies or cache directives.  My install of Chrome in incognito mode provides ~20 bits of identifying information, which if indicative of the population at large implies a collision about every 1,200 users.  In practice, most of these strings are globally unique so coupled with IP based geo-location it is more than sufficient for tracking if you’re only concerned with a small percentage of everyone on Earth.  Peter Eckersley’s paper on the subject also presents a rudimentary algorithm for following changing fingerprints (section 5.2), so you don’t even have to worry about increased instability when compared to third-party cookies.

You can get increasingly nefarious with things like “image cookies,”  where you a create a unique image and direct a browser to cache it forever.  You then read the colors out via HTML5’s Canvas, and you’ve got a string that uniquely identifies a browser.  This bypasses any same origin policy (like those applied to cookies and localStorage) since all browsers will just pull the image out of cache regardless of which domain the script is executing under.  I believe this technique was pioneered by Evercookie, but there may be some older work I’m not aware of.

If you’ve been paying attention, you’ll notice that none of these techniques are exactly cutting edge.  They’re still effective due in large part to the fact that closing all of these avenues would basically break the internet.

They aren't the most friendly of UIs, but they exist.

Why do we stick to cookies and localStorage?

The short of it is that we over at Stack Exchange are “Good Guys™,” and as such we don’t want to resort to such grey (or outright black) hat techniques even if we’re not using them nefariously.  I hope the irony of doing the “right thing” being more trouble than the alternative isn’t lost on anyone reading this.

More practically, after 15 years of popular internet usage normal people actually kind-of-sort-of get cookies.  Not in any great technical sense, but in the “clear them when I use a computer at the library” sense.  Every significant browser also has a UI for managing them, and a way to wipe them all out.  It’s for this reason that our OpenID provider only uses cookies, since it’s more important that it be practically secure-able than usable; at least when compared to the Stack Exchange sites themselves.

For global login, localStorage is acceptable since clearing it is somewhat less important.  You can only login to existing accounts, only on our network, and on that network there are significant hurdles preventing really nefarious behavior (you cannot permanently destroy your account, or your content in most cases).

This reference predates Internet Explorer's cookie support.

What good does Safari’s third-party cookie behavior do?

Depending on how cynical you are, one of: nothing, mildly inconveniencing unscrupulous ad networks, or childishly spiting Google.  I’m in the “nothing” category as there’s too much money to be had to believe it deters the seedier elements of the internet, and the notion that Apple would try to undermine a competitor’s revenue stream this way is too conspiracy theory-ish for me to take seriously.

I can believe someone at Apple thinks it helps privacy, but in practice it clearly doesn’t.  At best, it keeps honest developers honest (not that they needed any prompting for this) and at worst it makes it even harder for user’s to avoid tracking as more and more developers resort to the more nefarious (but more reliable!) alternatives to third-party cookies.

There may be legitimate complaints about browser’s default behavior with regards to privacy, but having third-party cookies enabled by default isn’t one of them.


History Of The Stack Exchange API, Version 1.1

In February we rolled out version 1.1 of the Stack Exchange API.  This version introduced 18 new methods, a new documentation system, and an application gallery.

Developing this release was decidedly different than developing version 1.0.  We were much more pressed for time as suggested edits (one of our bigger changes to the basic site experience) were being developed at basically the same time.  Total development time on 1.1 amounted to approximately one month, as compared to three for 1.0.

The time constraint meant that our next API release would be a point release (in that we wouldn’t be able to re-implement much), which also meant we were mostly constrained by what had gone before.  Version 1.0 had laid down some basic expectations: vectorized requests, a consistent “meta object” wrapper, JSON returns, and so on.  This was a help, since a lot of the work behind an API release is in deciding these really basic things.  It also hurt some though, since we couldn’t address any of the mistakes that had become apparent.

How we decided what to add in 1.1

There’s one big cheat available to Stack Exchange here; we’ve got a user base chock full of developers requesting features.  This is not to suggest that all requests have been good ones, but they certainly helps prevent group-think in the development team.

More generally, I approached each potential feature with this checklist.

  • Has there been any expressed interest in the feature?
  • Is it generally useful?
  • Does it fit within the same model as the rest of the API?

Take everything that passes muster, order them by a combination of usefulness and difficulty of implementing (which is largely educated guess work), and take however many you think you’ve got time to implement off the top.  I feel the need to stress that this is an ad hoc approach, while bits and pieces of this process were written down (in handy todo.txt files) there wasn’t a formal process or methodology built around it.  No index cards, functional specs, planning poker, or what have you (I’m on record [25 minutes in or so] saying that we don’t do much methodology at Stack Exchange).

Careers's distinguishing feature is contact based, not data based.

Some examples from 1.1

Some new methods, like /questions/{ids}/linked, were the direct results of feature requests.  Others, like /users/…/top-answers, came from internal requests; this one in support of Careers 2.0 (we felt it was important that most of the data backing Careers be publicly available with the introduction of passive candidates).  Both methods easily pass the “expressed interest” bar.

General usefulness is fuzzier, and therefore trickier to show; it is best defined by counter-example in my opinion.  Trivial violators are easy to imagine, the /jon-skeet or /users-born-in-february methods, but more subtle examples are less forthcoming.  A decent example of a less than general method is one which gives access to the elements of a user’s global inbox which are public (almost every type of notification is in response to a public event, but there are a few private notifications).  This would be useful only in the narrow cases where an app wants some subset of a user’s inbox data, but doesn’t want to show the inbox itself.  I suspect this would be a very rare use case, based on the lack of any request for similar features on the sites themselves.  It has the extra problem of being almost certain to be deprecated by a future API version that exposes the whole of an inbox in conjunction with user authentication.

One pitfall that leads to less than generally useful methods is to depend too much on using your own API (by building example apps, or consuming it internally for example) as a method of validating design.  The approach is a popular one, and it’s not without merit, but you have to be careful not to write “do exactly what my app needs (but nearly no other app will)” methods.  The Stack Exchange API veers a little into this territory with the /users/{ids}/timeline method which sort of assumes you’re trying to write a Stack Exchange clone, it’s not actually too specialized to be of no other use but it’s a less than ideally general.

Whether something “fits” can be a tad fuzzy as well.  For instance, while there’s nothing technically preventing the /users/moderators method from returning a different type than /users (by adding, say, an elected_on_date field) I feel that would still be very wrong.  A more subtle example would be a /posts method, that behaves like a union of /questions, /answers, and /comments.  There’s some clear utility (like using it to get differential updates) however such a method wouldn’t “fit,” because we currently have no notion of returning a heterogeneous set of objects.  There are also sharper “doesn’t fit” cases, like adding a method that returns XML (as the rest of the API returns JSON) or carrying state over between subsequent API calls (the very thought of which fills me with dread).

There was some experimentation in 1.1

In 1.1 almost everything done was quite safe, we didn’t change existing methods, we didn’t add new fields, and really there weren’t any radical changes anywhere.  Well… except for two methods, /sites and /users/{id}/associated which got completely new implementations (the old ones naturally still available under /1.0).

These new versions address some of the short comings we knew about the API in general, and some problems peculiar to those methods in 1.0 (most of which stem from underestimating how many sites would be launched as part of Stack Exchange 2.0).  Getting these methods, that would more properly belong in version 2.0, out early allowed us to get some feedback on the direction planned for the API.  We had the fortune of having a couple well isolated methods (their implementations are completely independent of the rest of the API) that needed some work anyway on which to test our future direction; I’m not sure this is something that can reasonably be applied to other APIs.

The world of tomorrow

Version 1.1 is the current release of the Stack Exchange API, and has been for the last seven months.  Aside from bug fixes, no changes have been made in that period.  While work has not yet begun on version 2.0, it has been promised for this year and some internal discussion has occurred, some documents circulated, and the like.  It’s really just a matter of finding the time now, which at the moment is mostly being taken up by Facebook Stack Overflow and related tasks.


History Of The Stack Exchange API, Mistakes

In an earlier post, I wrote about some of the philosophy and “cool bits” in the 1.0 release of the Stack Exchange API.  That’s all well and good, but of course I’m going to tout the good parts of our API; I wrote a lot of it after all.  More interesting are the things that have turned out to be mistakes, we learn more from failure than success after all.

Returning Total By Default

Practically every method in the API returns a count of the elements the query would return if not constrained by paging.

For instance, all questions on Stack Overflow:

{
  "total": 1936398,
  "page": 1,
  "pagesize": 30,
  "questions": [...]
}

Total is useful for rendering paging controls, and count(*) queries (how many of my comments have been up-voted, and so on); so it’s not that the total field itself was a mistake.  But returning it by default definitely was.

The trick is that while total can be useful, it’s not always useful.  Quite frequently queries take the form of “give me the most recent N questions/answers/users who X”, or “give me the top N questions/answers owned by U ordered by S”.  Neither of these common queries care about total, but they’re paying the cost of fetching it each time.

For simple queries (/1.0/questions call above), at least as much time is spent fetching total as is spent fetching data.

“Implicit” Types

Each method in the Stack Exchange API returns a homogenous set of results, wrapped in a meta data object.  You get collections of questions, answers, comments, users, badges, and so on back.

The mistake is that although the form of the response is conceptually consistent, the key under which the actual data is returned is based on the type.  Examples help illustrate this.

/1.0/questions returns:

{
 "total": 1947127,
 ...
 "questions": [...]
}

/1.0/users returns:

{
 "total": 507795,
 ...
 "users": [...]
}

This makes it something of a pain to write wrappers around our API in statically typed languages.  A much better design would have been a consistent `items` field with an additional `type` field.

How /1.0/questions should have looked:

{
 "total": 1947127,
 "type": "question",
 ...
  "items": [...]
}

This mistake became apparent as more API wrappers were written.  Stacky, for example, has a number of otherwise pointless classes (the “Responses” classes) just to deal with this.

It should be obvious what's dangerous, and most things shouldn't be.

Inconsistent HTML “Safety”

This one only affects web apps using our API, but it can be a real doozy when it does.  Essentially, not all text returns from our API is safe to embed directly into HTML.

This is complicated a bit by many of our fields having legitimate HTML in them, making it so consumers can’t just html encode everything.  Question bodies, for example, almost always have a great deal of HTML in them.

This led to the situation where question bodies are safe to embed directly, but question titles are not; user about mes, but not display names; and so on.  Ideally, everything would be safe to embed directly except in certain rare circumstances.

This mistake is a consequence of how we store the underlying data.  It just so happens that we encode question titles and user display names “just in time”, while question bodies and user about mes are stored pre-rendered.

A Focus On Registered Users

There are two distinct mistakes here.  First, we have no way of returning non-existent users.  This question, for instance, has no owner.  In the API, we return no user object even though we clearly know at least the display name of the user.  This comes from 1.0 assuming that every user will have an id, which is a flawed assumption.

Second, the /1.0/users route only returns registered users.  Unregistered users can be found via their ids, or via some other resource (their questions, comments, etc.).  This is basically a bug that no one noticed until it was too late, and got frozen into 1.0.

I suppose the lesson to take from these two mistakes is that your beta audience (in our case, registered users) and popular queries (which for us are all around questions and answers) have a very large impact on the “polish” pieces of an API get.  A corollary to Linus’ Law to be aware of, as the eyeballs are not uniformly distributed.

Things not copied from Twitter: API uptime.

Wasteful  Request Quotas

Our request quota system is a lift from Twitter’s API for the most part, since we figured it was better to steal borrow from an existing widely used API than risk inventing a worse system.

To quickly summarize, we issue every IP using the API a quota (that can be raised by using an app key) and return the remaining and total quotas in the X-RateLimit-Current and X-RateLimit-Max headers.  These quotas reset 24 hours after they are initially set.

This turns out to be pretty wasteful in terms of bandwidth as, unlike Twitter, our quotas are quite generous (10,000 requests a day) and not dynamic.  As with the total field, many applications don’t really care about the quota (until they exceed it, which is rare) but they pay to fetch it on every request.

Quotas are also the only bit of meta data we place in response headers, making them very easy for developers to miss (since no one reads documentation, they just start poking at APIs).  They also aren’t compressed due to the nature of headers, which goes against our “always compress responses” design decision.

The Good News

Is that all of these, along with some other less interesting mistakes, are slated to be fixed in 2.0.  We couldn’t address them in 1.1, as we were committed to not breaking backwards compatibility in a point-release (there were also serious time constraints).


History Of The Stack Exchange API, Version 1.0

When I was hired by Stack Exchange, it was openly acknowledged that it was in large part because of my enthusiasm for building an API.  We’ve since gone on to produce an initial 1.0 release, a minor 1.1 update (documented here), and are planning for a 2.0 this calendar year.

If you haven't read this book (or the blog it's from), why are you reading this?

Raymond Chen's blog is a great source for Windows history. Equivalents for most other topics are sorely lacking.

What I’d like to talk about isn’t 2.0 (though we are planning to publish a spec for comment in the not-to-distant future), but about the thought process behind the 1.0 release.  I always find the history behind such projects fascinating, so I’d like to get some of Stack Exchange API’s out there.

We weren’t shooting for the moon, we constrained ourselves.

A big one was that 1.0 had to be read-only.

Pragmatically, we didn’t have the resources to devote to the mounds of refactoring that would be required to get our ask and edit code paths up to snuff.  There are also all sorts of rejection cases to handle (at the time we had bans, too many posts in a certain timeframe, and “are you human” captcha checks), which we’d have to expose, and the mechanism would have to be sufficiently flexible to handle new rejection cases gracefully (and we’ve added some in the 1.0 -> 2.0 interim, validating this concern).  There’s also the difficulty in rendering Markdown (with our Stack Exchange specific extensions, plus Prettify, MathJax, jTab, and who knows what else in the future), which needs to be solved if applications built on the Stack Exchange API are to be able to mimic our preview pane.

Philosophically, write is incredibly dangerous.  Not just in the buggy-authentication, logged in as Jeff Atwood, mass content deleting sense; though that will keep me up at night.  More significantly (and insidiously) in the lowered friction, less guidance, more likely to post garbage sense.

Then there are quality checks, duplicate checks, history checks...

Similar titles, similar questions, live preview, tag tips, and a markdown helper. This is just the guidance we give a poster *before* they submit.

We do an awful lot to keep the quality of content on the Stack Exchange network very high (to the point where we shut down whole sites that don’t meet our standards).  A poorly thought out write API is a great way to screw it all up, so we pushed it out of the 1.0 time-frame.  It looks like we’ll be revisiting it in 3.0, for the record.

We also wanted to eliminate the need to scrape our sites.

This may seem a bit odd as a constraint, but there’s was only so much development time available and a lot of it needed to be dedicated to this one goal.  The influence of this is really quite easy to see, there’s an equivalent API method for nearly every top level route on a Stack Exchange site (/users, /badges, /questions, /tags, and so on).

Historically we had tolerated a certain amount of scraping in recognition that there were valid reasons to get up-to-date data out of a Stack Exchange site, and providing it is in the spirit of the cc-wiki license that covers all of our user contributed content.  However scraping is hideously inefficient both from a consuming and producing side, with time wasted rendering HTML, serving scripts, including unnecessary data, and then stripping all that garbage back out.  It’s also very hard to optimize a site for both programs and users, the access patterns are all different.  By moving scraping off of the main sites and onto an API, we were able to get a lot more aggressive about protecting the user experience by blocking bots that negatively affect it.

Of course, we were willing to try out some neat ideas.

So named for the "vector processors" popularized in the early days of the industry (as in the CM-1 pictured above). More commonly called SIMD today.

Vectorized requests are probably the most distinctive part of our API.  In a nutshell, almost everywhere we accept an id we’ll accept up to 100 of them.

/users/80572;22656;1/

Fetches user records for myself, Jon Skeet, and Jeff Atwood all in one go.

This makes polling for changes nice and easy within a set of questions, users, users’ questions, users’ answers, and so on.  It also makes it faster to fetch lots of data, since you’re only paying for a round trip for every 100  resources.

I’m not contending that this is a novel feature, Twitter’s API does something similar for user lookup.  We do go quite a bit further, making it a fundamental part of our API.

Representing compression visually is difficult.

We also forced all responses to be GZIP’d.  The rational for this has been discussed a bit before, but I’ll re-iterate.

Not GZIP’ing responses is a huge waste for all parties.  We waste bandwidth sending responses, and the consumer wastes time waiting for the pointlessly larger responses (especially painful on mobile devices).  And it’s not like GZIP is some exotic new technology, no matter what stack someone is on, they have access to a GZIP library.

This is one of those things in the world that I’d fix if I had a time machine.  There was very little reason to not require all content be GZIP’d under HTTP, even way back in the 90’s.  Bandwidth has almost always been much more expensive than CPU time.

Initially we tried just rejecting all requests without the appropriate Accept-Encoding header but eventually resorted to just always responding with GZIP’d requests, regardless of what the client nominally accepts.  This has to do with some proxies stripping out the Accept-Encoding header, for a variety of (generally terrible) reasons.

I’m unaware of any other API that goes whole hog and requires clients accept compressed responses.  Salesforce.com’s API at least encourages it.

Not nearly as complex as SQL can get, but hopefully complex enough for real work.

Finally, we emphasize sorting and filtering to make complex queries.  Most endpoints accept sort, min, max, fromdate, and todate parameters to craft these queries with.

For example, getting a quick count of how many of my comments have ever been upvoted on Stack Overflow (38, at time of writing):

/users/80572/comments?sort=votes&min=1&pagesize=0

or all the positively voted Meta Stack Overflow answers the Stack Exchange dev team made in July 2011 (all 5995 of them):

/users/130213;...;91687/answers?sort=votes&min=1&fromdate=1309478400&todate=1312156799

We eventually settled on one configurable sort, that varies by method, and an always present “creation filter” as adequately expressive.  Basically, it’s sufficiently constrained that we don’t have to worry (well… not too much anyway) about crippling our databases with absurdly expensive queries, while still being conveniently powerful in a lot of cases.

This isn’t to suggest that our API is perfect.

I’ve got a whole series of articles in me about all the mistakes that were made.  Plus there’s 1.1 and the upcoming 2.0 to discuss, both of which aim to address to short-comings in our initial offering.  I plan to address these in the future, as time allows.