Your Future On Stack Overflow

I recently spent a while working on a pretty fun problem over at Stack Exchange: predicting what tags you’re going to be active answering in.

Confirmed some suspicions, learned some lessons, got about a 10% improvement on answer posting from the homepage  (which I’m choosing to interpret as better surfacing of unanswered questions).

Good times.

Why do we care?

Stack Overflow has had the curious problem of being way too popular for a while now.  So many new questions are asked, new answers posted, and old posts updated that the old “what’s active” homepage would cover maybe the last 10 minutes.  We addressed this years ago by replacing the homepage with the interesting tab, which gives everyone a customized view of stuff to answer.

The interesting algorithm (while kind of magic) has worked pretty well, but the bit where we take your top tags has always seemed a bit sub-par.  Intuitively we know that not all tags are equal in volume or scoring potential, and we also know that activity in one tag isn’t really indicative just in that tag.

What we’d really like in there is your future, what you’re going to want to answer rather than what you already have.  They’re related, certainly, but not identical.

Stated more formally: what we wanted was an algorithm that when given a user and their activity on Stack Overflow to date, predicted for each tag what percentage of their future answers would be on questions in that tag.  “Percentage” is a tad mislead since each question on Stack Overflow can have up to five tags, so the percentages don’t sum to anything meaningful.

The immediate use of such an algorithm would be improving the homepage, making the questions shown to you more tailored to your interests and expertise.  With any luck the insights in such an algorithm would let us do similar tailoring elsewhere.

To TL;DR, you can check out what my system thinks it knows about you by going to /users/tag-future/current on any of the older Stack Exchange sites.  The rest of this post is about how I built it, and what I learned doing it.

Unsurprisingly, that’s a pretty good model of how I think about my Stack Overflow participation.

What Do We Know?

A big part of any modeling process is going to be choosing what data to look at.  Cast too wide a net and your iteration time explodes, too narrow and you risk missing some easy gains.  Practicality is also a factor, as data you technically have but never intended to query en masse may lead you to build something you can’t deploy.

What I ended up using is simply the answers on a site (their text, creation dates, and so on), along with the tags the associated questions had when the answer was posted.  This data set has the advantage of being eminently available, after all Stack Exchange has literally been built for the purpose of serving answers, and public knowledge.

At various times I did try using data from the questions themselves and an answerers history of asking, but to no avail.  I’m sure there’s more data we could pull in, and probably will over time; though I intend to focus on our public data.  In part this is because it’s easier to explain and consume the public data but also because intuitively answerers are making decisions based on what they can see, so it makes sense to focus there first.

Except with about 315,000 different parts.

A Model Of A Stack Exchange

The actual process of deriving a model was throwing a lot of assumptions about how Stack Overflow (and other Stack Exchanges) work against the wall, and seeing what actually matched reality.  Painstaking, time consuming, iteration.  The resulting model does work (confirmed by split testing against the then current homepage), and backs up with data a lot of things we only knew intuitively.

Some Tags Don’t Matter

It stands to reason that a tag that only occurs once on Stack Overflow is meaningless, and twice is probably just as meaningless.  Which begs the question, when, exactly does a tag start to matter?  Turns out, before about forty uses a tag on Stack Overflow has no predictive ability; so all these tags aren’t really worth looking at in isolation.

Similarly a single answer isn’t likely to tell us much about a user, what I’d expect is a habit of answering within a tag to be significant.  How many answers before it matters?  Looks like about three.  My two answers in “windows-desktop-gadgets” say about as much about me as my astrological sign (Pisces if you’re curious).

Pictured: the state of programming Q&A before Stack Overflow.

Most People Are Average (That’s Why It’s An Average)

What’s being asked on Stack Overflow is a pretty good indicator of what’s being used in the greater programming world, so it stands to reason that a lot of people’s future answering behavior is going to look like the “average user’s” answering behavior.  In fact, I found that the best naive algorithm for predicting a user’s future was taking the site average and then overlaying their personal activity.

Surprisingly, despite the seemingly breakneck speed of change in software, looking at recent history when calculating the site average is a much worse predictor than considering all-time.  Likewise when looking at user history, even for very highly active users, recent activity is a worse predictor than all time.

One interpretation of those results, which I have no additional evidence for, is that you don’t really get worse at things over time you mostly just learn new things.  That would gel with recent observations about older developers being more skilled than younger ones.

You Transition Into A Tag

As I mentioned above, our best baseline algorithm was predicting the average tags of the site and then plugging in a user’s actual observed history.  An obvious problem with that is that posting a single answer in say “java.util.date” could get us predicting 10% of your future answers will be in “java.util.date” even though you’ll probably never want to touch that again.

So again I expected there to be some number of uses of a tag after which your history in it is a better predictor than “site average”.  On Stack Overflow, it takes about nine answers before you’re properly “in” the tag.  Of course there needs to be a transition between “site average” and “your average” between three and nine answers, and I found a linear one works pretty well.

We All Kind Of Look The Same

Intuitively we know there are certain “classes” of users on Stack Overflow, but exactly what those classes are is debatable. Tech stack, FOSS vs MS vs Apple vs Google?  Skill level, Junior vs Senior?  Hobbyist vs Professional?  Front-end vs Back-end vs DB?  And on and on.

Instead of trying to guess those lines in the sand, I went with a different intuition which was “users who start off similarly will end up similarly”.  So I clustered users based on some N initial answers, then use what I knew about existing users to make predictions for new users who fall into the cluster.

Turns out you can cut Stack Overflow users into about 440 groups based on about 60 initial tags (or about 30 answers equivalently) using some really naive assumptions about minimum distances in euclidean space.  Eyeballing the clusters, it’s (very approximately) Tech stack + front/back-end that divides users most cleanly.

We’d also expect chocolate in there.

One Tag Implies Another

Spend anytime answering on Stack Overflow and you’ll notice certain tags tend to go together.  Web techs are really good for this like “html” and “css” and “javascript” and “jquery”, but you see it in things like “ios” and “objective-c”.  It stands to reason that answering a few “c#” questions should raise our confidence that you’re going to answer some “linq-to-object” questions then.

Testing that assumption I find that it does, in fact, match reality.  The best approach I found was predicting activity in a tag given activity in commonly co-occurring tags (via a variation on principal component analysis) and making small up or down tweaks to the baseline prediction accordingly.  This approach depends on there being enough data for co-occurrence to be meaningful, which I found to be true for about 12,000 tags on Stack Overflow.

Trust Your Instincts

Using the Force is optional.

One pretty painful lesson I learned doing all this is: don’t put your faith in standard machine learning.  It’s very easy to get the impression online (or in survey courses) that rubbing a neural net or a decision forest against your data is guaranteed to produce improvements.  Perhaps this is true if you’ve done nothing “by hand” to attack the problem or if your problem is perfectly suited to off the shelf algorithms, but what I found over and over again is that the truthiness of my gut (and that of my co-workers) beats the one-size-fits-all solutions.  You know rather a lot about your domain, it makes sense to exploit that expertise.

However you also have to realize your instincts aren’t perfect, and be willing to have the data invalidate your gut.  As an example, I spent about a week trying to find a way to roll title words into the predictor to no avail.  TF-IDF, naive co-occurrence, some neural network approaches, and even our home grown tag suggester never quite did well enough; titles were just too noisy with the tools at my disposal.

Get to testing live as fast as you possibly can, you can’t have any real confidence in your model until it’s actually running against live data.  By necessity much evaluation has to be done offline, especially if you’ve got a whole bunch of gut checks to make, but once you think you’ve got a winner start testing.  The biggest gotcha revealed when my predictor went live is that the way I selected training data made for a really bad predictor for low activity users, effectively shifting everything to the right.  I solved this by training two separate predictors (one for low activity, and one for high).

Finally, as always solving the hard part is 90% of the work, solving the easy part is also 90% of the work.  If you’re coming at a problem indirectly like we were, looking to increase answer rates by improving tag predictions, don’t have a ton of faith in your assumptions about the ease of integration.  It turned out that simply replacing observed history with a better prediction in our homepage algorithm broke some of the magic, and it took about twenty attempts to realize gains in spite of the predictor doing what we’d intended.  The winning approach was considering how unusual a user is when compared to their peers, rather than considering them in isolation.

Again, want to see what we think you’ll be active in?  Hit /users/tag-future/current on your Stack Exchange of choice.


Kids These Days

I freely admit about 1/2 the following content is just an excuse to use this title.

Lately I’ve found myself wondering, how are kids getting into programming these days?  I blame still attending the Stack Exchange “beer bashes” (though I think Github’s “drinkups” have won the naming battle there), while no longer being able to drink for these thoughts.  Programmers like to talk about programming, and drunks like to talk about the past, so drunk programmers… well you see where I’m going with this.  Anyway, it seems like the high level stuff has become more accessible, while the “what is actually happening”-bits are increasingly hidden.

Let Me Explain That Last Bit

Every kid who has touched a computer since 2001 has had access to decent javascript, and the mountains of examples view-source entails.  Today Java, C#, Python, Ruby, etc. are all freely available along with plenty of tutorials and example code.  That’s a hell of a lot more than what was out there when I started, but it’s all a few dozen levels of abstraction up.  And that’s my concern, the bare metal is really deep down and stumbling upon it is quite hard.

When I Was Your Age…

I first starting coding on a TI-99/4A, in TI-BASIC, at the age of 5. That’s a machine you can fit in your head.  The processor spec is ~40 pages long (the errata for the Core 2 Duo is almost 100 pages), and there’s none of the modern magic, like pipe-lining, branch prediction, or pre-fetching (there’s not even an on chip cache!).

Now of course, I didn’t actually understand the whole machine at 5 but the people teaching me did.  Heck, the only reason a machine discontinued in ’84 was available to me in ’92 was because they’d worked on the thing and saved a couple.

A grounding in such a simple machine lead me to question a lot about the higher level languages I eventually learned.  For example when I first started learning Java the idea that two functions with the same name could exist just messed with my head, because it didn’t gel with my previous experience, driving me to read up on namespacing and vtables.

Never did figure out how to catch Mew though.

Looking back, another avenue for programming knowledge is how horribly glitchy video games were.  Actually, video games are still horribly glitchy but the consoles are a lot more sophisticated now.  Used to be when a game went sideways the consoles just did not care, and you could do all wreak all sorts of havoc exploiting glitches.  I can honestly say I “got” pointers/indirection (though I don’t remember if I knew the words or not) when I wrapped my head around the infamous Cinnibar Island glitch.  Some of these style glitches can still happen on modern consoles, but there’s proper memory protection and some spare cycles to spend on validation now so they’re a lot rarer.

If you shift my story forward to the present day, a kid learning on 8 year old hardware would be using a PC running Windows XP (probably with a Pentium 4 in it); a simple machine that is not.  Their consoles have hypervisors, proper operating systems, and patches!  If you play Pokémon today you actually only have one masterball.  The most archaic kit they’re likely to encounter is a TI-83 calculator (which still has a Z80 in it), and even then not till high school.

Lies To Children

Which has some serious issues on the Surface RT; fix is pending approval in the app store.

Thinking about this lead me to write Code Warriors, a dinky little coding puzzle app, as my “see if XAML sucks any less than it used to” side project (the answer is yes, but the REPL is still bad).  Even it’s a lies to children version of assembly, making significant compromises in the name of “fun”.  And while I was thinking about kids learning, I certainly didn’t play test it with any children so… yeah, probably not great for kids.

I have no doubts we’re going to keep producing great programmers.  Colleges, the demands of the job market, and random people on the internet will produce employable ones at least.  But I find myself wondering if we aren’t growing past the point where you can slip into programming as if by accident.  It is after all quite a young field, kids in college when ENIAC was completed are still alive, so perhaps it’s just a natural progression.


Trustable Online Community Elections

Keeping votes secure for a couple thousand years, and counting.

One of the hallmarks of successful online communities is self-governance, typically by having some or all admins/moderators/what-have-yous come from “within” the community.  Many communities select such users through a formal election process.  Wikipedia and Stack Exchange (naturally) are two large examples, you’ve probably stumbled across a number of forums with similar conventions.

But this is the internet, anonymity, sock-puppetry, and paranoia run wild.

How can we prove online elections are on the up-and-up?

First, a disclaimer.  Today the solution is always some variation of reputation; sock puppetry is mitigated by the difficulty inherent in creating many “trusted” users, and the talliers are assumed to be acting in good faith (they, after all, are rarely anonymous and stand to lose a lot if caught).  This works pretty well, the status quo is good; what I’m going to discuss is interesting (I hope), but hardly a pressingly needed change.

Also, this is just me thinking aloud; this doesn’t reflect any planned changes over at Stack Exchange.

Modeling the threat

We need to define what we’re actually worried about (so we can proof our solution against it), and what we explicitly don’t care about.

Legitimate concerns:

  • Ballot box stuffing
  • Vote tampering
  • Vote miscounting

Explicitly ignoring:

  • Voter impersonation
  • Voter intimidation

The concerns are self explanatory, we just want to make sure only people who should vote do vote, that their votes are recorded correctly, and ultimately tallied correctly.

Attacks which we’re explicitly ignoring require a bit more explanation.  They all ultimately descend from the fact that these elections are occurring online, where identity is weak and the stakes are relatively low.

It is hard to prove who’s on the other end of the internet.

Consider how difficult it is to prove Jimmy Wales is actually tweeting from that account, or the one actually editing a Wikipedia article.  Since identity is so much weaker, we’re assuming whatever authentication a community is using for non-election tasks is adequate for elections.

We discount voter intimidation because, for practically all online communities, it’s impractical and unprofitable.  If you were to choose two Stack Overflow users at random, it’s 50/50 whether they’re even on the same side of the Atlantic much less within “driving to their home with a lead pipe”-distance.  Furthermore, the positions being run for aren’t of the high-paying variety.  Essentially, the risk and difficulty grossly outweigh the rewards so I feel confident in discounting the threat.

The Scheme

Before we get too deep into the technical details, I need to credit VoteBox (Sandler, Wallach) for quite a lot of these ideas.

Voter Rolls

To protect against ballot stuffing, it’s important to have a codified list of who can participate in a given election.  This is a really basic idea, you have to register to vote (directly or otherwise) in practically all jurisdictions.  But in the world of software it’s awfully tempting to do a quick “User.CanVote()” check at time of ballot casting, which makes it impossible to detect ballot box stuffing without consulting a historical record of everything that affects eligibility (which itself is almost impossible to guarantee hasn’t been tampered with).

Therefore, prior to an election all eligible voters should be published (for verification).  Once the election starts, this list must remain unchanged.

It’s A Challenge

Vote tampering (by the software typically) is difficult to proof against, but there are some good techniques.  First, all ballots must actually be stored as ballots; you can’t just “Canididate++” in code somewhere, normalized values in a database are no good to anyone; you need the full history for audits.  Again, this is common sense in real world elections but needs to be explicit for online ones.

This does make voting a little more confusing.

A more sophisticated approach is to make individual ballots “challenge-able”.  The idea is simple: break up “casting a ballot” into two parts, “stage” and “commit/challenge”.  When a ballot is staged, you publish it somewhere public (this implies some cryptography magic, which we’ll get back to later).  If the voter then “commits” their vote you make a note of that (again publicly), but if they “challenge” it you reveal the contents of the ballot (this implicitly spoils the ballot, and some more crypto-magic) for verification.  Obviously, if what the voter entered doesn’t match what the ballot contained something is amiss.

Challenge-able ballots make it possible for a group of voters to keep the system honest, as there’s no way for the system to know who will challenge a ballot any vote tampering carries a risk of being discovered. And with the internet being the internet news will get around quickly that something is up once tampered with ballots are being found, making the odds of challenging greater.

Freedom From Hash Chains

For challenges to work there needs to be a method for publishing “committed” ballots.  We could use something low-tech like an RSS feed, mailing list, or similar but ideally we’d have some guarantee that the feed itself isn’t being tampered with.  Consider the attack where some unchallenged ballots are removed from the feed long after they were published, or new ones are added as if they were cast in the past.

Our go-to solution for that is pushing each ballot (as well as some “keep alive” messages, for security reasons) into a modified hash chain.  The idea is simple, every new message contains a hash of the message before it.  Verifying no illicit insertions or deletions is just a matter of rehashing the chain one piece at a time and making sure everything matches.  You can also take a single message and verifying that it still belongs in the chain at a later date, meaning that people who want to verify the chain don’t have to capture every message the instant it’s published.

Note that the VoteBox system goes a step further and creates a “hash mesh” wherein multiple voting machines hash each others messages, to detect tampering done from a single machine.  In our model, all the machines are under the direct control of single entity and voters lack physical access to the hardware; the additional security of a “hash mesh” isn’t really present under this model accordingly.

Time Is Not On Your Side

One subtle attack on this hash chaining scheme: if you’re sufficiently motivated you can pre-compute them.  While far from a sure thing, with sufficient analytic data to predict about when users will vote, how likely they are to vote, challenge, and so on you could run a large number of fake elections “offline” and then try and get away with publishing one that deviates the least (in terms of time and challenged ballots) from reality.  With the same information and considerably more resources you could calculate fake elections on the fly starting from the currently accepted reality and try and steer things that way.  Since the entities involved typically have lots of computing resources, and elastic computing being widely available, we have to mitigate this, admittedly unlikely, but not impossible attack.

The problem in both cases is time.  An attacker has a lot of it before an election (when the voter roll may not be known, but can probably be guessed) and a fair amount of it during one.

Any independent and verifiable random source will do.

We can completely eliminate the time before an election by using a random root message in the hash chain.  Naturally we can’t trust the system to give us a randomly generated number, but if they commit to using something public and uncontrollable as the root message then all’s well.  An example would be to use the front page headline of some national newspapers (say the New York Times, USA Today, and the Wall Street Journal; in that order) from the day the election starts as the root message.  Since the root block is essentially impossible to know, faking a hash chain is no longer easier given the aforementioned analytic data.

To make attacking the hash chain during the election difficult we could introduce a random source for keep-alive payloads, like twitter messages from a Beiber search.  More important would be selecting a hash function that is both memory and time hard, such as scrypt.  Either is probably sufficient, but scrypt (or similar) would be preferable to Beiber if you had to choose.

Tallying Bananas

Thus far we’ve mostly concerned ourselves with individual ballots being tampered with, but what about the final results?  With what’s been discussed thus far, it’s still entirely possible that everyone votes, challenges, and audits to their hearts’ content and at the end of the election totally bogus results are announced.

The problem is that the ballots that count are necessarily opaque (they have to be encrypted clearly, even if we are publishing them), only the challenged ones have been revealed.

There do exist encryption schemes that can work around this.  Namely, any homomorphically additive encryption function; in other words, any encryption function E(x) for which there exist a function F(x1, x2) such that F(E(a), E(b)) = E(a+b).  Two such schemes are (a modifiedElGamal and Paillier (others exists), the ephemeral keys present in ElGamal are useful for our ballot challenge scheme.

Encrypting ballots with ElGamal and publishing them allows anyone to tally the results of the election.  The only constraint imposed by this system is that the voting system being used must be amenable to its ballots being represented by tuples of 1s and 0s; not a big hurdle, but something to be aware of.

This attack is like ballot stuffing with a single very large ballot.

Another subtle attack, since ballots are opaque nothing prevents a ballot from containing something other than 0 or 1 for a slot in a “race” (or all 1s, or any other invalid combination).  Some more crypto-magic in the form of zero knowledge proofs (specifically non-interactive ones) attached to a ballot for every “race” it contains gives us confidence (technically not 100%, but we can get arbitrarily close to it) that every ballot is well-formed.  The exact construction is complicated (and a bit beyond me, honestly), but the Adder system [Kiayias, Korman, Walluck] has provided a reference implementation for ElGamal (adapted into VoteBox, the Adder source seems to have disappeared).

There’s one last elephant in the room with this system.  Although everyone can tally the votes, and everyone can verify each individual vote is well-formed, those running the election are still the only ones who can decrypt the final results (they’re the only ones who have the ElGamal private key).  There honestly isn’t a fool-proof solution to this that I’m aware of, as revealing the private key allows for individual ballots to be decrypted in addition to the tally.  Depending on community this may be acceptable (ie. ballot privacy may only matter during the election), but that is more than likely not the case.  If there is a trusted third party, having them hold in the key in escrow for the duration of the election and then verify and decrypt the tally.  In the absence of a single third-party, a threshold scheme for revealing the private key may be acceptable.

Altogether Now

In short, the proposed system works like so:

  • Prior to the election starting, a full roll of eligible voters is published
  • When the election starts, a hash chain is started with an agreed upon (but random) first message
    • The hash chain uses a memory and time hard hash function
  • A public/private key pair for a homomorphically additive encryption system is stored in an agreed upon manner
  • Optionally, throughout the election agreed upon (but random) messages are added to the hash chain as keep-alives

When a user votes:

  • Their selections are encrypted using the previously generated public key
  • A NIZK is attached to the ballot to prove its well-formedness
  • An identifying token is also attached to the ballot to check against the voter roll
  • All this data is pushed to the hash chain as a single message
  • The voter then chooses to either commit or challenge their ballot
    • If they commit, a message is added to the hash chain to this effect
    • If they challenge, the ephemeral key for the ballot is published; allowing any third-party to decrypt the ballot, letting the voter prove the election is being run in good faith (they may then cast another ballot)

When the election ends:

  • Some pre-agreed upon message is placed in the hash chain
    • It’s not important this message be random, it’s just a signal that those watching can stop as the chain is finished
  • Anyone who wishes to can verify the integrity of the published hash chain, along with the integrity of individual ballots
  • Anyone who wishes to may tally the votes by exploiting the homomorphic properties of the chosen encryption scheme
  • Whichever approach to decrypting the tally has been chosen is undertaken, concluding the election

Improvements

The obvious place to focus on improving is the final decryption.  My gut (with all the weight of a Bachelor’s in Computer Science, so not a lot of weight) tells me there should be a way to construct a proof that the announced tally actually comes from decrypting the summed ciphertexts.  Or perhaps there’s some way to exploit the one-to-many nature of ElGamal encryption to provide (in advance of the election naturally) a function that will transform the tally ciphertext into one a different key can decrypt, but have that transformation not function on the individual ballots.

Of course, as I stated before, the status quo is acceptable.  It’s just fun to think of what we could build to deal with the theoretical problems.


Are you measuring your social media buttons?

I’d like you to go to a Stack Overflow (or any Stack Exchange) question, and see if you spot a nominally major change.  Such as this one, with a nice answer from Raymond Chen (I think it’s been more than month since I recommended his blog, seriously go read it).

Spot it?

Of course not, what we did is remove our social media sharing buttons.

Left is old, right is new.  A number of other instances were also removed.

The Process

I’m a big advocate of arguing from data and while I freely admit there are many things that are hard to quantify, your gains from social media buttons aren’t one of them.

When discussing sharing buttons, be careful about establishing scope.  Keeping or removing social media buttons isn’t about privacy, personal feelings on Twitter/Facebook/G+, or “what every other site does” (an argument I am particularly annoyed by); it’s about referred traffic and people.  There’s a strong temptation to wander into debate, feeling, and anecdotal territory; stick to the data.

You also need to gather lots of data, at Stack Exchange we prefer to slap a pointless query parameter onto urls we’re tracking.  It’s low tech, transparent, and doesn’t have any of the user facing penalties redirects have.  In particular we slapped ?sfb=1, ?stw=1, and ?sgp=1 onto links shared via the appropriate sharing buttons.  Determining click-throughs is just a matter of querying traffic logs with this approach.  We’ve been gathering data on social media buttons in particular for about four months.

Note that we’re implicitly saying that links that are shared that nobody clicks don’t count.  I don’t think this is contentious, spamming friends and followers (and circlers(?), whatever the G+ equivalent is) is a bit on the evil side. Somebody has to vindicate a share by clicking on it, otherwise we call it a waste.

This sort of nonsense isn’t worth it.

With all this preparation we can actually ask some interesting questions; for Stack Exchange we’re interested in how much traffic is coming from social media (this is an easy one), and how many sharers do so through a button.

Traffic gave us no surprises, we get almost nothing from social media.  Part of this is probably domain specific, my friends really don’t care about my knowledge of the Windows Registry or the Diablo III Auction House.  The sheer quantity of traffic coming from Google also drowns out any other source, it’s hard to get excited about tweaking social media buttons to bring in a few thousand extra views when tiny SEO changes can sling around hundreds of thousands.  To put the difference in scale in perspective, Google Search sends 3 orders of magnitude more traffic our way then the next highest referrer which itself sends more than twice what Twitter does (and Twitter is the highest “social” referrer).

Now that I’ve established a (rather underwhelming) upper-bound for how much our share buttons were getting us, I need to look at what portion can be ascribed to the share buttons versus people just copy/pasting.  This is where the slugs come in, presumably no-one is going to add a random query parameter to urls they’re copy/pasting after all.

You basically want to run a query like this:

SELECT *
FROM TrafficLogs
WHERE
-- everybody tries to set Referer so you know you got some juice from them
(RefererHost = 't.co' OR RefererHost = 'plus.url.google.com' OR RefererHost = 'www.facebook.com')
AND
-- Everybody scrapes links that are shared, don't count those they aren't people
UserAgent NOT LIKE '%bot%'
AND
-- Just the share buttons, ma'am
(Query LIKE '%stw=1%' OR Query LIKE '%sfb=1%' OR Query LIKE '%sgp=1%')

Adapt to your own data store, constrain by appropriate dates, etc. etc.  But you get the idea.  Exclude the final OR clause and the same query gives you all social media referred traffic to compare against.

What we found is that about 90% of everyone who does share, does so by copy/pasting.  Less than 10% of users make use of the share buttons even if they’ve already set out to share.  Such a low percentage of an already pretty inconsequential number doesn’t bode well for these buttons.

One final bit of investigation that’s a bit particular to Stack Exchange is figuring out the odds of a user sharing their newly created post.  We did this because while we always show the question sharing links to everyone, the answer sharing links are only shown to their owner and only for a short time after creation.  Same data, a little more complicated queries, and the answer comes out to ~0.4%.  Four out of every thousand new posts* on Stack Overflow get shared via a social media button (the ratio seems to hold constant for the rest of Stack Exchange, but there’s a lot less data to work with so my confidence is lower on them).

Data Shows They’re No Good, Now What?

Change for change’s sake is bad, right?  Benign things shouldn’t be moved around just because, but for social media buttons…

We know our users don’t like them, they’re sleezy (though we were very careful to avoid any of the tracking gotchas of +1 or Likes), and they’re cluttering some of our most important controls (six controls in a column on every question is a bit much, but to be effective [in theory] these buttons have to be prominent).

These buttons start with a lot of downsides, and the data shows that for us they don’t have much in the way of upsides.  So we we could either trudge along telling ourselves “viral marketing” and “new media” (plus we’ve already got them right, why throw them away?), or admit that these buttons aren’t cutting it and remove them.

So, are you measuring the impact of your site’s social media buttons?

You’re probably not in the same space as Stack Exchange, your users may behave differently, but you might be surprised at what you find out when you crunch the numbers.

*By coincidence this number seems to be about the same for questions and answers, it looks like more people seeing question share buttons balances out people being less enthusiastic about questions.


Stack Exchange API V2.0: Throttling

Here’s a post I lost while writing the rest of the Stack Exchange API V2.0 series, it’s been restored (ie. I hit the publish button) in glorious TechniColor.

Every API has some limit in the number of requests it can serve.  Practically, your servers will eventually fall down under load.  Pragmatically, you’ll want to limit access to keep your servers from falling down under load.  Rationing access to an API is a tricky business; you have to balance technical, business, and philosophical concerns.

History

In early versions of the Stack Exchange API we had two distinct throttles, a “simultaneous requests” throttle that would temporarily ban any IP making more than 30 requests per second and a “daily quota” throttle that would drop any requests over a certain threshold (typically 10,000) per day.

In V2.0, the simultaneous request throttle remains and unauthenticated applications (those making requests without access tokens) are subject to the same IP base daily quota that was in V1.x.  We’ve also added two new classes of throttles for authenticated applications, a per-app-user-pair quota and a per-user quota.

Function

When an authenticated app makes a request it is counted against a per-user-app-pair daily quota (notably not dependent upon IP address) and a per-user daily quota.  The user-app pair numbers are returned on each request (in fields on the common wrapper object), but the per-user daily value is hidden.

The authenticated application cases are a bit complicated, so here’s a concrete example:

A user is actively using 6 applications, all of which are making authenticated requests.  The user’s daily quota is 50,000 and each application has a daily quota of 10,000 (these numbers are typical).  Every time an application makes a request it receives its remaining quota with the response, and this number should be decremented by one on each request; if application #1 makes its 600th request it will receive “9,400” back, if application #2 simultaneously made its 9,000th request it will get “1,000” back.  If an application exceeds its personal quota (with its 10,001st request) its request will begin being denied, but every other application will continue to make requests as normal. However, if the sum of the requests made by each application exceeds the user’s quota of 50,000 then all requests will be denied.

There are good reasons to throttle.

Rationale

The simultaneous request throttle is purely technically driven, it helps protect the network from DOS attacks.  This throttle is enforced much higher in the stack than the others (affected requests never even hit our web tier), and is also much “ruder”.  In a perfect world we’d return well formed errors (as we do elsewhere in the API) when this is encountered, but practically speaking it’s much simpler to just let HAProxy respond however it pleases.

Our daily quotas are more tinted philosophically.  We still need to limit access to a scarce resource (our servers’ CPU time) and we don’t want any one application to monopolize the API to the detriment of others (or direct users of the Stack Exchange sites).  A simple quota system addresses these concerns sufficiently in my opinion.  It’s also “fair” since all you have to do is register on Stack Apps, after which you have access to all of our CC-wiki’d data.  The quota of 10,000 was chosen because it allows an app to request (100 at a time) one million questions a day via the API; it’s a nice round number, and is also fairly “absurd” giving us some confidence that most applications can function within the quota.

The per-app-user-pair quota starts pulling some business sense into the discussion.  In essence, IP based throttles fall apart in the face of computing as a service (Google App Engine, Azure, Amazon Web Services, and others).  While we were aware of this flaw in v1.0, we had no recourse at the time (lacking the resources to add authentication to the API and still ship in a timely manner).  By excluding services like App Engine from our API consumers we were hamstringing them, which made improving our throttling story a high ticket item for API v2.0.  As a rule of thumb, if a feature makes life easier on the consumers of your API you should seriously investigate it; after-all an API is not in and of itself interesting, the applications built on top of it are.

With the per-app-user-pair quota alone we still have to worry about DOS attacks, thus the per-user quota.  One wrinkle is in not reporting the state of the quota to applications.  I considered that, but decided that it was something of a privacy leak and not of interest to most applications; after all, there’s no way for one application to control another.


Stack Exchange API V2.0: The Stable Future

This is the last of my planned series about v2.0 of the Stack Exchange API (the contest is still going on, and we froze the interface a few weeks ago), and I’d like to talk about some regrets.  Not another mistakes post (API V2.0 is still too young to judge with the benefit of hindsight), but something in a similar vein.

This is thankfully a pretty short post, as I’ve worked pretty hard to reduce the regrettable things in every iteration of our API.

The need for breaking changes

I definitely tried to break as little as possible, compare the question object in V1.1 to that in V2.0 to see how little changed.  However, we ate two big breaks: restructuring the common response wrapper, and moving the API from per-site urls (api.stackoverflow.com, api.serverfault.com, etc.) to a single shared domain (api.stackexchange.com).

These were definitely needed, and I’m certain benefits they give us will quickly pay off the transition pain; but still, breaking changes impose work on consumers.  When building an API it’s important to never make breaking changes wantonly, you’re making non-trivial imposition on other developer’s time.

We couldn’t do everything

I’ve already got a list of things I want in V2.1, unfortunately some of them were things I’d like to have gotten into V2.0.  A truism of development is that there’s always more to be done, real developers ship, and so on.  API development is no different, but I’m hoping to see an increase in new features and a decrease in time between API revisions as Stack Exchange matures.

We should be much stabler now

Naturally, since I’m aware of these problems we’ve taken steps to make them less of an issue.

I’m very satisfied with the current request wrapper, and the filter feature (plus our backwards compatibility tooling around it) makes it much easier to make additions to it without breaking existing consumers.  In general, filters allow for lots of forward compatibility guarantees.

We’ll never be able to do everything, but all sorts of internal gotchas that made V1.1 really rough to roll out have been fixed as part of V2.0.  Subsequent minor revisions to V2.0 shouldn’t be nearly as rough, and with our growing staff developer resources should be comparatively plentiful.


Stack Exchange API V2.0: Consistency

If you take a look at the changelog for Stack Exchange API V2.0 (go check it out, you could win a prize), you’ll see we did more than just add new methods.  We also made a considerable number changes which taken as a whole aim to improve the consistency of the API’s interface.

Broadly speaking good APIs are consistent APIs.  Every time your API surprises a developer, or forces them to re-read the documentation, you’ve lost a small but important battle.  As I’ve said before, an API isn’t interesting in and of itself the applications built on top of it are; and accordingly any time wasted wrestling with the rough edges of your API is time not spent making those applications even more interesting.

Addressing Past Mistakes

This is an opinion I held during V1.0 as well, but we’re hardly perfect and some oddities slipped through.  Fields were inconsistently HTML encoded/escaped, question bodies and titles needing completely different treatment by an application.  There were ways to exclude some, but not all, expensive to fetch fields on some, but not many, methods.  There were odd naming patterns, for example /questions/{ids} expected question ids, and /comments/{ids} expected comment ids, but /revisions/{ids} expected post ids.  Return types could be unexpected, /questions returned questions as did /questions/{ids} but while /badges returned badges /badges/{ids} returned users (those who had earned the badges in question).

We’ve addressed all of these faults, and more, in V2.0.  Safety addresses the encoding issues, while Filters address the excluding of fields.  /revisions/{ids} was renamed to /posts/{ids}/revisions, while another odd method /revisions/{post ids}/{revision id} became the new /revisions/{revision ids}.  And all methods that exist in /method, /method/{ids} pairs return the same types.

There are some ways we’re perhaps… oddly consistent, that I feel merit mention.  All returns use the same “common wrapper object”, even errors; this lets developers use common de-serialization code paths for all returns.  Related, all methods return arrays of results even if the can reasonably be expected to always return a single item (such as /info); as with the common wrapper, this allows for sharing de-serialization code.  We return quota fields as data in the response body (other APIs, like Twitter, often place them in headers), so developers can handle them in the same manner as they would any other numeric data.

Identifying this game would make an interesting "age check" form.

Pitfalls

I would caution against what I’ve taken to calling “false consistency”, where an API either implies a pattern that doesn’t exist or sacrifices more important features in the name of consistency.

By way of example: the Stack Exchange API has a number of optional sorts on many methods, many of which are held in common (a great many methods have an activity sort, for instance).  The crux of the consistency objection to these sorts is that they often times apply to the same field (creation sorting applying to question, answer, and comment creation_date fields for example) but are named differently from the fields, some argue they should have the same name.

However, this would be a false consistency.  Consider if we just changed the names of the existing sorts, we would then be implying that all fields can be sorted by which isn’t true.  If did make it possible, we’d have sacrificed performance (and probably some of our SQL servers) in pursuit of consistency.  In this case the cure would be worse than the disease, though I would argue our sorts are consistent with regards to each other if not with regards to the fields they operate on.


Stack Exchange API V2.0: Safety

Every method in version 2.0 of the Stack Exchange API (no longer in beta, but an app contest continues) can return either “safe” or “unsafe” results.  This is a rather odd concept that deserves some explanation.

Semantics

Safe results are those for which every field on every object returned can be inlined into HTML without concern for script injections or other “malicious” content.  Unsafe results are those for which this is not the case, however that is not to imply that all fields are capable of containing malicious content in an unsafe return.

An application indicates where it wants safe or unsafe results via the filter it passes along with the request.  Since most requests should have some filter passed, it seemed reasonable to bundle safety into them.

To the best of my knowledge, this is novel in an API.  It’s not exactly something that we set out to add however, there’s a strong historical component to its inclusion.

Rationale

In version 1.0 of the Stack Exchange API every result is what we would call unsafe in version 2.0.  We were simply returning the data stored in our database, without much concern for encoding or escaping.  This lead to cases where, for example, question bodies were safe to inline (and in fact, escaping them would be an error) but question titles required encoding lest an application open itself to script injections.  This behavior caused difficulties for a number of third-party developers, and bit us internally a few times as well; which is why I resolved to do something to address it in version 2.0.

I feel that a great strength in any API is consistency, and thus the problem was not that you had to encode data, it was that you didn’t always have to.  We had two options, we could either make all fields “high fidelity” by returning the most original data we had (ie. exactly what user’s had entered, before tag balancing, entity encoding, and so on) or we could make all fields “safe” by make sure the same sanitization code ran against them regardless of how they were stored in the database.

Unfortunately, it was not to be.

Strictly speaking, I would have preferred to go with the “high fidelity” option but another consideration forced the “safe” one.  The vast majority of the content in the Stack Exchange network is submitted in markdown, which isn’t really standardized (and we’ve added a number of our own extensions).  Returning the highest fidelity data would be making the assumption that all of our consumers could render our particular markdown variant.  While we have open sourced most of our markdown implementation, even with that we’d be restricting potential consumers to just those built on .NET.  Thus the “high fidelity” option wasn’t just difficult to pursue, it was effectively impossible.

Given this train of thought, I was originally going to just make all returns “safe” period.  I realize in hindsight that this would have been a pretty grave mistake (thankfully, some of the developers at Stack Exchange, and some from the community, talked me out of it).  I think the parallel with C#’s unsafe keyword is a good one, sometimes you need dangerous things and you put a big “I know what I’m doing” signal right there when you do.  This parallel ultimately granted the options their names; properly escaped and inline-able returns are “safe”, and those that are pulled straight out of the database are “unsafe”.

Adding a notion of “safety” allows us to be very consistent in how data we return should be treated.  It also allows developers to ignore encoding issues if they so desire, at least in the web app case.  Of course, if they’re willing to handle encoding themselves they also have the option.


Stack Exchange API V2.0: No Write Access

Version 2.0 of the Stack Exchange API (still in beta, with an app contest) introduced authentication, but unlike most other APIs did not introduced write access at the same time.  Write access is one of the most common feature requests we get for the API, so I feel it’s important to explain that this was very much a conscious decision, not one driven solely by the scarcity of development resources.

Why?

The “secret sauce” of the Stack Exchange network is the quality of the content found on the sites.  When compared to competing platforms you’re much more likely to find quality existing answers to your questions, to find interesting and coherent questions to answer, and to quickly get good answers to new questions you post.  Every change we make or feature we add is considered in terms of either preserving or improving quality, and the Stack Exchange API has been no different.  When viewed under with lens of quality, a few observations can be made about the API.

Screw up authentication, and you screw up write access

Write access presupposes authentication, and any flaw in authentication is going to negatively impact the use of write access accordingly.  Authentication, being a new 2.0 feature, is essentially untested, unverified, and above all untrusted in the current Stack Exchange API.  By pushing authentication out in its own release we’re starting to build some confidence in our implementation, allowing us to be less nervous about write access in the future.

Screw up write access, and you screw up quality

The worst possible outcome of adding write access to the Stack Exchange API is a flood of terrible questions and answers on our sites.  As such the design of our API will need to actively discourage such an outcome, we can’t get away with a simple “POST /questions/create”.  However, a number of other APIs can get away with very simple write methods and it’s important to understand how they differ from the Stack Exchange model and as such need not be as concerned (beyond not having nearly our quality in the first place).

The biggest observation is that every Stack Exchange site is, in a sense, a single “well” in danger of being poisoned.  Every Stack Exchange site user sees the same Newest Questions page, the same User page, and (with the exception of Stack Overflow) the same Home page.  Compare with social APIs (ie. Facebook and Twitter, where everything is sharded around the user), or service APIs (like Twilio, which doesn’t really have common content to show users); there are lots of “wells”, none of which are crucial to protect.

Write access requires more than just writing

It’s easy to forget just how much a Stack Exchange site does to encourage proper question asking behavior in users.

I've circled the parts of the Ask page that don't offer the user guidance

A write API for Stack Exchange will need to make all of this guidance available for developers to integrate into their applications, as well as find ways to incentivize that integration.  We also have several automated quality checks against new posts, and a plethora of other rejection causes; all of which need to be conscientiously reported by the API (without creation oracles for bypassing those checks).

Ultimately the combination of: wanting authentication to be independently battle tested, the need to really focus on getting write access correct, and the scarcity of development resources caused by other work also slated for V2.0 caused write access to be deferred to a subsequent release.


Stack Exchange API V2.0: JS Auth Library

In a previous article I discussed why we went with OAuth 2.0 for authentication in V2.0 of the Stack Exchange API (beta and contest currently underway), and very shortly after we released a simple javascript library to automate the whole affair (currently also considered “in beta”, report any issues on Stack Apps).  The motivations for creating this are, I feel, non-obvious as is why it’s built the way it is.

Motivations

I’m a strong believer in simple APIs.  Every time a developer using your API has to struggle with a concept or move outside their comfort zone, your design has failed in some small way. When you look at the Stack Exchange API V2.0, the standout “weird” thing is authentication.  Every other function in the system is a simple GET (well, there is one POST with /filters/create), has no notion of state, returns JSON, and so on.  OAuth 2.0 requires user redirects, obviously has some notion of state, has different flows, and is passing data around on query strings or in hashes. It follows that, in pursuit of overall simplicity, it’s worthwhile to focus on simplifying consumers using our authentication flows.  The question then becomes “what can we do to simplify authentication?”, with an eye towards doing as much good as possible with our limited resources. The rationale for a javascript library is that:

  • web applications are prevalent, popular, and all use javascript
  • we lack expertise in the other (smaller) comparable platforms (Android and iOS, basically)
  • web development makes it very easy to push bug fixes to all consumers (high future bang for buck)
  • other APIs offer hosted javascript libraries (Facebook, Twitter, Twilio, etc.)

Considerations

The first thing that had to be decided was the scope of the library, as although the primary driver for the library was the complexity of authentication that did not necessarily mean that’s all the library should offer. Ultimately, all it did cover is authentication, for reasons of both time and avoidance of a chilling affect.  Essentially, scoping the library to just authentication gave us the biggest bang for our buck while alleviating most fears that we’d discourage the development of competing javascript libraries for our API.  It is, after all, in Stack Exchange’s best interest for their to be a healthy development community around our API. I also decided that it was absolutely crucial that our library be as small as possible, and quickly served up.  Negatively affecting page load is unacceptable in a javascript library, basically.  In fact, concerns about page load times are why the Stack Exchange sites themselves do not use Facebook or Twitter provided javascript for their share buttons (and also why there is, at time of writing, no Google Plus share option).  It would be hypocritical to expect other developers to not have the same concerns we do about third-party includes.

Implementation

Warning, lots of this follows.

Since it’s been a while since there’s been any code in this discussion, I’m going to go over the current version (which reports as 453) and explain the interesting bits.  The source is here, though I caution that a great many things in it are implementation details that should not be depended upon.  In particular, consumers should always link to our hosted version of the library (at https://api.stackexchange.com/js/2.0/all.js).

The first three lines sort of set the stage for “small as we can make it”.

window.SE = (function (navigator, document,window,encodeURIComponent,Math, undefined) {
 "use strict";
var seUrl, clientId, loginUrl, proxyUrl, fetchUserUrl, requestKey, buildNumber = '@@~~BuildNumber~~@@';

I’m passing globals as parameters to the closure defining the interface in those cases where we can expect minification to save space (there’s still some work to be done here, where I will literally be counting bytes for every reference).  We don’t actually pass an undefined to this function, which both saves space and assures nobody’s done anything goofy like giving undefined a value.  I intend to spend some time seeing if similar proofing for all passed terms is possible (document and window are already un-assignable, at least in some browsers). Note that we also declare all of our variables in batches throughout this script, to save bytes from repeating “var” keywords.

Implementation Detail: “@@~~BuildNumber~~@@” is replaced as part of our deploy.  Note that we pass it as a string everywhere, allow us to change the format of the version string in the future.  Version is provided only for bug reporting purposes, consumers should not depend on its format nor use it in any control flow.

function rand() { return Math.floor(Math.random() * 1000000); }

Probably the most boring part of the entire implementation, gives us a random number.  Smaller than inlining it everywhere where we need one, but not by a lot even after minifying references to Math.  Since we only ever use this to avoid collisions, I’ll probably end up removing it altogether in a future version to save some bytes.

function oldIE() {
  if (navigator.appName === 'Microsoft Internet Explorer') {
    var x = /MSIE ([0-9]{1,}[\.0-9]{0,})/.exec(navigator.userAgent);
    if (x) {
      return x[1] <= 8.0;
    }
  }
  return false;
}

Naturally, there’s some Internet Explorer edge case we have to deal with.  For this version of the library, it’s that IE8 has all the appearances of supporting postMessage but does not actually have a compliant implementation.  This is a fairly terse check for Internet Explorer versions <= 8.0, inspired by the Microsoft recommended version.  I suspect a smaller one could be built, and it’d be nice to remove the usage of navigator if possible.

Implementation Detail:  There is no guarantee that this library will always treat IE 8 or lower differently than other browsers, nor is there a guarantee that it will always use postMessage for communication when able.

Now we get into the SE.init function, the first method that every consumer will need to call.  You’ll notice that we accept parameters as properties on an options object; this is a future proofing consideration, as we’ll be able to add new parameters to the method without worrying (much) about breaking consumers.

You’ll also notice that I’m doing some parameter validation here:

if (!cid) { throw "`clientId` must be passed in options to init"; }
if (!proxy) { throw "`channelUrl` must be passed in options to init"; }
if (!complete) { throw "a `complete` function must be passed in options to init"; }
if (!requestKey) { throw "`key` must be passed in options to init"; }

This is something of a religious position, but I personally find it incredibly frustrating when a minified javascript library blows up because it expected a parameter that wasn’t passed.  This is inordinately difficult to diagnose given how trivial the error is (often being nothing more than a typo), so I’m checking for it in our library and thereby hopefully saving developers some time.

Implementation Detail:  The exact format of these error messages isn’t specified, in fact I suspect we’ll rework them to reduce some repetition and thereby save some bytes.  It is also not guaranteed that we will always check for required parameters (though I doubt we’ll remove it, it’s still not part of the spec) so don’t go using try-catch blocks for control flow.

This odd bit:

if (options.dev) {
 seUrl = 'https://dev.stackexchange.com';
 fetchUserUrl = 'https://dev.api.stackexchange.com/2.0/me/associated';
} else {
 seUrl = 'https://stackexchange.com';
 fetchUserUrl = 'https://api.stackexchange.com/2.0/me/associated';
}

Is for testing on our dev tier.  At some point I’ll get our build setup to strip this out from the production version, there’s a lot of wasted bytes right there.

Implementation Detail: If the above wasn’t enough, don’t even think about relying on passing dev to SE.init(); it’s going away for sure.

The last bit of note in SE.init, is the very last line:

setTimeout(function () { complete({ version: buildNumber }); }, 1);

This is a bit of future proofing as well.  Currently, we don’t actually have any heaving lifting to do in SE.init() but there very well could be some in the future.  Since we’ll never accept blocking behavior, we know that any significant additions to SE.init() will be asynchronous; and a complete function would be the obvious way to signal that SE.init() is done.

Implementation Detail:  Currently, you can get away with calling SE.authenticate() immediately, without waiting for the complete function passed to SE.init() to execute.  Don’t do this, as you may find that your code will break quite badly if our provided library starts doing more work in SE.init().

Next up is fetchUsers(), an internal method that handles fetching network_users after an authentication session should the consumer request them.  We make a JSONP request to /me/associated, since we cannot rely on the browser understanding CORS headers (which are themselves a fairly late addition to the Stack Exchange API).

Going a little out of order, here’s how we attach the script tag.

while (window[callbackName] || document.getElementById(callbackName)) {
 callbackName = 'sec' + rand();
}
window[callbackName] = callbackFunction;
src += '?access_token=' + encodeURIComponent(token);
src += '&pagesize=100';
src += '&key=' + encodeURIComponent(requestKey);
src += '&callback=' + encodeURIComponent(callbackName);
src += '&filter=!6RfQBFKB58ckl';
script = document.createElement('script');
script.type = 'text/javascript';
script.src = src;
script.id = callbackName;
document.getElementsByTagName('head')[0].appendChild(script);

The only interesting bit here is the while loop making sure we don’t pick a callback name that is already in use.  Such a collision would be catastrophically bad, and since we can’t guarantee anything about the hosting page we don’t have a choice but to check.

Implementation Detail:  JSONP is the lowest common denominator, since many browsers still in use do not support CORS.  It’s entirely possible we’ll stop using JSONP in the future, if CORS supporting browsers become practically universal.

Our callbackFunction is defined earlier as:

callbackFunction =
 function (data) {
  try {
   delete window[callbackName];
  } catch (e) {
   window[callbackName] = undefined;
  }
  script.parentNode.removeChild(script);
  if (data.error_id) {
   error({ errorName: data.error_name, errorMessage: data.error_message });
   return;
  }
  success({ accessToken: token, expirationDate: expires, networkUsers: data.items });
 };

Again, this is fairly pedestrian.  One important thing that is often overlooked when making these sorts of libraries is the cleanup of script tags and callback functions that are no longer needed.  Leaving those lingering around does nothing but negatively affect browser performance.

Implementation Detail:  The try-catch block is a workaround for older IE behaviors.  Some investigation into whether setting the callback to undefined performs acceptably for all browsers may let us shave some bytes there, and remove the block.

Finally, we get to the whole point of this library: the SE.authenticate() method.

We do the same parameter validation we do in SE.init, though there’s a special case for scope.

if (scopeOpt && Object.prototype.toString.call(scopeOpt) !== '[object Array]') { throw "`scope` must be an Array in options to authenticate"; }

Because we can’t rely on the presence of Array.isArray in all browsers, we have to fall back on this silly toString() check.

The meat of SE.authenticate() is in this block:

if (window.postMessage && !oldIE()) {
 if (window.attachEvent) {
  window.attachEvent("onmessage", handler);
 } else {
  window.addEventListener("message", handler, false);
 }
 } else {
  poll =
   function () {
    if (!opened) { return; }
    if (opened.closed) {
     clearInterval(pollHandle);
     return;
    }
    var msgFrame = opened.frames['se-api-frame'];
    if (msgFrame) {
     clearInterval(pollHandle);
     handler({ origin: seUrl, source: opened, data: msgFrame.location.hash });
    }
 };
 pollHandle = setInterval(poll, 50);
}
opened = window.open(url, "_blank", "width=660, height=480");

In a nutshell, if a browser supports (and properly implements, unlike IE8) postMessage we use that for cross-domain communication other we use the old iframe trick.  The iframe approach here isn’t the most elegant (polling isn’t strictly required) but it’s simpler.

Notice that if we end up using the iframe approach, I’m wrapping the results up in an object that quacks enough like a postMessage event to make use of the same handler function.  This is easier to maintain, and saves some space through code reuse.

Implementation Detail:  Hoy boy, where to start.  First, the usage of postMessage or iframes shouldn’t be relied upon.  Nor should the format of those messages sent.  The observant will notice that stackexchange.com detects that this library is in use, and only create an iframe named “se-api-frame” when it is; this behavior shouldn’t be relied upon.  There’s quite a lot in this method that should be treated as a black box; note that the communication calisthenics this library is doing isn’t necessary if you’re hosting your javascript under your own domain (as is expected of other, more fully featured, libraries like those found on Stack Apps).

Here’s the handler function:

handler =
 function (e) {
 if (e.origin !== seUrl || e.source !== opened) { return; }
 var i,
 pieces,
 parts = e.data.substring(1).split('&'),
 map = {};
 for (i = 0; i < parts.length; i++) {
  pieces = parts[i].split('=');
  map[pieces[0]] = pieces[1];
 }
 if (+map.state !== state) {
  return;
 }
 if (window.detachEvent) {
  window.detachEvent("onmessage", handler);
 } else {
 window.removeEventListener("message", handler, false);
 }
opened.close();
if (map.access_token) {
 mapSuccess(map.access_token, map.expires);
 return;
}
error({ errorName: map.error, errorMessage: map.error_description });
};

You’ll notice that we’re religious about checking the message for authenticity (origin, source, and state checks).  This is very important as it helps prevent malicious scripts from using our script as a vector into a consumer; security is worth throwing bytes at.

Again we’re also conscientious about cleaning up, making sure to unregister our event listener, for the same performance reasons.

I’m using a mapSuccess function to handle the conversion of the response and invokation of success (and optionally calling fetchUsers()).  This is probably wasting some space and will get refactored sometime in the future.

I’m passing expirationDate to success as a Date because of a mismatch between the Stack Exchange API (which talks in “seconds since the unix epoch”) and javascript (which while it has a dedicated Date type, thinks in “milliseconds since unix epoch”).  They’re just similar enough to be confusing, so I figured it was best to pass the data in an unambiguous type.

Implementation Detail:  The manner in which we’re currently calculating expirationDate can over-estimate how long the access token is good for.  This is legal, because the expiration date of an access token technically just specifies a date by which the access token is guaranteed to be unusable (consider what happens to an access token for an application a user removes immediately after authenticating to).

Currently we’ve managed to squeeze this whole affair down into a little less than 3K worth of minified code, which gets down under 2K after compression.  Considering caching (and our CDN) I’m pretty happy with the state of the library, though I have some hope that I can get us down close to 1K after compression.

[ed. As of version 568, the Stack Exchange Javascript SDK is down to 1.77K compressed, 2.43K uncompressed.]


Follow

Get every new post delivered to your Inbox.