An Absurd Experiment In String Parsing

I’m going to start this off by saying that this post deals with some code that is a bit… silly.  Definitely veering into “why would you do this?” territory for most, but it’s still an interesting endeavor.

The question is, how quickly can you parse a string?

To give some context, at Stack Exchange we’ve got a process that’s slurping logs from our load balancer into SQL Server for us to query; it’s useful for analytics, debugging, and performance monitoring.  The trick is that at our scale this is quite a lot of data to be dealing with, and traffic spikes at peak time have a habit of knocking the service into a death spiral.

Investigations into one of the latest incidents of service suicide indicated, for a short time, that actually parsing the logs was a bottleneck.  Ultimately this turned out to be a red-herring (the actual killer was, rather predictably, disk IO), but the seed of “how would we do this faster?” was planted.

To be clear, what follows isn’t solving a real problem; this is just me having some fun with code.  Everyone should do it now and again, builds character.

There are, broadly, two approaches to string parsing

You can either go with regular expressions, or you can roll it yourself with IndexOf (or your framework of choice’s equivalent).  All things being equal, I’m inclined to go with regular expressions for their terseness and comparable ease of maintainability.

If you’re going for maximum speed (as I am here), you really do want to do all the string manipulation yourself and maintainability be damned.  However, it’d be nice if we could get the performance of IndexOf’ing and Substring’ing everywhere with a cleaner interface.

Writing IL, approximately as intense as this song.

Enter IL

ILGenerator, my favorite class to go out of my way to avoid, lets you generate new methods at the bytecode (Common Intermediate Language, in the .NET world) level at runtime.  If you want speed, you’ll find it here.

The approach I went with was to create a nice interface (call chaining, opaque state objects, and all that jazz) for describing a parser, and at the last minute dropping into some hideous IL generation to create an actual delegate to do the deed.  This produces very, very fast string parsing code without being unbearable to use; it also enables some dirty tricks we wouldn’t want to use even in hand rolled parsing code, which I’ll get to later.

In terms of capabilities, I decided that a reasonable string parser would be composed of the following components

  • Moving forward or backwards in the string a fixed number of characters
  • Moving forward or backwards in the string until a certain string is encountered
  • Taking a given number of characters from the string as a value
  • Taking characters from the string until a certain string is encountered
  • Taking the remainder of the string as a value
  • Perform an arbitrary action when the input string does not conform to the parser’s expectations

I decomposed these six components further into eighteen different method calls, though there are many convenience overrides (referring to a member via either a string or a MemberInfo being the most common).

It ends up looking like the following (contrived example) in practice:

var parser =
 FSBuilder
  .Take(":=", "VarName")
  .Take("(", "MethodName")
  .Take(")", "Parameters")
  .Until(";")
  .TakeRest("Remained")
  .Else((str, obj) => { throw new ArgumentException(); })
  .Seal();

var myObj = new MyObject();
parser("a:=fx(a,b,c);   ... and then some more ...", myObj);

Do note that this isn’t meant to be a replacement for regular expressions, it’s meant to replace a class of string parsers for which I’d expect most developers to use regular expressions.  I do suspect these operations are enough to build DFAs (though I’m not going to spend time trying to prove it, I don’t really care), but not very cleanly and most regular expression engines aren’t strictly regular anyway.

This code runs like greased lightening

On my machine, parsing “(\d+),(\d+),(\d+),(\d+)” and it’s equivalent hand written and composed versions across one million randomly generated (but compliant) strings yields the following.

  • Regex: 3540ms
  • Hand Written: 1140ms
  • Composed: 690ms

I’m dropping the last digit, as it’s highly variable, and taking the median of five runs; forcing a GC (and pausing for it to complete) between each run to try and minimize GC jitters.

Wait, aren’t the hand written and composed versions equivalent?

They’re definately similar, but there are a lot of dirty tricks you can pull off when you’re emitting IL that you wouldn’t really want to tolerate in hand written code.

A big one, if you’re using IndexOf and Substring you’re still creating lots of strings (creating GC pressure) and implicitly paying for method calls (which are fast but not instant, so they add up over many iterations) to get at actual character data.  The internals of a regular expression are even worse in terms of overhead and object churn.  The IL I’m generating does everything in terms of character arrays (in fact, I can guarantee I create only one to parse a string of integers) and avoids method calls like the plague, effectively inlining everything.

In a related vein, both regular expression and IndexOf/Substring approaches end up giving you strings you need to parse into the appropriate data types.  Naively, you’ll typically use things like Int32.Parse which have the same “strings + method call” overhead as above.  Generating IL lets me inline my own parsing code, which naturally deals in character arrays, avoiding both costs again.

The IndexOf/Substring approach does still eek out better performance when you’re just dealing with string members.  Delving into the .NET classes with Reflector shows that Microsoft has (wisely) invested quite a lot of time in optimizing string manipulation, to match them I’d probably have to start linking C in which is a bigger time investment than I’m willing to make.

There really is no lower limit.

Of course, it could be dirtier

I have left some areas unexplored that would probably squeeze a few more milliseconds out.

Non-integer types still fall back to Parse calls, as do DateTime and TimeSpan and some cases of enumerations.  Alternative implementations are time consuming to produce, but would pay dividends especially in the DateTime and TimeSpan cases.

I suspect allocations can be reduced even further through some dubious stackalloc or ThreadLocal usage, possibly combined with some heap juggling.  Not nearly as certain these would work out for the best, but a lack of expertise and time have kept from digging too deeply.

My string scanning algorithm is basically a for loop, breaking out something like Boyer-Moore would pay noticeable dividends for larger “needles”.  This is a lot of complexity for exactly zero gain in the rather common “single character needle” case, so I haven’t sunk the necessary time in yet and may never.

Not much time has been spent on generating ideal IL either.  There’s certainly a lot of stack shuffling that could be eliminated for truly minuscule, if non-zero, savings.

If you’re interested, play around with the code

I’ve thrown it up on github, and uploaded a package on NuGet as the unexcitingly named FluentStringParser (because naming is hard).  The code’s moderately well documented, but it’s still IL so the learning curve is more of a brick wall really.  Still, learn by doing and all that.

I do stress that this is a personal project, not a Stack Exchange endorsed chunk of code (those are currently documented in this blog post).  Caveat emptor if you pull this code in as it does dirty, dirty things for silly, silly reasons.


Stack Exchange API V2.0: The Stable Future

This is the last of my planned series about v2.0 of the Stack Exchange API (the contest is still going on, and we froze the interface a few weeks ago), and I’d like to talk about some regrets.  Not another mistakes post (API V2.0 is still too young to judge with the benefit of hindsight), but something in a similar vein.

This is thankfully a pretty short post, as I’ve worked pretty hard to reduce the regrettable things in every iteration of our API.

The need for breaking changes

I definitely tried to break as little as possible, compare the question object in V1.1 to that in V2.0 to see how little changed.  However, we ate two big breaks: restructuring the common response wrapper, and moving the API from per-site urls (api.stackoverflow.com, api.serverfault.com, etc.) to a single shared domain (api.stackexchange.com).

These were definitely needed, and I’m certain benefits they give us will quickly pay off the transition pain; but still, breaking changes impose work on consumers.  When building an API it’s important to never make breaking changes wantonly, you’re making non-trivial imposition on other developer’s time.

We couldn’t do everything

I’ve already got a list of things I want in V2.1, unfortunately some of them were things I’d like to have gotten into V2.0.  A truism of development is that there’s always more to be done, real developers ship, and so on.  API development is no different, but I’m hoping to see an increase in new features and a decrease in time between API revisions as Stack Exchange matures.

We should be much stabler now

Naturally, since I’m aware of these problems we’ve taken steps to make them less of an issue.

I’m very satisfied with the current request wrapper, and the filter feature (plus our backwards compatibility tooling around it) makes it much easier to make additions to it without breaking existing consumers.  In general, filters allow for lots of forward compatibility guarantees.

We’ll never be able to do everything, but all sorts of internal gotchas that made V1.1 really rough to roll out have been fixed as part of V2.0.  Subsequent minor revisions to V2.0 shouldn’t be nearly as rough, and with our growing staff developer resources should be comparatively plentiful.


Stack Exchange API V2.0: Consistency

If you take a look at the changelog for Stack Exchange API V2.0 (go check it out, you could win a prize), you’ll see we did more than just add new methods.  We also made a considerable number changes which taken as a whole aim to improve the consistency of the API’s interface.

Broadly speaking good APIs are consistent APIs.  Every time your API surprises a developer, or forces them to re-read the documentation, you’ve lost a small but important battle.  As I’ve said before, an API isn’t interesting in and of itself the applications built on top of it are; and accordingly any time wasted wrestling with the rough edges of your API is time not spent making those applications even more interesting.

Addressing Past Mistakes

This is an opinion I held during V1.0 as well, but we’re hardly perfect and some oddities slipped through.  Fields were inconsistently HTML encoded/escaped, question bodies and titles needing completely different treatment by an application.  There were ways to exclude some, but not all, expensive to fetch fields on some, but not many, methods.  There were odd naming patterns, for example /questions/{ids} expected question ids, and /comments/{ids} expected comment ids, but /revisions/{ids} expected post ids.  Return types could be unexpected, /questions returned questions as did /questions/{ids} but while /badges returned badges /badges/{ids} returned users (those who had earned the badges in question).

We’ve addressed all of these faults, and more, in V2.0.  Safety addresses the encoding issues, while Filters address the excluding of fields.  /revisions/{ids} was renamed to /posts/{ids}/revisions, while another odd method /revisions/{post ids}/{revision id} became the new /revisions/{revision ids}.  And all methods that exist in /method, /method/{ids} pairs return the same types.

There are some ways we’re perhaps… oddly consistent, that I feel merit mention.  All returns use the same “common wrapper object”, even errors; this lets developers use common de-serialization code paths for all returns.  Related, all methods return arrays of results even if the can reasonably be expected to always return a single item (such as /info); as with the common wrapper, this allows for sharing de-serialization code.  We return quota fields as data in the response body (other APIs, like Twitter, often place them in headers), so developers can handle them in the same manner as they would any other numeric data.

Identifying this game would make an interesting "age check" form.

Pitfalls

I would caution against what I’ve taken to calling “false consistency”, where an API either implies a pattern that doesn’t exist or sacrifices more important features in the name of consistency.

By way of example: the Stack Exchange API has a number of optional sorts on many methods, many of which are held in common (a great many methods have an activity sort, for instance).  The crux of the consistency objection to these sorts is that they often times apply to the same field (creation sorting applying to question, answer, and comment creation_date fields for example) but are named differently from the fields, some argue they should have the same name.

However, this would be a false consistency.  Consider if we just changed the names of the existing sorts, we would then be implying that all fields can be sorted by which isn’t true.  If did make it possible, we’d have sacrificed performance (and probably some of our SQL servers) in pursuit of consistency.  In this case the cure would be worse than the disease, though I would argue our sorts are consistent with regards to each other if not with regards to the fields they operate on.


Stack Exchange API V2.0: Safety

Every method in version 2.0 of the Stack Exchange API (no longer in beta, but an app contest continues) can return either “safe” or “unsafe” results.  This is a rather odd concept that deserves some explanation.

Semantics

Safe results are those for which every field on every object returned can be inlined into HTML without concern for script injections or other “malicious” content.  Unsafe results are those for which this is not the case, however that is not to imply that all fields are capable of containing malicious content in an unsafe return.

An application indicates where it wants safe or unsafe results via the filter it passes along with the request.  Since most requests should have some filter passed, it seemed reasonable to bundle safety into them.

To the best of my knowledge, this is novel in an API.  It’s not exactly something that we set out to add however, there’s a strong historical component to its inclusion.

Rationale

In version 1.0 of the Stack Exchange API every result is what we would call unsafe in version 2.0.  We were simply returning the data stored in our database, without much concern for encoding or escaping.  This lead to cases where, for example, question bodies were safe to inline (and in fact, escaping them would be an error) but question titles required encoding lest an application open itself to script injections.  This behavior caused difficulties for a number of third-party developers, and bit us internally a few times as well; which is why I resolved to do something to address it in version 2.0.

I feel that a great strength in any API is consistency, and thus the problem was not that you had to encode data, it was that you didn’t always have to.  We had two options, we could either make all fields “high fidelity” by returning the most original data we had (ie. exactly what user’s had entered, before tag balancing, entity encoding, and so on) or we could make all fields “safe” by make sure the same sanitization code ran against them regardless of how they were stored in the database.

Unfortunately, it was not to be.

Strictly speaking, I would have preferred to go with the “high fidelity” option but another consideration forced the “safe” one.  The vast majority of the content in the Stack Exchange network is submitted in markdown, which isn’t really standardized (and we’ve added a number of our own extensions).  Returning the highest fidelity data would be making the assumption that all of our consumers could render our particular markdown variant.  While we have open sourced most of our markdown implementation, even with that we’d be restricting potential consumers to just those built on .NET.  Thus the “high fidelity” option wasn’t just difficult to pursue, it was effectively impossible.

Given this train of thought, I was originally going to just make all returns “safe” period.  I realize in hindsight that this would have been a pretty grave mistake (thankfully, some of the developers at Stack Exchange, and some from the community, talked me out of it).  I think the parallel with C#’s unsafe keyword is a good one, sometimes you need dangerous things and you put a big “I know what I’m doing” signal right there when you do.  This parallel ultimately granted the options their names; properly escaped and inline-able returns are “safe”, and those that are pulled straight out of the database are “unsafe”.

Adding a notion of “safety” allows us to be very consistent in how data we return should be treated.  It also allows developers to ignore encoding issues if they so desire, at least in the web app case.  Of course, if they’re willing to handle encoding themselves they also have the option.


Stack Exchange API V2.0: No Write Access

Version 2.0 of the Stack Exchange API (still in beta, with an app contest) introduced authentication, but unlike most other APIs did not introduced write access at the same time.  Write access is one of the most common feature requests we get for the API, so I feel it’s important to explain that this was very much a conscious decision, not one driven solely by the scarcity of development resources.

Why?

The “secret sauce” of the Stack Exchange network is the quality of the content found on the sites.  When compared to competing platforms you’re much more likely to find quality existing answers to your questions, to find interesting and coherent questions to answer, and to quickly get good answers to new questions you post.  Every change we make or feature we add is considered in terms of either preserving or improving quality, and the Stack Exchange API has been no different.  When viewed under with lens of quality, a few observations can be made about the API.

Screw up authentication, and you screw up write access

Write access presupposes authentication, and any flaw in authentication is going to negatively impact the use of write access accordingly.  Authentication, being a new 2.0 feature, is essentially untested, unverified, and above all untrusted in the current Stack Exchange API.  By pushing authentication out in its own release we’re starting to build some confidence in our implementation, allowing us to be less nervous about write access in the future.

Screw up write access, and you screw up quality

The worst possible outcome of adding write access to the Stack Exchange API is a flood of terrible questions and answers on our sites.  As such the design of our API will need to actively discourage such an outcome, we can’t get away with a simple “POST /questions/create”.  However, a number of other APIs can get away with very simple write methods and it’s important to understand how they differ from the Stack Exchange model and as such need not be as concerned (beyond not having nearly our quality in the first place).

The biggest observation is that every Stack Exchange site is, in a sense, a single “well” in danger of being poisoned.  Every Stack Exchange site user sees the same Newest Questions page, the same User page, and (with the exception of Stack Overflow) the same Home page.  Compare with social APIs (ie. Facebook and Twitter, where everything is sharded around the user), or service APIs (like Twilio, which doesn’t really have common content to show users); there are lots of “wells”, none of which are crucial to protect.

Write access requires more than just writing

It’s easy to forget just how much a Stack Exchange site does to encourage proper question asking behavior in users.

I've circled the parts of the Ask page that don't offer the user guidance

A write API for Stack Exchange will need to make all of this guidance available for developers to integrate into their applications, as well as find ways to incentivize that integration.  We also have several automated quality checks against new posts, and a plethora of other rejection causes; all of which need to be conscientiously reported by the API (without creation oracles for bypassing those checks).

Ultimately the combination of: wanting authentication to be independently battle tested, the need to really focus on getting write access correct, and the scarcity of development resources caused by other work also slated for V2.0 caused write access to be deferred to a subsequent release.