An Absurd Experiment In String Parsing

I’m going to start this off by saying that this post deals with some code that is a bit… silly.  Definitely veering into “why would you do this?” territory for most, but it’s still an interesting endeavor.

The question is, how quickly can you parse a string?

To give some context, at Stack Exchange we’ve got a process that’s slurping logs from our load balancer into SQL Server for us to query; it’s useful for analytics, debugging, and performance monitoring.  The trick is that at our scale this is quite a lot of data to be dealing with, and traffic spikes at peak time have a habit of knocking the service into a death spiral.

Investigations into one of the latest incidents of service suicide indicated, for a short time, that actually parsing the logs was a bottleneck.  Ultimately this turned out to be a red-herring (the actual killer was, rather predictably, disk IO), but the seed of “how would we do this faster?” was planted.

To be clear, what follows isn’t solving a real problem; this is just me having some fun with code.  Everyone should do it now and again, builds character.

There are, broadly, two approaches to string parsing

You can either go with regular expressions, or you can roll it yourself with IndexOf (or your framework of choice’s equivalent).  All things being equal, I’m inclined to go with regular expressions for their terseness and comparable ease of maintainability.

If you’re going for maximum speed (as I am here), you really do want to do all the string manipulation yourself and maintainability be damned.  However, it’d be nice if we could get the performance of IndexOf’ing and Substring’ing everywhere with a cleaner interface.

Writing IL, approximately as intense as this song.

Enter IL

ILGenerator, my favorite class to go out of my way to avoid, lets you generate new methods at the bytecode (Common Intermediate Language, in the .NET world) level at runtime.  If you want speed, you’ll find it here.

The approach I went with was to create a nice interface (call chaining, opaque state objects, and all that jazz) for describing a parser, and at the last minute dropping into some hideous IL generation to create an actual delegate to do the deed.  This produces very, very fast string parsing code without being unbearable to use; it also enables some dirty tricks we wouldn’t want to use even in hand rolled parsing code, which I’ll get to later.

In terms of capabilities, I decided that a reasonable string parser would be composed of the following components

  • Moving forward or backwards in the string a fixed number of characters
  • Moving forward or backwards in the string until a certain string is encountered
  • Taking a given number of characters from the string as a value
  • Taking characters from the string until a certain string is encountered
  • Taking the remainder of the string as a value
  • Perform an arbitrary action when the input string does not conform to the parser’s expectations

I decomposed these six components further into eighteen different method calls, though there are many convenience overrides (referring to a member via either a string or a MemberInfo being the most common).

It ends up looking like the following (contrived example) in practice:

var parser =
  .Take(":=", "VarName")
  .Take("(", "MethodName")
  .Take(")", "Parameters")
  .Else((str, obj) => { throw new ArgumentException(); })

var myObj = new MyObject();
parser("a:=fx(a,b,c);   ... and then some more ...", myObj);

Do note that this isn’t meant to be a replacement for regular expressions, it’s meant to replace a class of string parsers for which I’d expect most developers to use regular expressions.  I do suspect these operations are enough to build DFAs (though I’m not going to spend time trying to prove it, I don’t really care), but not very cleanly and most regular expression engines aren’t strictly regular anyway.

This code runs like greased lightening

On my machine, parsing “(\d+),(\d+),(\d+),(\d+)” and it’s equivalent hand written and composed versions across one million randomly generated (but compliant) strings yields the following.

  • Regex: 3540ms
  • Hand Written: 1140ms
  • Composed: 690ms

I’m dropping the last digit, as it’s highly variable, and taking the median of five runs; forcing a GC (and pausing for it to complete) between each run to try and minimize GC jitters.

Wait, aren’t the hand written and composed versions equivalent?

They’re definately similar, but there are a lot of dirty tricks you can pull off when you’re emitting IL that you wouldn’t really want to tolerate in hand written code.

A big one, if you’re using IndexOf and Substring you’re still creating lots of strings (creating GC pressure) and implicitly paying for method calls (which are fast but not instant, so they add up over many iterations) to get at actual character data.  The internals of a regular expression are even worse in terms of overhead and object churn.  The IL I’m generating does everything in terms of character arrays (in fact, I can guarantee I create only one to parse a string of integers) and avoids method calls like the plague, effectively inlining everything.

In a related vein, both regular expression and IndexOf/Substring approaches end up giving you strings you need to parse into the appropriate data types.  Naively, you’ll typically use things like Int32.Parse which have the same “strings + method call” overhead as above.  Generating IL lets me inline my own parsing code, which naturally deals in character arrays, avoiding both costs again.

The IndexOf/Substring approach does still eek out better performance when you’re just dealing with string members.  Delving into the .NET classes with Reflector shows that Microsoft has (wisely) invested quite a lot of time in optimizing string manipulation, to match them I’d probably have to start linking C in which is a bigger time investment than I’m willing to make.

There really is no lower limit.

Of course, it could be dirtier

I have left some areas unexplored that would probably squeeze a few more milliseconds out.

Non-integer types still fall back to Parse calls, as do DateTime and TimeSpan and some cases of enumerations.  Alternative implementations are time consuming to produce, but would pay dividends especially in the DateTime and TimeSpan cases.

I suspect allocations can be reduced even further through some dubious stackalloc or ThreadLocal usage, possibly combined with some heap juggling.  Not nearly as certain these would work out for the best, but a lack of expertise and time have kept from digging too deeply.

My string scanning algorithm is basically a for loop, breaking out something like Boyer-Moore would pay noticeable dividends for larger “needles”.  This is a lot of complexity for exactly zero gain in the rather common “single character needle” case, so I haven’t sunk the necessary time in yet and may never.

Not much time has been spent on generating ideal IL either.  There’s certainly a lot of stack shuffling that could be eliminated for truly minuscule, if non-zero, savings.

If you’re interested, play around with the code

I’ve thrown it up on github, and uploaded a package on NuGet as the unexcitingly named FluentStringParser (because naming is hard).  The code’s moderately well documented, but it’s still IL so the learning curve is more of a brick wall really.  Still, learn by doing and all that.

I do stress that this is a personal project, not a Stack Exchange endorsed chunk of code (those are currently documented in this blog post).  Caveat emptor if you pull this code in as it does dirty, dirty things for silly, silly reasons.

One Comment on “An Absurd Experiment In String Parsing”

  1. Matt Warren says:

    Is there a new-rule for StackOverflow developers,

    “Every piece of code you develop must contain some IL Generation code” 😉

    Dapper, FastMemeber and now this!!

    Thanks for posting this though, very useful. I love this low-level stuff!!