An Optimization Exercise

Posted: 2016/04/26 | Author: kevinmontrose | Filed under: code |8 Comments

Nick Craver, one of my coworkers at Stack Overflow, tweets out snippets of the Stack Overflow occasionally. About a week ago he showed off a ContainsToken method which has been tuned for performance. This set off a little bit of a benchmarking contest, which I participated in. My final attempt (which is among the fastest) ended up using a lot of tricks, which I think may be of general interest – so I’m going to walk through how the code works.

First, we need to define the problem. We’re creating a function, which:

Determines if, when split on a given delimiter, a value string contains a given token
- Ordinal comparison of characters is acceptable (this wasn’t clear from the tweet initially)
Does not allocate

I’m also allowing for optimizations that exploit Stack Overflow’s production environment, so 64-bit Intel and .NET 4.6+ specific tricks are allowed. Since Stack Overflow already has some unsafe code in it, unsafe is also acceptable..

My full solution is available on GitHub, my solution is in Program.cs along with several others. Credit to mattwarren for creating the Benchmark gist.

Here’s my full solution, sans comments, which I’ll be going over line-by-line.

public static bool ContainsTokenMonty(string value, string token, char delimiter = ';')
{
  const int charsPerLong = sizeof(long) / sizeof(char);
  const int charsPerInt = sizeof(int) / sizeof(char);
  const int bytesPerChar = sizeof(char) / sizeof(byte);

  if (string.IsNullOrEmpty(token)) return false;
  if (string.IsNullOrEmpty(value)) return false;

  var delimiterTwice = (delimiter << 16) | delimiter;

  var valueLength = value.Length;
  var tokenLength = token.Length;

  if (tokenLength > valueLength) return false;

  int tokenLongs;
  bool tokenTrailingInt, tokenTrailingChar;
  {
    var remainingChars = tokenLength;
    tokenLongs = remainingChars / charsPerLong;
    tokenTrailingInt = (tokenLength & 0x02) != 0;
    tokenTrailingChar = (tokenLength & 0x01) != 0;
  }

  var tokenByteLength = tokenLength * bytesPerChar;

  unsafe
  {
    fixed (char* valuePtr = value, tokenPtr = token)
    {
      var curValuePtr = valuePtr;
      var endValuePtr = valuePtr + valueLength;

      while (true)
      {
        long* tokenLongPtr = (long*)tokenPtr;
        {
          for (var i = 0; i < tokenLongs; i++)
          {
            var tokenLong = *tokenLongPtr;

            var valueLong = *((long*)curValuePtr);

            if (tokenLong == valueLong)
            {
              tokenLongPtr++;
              curValuePtr += charsPerLong;
              continue;
            }
            else
            {
              goto advanceToNextDelimiter;
            }
          }
        }

        int* tokenIntPtr = (int*)tokenLongPtr;
        if (tokenTrailingInt)
        {
          var tokenInt = *tokenIntPtr;

          var valueInt = *((int*)curValuePtr);

          if (tokenInt == valueInt)
          {
            tokenIntPtr++;
            curValuePtr += charsPerInt;
          }
          else
          {
            goto advanceToNextDelimiter;
          }
        }

        char* tokenCharPtr = (char*)tokenIntPtr;
        if (tokenTrailingChar)
        {
          var tokenChar = *tokenCharPtr;

          var valueChar = *curValuePtr;

          if (tokenChar == valueChar)
          {
            tokenCharPtr++;
            curValuePtr++;
          }
          else
          {
            goto advanceToNextDelimiter;
          }
        }

        if (curValuePtr == endValuePtr || *curValuePtr == delimiter)
        {
          return true;
        }

        advanceToNextDelimiter:

        while (true)
        {
          var curVal = *((int*)curValuePtr);

          var masked = curVal ^ delimiterTwice;
          var temp = masked & 0x7FFF7FFF;
          temp = temp + 0x7FFF7FFF;
          temp = (int)(temp & 0x80008000);
          temp = temp | masked;
          temp = temp | 0x7FFF7FFF;
          temp = ~temp;
          var neitherMatch = temp == 0;

          if (neitherMatch)
          {
            curValuePtr += charsPerInt;
            if(curValuePtr >= endValuePtr)
            {
              return false;
            }
            continue;
          }

          var top16 = temp & 0xFFFF0000;
          if (top16 != 0)
          {
            curValuePtr += charsPerInt;

            break;
          }

          var bottom16 = temp & 0x0000FFFF;
          if (bottom16 != 0)
          {
            curValuePtr += 1;
          }
        }

        var remainingBytesInValue = ((byte*)endValuePtr) - ((byte*)curValuePtr);
        if (remainingBytesInValue < tokenByteLength)
        {
          return false;
        }
      }
    }
  }
}

It’s no accident that this resembles C, performance critical C# often does. Take note that there are no news (or calls out to functions which may allocate), so this code does zero allocations.

The overall algorithm is as follows:

Scan forward in token and value so long as they are equal
If they are not equal, move back to the start of token and advance in value until after the next delimiter
1. If there is no next delimiter, return false
If the end of token is reached…
1. If the next character in value is delimiter, or the end of value has been reached, return true
2. Otherwise, reset to the start of token and advance in value until after the next delimiter
If the end of value is reached, return false

Now for the line-by-line.

const int charsPerLong = sizeof(long) / sizeof(char);
const int charsPerInt = sizeof(int) / sizeof(char);
const int bytesPerChar = sizeof(char) / sizeof(byte);

Some convenient constants for later. These could be hardcoded since long, int,and char are always 8, 4, and 2 bytes long in C# – but I find this style easier to reason about.

if (string.IsNullOrEmpty(token)) return false;
if (string.IsNullOrEmpty(value)) return false;

This code is copied from the initial tweet, no real optimizations are available here. The check is important to maintain compatibility with the original code.

var delimiterTwice = (delimiter << 16) | delimiter;

var valueLength = value.Length;
var tokenLength = token.Length;

Some values I’ll need later. Doubling the delimiter will be used to test for it in value, and the lengths will be used to for bounds calculations.

if (tokenLength > valueLength) return false;

The rest of the code assumes that token is no larger than the remaining parts of value, so I check it right here to make sure that’s true. If token couldn’t even fit in value, we don’t even bother to look at either string and just return false.

int tokenLongs;
bool tokenTrailingInt, tokenTrailingChar;
{
  var remainingChars = tokenLength;
  tokenLongs = remainingChars / charsPerLong;
  tokenTrailingInt = (tokenLength & 0x02) != 0;
  tokenTrailingChar = (tokenLength & 0x01) != 0;
}

I going to compare token to value in 8-byte/4-char/1-long chunks, so I need to know how many longs can fit in token. Because token isn’t always a multiple of 4 in length, I also need to see if there are 0, 1, 2, or 3 characters after the longs. Note that the single division is by a constant power-of-two, so it gets compiled into something much faster than actual division.

var tokenByteLength = tokenLength * bytesPerChar;

One more useful value before we start actually comparing strings. Since there will be some pointer math later, knowing the number of bytes (importantly not the number of characters) in token will come in handy.

unsafe
{
  fixed (char* valuePtr = value, tokenPtr = token)
  {
    var curValuePtr = valuePtr;
    var endValuePtr = valuePtr + valueLength;

Now for the first scary bits. The rest of this method is in an unsafe block which means: we can use pointers, and we have to do our own bounds checking with those pointers.

I take and pin pointers to value and token, make a copy of the value pointer (so it can be mutated, which pointers declared in a fixed block cannot be), and calculate endValuePtr so we can use it to do bounds checks on value.

while (true)
{
  long* tokenLongPtr = (long*)tokenPtr;
  {
    for (var i = 0; i < tokenLongs; i++)
    {
      var tokenLong = *tokenLongPtr;

      var valueLong = *((long*)curValuePtr);

      if (tokenLong == valueLong)
      {
        tokenLongPtr++;
        curValuePtr += charsPerLong;
        continue;
      }
      else
      {
        goto advanceToNextDelimiter;
      }
    }
  }

Here I start comparing the token and value strings. Earlier I calculated how many longs fit in value, so it’s a simple for loop to make sure they all match. Each iteration compares 4 characters worth of value to token, if they match I continue the loop if they don’t we goto advanceToNextDelimiter. This is a big win for larger tokens, having a quarter the comparisons and branches.

Important things to note: the long* casts are free, and curValuePtr being incremented is faster than you might assume (since charsPerLong is a const, the increment value is pre-calculated instead of involving a multiplication).

int* tokenIntPtr = (int*)tokenLongPtr;
if (tokenTrailingInt)
{
  var tokenInt = *tokenIntPtr;

  var valueInt = *((int*)curValuePtr);

  if (tokenInt == valueInt)
  {
    tokenIntPtr++;
    curValuePtr += charsPerInt;
  }
  else
  {
    goto advanceToNextDelimiter;
  }
}

Then essentially the same check, but for a single int now – comparing two characters at a time isn’t as good as four but it’s not nothing. Whether or not this check was necessary was determined earlier, so tokenTrailingInt is unchanging across a single call – this is good for branch prediction.

char* tokenCharPtr = (char*)tokenIntPtr;
if (tokenTrailingChar)
{
  var tokenChar = *tokenCharPtr;

  var valueChar = *curValuePtr;

  if (tokenChar == valueChar)
  {
    tokenCharPtr++;
    curValuePtr++;
  }
  else
  {
    goto advanceToNextDelimiter;
  }
}

And then again for a single char.

if (curValuePtr == endValuePtr || *curValuePtr == delimiter)
{
  return true;
}

If this chunk of code is reached then every character in token has been matched, all that’s left is to make sure that either a delimiter or the end of value has been reached. If it has then we’ve got a match and return true, otherwise we fall through to advanceToNextDelimiter.

advanceToNextDelimiter:

while (true)
{
  var curVal = *((int*)curValuePtr);

  var masked = curVal ^ delimiterTwice;
  var temp = masked & 0x7FFF7FFF;
  temp = temp + 0x7FFF7FFF;
  temp = (int)(temp & 0x80008000);
  temp = temp | masked;
  temp = temp | 0x7FFF7FFF;
  temp = ~temp;
  var neitherMatch = temp == 0;

  if (neitherMatch)
  {
    curValuePtr += charsPerInt;
    if(curValuePtr >= endValuePtr)
    {
      return false;
    }
    continue;
  }

This code advances curValuePtr forward until a delimiter is found. There’s a clever combination of tricks here to allow checking two characters with one branch.

The first trick is advancing two characters at a time (thus the int* cast), and being willing to advance 1 character past the end of value. This works because C# guarantees that a string coerced to a char* is terminated with a null character (not a null byte), so there’s always two additional bytes at the end of the string for us to read into whether we need them or not.

The second trick is a series of bit twiddlings, as follows:

XOR delimiterTwice into curVal, making the the top or bottom char-sized chunks of masked all 0 if they match delimiter
- delimiterTwice was precalculated
AND masked with 0x7FFF7FFF, clearing the high bits in the top and bottom char-sized chunks in temp, store the result in temp
ADD 0x7FFF7FFF to temp, causing the high bits in the top and bottom char-sized chunks of temp to be 1 if any of the low bits in those respective chunks are set
AND temp with 0x80008000, clearing all but the high bits in the top and bottom char-sized chunks in temp
OR temp with masked, making sure the high bits in each char-sized chunk are set in temp when the high-bits in masked are.
OR temp with 0x7FFF7FFF, set all the low bits in the char-sized chunks in temp which will make temp all 1s if both char-sized chunks in curVal do not match delimiter
INVERT temp, making it all 0s if step #5 made it all 1s
COMPARE temp against 0, setting neitherMatch if so

This is followed up with an if that’s taken if neitherMatch is set in which I increment the pointer into value by two chars. There’s then a bounds check, comparing curValPtr to endValuePtr (which was pre-calculated for this purpose) and false is returned if the bounds have been exceeded. Note that endValuePtr points at the null-terminator, so bailing when curValPtr is equal to it is correct.

var top16 = temp & 0xFFFF0000;
if (top16 != 0)
{
  curValuePtr += charsPerInt;

  break;
}

var bottom16 = temp & 0x0000FFFF;
if (bottom16 != 0)
{
  curValuePtr += 1;
}

If this code is run a delimiter has been found but curValuePtr isn’t pointing to that character, just to the int that contains it. top16 and bottom16 contain the top and bottom chars from temp, whichever one isn’t all zeros is the one that contained the delimiter. Depending on which is set, curValuePtr is incremented so that it points at the first character after the delimiter.

Note because we’re executing on a little-endian architecture but C# is big-endian the increment sizes here appear to be flipped. Also remember that token cannot be empty, and that is checked at the start of the method, so checking the second char (in top16) first isn’t an error.

var remainingBytesInValue = ((byte*)endValuePtr) - ((byte*)curValuePtr);
if (remainingBytesInValue < tokenByteLength)
{
  return false;
}

This code runs once the next delimiter has been found, and curValuePtr advanced to one char past it. It is necessary to reassert that there’s enough space remaining in value for token to fit, just as was asserted back towards the start of the method.

The space check is done in bytes instead of chars (the casts are free) because counting the number of chars between two pointers is a subtraction followed by a division, but counting the number of bytes is only a subtraction. Pre-calculating tokenByteLength was done for this purpose.

Then the while loop begins again, and continues until a match is found or value is exhausted.

Aaaaaaand that’s it. Performance wise, this is about 15x faster than the original ContainsToken code and something like 40% faster than than the next best implementation.

I’m not sure I’d encourage anyone to use this code, as relies on several subtleties which make it hard to understand. It also isn’t tuned for 32-bit architectures, or correct on big-endian ones. Both could be fixed, but require some build tweaks which aren’t particularly interesting or portable themselves.

It sure was fun to write though.

8 Comments on “An Optimization Exercise”

Jimmy says:

2016/04/26 at 15:27

I have used the boyer moore algorithm to search for tokens in inbound trading messages for a while now, we process about 10000 trades a second, and search multiple tokens in text and (more frequently) binary messages. I have a .NET algorithm for it. It would be of interest to me to know how your optimisations perform compared to mine.

https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
- kevinmontrose says:
  
  2016/04/26 at 15:53
  
  The benchmark is up on Github (the trickiest part was getting the benchmark framework built, but it’s included in /lib in my repo) so you can see for yourself.
  
  https://github.com/kevin-montrose/ContainsTokenBenchmark
  
  Adding a new function to benchmark isn’t too hard, should take 20-30 minutes to run. All the standard caveats (RELEASE, x64, disable any power saving stuff, only process doing real work, etc.) apply for benchmarking.
Kevin says:

2016/04/26 at 17:19

What is going on with the bit twiddling part? You explain what each line of code does, which is not that useful if you know what AND, XOR, OR, etc. does, but not why you do it?
- kevinmontrose says:
  
  2016/04/26 at 17:21
  
  It lets you check if either half of the int is a char matching delimiter.
  
  Saves a branch, and since the answer is typically no the branch predictor is pretty good at it.
  - Kevin says:
    
    2016/04/27 at 21:17
    
    I see. Would it work with a long, allowing you to check 4 characters at a time? Probably slower on x86, but would it be faster on x64?
Nicholas Wilson says:

2016/04/27 at 06:54

How do you deal with alignment issues? I don’t know any C#, but surely the long pointer must be 8-byte aligned, or you won’t be able to read from it on non-x86 architectures.
- kevinmontrose says:
  
  2016/04/27 at 13:59
  
  I don’t deal with them, for the same reasons I don’t deal with 32-bit or big-endian architectures, I can rely on Stack Overflow running on x64 processors. I note that I’m allow myself to exploit this fact towards the top of the post.
  
  Getting this sort of low-level, crazy optimized, code running on wildly different platforms is something of a fools errand in my opinion. If I wanted something optimized to this extent on ARM, I’d write it from scratch for ARM (and #ifdef it in, or something) – it’s pretty unlikely the “best” x64 solution is also the “best” ARM one.
Groo says:

2016/05/05 at 08:54

This is very cool, you don’t often get an opportunity to go low level this far with a real justification.

One detail though: you stated that C# is “big-endian” a couple of times, but I don’t think that’s correct (if I understood you correctly). A value of `0xFFFF0000` is a hex representation of the same 32-bit number in any language and doesn’t have any architecture-specific information, and it is this *number* which will get a different value during casting, depending on the architecture. A byte sequence 01-02-03-04 on a little-endian machine will be cast to 0x04030201 in any language (capable of casting). Vice versa, you only need to know the *system* architecture to know that the LSB is the first or last byte in the sequence.