# An Optimization Exercise

Nick Craver, one of my coworkers at Stack Overflow, tweets out snippets of the Stack Overflow occasionally.  About a week ago he showed off a ContainsToken method which has been tuned for performance.  This set off a little bit of a benchmarking contest, which I participated in.  My final attempt (which is among the fastest) ended up using a lot of tricks, which I think may be of general interest – so I’m going to walk through how the code works.

First, we need to define the problem.  We’re creating a function, which:

• Determines if, when split on a given delimiter, a value string contains a given token
• Ordinal comparison of characters is acceptable (this wasn’t clear from the tweet initially)
• Does not allocate

I’m also allowing for optimizations that exploit Stack Overflow’s production environment, so 64-bit Intel and .NET 4.6+ specific tricks are allowed.  Since Stack Overflow already has some unsafe code in it, unsafe is also acceptable..

My full solution is available on GitHub, my solution is in Program.cs along with several others.  Credit to mattwarren for creating the Benchmark gist.

Here’s my full solution, sans comments, which I’ll be going over line-by-line.

```public static bool ContainsTokenMonty(string value, string token, char delimiter = ';')
{
const int charsPerLong = sizeof(long) / sizeof(char);
const int charsPerInt = sizeof(int) / sizeof(char);
const int bytesPerChar = sizeof(char) / sizeof(byte);

if (string.IsNullOrEmpty(token)) return false;
if (string.IsNullOrEmpty(value)) return false;

var delimiterTwice = (delimiter << 16) | delimiter;

var valueLength = value.Length;
var tokenLength = token.Length;

if (tokenLength > valueLength) return false;

int tokenLongs;
bool tokenTrailingInt, tokenTrailingChar;
{
var remainingChars = tokenLength;
tokenLongs = remainingChars / charsPerLong;
tokenTrailingInt = (tokenLength & 0x02) != 0;
tokenTrailingChar = (tokenLength & 0x01) != 0;
}

var tokenByteLength = tokenLength * bytesPerChar;

unsafe
{
fixed (char* valuePtr = value, tokenPtr = token)
{
var curValuePtr = valuePtr;
var endValuePtr = valuePtr + valueLength;

while (true)
{
long* tokenLongPtr = (long*)tokenPtr;
{
for (var i = 0; i < tokenLongs; i++)
{
var tokenLong = *tokenLongPtr;

var valueLong = *((long*)curValuePtr);

if (tokenLong == valueLong)
{
tokenLongPtr++;
curValuePtr += charsPerLong;
continue;
}
else
{
}
}
}

int* tokenIntPtr = (int*)tokenLongPtr;
if (tokenTrailingInt)
{
var tokenInt = *tokenIntPtr;

var valueInt = *((int*)curValuePtr);

if (tokenInt == valueInt)
{
tokenIntPtr++;
curValuePtr += charsPerInt;
}
else
{
}
}

char* tokenCharPtr = (char*)tokenIntPtr;
if (tokenTrailingChar)
{
var tokenChar = *tokenCharPtr;

var valueChar = *curValuePtr;

if (tokenChar == valueChar)
{
tokenCharPtr++;
curValuePtr++;
}
else
{
}
}

if (curValuePtr == endValuePtr || *curValuePtr == delimiter)
{
return true;
}

while (true)
{
var curVal = *((int*)curValuePtr);

var masked = curVal ^ delimiterTwice;
var temp = masked & 0x7FFF7FFF;
temp = temp + 0x7FFF7FFF;
temp = (int)(temp & 0x80008000);
temp = temp | 0x7FFF7FFF;
temp = ~temp;
var neitherMatch = temp == 0;

if (neitherMatch)
{
curValuePtr += charsPerInt;
if(curValuePtr >= endValuePtr)
{
return false;
}
continue;
}

var top16 = temp & 0xFFFF0000;
if (top16 != 0)
{
curValuePtr += charsPerInt;

break;
}

var bottom16 = temp & 0x0000FFFF;
if (bottom16 != 0)
{
curValuePtr += 1;
}
}

var remainingBytesInValue = ((byte*)endValuePtr) - ((byte*)curValuePtr);
if (remainingBytesInValue < tokenByteLength)
{
return false;
}
}
}
}
}
```

It’s no accident that this resembles C, performance critical C# often does.  Take note that there are no news (or calls out to functions which may allocate), so this code does zero allocations.

The overall algorithm is as follows:

1. Scan forward in `token` and `value` so long as they are equal
2. If they are not equal, move back to the start of `token` and advance in `value` until after the next `delimiter`
1. If there is no next `delimiter`, return false
3. If the end of `token` is reached…
1. If the next character in `value` is `delimiter`, or the end of `value` has been reached, return true
2. Otherwise, reset to the start of `token` and advance in `value` until after the next `delimiter`
4. If the end of `value` is reached, return false

Now for the line-by-line.

```const int charsPerLong = sizeof(long) / sizeof(char);
const int charsPerInt = sizeof(int) / sizeof(char);
const int bytesPerChar = sizeof(char) / sizeof(byte);
```

Some convenient constants for later.  These could be hardcoded since long, int,and char are always 8, 4, and 2 bytes long in C# – but I find this style easier to reason about.

```if (string.IsNullOrEmpty(token)) return false;
if (string.IsNullOrEmpty(value)) return false;
```

This code is copied from the initial tweet, no real optimizations are available here.  The check is important to maintain compatibility with the original code.

```var delimiterTwice = (delimiter << 16) | delimiter;

var valueLength = value.Length;
var tokenLength = token.Length;
```

Some values I’ll need later.  Doubling the delimiter will be used to test for it in value, and the lengths will be used to for bounds calculations.

```if (tokenLength > valueLength) return false;
```

The rest of the code assumes that `token` is no larger than the remaining parts of `value`, so I check it right here to make sure that’s true.  If `token` couldn’t even fit in `value`, we don’t even bother to look at either string and just return false.

```int tokenLongs;
bool tokenTrailingInt, tokenTrailingChar;
{
var remainingChars = tokenLength;
tokenLongs = remainingChars / charsPerLong;
tokenTrailingInt = (tokenLength & 0x02) != 0;
tokenTrailingChar = (tokenLength & 0x01) != 0;
}
```

I going to compare `token` to value in 8-byte/4-char/1-long chunks, so I need to know how many `long`s can fit in token.  Because `token` isn’t always a multiple of 4 in length, I also need to see if there are 0, 1, 2, or 3 characters after the `long`s.  Note that the single division is by a constant power-of-two, so it gets compiled into something much faster than actual division.

```var tokenByteLength = tokenLength * bytesPerChar;
```

One more useful value before we start actually comparing strings.  Since there will be some pointer math later, knowing the number of bytes (importantly not the number of characters) in `token` will come in handy.

```unsafe
{
fixed (char* valuePtr = value, tokenPtr = token)
{
var curValuePtr = valuePtr;
var endValuePtr = valuePtr + valueLength;
```

Now for the first scary bits.  The rest of this method is in an unsafe block which means: we can use pointers, and we have to do our own bounds checking with those pointers.

I take and pin pointers to `value` and `token`, make a copy of the value pointer (so it can be mutated, which pointers declared in a fixed block cannot be), and calculate `endValuePtr` so we can use it to do bounds checks on value.

```while (true)
{
long* tokenLongPtr = (long*)tokenPtr;
{
for (var i = 0; i < tokenLongs; i++)
{
var tokenLong = *tokenLongPtr;

var valueLong = *((long*)curValuePtr);

if (tokenLong == valueLong)
{
tokenLongPtr++;
curValuePtr += charsPerLong;
continue;
}
else
{
}
}
}
```

Here I start comparing the `token` and `value` strings.  Earlier I calculated how many `long`s fit in value, so it’s a simple for loop to make sure they all match.  Each iteration compares 4 characters worth of `value` to `token`, if they match I continue the loop if they don’t we `goto advanceToNextDelimiter`.  This is a big win for larger tokens, having a quarter the comparisons and branches.

Important things to note: the `long*` casts are free, and `curValuePtr` being incremented is faster than you might assume (since `charsPerLong` is a const, the increment value is pre-calculated instead of involving a multiplication).

```int* tokenIntPtr = (int*)tokenLongPtr;
if (tokenTrailingInt)
{
var tokenInt = *tokenIntPtr;

var valueInt = *((int*)curValuePtr);

if (tokenInt == valueInt)
{
tokenIntPtr++;
curValuePtr += charsPerInt;
}
else
{
}
}
```

Then essentially the same check, but for a single `int` now – comparing two characters at a time isn’t as good as four but it’s not nothing.  Whether or not this check was necessary was determined earlier, so `tokenTrailingInt` is unchanging across a single call – this is good for branch prediction.

```char* tokenCharPtr = (char*)tokenIntPtr;
if (tokenTrailingChar)
{
var tokenChar = *tokenCharPtr;

var valueChar = *curValuePtr;

if (tokenChar == valueChar)
{
tokenCharPtr++;
curValuePtr++;
}
else
{
}
}
```

And then again for a single `char`.

```if (curValuePtr == endValuePtr || *curValuePtr == delimiter)
{
return true;
}
```

If this chunk of code is reached then every character in `token` has been matched, all that’s left is to make sure that either a `delimiter` or the end of `value` has been reached.  If it has then we’ve got a match and return true, otherwise we fall through to `advanceToNextDelimiter`.

```advanceToNextDelimiter:

while (true)
{
var curVal = *((int*)curValuePtr);

var masked = curVal ^ delimiterTwice;
var temp = masked & 0x7FFF7FFF;
temp = temp + 0x7FFF7FFF;
temp = (int)(temp & 0x80008000);
temp = temp | 0x7FFF7FFF;
temp = ~temp;
var neitherMatch = temp == 0;

if (neitherMatch)
{
curValuePtr += charsPerInt;
if(curValuePtr >= endValuePtr)
{
return false;
}
continue;
}
```

This code advances `curValuePtr` forward until a `delimiter` is found.  There’s a clever combination of tricks here to allow checking two characters with one branch.

The first trick is advancing two characters at a time (thus the `int*` cast), and being willing to advance 1 character past the end of `value`.  This works because C# guarantees that a string coerced to a char* is terminated with a null character (not a null byte), so there’s always two additional bytes at the end of the string for us to read into whether we need them or not.

The second trick is a series of bit twiddlings, as follows:

1. XOR `delimiterTwice` into `curVal`, making the the top or bottom char-sized chunks of `masked` all 0 if they match `delimiter`
• `delimiterTwice` was precalculated
2. AND `masked` with `0x7FFF7FFF`, clearing the high bits in the top and bottom char-sized chunks in `temp`, store the result in `temp`
3. ADD `0x7FFF7FFF` to `temp`, causing the high bits in the top and bottom char-sized chunks of `temp` to be 1 if any of the low bits in those respective chunks are set
4. AND `temp` with `0x80008000`, clearing all but the high bits in the top and bottom char-sized chunks in `temp`
5. OR `temp` with `masked`, making sure the high bits in each char-sized chunk are set in `temp` when the high-bits in `masked` are.
6. OR `temp` with `0x7FFF7FFF`, set all the low bits in the char-sized chunks in `temp` which will make temp all 1s if both char-sized chunks in `curVal` do not match delimiter
7. INVERT `temp`, making it all 0s if step #5 made it all 1s
8. COMPARE `temp` against 0, setting `neitherMatch` if so

This is followed up with an if that’s taken if `neitherMatch` is set in which I increment the pointer into `value` by two `char`s.  There’s then a bounds check, comparing `curValPtr` to `endValuePtr` (which was pre-calculated for this purpose) and false is returned if the bounds have been exceeded.  Note that `endValuePtr` points at the null-terminator, so bailing when `curValPtr` is equal to it is correct.

```var top16 = temp & 0xFFFF0000;
if (top16 != 0)
{
curValuePtr += charsPerInt;

break;
}

var bottom16 = temp & 0x0000FFFF;
if (bottom16 != 0)
{
curValuePtr += 1;
}
```

If this code is run a `delimiter` has been found but `curValuePtr` isn’t pointing to that character, just to the int that contains it.  `top16` and `bottom16` contain the top and bottom `char`s from `temp`, whichever one isn’t all zeros is the one that contained the `delimiter`.  Depending on which is set, `curValuePtr` is incremented so that it points at the first character after the `delimiter`.

Note because we’re executing on a little-endian architecture but C# is big-endian the increment sizes here appear to be flipped.  Also remember that `token` cannot be empty, and that is checked at the start of the method, so checking the second `char` (in `top16`) first isn’t an error.

```var remainingBytesInValue = ((byte*)endValuePtr) - ((byte*)curValuePtr);
if (remainingBytesInValue < tokenByteLength)
{
return false;
}
```

This code runs once the next `delimiter` has been found, and `curValuePtr` advanced to one `char` past it.  It is necessary to reassert that there’s enough space remaining in `value` for `token` to fit, just as was asserted back towards the start of the method.

The space check is done in `byte`s instead of `char`s (the casts are free) because counting the number of `char`s between two pointers is a subtraction followed by a division, but counting the number of `byte`s is only a subtraction.  Pre-calculating `tokenByteLength` was done for this purpose.

Then the while loop begins again, and continues until a match is found or value is exhausted.

Aaaaaaand that’s it.  Performance wise, this is about 15x faster than the original ContainsToken code and something like 40% faster than than the next best implementation.

I’m not sure I’d encourage anyone to use this code, as relies on several subtleties which make it hard to understand.  It also isn’t tuned for 32-bit architectures, or correct on big-endian ones.  Both could be fixed, but require some build tweaks which aren’t particularly interesting or portable themselves.

It sure was fun to write though.

### 8 Comments on “An Optimization Exercise”

1. Jimmy says:

I have used the boyer moore algorithm to search for tokens in inbound trading messages for a while now, we process about 10000 trades a second, and search multiple tokens in text and (more frequently) binary messages. I have a .NET algorithm for it. It would be of interest to me to know how your optimisations perform compared to mine.

https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

• The benchmark is up on Github (the trickiest part was getting the benchmark framework built, but it’s included in /lib in my repo) so you can see for yourself.

https://github.com/kevin-montrose/ContainsTokenBenchmark

Adding a new function to benchmark isn’t too hard, should take 20-30 minutes to run. All the standard caveats (RELEASE, x64, disable any power saving stuff, only process doing real work, etc.) apply for benchmarking.

2. Kevin says:

What is going on with the bit twiddling part? You explain what each line of code does, which is not that useful if you know what AND, XOR, OR, etc. does, but not why you do it?

• It lets you check if either half of the int is a char matching delimiter.

Saves a branch, and since the answer is typically no the branch predictor is pretty good at it.

• Kevin says:

I see. Would it work with a long, allowing you to check 4 characters at a time? Probably slower on x86, but would it be faster on x64?

3. Nicholas Wilson says:

How do you deal with alignment issues? I don’t know any C#, but surely the long pointer must be 8-byte aligned, or you won’t be able to read from it on non-x86 architectures.

• I don’t deal with them, for the same reasons I don’t deal with 32-bit or big-endian architectures, I can rely on Stack Overflow running on x64 processors. I note that I’m allow myself to exploit this fact towards the top of the post.

Getting this sort of low-level, crazy optimized, code running on wildly different platforms is something of a fools errand in my opinion. If I wanted something optimized to this extent on ARM, I’d write it from scratch for ARM (and #ifdef it in, or something) – it’s pretty unlikely the “best” x64 solution is also the “best” ARM one.

4. Groo says:

This is very cool, you don’t often get an opportunity to go low level this far with a real justification.

One detail though: you stated that C# is “big-endian” a couple of times, but I don’t think that’s correct (if I understood you correctly). A value of `0xFFFF0000` is a hex representation of the same 32-bit number in any language and doesn’t have any architecture-specific information, and it is this *number* which will get a different value during casting, depending on the architecture. A byte sequence 01-02-03-04 on a little-endian machine will be cast to 0x04030201 in any language (capable of casting). Vice versa, you only need to know the *system* architecture to know that the LSB is the first or last byte in the sequence.