An Optimization Exercise
Posted: 2016/04/26 Filed under: code 8 CommentsNick Craver, one of my coworkers at Stack Overflow, tweets out snippets of the Stack Overflow occasionally. About a week ago he showed off a ContainsToken method which has been tuned for performance. This set off a little bit of a benchmarking contest, which I participated in. My final attempt (which is among the fastest) ended up using a lot of tricks, which I think may be of general interest – so I’m going to walk through how the code works.
First, we need to define the problem. We’re creating a function, which:
- Determines if, when split on a given delimiter, a value string contains a given token
- Ordinal comparison of characters is acceptable (this wasn’t clear from the tweet initially)
- Does not allocate
I’m also allowing for optimizations that exploit Stack Overflow’s production environment, so 64-bit Intel and .NET 4.6+ specific tricks are allowed. Since Stack Overflow already has some unsafe code in it, unsafe is also acceptable..
My full solution is available on GitHub, my solution is in Program.cs along with several others. Credit to mattwarren for creating the Benchmark gist.
Here’s my full solution, sans comments, which I’ll be going over line-by-line.
public static bool ContainsTokenMonty(string value, string token, char delimiter = ';') { const int charsPerLong = sizeof(long) / sizeof(char); const int charsPerInt = sizeof(int) / sizeof(char); const int bytesPerChar = sizeof(char) / sizeof(byte); if (string.IsNullOrEmpty(token)) return false; if (string.IsNullOrEmpty(value)) return false; var delimiterTwice = (delimiter << 16) | delimiter; var valueLength = value.Length; var tokenLength = token.Length; if (tokenLength > valueLength) return false; int tokenLongs; bool tokenTrailingInt, tokenTrailingChar; { var remainingChars = tokenLength; tokenLongs = remainingChars / charsPerLong; tokenTrailingInt = (tokenLength & 0x02) != 0; tokenTrailingChar = (tokenLength & 0x01) != 0; } var tokenByteLength = tokenLength * bytesPerChar; unsafe { fixed (char* valuePtr = value, tokenPtr = token) { var curValuePtr = valuePtr; var endValuePtr = valuePtr + valueLength; while (true) { long* tokenLongPtr = (long*)tokenPtr; { for (var i = 0; i < tokenLongs; i++) { var tokenLong = *tokenLongPtr; var valueLong = *((long*)curValuePtr); if (tokenLong == valueLong) { tokenLongPtr++; curValuePtr += charsPerLong; continue; } else { goto advanceToNextDelimiter; } } } int* tokenIntPtr = (int*)tokenLongPtr; if (tokenTrailingInt) { var tokenInt = *tokenIntPtr; var valueInt = *((int*)curValuePtr); if (tokenInt == valueInt) { tokenIntPtr++; curValuePtr += charsPerInt; } else { goto advanceToNextDelimiter; } } char* tokenCharPtr = (char*)tokenIntPtr; if (tokenTrailingChar) { var tokenChar = *tokenCharPtr; var valueChar = *curValuePtr; if (tokenChar == valueChar) { tokenCharPtr++; curValuePtr++; } else { goto advanceToNextDelimiter; } } if (curValuePtr == endValuePtr || *curValuePtr == delimiter) { return true; } advanceToNextDelimiter: while (true) { var curVal = *((int*)curValuePtr); var masked = curVal ^ delimiterTwice; var temp = masked & 0x7FFF7FFF; temp = temp + 0x7FFF7FFF; temp = (int)(temp & 0x80008000); temp = temp | masked; temp = temp | 0x7FFF7FFF; temp = ~temp; var neitherMatch = temp == 0; if (neitherMatch) { curValuePtr += charsPerInt; if(curValuePtr >= endValuePtr) { return false; } continue; } var top16 = temp & 0xFFFF0000; if (top16 != 0) { curValuePtr += charsPerInt; break; } var bottom16 = temp & 0x0000FFFF; if (bottom16 != 0) { curValuePtr += 1; } } var remainingBytesInValue = ((byte*)endValuePtr) - ((byte*)curValuePtr); if (remainingBytesInValue < tokenByteLength) { return false; } } } } }
It’s no accident that this resembles C, performance critical C# often does. Take note that there are no news (or calls out to functions which may allocate), so this code does zero allocations.
The overall algorithm is as follows:
- Scan forward in
token
andvalue
so long as they are equal - If they are not equal, move back to the start of
token
and advance invalue
until after the nextdelimiter
- If there is no next
delimiter
, return false
- If there is no next
- If the end of
token
is reached…- If the next character in
value
isdelimiter
, or the end ofvalue
has been reached, return true - Otherwise, reset to the start of
token
and advance invalue
until after the nextdelimiter
- If the next character in
- If the end of
value
is reached, return false
Now for the line-by-line.
const int charsPerLong = sizeof(long) / sizeof(char); const int charsPerInt = sizeof(int) / sizeof(char); const int bytesPerChar = sizeof(char) / sizeof(byte);
Some convenient constants for later. These could be hardcoded since long, int,and char are always 8, 4, and 2 bytes long in C# – but I find this style easier to reason about.
if (string.IsNullOrEmpty(token)) return false; if (string.IsNullOrEmpty(value)) return false;
This code is copied from the initial tweet, no real optimizations are available here. The check is important to maintain compatibility with the original code.
var delimiterTwice = (delimiter << 16) | delimiter; var valueLength = value.Length; var tokenLength = token.Length;
Some values I’ll need later. Doubling the delimiter will be used to test for it in value, and the lengths will be used to for bounds calculations.
if (tokenLength > valueLength) return false;
The rest of the code assumes that token
is no larger than the remaining parts of value
, so I check it right here to make sure that’s true. If token
couldn’t even fit in value
, we don’t even bother to look at either string and just return false.
int tokenLongs; bool tokenTrailingInt, tokenTrailingChar; { var remainingChars = tokenLength; tokenLongs = remainingChars / charsPerLong; tokenTrailingInt = (tokenLength & 0x02) != 0; tokenTrailingChar = (tokenLength & 0x01) != 0; }
I going to compare token
to value in 8-byte/4-char/1-long chunks, so I need to know how many long
s can fit in token. Because token
isn’t always a multiple of 4 in length, I also need to see if there are 0, 1, 2, or 3 characters after the long
s. Note that the single division is by a constant power-of-two, so it gets compiled into something much faster than actual division.
var tokenByteLength = tokenLength * bytesPerChar;
One more useful value before we start actually comparing strings. Since there will be some pointer math later, knowing the number of bytes (importantly not the number of characters) in token
will come in handy.
unsafe { fixed (char* valuePtr = value, tokenPtr = token) { var curValuePtr = valuePtr; var endValuePtr = valuePtr + valueLength;
Now for the first scary bits. The rest of this method is in an unsafe block which means: we can use pointers, and we have to do our own bounds checking with those pointers.
I take and pin pointers to value
and token
, make a copy of the value pointer (so it can be mutated, which pointers declared in a fixed block cannot be), and calculate endValuePtr
so we can use it to do bounds checks on value.
while (true) { long* tokenLongPtr = (long*)tokenPtr; { for (var i = 0; i < tokenLongs; i++) { var tokenLong = *tokenLongPtr; var valueLong = *((long*)curValuePtr); if (tokenLong == valueLong) { tokenLongPtr++; curValuePtr += charsPerLong; continue; } else { goto advanceToNextDelimiter; } } }
Here I start comparing the token
and value
strings. Earlier I calculated how many long
s fit in value, so it’s a simple for loop to make sure they all match. Each iteration compares 4 characters worth of value
to token
, if they match I continue the loop if they don’t we goto advanceToNextDelimiter
. This is a big win for larger tokens, having a quarter the comparisons and branches.
Important things to note: the long*
casts are free, and curValuePtr
being incremented is faster than you might assume (since charsPerLong
is a const, the increment value is pre-calculated instead of involving a multiplication).
int* tokenIntPtr = (int*)tokenLongPtr; if (tokenTrailingInt) { var tokenInt = *tokenIntPtr; var valueInt = *((int*)curValuePtr); if (tokenInt == valueInt) { tokenIntPtr++; curValuePtr += charsPerInt; } else { goto advanceToNextDelimiter; } }
Then essentially the same check, but for a single int
now – comparing two characters at a time isn’t as good as four but it’s not nothing. Whether or not this check was necessary was determined earlier, so tokenTrailingInt
is unchanging across a single call – this is good for branch prediction.
char* tokenCharPtr = (char*)tokenIntPtr; if (tokenTrailingChar) { var tokenChar = *tokenCharPtr; var valueChar = *curValuePtr; if (tokenChar == valueChar) { tokenCharPtr++; curValuePtr++; } else { goto advanceToNextDelimiter; } }
And then again for a single char
.
if (curValuePtr == endValuePtr || *curValuePtr == delimiter) { return true; }
If this chunk of code is reached then every character in token
has been matched, all that’s left is to make sure that either a delimiter
or the end of value
has been reached. If it has then we’ve got a match and return true, otherwise we fall through to advanceToNextDelimiter
.
advanceToNextDelimiter: while (true) { var curVal = *((int*)curValuePtr); var masked = curVal ^ delimiterTwice; var temp = masked & 0x7FFF7FFF; temp = temp + 0x7FFF7FFF; temp = (int)(temp & 0x80008000); temp = temp | masked; temp = temp | 0x7FFF7FFF; temp = ~temp; var neitherMatch = temp == 0; if (neitherMatch) { curValuePtr += charsPerInt; if(curValuePtr >= endValuePtr) { return false; } continue; }
This code advances curValuePtr
forward until a delimiter
is found. There’s a clever combination of tricks here to allow checking two characters with one branch.
The first trick is advancing two characters at a time (thus the int*
cast), and being willing to advance 1 character past the end of value
. This works because C# guarantees that a string coerced to a char* is terminated with a null character (not a null byte), so there’s always two additional bytes at the end of the string for us to read into whether we need them or not.
The second trick is a series of bit twiddlings, as follows:
- XOR
delimiterTwice
intocurVal
, making the the top or bottom char-sized chunks ofmasked
all 0 if they matchdelimiter
delimiterTwice
was precalculated
- AND
masked
with0x7FFF7FFF
, clearing the high bits in the top and bottom char-sized chunks intemp
, store the result intemp
- ADD
0x7FFF7FFF
totemp
, causing the high bits in the top and bottom char-sized chunks oftemp
to be 1 if any of the low bits in those respective chunks are set - AND
temp
with0x80008000
, clearing all but the high bits in the top and bottom char-sized chunks intemp
- OR
temp
withmasked
, making sure the high bits in each char-sized chunk are set intemp
when the high-bits inmasked
are. - OR
temp
with0x7FFF7FFF
, set all the low bits in the char-sized chunks intemp
which will make temp all 1s if both char-sized chunks incurVal
do not match delimiter - INVERT
temp
, making it all 0s if step #5 made it all 1s - COMPARE
temp
against 0, settingneitherMatch
if so
This is followed up with an if that’s taken if neitherMatch
is set in which I increment the pointer into value
by two char
s. There’s then a bounds check, comparing curValPtr
to endValuePtr
(which was pre-calculated for this purpose) and false is returned if the bounds have been exceeded. Note that endValuePtr
points at the null-terminator, so bailing when curValPtr
is equal to it is correct.
var top16 = temp & 0xFFFF0000; if (top16 != 0) { curValuePtr += charsPerInt; break; } var bottom16 = temp & 0x0000FFFF; if (bottom16 != 0) { curValuePtr += 1; }
If this code is run a delimiter
has been found but curValuePtr
isn’t pointing to that character, just to the int that contains it. top16
and bottom16
contain the top and bottom char
s from temp
, whichever one isn’t all zeros is the one that contained the delimiter
. Depending on which is set, curValuePtr
is incremented so that it points at the first character after the delimiter
.
Note because we’re executing on a little-endian architecture but C# is big-endian the increment sizes here appear to be flipped. Also remember that token
cannot be empty, and that is checked at the start of the method, so checking the second char
(in top16
) first isn’t an error.
var remainingBytesInValue = ((byte*)endValuePtr) - ((byte*)curValuePtr); if (remainingBytesInValue < tokenByteLength) { return false; }
This code runs once the next delimiter
has been found, and curValuePtr
advanced to one char
past it. It is necessary to reassert that there’s enough space remaining in value
for token
to fit, just as was asserted back towards the start of the method.
The space check is done in byte
s instead of char
s (the casts are free) because counting the number of char
s between two pointers is a subtraction followed by a division, but counting the number of byte
s is only a subtraction. Pre-calculating tokenByteLength
was done for this purpose.
Then the while loop begins again, and continues until a match is found or value is exhausted.
Aaaaaaand that’s it. Performance wise, this is about 15x faster than the original ContainsToken code and something like 40% faster than than the next best implementation.
I’m not sure I’d encourage anyone to use this code, as relies on several subtleties which make it hard to understand. It also isn’t tuned for 32-bit architectures, or correct on big-endian ones. Both could be fixed, but require some build tweaks which aren’t particularly interesting or portable themselves.
It sure was fun to write though.