A (Nearly) Branchless RESP Request Parser

I’ve been messing around with Garnet at work a lot lately – a Redis compatible, highly scalable, mostly C# remote cache.  This brought an old idea back to mind, and I decided to actually hash it out this time – can you write a branchless RESP request parser, and would it be worth it?

Framing The Problem

Defining some terms up front:

  • RESP – the REdis Serialization Protocol, a human readable protocol for communication with Redis and related (Valkey, DragonflyDB, the aforementioned Garnet) databases/caches.  Tends to look like +OK\r\n or $5\r\nhello\r\n or *1\r\n$5\r\nworld\r\n .
  • RESP Request – A RESP array of at least 1 item, where every element is a bulk (ie. length prefixed) string, and the first element is a (case insensitive) command name.  Looks so “get the value under ‘fizz’” looks like *2\r\n$3\r\nGET\r\n$4\r\nfizz\r\n .  Nearly all operations against a Redis compatible server will be RESP requests.
  • Branchless code – Code which, once compiled/JIT’d/etc., has no branching assembly instructions like JMP, JGE, CALL, etc..  For C# the “once JIT’d” is an important distinction, as the intermediate code generated by the compiler may or may not have branches but I only care about what ultimately executes.
  • Nearly branchless code – A term I’m inventing right here, to mean code that has a small number of necessarily highly predictable branches.  Sometimes it’s just “worth it” to have a branch because the purely branchless code is so much worse – typically this is in cases where there’s some edge case that needs handling for completeness, but which happens rarely-bordering-on-never in practice.

What’s “worth it?” – While this is inherently a judgement call, I will compare my approach to a naive RESP request parser and one based on Garnet’s (which is decidedly not naive) on metrics of code size, performance, and maintainability.  Spoiler – it’s probably not.

The interface I’ll implement is as follows:

public static void Parse(
int commandBufferAllocatedSize,
Span commandBufferTotalSpan,
int commandBufferFilledBytes,
int scratchBufferAllocatedSize,
Span scratchBuffer,
Span commandsSpan,
out int commandsSlotsUsed,
out int bytesConsumed
);

public struct ParsedRespCommandOrArgument
{
private readonly int argumentCount;
private readonly RespCommand command;
private readonly int byteStart;
private readonly int byteEnd;
}

public enum RespCommand { /* Each Valid Command Has Entry */ }

Command and scratch buffer sizes are determined by the implementation, not the caller, and created with helpers.  We take the two sizes in mostly for integrity checking in DEBUG builds, but they do have some real uses.

The Parse(…) method takes the input buffer and parses ALL of the RESP requests in the filled portion.  I do this because when using Redis (or similar) if you aren’t pipelining then, IMO, you don’t care about performance. It follows that in a real world workload the receive buffer will have multiple commands in it.

Incomplete commands terminate parsing, as does completely filling the output span.  A malformed command results in a sentinel being written to the output after any valid commands that preceded it, and parsing terminates.  Incomplete commands are typically encountered because some buffer in transmission (on either client or server) filled up, which isn’t guaranteed to ever happen but is hardly uncommon.
When Parse(…) returns,commandsSlotsUsed indicates how many slots were filled, which will cover all complete commands and optionally one malformed sentinel.  Likewise bytesConsumed indicates all bytes from fully parsed and valid commands.

Insights and Tricks

Before diving into the implementation, there are a few key insights I exploited:

  • The minimum valid RESP request is 4-bytes: *0\r\n
    • This is arguably invalid since all requests should have a command, but using 4 lets us reject sooner
  • Although RESP has many sigils, for requests only digits are of variable length
  • Accordingly, we know exactly when all other sigils (in this case, *, $, and \r\n) are expected
  • The maximum length of any valid digits sequence is 10, just large enough to hold 2^31-1
  • If we have a bitmap that represents which bytes are digits, we can figure out how long that run of digits is with some shifting, a bitwise invert, and a trailing zero count
  • The possible commands for a RESP request are known in advance, so a custom hashing function can be used as part of validation – though care must be taken not to accept unexpected values
  • We can treat running out of data the same as encountering some other error, except we don’t want to include a sentinel indicating malformed in the output
  • ParsedRespCommandOrArgument is laid out so that all byteStart == byteEnd == 0 represents malformed, command > 0 represents the start of a command request, and anything else represents an argument in a request
  • If our “in error” state is an int of all 0s (for false) or all 1s (for true), we prevent or rollback changes with clever bitwise ands and ors

The Branches

The code is dense and full of terrors, so although I will go over it block by block below I doubt I can make it easy to follow.  To emphasize the cheats that make this “nearly” branchless, I’m listing the explicit branches and their justifications here:

  • When we build a bitmap of digits, there’s a loop that terminates once we’ve passed over the whole set of read bytes
    • Because we require the command buffer to be a multiple of 64-bytes in size, the contents of this loop can be processed entirely with SIMD ops.  It still needs to conditionally terminate because the buffer size is determined at runtime, and it is difficult-bordering-on-impossible to guarantee the buffer will always be full.
  • There’s a loop that terminates if we hit its end with the “in error” state mask set, this is used to terminate parsing when we either encounter the end of the available data or a malformed request is encountered
    • The number of requests in each buffer read is expected to be relatively predictable given a workload, but highly variable between workloads.  This means we basically HAVE to rely on the branch predictor.
    • However, it is expected that almost all request buffers will be full of complete, valid, requests, so we don’t exit the loop early ever – we only have the final check of the do {... } while (...).
  • When parsing numbers, we have a fast SWAR path for values <= 9,999 and a slower SIMD path for all larger numbers.
    • In almost all cases, lengths are expected to be less than 9,999 so this branch should be highly selective.  We must still handle larger numbers for correctness.
  • When parsing the length of an array, we have an even faster fast path for lengths <= 9 and then fallback to the number parsing code above (which itself has a fast and slow path).
    • RESP workloads tend to be dominated by single key get/set workloads, which all fit in fewer than 9 arguments (counting the command as an argument).
    • Accordingly, this branch should be highly selective.

This amounts to 5 branches total, not counting the call into the Parse(...) function or the return out of it.  The JIT (especially with PGO) will introduce some back edges, but we will expect a factor of ~2x there.

The Code

I’ve elided some asserts and comments, full code is on GitHub.

Build The Bitmap

var zeros = Vector512.Create(ZERO);
var nines = Vector512.Create(NINE);

ref var cmdEnd = ref Unsafe.Add(ref commandBuffer, commandBufferFilledBytes);

ref var curBitmap = ref digitsBitmap;

// We always go at least one round, so elide a comparison
do
{
    var data = Vector512.LoadUnsafe(ref commandBuffer);

    var d0 = Vector512.GreaterThanOrEqual(data, zeros);
    var d1 = Vector512.LessThanOrEqual(data, nines);

    var digits = Vector512.BitwiseAnd(d0, d1);
    var digitsPacked = Vector512.ExtractMostSignificantBits(digits);

    Unsafe.As<byte, ulong>(ref curBitmap) = digitsPacked;

    commandBuffer = ref Unsafe.Add(ref commandBuffer, Vector512<byte>.Count);
    curBitmap = ref Unsafe.Add(ref curBitmap, sizeof(ulong));
} while (Unsafe.IsAddressLessThan(ref commandBuffer, ref cmdEnd));

First, I build the aforementioned bitmap where all [0-9] characters get 1-bits.  This loop exploits the over-allocation of the command buffer to only work in 64-byte chunks (using .NET’s Vector512<T> to abstract over the actual vector instruction set).
Then, after some basic setup, we enter the per-command loop.  This is broken into 4 steps: initial error checking, parsing the array size, extracting each string, and then parsing the command.

Per-Command Loop

const int MinimumCommandSize = 1 + 1 + 2;
const uint CommonArrayLengthSuffix = 0x240A_0D31;  // $\n\r1
const uint CommonCommandLengthPrefix = 0x0A0D_3024; // \n\r0$

// Track how many bytes we need to rollback the command buffer ref
var rollbackCommandRefCount = 0;
// Track how many ints we need to rollback the into ref
var rollbackWriteIntoRefCount = 0;

var oldErrorState = inErrorState;
// first check that there's sufficient space for a command
var availableBytes = commandBufferFilledBytes - (int)Unsafe.ByteOffset(ref commandBuffer, ref currentCommandRef);
ranOutOfData |= (uint)((availableBytes - MinimumCommandSize) >> 31); MinimumCommandSize
inErrorState |= ranOutOfData;

// check that we can even write a malformed, if needed
ranOutOfData |= (uint)((~(-(int)Unsafe.ByteOffset(ref currentIntoRef, ref intoAsIntsAllocatedEnd))) >> 31);
inErrorState |= ranOutOfData;

// check for *
inErrorState |= (uint)((-(currentCommandRef ^ ArrayStart)) >> 31);

The initial error check updates our state flags if we have insufficient data to proceed, insufficient space to store a result, or a missing * (the sigil for RESP arrays).  These patterns for turning conditionals into all 1s occur throughout the rest of the code, so get used to seeing them.

var arrayLengthProbe = Unsafe.As<byte, uint>(ref currentCommandRef);
arrayLengthProbe -= CommonArrayLengthSuffix;

// Ultra fast check for array length
int arrayLength;
if (arrayLengthProbe < 8)
{
    arrayLength = (int)(arrayLengthProbe + 1);

    currentCommandRef = ref Unsafe.Add(ref currentCommandRef, 3);
    rollbackCommandRefCount += 3;
}
else
{
    // Slower check for array length
    {
        var commandBufferIx = (int)Unsafe.ByteOffset(ref commandBuffer, ref currentCommandRef);
        var startByteIndex = commandBufferIx / 8;
        var startBitIndex = commandBufferIx % 8;

        var digitsBitmapValue = Unsafe.As<byte, uint>(ref Unsafe.Add(ref digitsBitmap, startByteIndex)) >> startBitIndex;
        var digitCount = (uint)BitOperations.TrailingZeroCount(~digitsBitmapValue);

        // we expect no more than 10 digits (int.MaxValue is 2,147,483,647), and at least 1 digit
        inErrorState |= ((10 - digitCount) >> 31);  // digitCount >= 11

        // check for \r\n
        ref var expectedCrLfRef = ref Unsafe.Add(ref currentCommandRef, digitCount);
        var expectedCrPosition = (int)Unsafe.ByteOffset(ref commandBuffer, ref expectedCrLfRef);
        ranOutOfData |= (uint)((commandBufferFilledBytes - (expectedCrPosition + 2)) >> 31);    // expectedCrPosition + 2 > commandBufferFilledBytes
        inErrorState |= ranOutOfData;

        inErrorState |= (uint)((0 - (Unsafe.As<byte, ushort>(ref expectedCrLfRef) ^ CRLF)) >> 31);    // expectedCrLfRef != CRLF;

        ref var digitsStartRef = ref currentCommandRef;
        digitsStartRef = ref Unsafe.Subtract(ref digitsStartRef, (int)(inErrorState & rollbackCommandRefCount));
        ref var digitsEndRef = ref expectedCrLfRef;
        digitsEndRef = ref Unsafe.Subtract(ref digitsEndRef, (int)(inErrorState & (rollbackCommandRefCount + digitCount)));

        // Having found the bounds of the number, parse it
        {
            var length = (int)Unsafe.ByteOffset(ref digitsStartRef, ref digitsEndRef);

            var multLookupIx = length * 12;
            ref var multStart = ref Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(UnconditionalParsePositiveIntMultiples), multLookupIx);

            // Fast-ish approach for numbers <= 9,999
            if (length <= 4)
            {
                // load all the digits (and some extra junk, maybe)
                var fastPathDigits = Unsafe.As<byte, uint>(ref digitsStartRef);

                // reverse so we can pad more easily
                fastPathDigits = BinaryPrimitives.ReverseEndianness(fastPathDigits); // this is an intrinsic, so should be a single op

                // shift right to pad with zeros
                fastPathDigits = fastPathDigits >> (32 - (8 * length));

                var onesAndHundreds = fastPathDigits & ~0xFF_30_FF_30U;
                var tensAndThousands = fastPathDigits & ~0x30_FF_30_FFU;

                var mixedOnce = (tensAndThousands * 10U) + (onesAndHundreds << 8);

                ulong topPair = mixedOnce & 0xFF_FF_00_00U;
                ulong lowPair = mixedOnce & 0x00_00_FF_FFU;

                var mixedTwice = (topPair * 100U) + (lowPair << 16);

                var result = (int)(mixedTwice >> 24);

                // leading zero check, force result to -1 if below expected value
                result |= ((-(length ^ 1)) >> 31) & ((result - (int)multStart) >> 31);

                arrayLength = result;
            }
            else
            {
                // Slow path for numbers >= 10,000, uses SIMD and lookup tables to remain fast and branchless

                var maskLookupIx = length * 16;

                ref var maskStart = ref Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(UnconditionalParsePositiveIntLookup), maskLookupIx);

                var mask = Vector128.LoadUnsafe(ref maskStart);

                // load 16 bytes of data, which will have \r\n and some trailing garbage
                var data = Vector128.LoadUnsafe(ref digitsStartRef);

                var toUseBytes = Vector128.BitwiseAnd(data, mask);                        // bytes 0-9 are (potentially) valid

                // expand so we can multiply cleanly
                var (d0To7, d8To15) = Vector128.Widen(toUseBytes);
                var (d0To3, d4To7) = Vector128.Widen(d0To7);
                var (d8To11, _) = Vector128.Widen(d8To15);

                // load per digit multiples
                var scale0To3 = Vector128.LoadUnsafe(ref Unsafe.Add(ref multStart, 0));
                var scale4To7 = Vector128.LoadUnsafe(ref Unsafe.Add(ref multStart, 4));
                var scale8To11 = Vector128.LoadUnsafe(ref Unsafe.Add(ref multStart, 8));

                // scale each digit by it's place value
                // 
                // at maximum length, m0To3[0] might overflow
                var m0To3 = Vector128.Multiply(d0To3, scale0To3);
                var m4To7 = Vector128.Multiply(d4To7, scale4To7);
                var m8To11 = Vector128.Multiply(d8To11, scale8To11);

                var maybeMultOverflow = Vector128.LessThan(m0To3, scale0To3);

                // calculate the vertical sum
                // 
                // vertical sum cannot overflow (assuming m0To3 didn't overflow)
                var verticalSum = Vector128.Add(Vector128.Add(m0To3, m4To7), m8To11);

                // calculate the horizontal sum
                //
                // horizontal sum can overflow, even if m0To3 didn't overflow
                var horizontalSum = Vector128.Dot(verticalSum, Vector128<uint>.One);

                var maybeHorizontalOverflow = Vector128.GreaterThan(verticalSum, Vector128.CreateScalar(horizontalSum));

                var maybeOverflow = Vector128.BitwiseOr(maybeMultOverflow, maybeHorizontalOverflow);

                var res = horizontalSum;

                // check for overflow
                //
                // note that overflow can only happen when length == 10
                // only the first digit can overflow on it's own (if it's > 2)
                // everything else has to overflow when summed
                // 
                // we know length is <= 10, we can exploit this in the == check
                var allOnesIfOverflow = (uint)(((9 - length) >> 31) & ((0L - maybeOverflow.GetElement(0)) >> 63));  // maybeOverflow.GetElement(0) != 0 & length == 10

                // check for leading zeros
                var allOnesIfLeadingZeros = (uint)(((-(length ^ 1)) >> 31) & (((int)res - (int)multStart) >> 31)); // length != 1 & multStart > (int)res

                // turn into -1 if things are invalid
                res |= allOnesIfOverflow | allOnesIfLeadingZeros;

                arrayLength = (int)res;
            }
        }
        inErrorState |= (uint)(arrayLength >> 31);    // value < 0

        // update to point one past the \r\n if we're not in an error state
        currentCommandRef = ref Unsafe.Add(ref currentCommandRef, (int)((digitCount + 2) & ~inErrorState));
        rollbackCommandRefCount += (int)((digitCount + 2) & ~inErrorState);
    }

    inErrorState |= (uint)((arrayLength - 1) >> 31);    // arrayLength < 1
}

Then we proceed into parsing the actual array length.  We use one of our branches on special casing very short arrays (<= 9), then handle reasonably sized arrays (<= 9,999), and then finally handle quite large arrays (10,000+).

One subtlety here is that RESP forbids leading zeros in lengths, something many parsers get wrong.  Accordingly we have to do some range checks in all paths except the fastest path – both slower paths use a lookup to determine if the final result is lower than the lowest legal value for the extracted length.
At the end of parsing the array length, we have some number of items to parse AND might have set inErrorState or ranOutOfData to all 1s.

// first string will be a command, so we can do a smarter length check MOST of the time
var cmdLengthProbe = Unsafe.As<byte, uint>(ref currentCommandRef);
cmdLengthProbe -= CommonCommandLengthPrefix;
cmdLengthProbe = BitOperations.RotateRight(cmdLengthProbe, 8);
if (cmdLengthProbe < 10)
{
    // handle the common case where command length is < 10

    ranOutOfData |= (uint)~(((int)-Unsafe.ByteOffset(ref currentIntoRef, ref intoAsIntsAllocatedEnd)) >> 31);                // currentIntoRef == intoAsIntsAllocatedEnd

    var byteAfterCurrentCommandRef = (int)Unsafe.ByteOffset(ref commandBuffer, ref currentCommandRef);
    ranOutOfData |= (uint)(((commandBufferFilledBytes - byteAfterCurrentCommandRef) - 4) >> 31);    // (commandBufferFilledBytes - byteAfterCurrentCommandRef) < 4  (considering $\d\r\n)
    inErrorState |= ranOutOfData;

    // if we're out of data, we also need an extra bit of rollback to prevent 
    var outOfDataRollback = (int)ranOutOfData & 4;

    // get a ref which is rolled back to the original value IF we're in an errorState
    ref var commandIntoRef = ref Unsafe.Subtract(ref currentIntoRef, (rollbackWriteIntoRefCount & (int)inErrorState) + outOfDataRollback);
    var commonCommandStart = (int)Unsafe.ByteOffset(ref commandBuffer, ref currentCommandRef) + 4;
    var commonCommandEnd = (int)(commonCommandStart + cmdLengthProbe + 2);

    ranOutOfData |= (uint)((commandBufferFilledBytes - commonCommandEnd) >> 31);    // commonCommandEnd > commandBufferFilledBytes
    inErrorState |= ranOutOfData;

    // check for \r\n
    var trailingCrLfOffset = ~inErrorState & (4 + cmdLengthProbe);
    ref var trailingCrLfRef = ref Unsafe.Add(ref currentCommandRef, trailingCrLfOffset);
    var trailingCrLf = Unsafe.As<byte, ushort>(ref trailingCrLfRef);
    inErrorState |= (uint)((-(trailingCrLf ^ CRLF)) >> 31);  // trailingCrLf != CRLF

    ref var commonCmdAndArgCountRef = ref Unsafe.As<int, long>(ref Unsafe.Add(ref commandIntoRef, ParsedRespCommandOrArgument.ArgumentCountIx));
    ref var commonByteStartAndEndRef = ref Unsafe.As<int, long>(ref Unsafe.Add(ref commandIntoRef, ParsedRespCommandOrArgument.ByteStartIx));

    // clear argCount & cmd if not in error state
    commonCmdAndArgCountRef &= (int)inErrorState;

    // leave unmodified if in error state, or set to commonCommandStart & commonCommandEnd
    var packedStartAndEnd = (long)(((uint)commonCommandStart) | (((ulong)(uint)commonCommandEnd) << 32));
    commonByteStartAndEndRef = (commonByteStartAndEndRef & (long)(int)inErrorState) | (packedStartAndEnd & ~(long)(int)inErrorState);

    // advance writeNextResultInto if not in error state
    rollbackWriteIntoRefCount += (int)(4 & ~inErrorState);
    currentIntoRef = ref Unsafe.Add(ref currentIntoRef, (int)(4 & ~inErrorState));
    remainingArrayItems--;

    // update currentCommandBufferIx if not in error state
    currentCommandRef = ref Unsafe.Add(ref currentCommandRef, (int)(~inErrorState & (4 + cmdLengthProbe + 2)));
    rollbackCommandRefCount += (int)(~inErrorState & (4 + cmdLengthProbe + 2));
}

Next we optimistically try to read a string with length < 10.  This will almost always succeed, since the expected string will be a command name AND most RESP commands are short.  There’s no fallback for longer strings, because we’ll handle those in a more general loop that follows.
Note that the bounds of the parsed string are written into the results, but only if inErrorState is all 0s – otherwise we leave the values unmodified through bit twiddling trickery.  Also observe that we actually update the ParsedRespCommandOrArgument fields as longs instead of the individual ints they’ve been decomposed into, as a small efficiency hack.

// Loop for handling remaining strings
// 
// This always executes at least once, but remainingArrayItems == 0 will prevent
// it from taking effect for the (rare) commands that take no arguments
do
{
    // Read one string
    {
        const int MinimumBulkStringSize = 1 + 1 + 2 + 2; // $0\r\n\r\n

        // if we've read everything, we need to just do nothing (including updating inErrorState and ranOutOfData)
        // but still run through everything unconditionally
        var readAllItems = (uint)~((-remainingArrayItems) >> 31);  // remainingItemsInArray == 0
        var oldInErrorState = inErrorState;
        var oldRanOutOfData = ranOutOfData;

        // we need to prevent action now, so act like we're in an error state
        inErrorState |= readAllItems;

        // check that there's enough data to even fit a string
        var availableBytesSub = commandBufferFilledBytes - (int)Unsafe.ByteOffset(ref commandBuffer, ref currentCommandRef);
        ranOutOfData |= (uint)((availableBytesSub - MinimumBulkStringSize) >> 31); // availableBytes < MinimumBulkStringSize
        inErrorState |= ranOutOfData;

        // check for $
        ref var bulkStringStartExpectedRef = ref currentCommandRef;

        inErrorState |= (uint)((-(bulkStringStartExpectedRef ^ BulkStringStart)) >> 31); // (ref commandBuffer + bulkStringStartExpectedIx) != BulkStringStart

        // move past $
        currentCommandRef = ref Unsafe.Add(ref currentCommandRef, 1);
        rollbackCommandRefCount++;

        // Parse the length again, this is a duplicate of the array length logic w/o the ultra fast path for lengths <=9
        //
        // Most values are not going to be that small, unlike most arrays
        int bulkStringLength;
        {
            var commandBufferIx = (int)Unsafe.ByteOffset(ref commandBuffer, ref currentCommandRef);
            var startByteIndex = commandBufferIx / 8;
            var startBitIndex = commandBufferIx % 8;

            var digitsBitmapValue = Unsafe.As<byte, uint>(ref Unsafe.Add(ref digitsBitmap, startByteIndex)) >> startBitIndex;
            var digitCount = (uint)BitOperations.TrailingZeroCount(~digitsBitmapValue);

            inErrorState |= ((10 - digitCount) >> 31);  // digitCount >= 11

            ref var expectedCrLfRef = ref Unsafe.Add(ref currentCommandRef, digitCount);
            var expectedCrPosition_inline = (int)Unsafe.ByteOffset(ref commandBuffer, ref expectedCrLfRef);
            ranOutOfData |= (uint)((commandBufferFilledBytes - (expectedCrPosition_inline + 2)) >> 31);    // expectedCrPosition + 2 > commandBufferFilledBytes
            inErrorState |= ranOutOfData;

            inErrorState |= (uint)((-(Unsafe.As<byte, ushort>(ref expectedCrLfRef) ^ CRLF)) >> 31);    // expectedCrLfRef != CRLF;

            ref var digitsStartRef = ref currentCommandRef;
            digitsStartRef = ref Unsafe.Subtract(ref digitsStartRef, (int)(inErrorState & rollbackCommandRefCount));
            ref var digitsEndRef = ref expectedCrLfRef;
            digitsEndRef = ref Unsafe.Subtract(ref digitsEndRef, (int)(inErrorState & (rollbackCommandRefCount + digitCount)));

            // actually do the parsing
            {
                var length = (int)Unsafe.ByteOffset(ref digitsStartRef, ref digitsEndRef);

                var multLookupIx = length * 12;
                ref var multStart = ref Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(UnconditionalParsePositiveIntMultiples), multLookupIx);

                if (length <= 4)
                {
                    // load all the digits (and some extra junk, maybe)
                    var fastPathDigits = Unsafe.As<byte, uint>(ref digitsStartRef);

                    // reverse so we can pad more easily
                    fastPathDigits = BinaryPrimitives.ReverseEndianness(fastPathDigits); // this is an intrinsic, so should be a single op

                    // shift right to pad with zeros
                    fastPathDigits = fastPathDigits >> (32 - (8 * length));

                    var onesAndHundreds = fastPathDigits & ~0xFF_30_FF_30U;
                    var tensAndThousands = fastPathDigits & ~0x30_FF_30_FFU;

                    var mixedOnce = (tensAndThousands * 10U) + (onesAndHundreds << 8);

                    ulong topPair = mixedOnce & 0xFF_FF_00_00U;
                    ulong lowPair = mixedOnce & 0x00_00_FF_FFU;

                    var mixedTwice = (topPair * 100U) + (lowPair << 16);

                    var result = (int)(mixedTwice >> 24);

                    result |= ((-(length ^ 1)) >> 31) & ((result - (int)multStart) >> 31);   // length != 1 & multStart > result;

                    bulkStringLength = result;
                }
                else
                {
                    var maskLookupIx = length * 16;

                    ref var maskStart = ref Unsafe.Add(ref MemoryMarshal.GetArrayDataReference(UnconditionalParsePositiveIntLookup), maskLookupIx);

                    var mask = Vector128.LoadUnsafe(ref maskStart);

                    // load 16 bytes of data, which will have \r\n and some trailing garbage
                    var data = Vector128.LoadUnsafe(ref digitsStartRef);

                    var toUseBytes = Vector128.BitwiseAnd(data, mask);                        // bytes 0-9 are (potentially) valid

                    var (d0To7, d8To15) = Vector128.Widen(toUseBytes);
                    var (d0To3, d4To7) = Vector128.Widen(d0To7);
                    var (d8To11, _) = Vector128.Widen(d8To15);

                    var scale0To3 = Vector128.LoadUnsafe(ref Unsafe.Add(ref multStart, 0));
                    var scale4To7 = Vector128.LoadUnsafe(ref Unsafe.Add(ref multStart, 4));
                    var scale8To11 = Vector128.LoadUnsafe(ref Unsafe.Add(ref multStart, 8));

                    var m0To3 = Vector128.Multiply(d0To3, scale0To3);
                    var m4To7 = Vector128.Multiply(d4To7, scale4To7);
                    var m8To11 = Vector128.Multiply(d8To11, scale8To11);

                    var maybeMultOverflow = Vector128.LessThan(m0To3, scale0To3);

                    var verticalSum = Vector128.Add(Vector128.Add(m0To3, m4To7), m8To11);

                    var horizontalSum = Vector128.Dot(verticalSum, Vector128<uint>.One);

                    var maybeHorizontalOverflow = Vector128.GreaterThan(verticalSum, Vector128.CreateScalar(horizontalSum));

                    var maybeOverflow = Vector128.BitwiseOr(maybeMultOverflow, maybeHorizontalOverflow);

                    var res = horizontalSum;

                    var allOnesIfOverflow = (uint)(((9 - length) >> 31) & ((0L - maybeOverflow.GetElement(0)) >> 63));  // maybeOverflow.GetElement(0) != 0 & length == 10

                    var allOnesIfLeadingZeros = (uint)(((-(length ^ 1)) >> 31) & (((int)res - (int)multStart) >> 31)); // length != 1 & multStart > (int)res

                    res |= allOnesIfOverflow | allOnesIfLeadingZeros;

                    bulkStringLength = (int)res;
                }
            }
            inErrorState |= (uint)(bulkStringLength >> 31);    // value < 0

            // update to point one past the \r\n if we're not in an error state
            currentCommandRef = ref Unsafe.Add(ref currentCommandRef, (int)((digitCount + 2) & ~inErrorState));
            rollbackCommandRefCount += (int)((digitCount + 2) & ~inErrorState);
        }

        var bulkStringStart = (int)Unsafe.ByteOffset(ref commandBuffer, ref currentCommandRef);
        var bulkStringEnd = bulkStringStart + bulkStringLength;

        // check for \r\n
        var expectedCrPosition = bulkStringEnd;
        ranOutOfData |= (uint)((commandBufferFilledBytes - (expectedCrPosition + 2)) >> 31);    // expectedCrPosition + 2 > commandBufferFilledBytes
        inErrorState |= ranOutOfData;

        expectedCrPosition &= (int)~inErrorState;
        inErrorState |= (uint)((-(Unsafe.As<byte, ushort>(ref Unsafe.Add(ref commandBuffer, expectedCrPosition)) ^ CRLF)) >> 31);    // (ref commandBuffer + expectedCrPosition) != CRLF

        // move past the \r\n
        currentCommandRef = ref Unsafe.Add(ref currentCommandRef, bulkStringLength + 2);
        rollbackCommandRefCount += bulkStringLength + 2;

        // now we're ready to write out results

        // do we even have space to write something?
        ranOutOfData |= (uint)~(((int)-Unsafe.ByteOffset(ref currentIntoRef, ref intoAsIntsAllocatedEnd)) >> 31);    // writeNextResultInto == intoAsIntsSize
        inErrorState |= ranOutOfData;

        // write the results out
        ref var effectiveInto = ref Unsafe.Subtract(ref currentIntoRef, inErrorState & 4);
        
        ref var argCountAndCommandRef = ref Unsafe.As<int, long>(ref Unsafe.Add(ref effectiveInto, ParsedRespCommandOrArgument.ArgumentCountIx));
        ref var byteStartAndEndRefSub = ref Unsafe.As<int, long>(ref Unsafe.Add(ref effectiveInto, ParsedRespCommandOrArgument.ByteStartIx));

        // clear argCount and command UNLESS we're in an error state
        argCountAndCommandRef &= (int)inErrorState;

        // set byteStart and byteEnd if we're not in an error state, otherwise leave them unmodified
        var packedByteStartAndEnd = (long)(((uint)bulkStringStart) | (((ulong)(uint)(expectedCrPosition + 2)) << 32));
        byteStartAndEndRefSub = (byteStartAndEndRefSub & (int)inErrorState) | (packedByteStartAndEnd & ~(int)inErrorState);

        // now update state

        // advance write target if we succeeded
        currentIntoRef = ref Unsafe.Add(ref currentIntoRef, (int)(~inErrorState & 4));
        rollbackWriteIntoRefCount += (int)(~inErrorState & 4);

        // note that we rollback currentCommandBufferIx if there's any error
        //
        // if we've read past the end of the command, we will _for sure_ be in error state
        currentCommandRef = ref Unsafe.Subtract(ref currentCommandRef, (int)(inErrorState & (uint)(bulkStringLength + 2 + 1)));
        rollbackCommandRefCount -= (int)(inErrorState & (uint)(bulkStringLength + 2 + 1));

        // update the number of items read from the array, if appropriate
        var shouldUpdateRemainingItemCount = ~readAllItems & ~inErrorState;
        remainingArrayItems -= (int)(shouldUpdateRemainingItemCount & 1);

        // but we rollback errors and out of data if we fully consumed the array
        inErrorState = ((readAllItems & oldInErrorState) | (~readAllItems & inErrorState));
        ranOutOfData = ((readAllItems & oldRanOutOfData) | (~readAllItems & ranOutOfData));

    }
} while ((remainingArrayItems & ~inErrorState) != 0);

The loop which follows handles the remaining strings in a command array.  This strongly resembles the above except instead of just having a path for strings of length < 10 it has two paths, one for strings of length <= 9,999 and one for everything else.
Peppered throughout that code are places where we track old values (like oldInErrorState) or enough data to reverse a computation (like rollbackWriteIntoRefCount).  These are then applied conditionally, but without a branch, where needed – again using bit twiddling hackery.

// we've now parsed everything EXCEPT the command enum
// and we haven't written the argument count either
//
// we might be in an error state, or we might be out of data
//
// if we're out of data, we just need to take nothing (rolling everything back)
ref var effectInto = ref Unsafe.Subtract(ref currentIntoRef, rollbackWriteIntoRefCount + effectiveIntoReverse);
var cmdStart = (int)(Unsafe.Add(ref effectInto, ParsedRespCommandOrArgument.ByteStartIx) & ~inErrorState);
var cmdEnd = (int)(Unsafe.Add(ref effectInto, ParsedRespCommandOrArgument.ByteEndIx) & ~inErrorState);

// Parse the first string we read as a command
//
// If we're in an error state, we try to parse a 0 length string which will with no side effects
RespCommand cmd;
{
    // We read the command string back into a 32 byte vector, extra padding in the command buffer makes this malformed
    // We then AND off any extra data, AND off high bits to upper case everything
    // And then use the vector to calculate a hash, with constants found by a brute force search

    var commandLength = (cmdEnd - cmdStart) - 2;
    var commandVector = Vector256.LoadUnsafe(ref Unsafe.Add(ref commandBuffer, cmdStart));

    var effectiveLength = commandLength & 31;
    var maskIx = effectiveLength * 64;
    var xorIx = maskIx + 32;

    ref var lengthMaskAndHashXorsRef = ref MemoryMarshal.GetArrayDataReference(UnconditionalParseRespCommandImpl.LengthMaskAndHashXors);
    var maskVector = Vector256.LoadUnsafe(ref Unsafe.Add(ref lengthMaskAndHashXorsRef, maskIx));
    var xorVector = Vector256.Create(Unsafe.As<byte, uint>(ref Unsafe.Add(ref lengthMaskAndHashXorsRef, xorIx)));

    var truncatedCommandVector = Vector256.BitwiseAnd(commandVector, Vector256.GreaterThan(maskVector, Vector256<byte>.Zero));
    var upperCommandVector = Vector256.BitwiseAnd(truncatedCommandVector, maskVector);

    var invalid = Vector256.GreaterThanAny(truncatedCommandVector, Vector256.Create((byte)'z'));
    var allZerosIfInvalid = Unsafe.As<bool, byte>(ref invalid) - 1;

    var xored = Vector256.Xor(upperCommandVector, Vector256.As<uint, byte>(xorVector));
    var mixed = Vector256.Dot(Vector256.As<byte, uint>(xored), Vector256<uint>.One);
    var moded = mixed % 71;

    // Now that we have a unique index for a command, we lookup what SHOULD be in the command buffer (after ANDing away lowercase)
    // and compare to see if the hashed value should actually be parsed as a command
    var dataStartIx = 8192 * effectiveLength;
    dataStartIx += (int)(moded * 64);

    ref var commandAndExpectedValuesRef = ref MemoryMarshal.GetArrayDataReference(UnconditionalParseRespCommandImpl.CommandAndExpectedValues);
    var cmdSub = Unsafe.As<byte, int>(ref Unsafe.Add(ref commandAndExpectedValuesRef, dataStartIx));
    var expectedCommandVector = Vector256.LoadUnsafe(ref Unsafe.Add(ref commandAndExpectedValuesRef, dataStartIx + 4));

    var matches = Vector256.EqualsAll(upperCommandVector, expectedCommandVector);
    var allOnesIfMatches = (-Unsafe.As<bool, byte>(ref matches)) >> 31;

    // cmd is left as None if an error was encountered
    cmd = (RespCommand)(cmdSub & allOnesIfMatches & allZerosIfInvalid);
}
inErrorState |= (uint)~(-(int)cmd >> 31);   // cmd == 0

Then we have what is probably the trickiest part of the code, parsing the first string in a command array as a RespCommand.  This uses a custom hash function, conceptually, derived through a brute force search.

The idea is to:

  • Load the command text into a SIMD register
  • Mask away bits to force the command to be upper case
  • Perform a simple hash function (in this case an XOR, a dot product, and a mod by a constant factor)
  • Lookup the expected RespCommand and command string given that hash
  • Compare the command string to what we actually have, erroring if there’s no match

One subtlety is that RESP commands sometimes have non-alpha characters.  For the set I handle here, a comparison against ‘z’ is sufficient to exclude collisions that the basic approach can’t.
This hash approach could probably be greatly improved upon.  The function is very basic, and there’s a vast literature on fast (or perfect) hash functions to draw upon.

        // we need to update the first entry, to either have command and argument values,
        ref var argCountAndCmdRef = ref Unsafe.As<int, long>(ref Unsafe.Add(ref effectInto, ParsedRespCommandOrArgument.ArgumentCountIx));
        ref var byteStartAndEndRef = ref Unsafe.As<int, long>(ref Unsafe.Add(ref effectInto, ParsedRespCommandOrArgument.ByteStartIx));

        // update command and arg count if not in error state, otherwise leave them unchanged
        var packedArgCountAndCmd = (long)((uint)arrayLength | (((ulong)(uint)cmd) << 32));
        argCountAndCmdRef = (argCountAndCmdRef & (long)(int)inErrorState) | (packedArgCountAndCmd & ~(long)(int)inErrorState);

        // we need to zero these out if we're in error state, but NOT out of data
        // this will produce a Malformed entry
        var writeMalformed = inErrorState & ~ranOutOfData;
        byteStartAndEndRef &= ~(int)writeMalformed;

        // update the pointer for storage

        // if we ENTERED the error state, we wrote a Malformed out
        // so we need to advance writeNextResultInto by 4
        // if we ran out of data, we roll writeNextResultInto all the way back
        // 
        // so the logic is writeNextResultInto - inErrorState
        var rollbackNextResultIntoBy = (int)(inErrorState & ~oldErrorState) & (rollbackWriteIntoRefCount - 4);
        rollbackNextResultIntoBy = ((int)ranOutOfData & rollbackWriteIntoRefCount) | ((int)~ranOutOfData & rollbackNextResultIntoBy);
        currentIntoRef = ref Unsafe.Subtract(ref currentIntoRef, rollbackNextResultIntoBy);

        // roll current pointer back if we are in an error state
        currentCommandRef = ref Unsafe.Subtract(ref currentCommandRef, (int)(inErrorState & (uint)rollbackCommandRefCount));
    }
} while (inErrorState == 0);

After parsing the command, we end the main loop with some error handling and update the first ParsedRespCommandOrArgument to its final (either malformed or command w/ arguments) state.

Wrapping Up

// At the end of everything, calculate the amount of intoCommandsSpan and commandBufferTotalSpan consumed
intoCommandsSlotsUsed = (int)(Unsafe.ByteOffset(ref intoAsInts, ref currentIntoRef) / (sizeof(int) * 4));
bytesConsumed = (int)Unsafe.ByteOffset(ref commandBuffer, ref currentCommandRef); 

Finally, after exiting the main loop, we calculate how many bytes we consumed and ParsedRespCommandOrArgument slots we used.  Recall that, if we entered an error state, we rolled back advances to currentIntoRef and currentCommandRef – so we’ll be left pointing at the start of the incomplete or malformed command.

The Assembly

To prove the assertion about branches in the final code, here’s the JIT’d assembly.  This is x86 assembly, generated in .NET 8 on an AMD Ryzen 7 machine with PGO and tiered compilation enabled.

; Assembly listing for method OptimizationExercise.SIMDRespParsing.RespParserFinal:Parse(int,System.Span`1[ubyte],int,int,System.Span`1[ubyte],System.Span`1[OptimizationExercise.SIMDRespParsing.ParsedRespCommandOrArgument],byref,byref) (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rsp based frame
; fully interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 274176
; 0 inlinees with PGO data; 42 single block inlinees; 6 inlinees without PGO data

G_M000_IG01:
       push     r15
       push     r14
       push     r13
       push     r12
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 200
       vzeroupper 
       vmovaps  xmmword ptr [rsp+0xB0], xmm6
       vmovaps  xmmword ptr [rsp+0xA0], xmm7
       xor      eax, eax
       mov      qword ptr [rsp+0x90], rax
       mov      rcx, bword ptr [rsp+0x130]
       mov      rax, bword ptr [rsp+0x138]
 
G_M000_IG02:
       mov      r10, bword ptr [rdx]
       mov      rcx, bword ptr [rcx]
       mov      bword ptr [rsp+0x30], rcx
       movsxd   r9, r8d
       add      r9, r10
       mov      r11, rcx
 
G_M000_IG03:
       vmovups  ymm0, ymmword ptr [r10]
       vmovups  ymm1, ymmword ptr [r10+0x20]
       vpmaxub  ymm2, ymm0, ymmword ptr [reloc @RWD00]
       vpcmpeqb ymm2, ymm2, ymm0
       vpmaxub  ymm3, ymm1, ymmword ptr [reloc @RWD00]
       vpcmpeqb ymm3, ymm3, ymm1
       vpminub  ymm4, ymm0, ymmword ptr [reloc @RWD32]
       vpcmpeqb ymm0, ymm4, ymm0
       vpminub  ymm4, ymm1, ymmword ptr [reloc @RWD32]
       vpcmpeqb ymm1, ymm4, ymm1
       vpand    ymm0, ymm2, ymm0
       vpand    ymm1, ymm3, ymm1
       vpmovmskb ebx, ymm0
       mov      ebx, ebx
       vpmovmskb esi, ymm1
       mov      esi, esi
       shl      rsi, 32
       or       rbx, rsi
       mov      qword ptr [r11], rbx
       add      r10, 64
       add      r11, 8
       cmp      r10, r9
       jb       SHORT G_M000_IG03 ; end of bitmap loop
       mov      r10, bword ptr [rdx]
       mov      rdx, bword ptr [rax]
       mov      eax, dword ptr [rax+0x08]
       xor      r9d, r9d
       xor      r11d, r11d
       mov      bword ptr [rsp+0x28], rdx
       shl      eax, 2
       cdqe     
       lea      rax, bword ptr [rdx+4*rax]
       mov      bword ptr [rsp+0x20], rax
       mov      rbx, r10
       mov      rsi, rdx
       vmovups  ymm0, ymmword ptr [reloc @RWD64]
       vmovups  ymm1, ymmword ptr [reloc @RWD96]
 
G_M000_IG04:
       xor      edi, edi
       mov      dword ptr [rsp+0x80], r9d
       mov      rbp, rbx
       sub      rbp, r10
       mov      r14d, r8d
       sub      r14d, ebp
       add      r14d, -4
       sar      r14d, 31
       or       r11d, r14d
       mov      ebp, r9d
       or       ebp, r11d
       mov      r14, rax
       sub      r14, rsi
       neg      r14d
       not      r14d
       sar      r14d, 31
       or       r14d, r11d
       mov      r11d, r14d
       or       ebp, r11d
       movzx    r14, byte  ptr [rbx]
       xor      r14d, 42
       neg      r14d
       sar      r14d, 31
       or       ebp, r14d
       inc      rbx
       mov      r14d, dword ptr [rbx]
       add      r14d, 0xD1FFAB1E
       cmp      r14d, 8
       jae      G_M000_IG16 ; take slow path if array probe fails
 
G_M000_IG05:
       inc      r14d
       add      rbx, 3
       mov      r15d, 4
 
G_M000_IG06:
       mov      dword ptr [rsp+0x7C], r14d
       mov      r13d, r14d
       mov      r12d, dword ptr [rbx]
       add      r12d, 0xD1FFAB1E
       ror      r12d, 8
       cmp      r12d, 10
       jae      G_M000_IG15 ; failed command length fast path
 
G_M000_IG07:
       mov      rdi, rbx
       sub      rdi, r10
       mov      dword ptr [rsp+0x3C], edi
       mov      edx, r8d
       sub      edx, edi
       add      edx, -4
       sar      edx, 31
       mov      rdi, rax
       sub      rdi, rsi
       neg      edi
       sar      edi, 31
       not      edi
       or       edi, r11d
       mov      r11d, edx
       or       r11d, edi
       or       ebp, r11d
       mov      edx, r11d
       and      edx, 4
       mov      dword ptr [rsp+0x70], edx
       mov      edi, dword ptr [rsp+0x3C]
       add      edi, 4
       mov      dword ptr [rsp+0x6C], edi
       mov      edx, r12d
       lea      edx, [rdi+rdx+0x02]
       mov      dword ptr [rsp+0x68], edx
       mov      edi, r8d
       sub      edi, edx
       sar      edi, 31
       or       edi, r11d
       mov      r11d, edi
       movsxd   rdi, dword ptr [rsp+0x70]
       shl      rdi, 2
       mov      rdx, rsi
       sub      rdx, rdi
       or       ebp, r11d
       lea      edi, [r12+0x04]
       andn     edi, ebp, edi
       movzx    rdi, word  ptr [rbx+rdi]
       xor      edi, 0xA0D
       neg      edi
       sar      edi, 31
       or       edi, ebp
       mov      ebp, edi
       movsxd   rdi, ebp
       and      qword ptr [rdx], rdi
       mov      edi, dword ptr [rsp+0x6C]
       mov      qword ptr [rsp+0x88], rdi
       mov      edi, dword ptr [rsp+0x68]
       shl      rdi, 32
       or       rdi, qword ptr [rsp+0x88]
       mov      qword ptr [rsp+0x60], rdi
       add      rdx, 8
       movsxd   rdi, ebp
       and      rdi, qword ptr [rdx]
       mov      qword ptr [rsp+0x88], rdi
       movsxd   rdi, ebp
       andn     rdi, rdi, qword ptr [rsp+0x60]
       or       rdi, qword ptr [rsp+0x88]
       mov      qword ptr [rdx], rdi
       mov      edx, 4
       andn     edx, ebp, edx
       mov      edi, edx
       movsxd   rdx, edx
       lea      rsi, bword ptr [rsi+4*rdx]
       dec      r13d
       add      r12d, 6
       andn     edx, ebp, r12d
       movsxd   r12, edx
       add      r12, rbx
       mov      rbx, r12
       add      r15d, edx
 
G_M000_IG08:
       mov      r12d, r13d
       neg      r12d
       sar      r12d, 31
       not      r12d
       mov      dword ptr [rsp+0x5C], ebp
       mov      dword ptr [rsp+0x58], r11d
       mov      edx, ebp
       or       edx, r12d
       mov      rcx, rbx
       sub      rcx, r10
       mov      eax, r8d
       sub      eax, ecx
       add      eax, -6
       sar      eax, 31
       or       eax, r11d
       or       edx, eax
       movzx    rcx, byte  ptr [rbx]
       xor      ecx, 36
       neg      ecx
       sar      ecx, 31
       or       edx, ecx
       inc      rbx
       inc      r15d
       mov      rcx, rbx
       sub      rcx, r10
       mov      r9d, ecx
       sar      r9d, 31
       and      r9d, 7
       add      r9d, ecx
       sar      r9d, 3
       movsxd   r9, r9d
       mov      r11, bword ptr [rsp+0x30]
       mov      r9d, dword ptr [r11+r9]
       mov      r11d, ecx
       sar      r11d, 31
       and      r11d, 7
       add      r11d, ecx
       and      r11d, -8
       sub      ecx, r11d
       shrx     ecx, r9d, ecx
       not      ecx
       tzcnt    ecx, ecx
       mov      r9d, ecx
       neg      r9d
       add      r9d, 10
       shr      r9d, 31
       or       r9d, edx
       mov      edx, r9d
       mov      r9d, ecx
       add      r9, rbx
       mov      r11, r9
       sub      r11, r10
       add      r11d, 2
       mov      ebp, r8d
       sub      ebp, r11d
       sar      ebp, 31
       or       ebp, eax
       mov      eax, ebp
       or       edx, eax
       movzx    r11, word  ptr [r9]
       xor      r11d, 0xA0D
       neg      r11d
       sar      r11d, 31
       or       edx, r11d
       mov      ebp, edx
       mov      edx, ebp
       and      edx, r15d
       movsxd   rdx, edx
       mov      r11, rbx
       sub      r11, rdx
       mov      bword ptr [rsp+0x08], r11
       mov      edx, ecx
       add      edx, r15d
       mov      r11d, ebp
       and      edx, r11d
       movsxd   rdx, edx
       sub      r9, rdx
       sub      r9, qword ptr [rsp+0x08]
       mov      rdx, 0xD1FFAB1E
       mov      rdx, gword ptr [rdx]
       lea      r11d, [r9+2*r9]
 
G_M000_IG09:
       shl      r11d, 2
       movsxd   r11, r11d
       lea      rdx, bword ptr [rdx+4*r11+0x10]
       mov      bword ptr [rsp], rdx
       cmp      r9d, 4
       jg       G_M000_IG19 ; take the slow path for string length
       lea      r11d, [8*r9]
       neg      r11d
       add      r11d, 32
       mov      rdx, bword ptr [rsp+0x08]
       movbe    edx, dword ptr [rdx]
       shrx     edx, edx, r11d
       mov      r11d, edx
       and      r11d, 0xD1FFAB1E
       lea      r11d, [r11+4*r11]
       and      edx, 0xD1FFAB1E
       shl      edx, 8
       lea      edx, [rdx+2*r11]
       mov      r11d, edx
       and      r11d, -0x10000
       imul     r11, r11, 100
       movzx    rdx, dx
       shl      rdx, 16
       add      rdx, r11
       shr      rdx, 24
       mov      dword ptr [rsp+0x44], edx
       xor      r9d, 1
       neg      r9d
       sar      r9d, 31
       mov      r11, bword ptr [rsp]
       sub      edx, dword ptr [r11]
       sar      edx, 31
       and      edx, r9d
       or       edx, dword ptr [rsp+0x44]
 
G_M000_IG10:
       mov      r11d, edx
       sar      r11d, 31
       or       ebp, r11d
       add      ecx, 2
       andn     ecx, ebp, ecx
       movsxd   r9, ecx
       add      rbx, r9
       add      r15d, ecx
       mov      rcx, rbx
       sub      rcx, r10
       mov      dword ptr [rsp+0x54], ecx
       lea      r9d, [rcx+rdx]
       lea      r11d, [r9+0x02]
       mov      ecx, r8d
       sub      ecx, r11d
       sar      ecx, 31
       or       ecx, eax
       mov      eax, ecx
       or       ebp, eax
       andn     r9d, ebp, r9d
       movsxd   rcx, r9d
       movzx    rcx, word  ptr [r10+rcx]
       xor      ecx, 0xA0D
       neg      ecx
       sar      ecx, 31
       or       ecx, ebp
       mov      ebp, ecx
       lea      ecx, [rdx+0x02]
       movsxd   rcx, ecx
       add      rbx, rcx
       lea      r15d, [r15+rdx+0x02]
       mov      rcx, bword ptr [rsp+0x20]
       mov      r11, rcx
       sub      r11, rsi
       neg      r11d
       sar      r11d, 31
       not      r11d
       or       r11d, eax
       mov      eax, r11d
       or       ebp, eax
       mov      r11d, ebp
       and      r11d, 4
       shl      r11, 2
       mov      rcx, rsi
       sub      rcx, r11
       movsxd   r11, ebp
       and      qword ptr [rcx], r11
       mov      r11d, dword ptr [rsp+0x54]
       add      r9d, 2
       shl      r9, 32
       or       r9, r11
       mov      qword ptr [rsp+0x48], r9
       add      rcx, 8
       movsxd   r11, ebp
       and      r11, qword ptr [rcx]
       mov      r9d, ebp
       not      r9d
       movsxd   r14, r9d
       and      r14, qword ptr [rsp+0x48]
       or       r11, r14
       mov      qword ptr [rcx], r11
       mov      ecx, r9d
       and      ecx, 4
       movsxd   r11, ecx
       lea      rsi, bword ptr [rsi+4*r11]
       add      edi, ecx
       add      edx, 3
       and      edx, ebp
       movsxd   rcx, edx
       sub      rbx, rcx
       sub      r15d, edx
       mov      ecx, r12d
       not      ecx
       and      r9d, ecx
       and      r9d, 1
       sub      r13d, r9d
       mov      edx, r12d
       and      edx, dword ptr [rsp+0x5C]
       and      ebp, ecx
 
G_M000_IG11:
       or       ebp, edx
       and      r12d, dword ptr [rsp+0x58]
       and      ecx, eax
       mov      eax, r12d
       or       eax, ecx
       mov      ecx, ebp
       not      ecx
       movsxd   rdx, r13d
       test     rcx, rdx
       mov      r11d, eax
       jne      G_M000_IG08 ; loop back to continue extracting strings
       mov      r13d, r11d
       and      r13d, 4
       add      r13d, edi
       movsxd   r13, r13d
       shl      r13, 2
       mov      r12, rsi
       sub      r12, r13
       mov      r13d, ebp
       not      r13d
       mov      edx, r13d
       and      edx, dword ptr [r12+0x08]
       and      r13d, dword ptr [r12+0x0C]
       sub      r13d, edx
       add      r13d, -2
       movsxd   rdx, edx
       vmovups  ymm2, ymmword ptr [r10+rdx]
       and      r13d, 31
       mov      edx, r13d
       shl      edx, 6
       lea      ecx, [rdx+0x20]
       mov      rax, 0xD1FFAB1E
       mov      rax, gword ptr [rax]
       add      rax, 16
       movsxd   rdx, edx
       vmovups  ymm3, ymmword ptr [rax+rdx]
       movsxd   rcx, ecx
       vpbroadcastd ymm4, dword ptr [rax+rcx]
       vpsubb   ymm5, ymm3, ymm0
       vpcmpgtb ymm5, ymm5, ymmword ptr [reloc @RWD64]
       vpand    ymm2, ymm2, ymm5
       vpand    ymm3, ymm2, ymm3
       vpsubb   ymm2, ymm2, ymm0
       vpcmpgtb ymm2, ymm2, ymmword ptr [reloc @RWD128]
       vptest   ymm2, ymm2
       setne    al
       movzx    rax, al
       dec      eax
       vpxor    ymm2, ymm3, ymm4
       vpmulld  ymm2, ymm2, ymm1
       vphaddd  ymm2, ymm2, ymm2
       vphaddd  ymm2, ymm2, ymm2
       vperm2i128 ymm4, ymm2, ymm2, 1
       vpaddd   ymm2, ymm2, ymm4
       vmovd    ecx, xmm2
       mov      edx, 0xD1FFAB1E
       mov      r9d, ecx
       imul     rdx, r9
       shr      rdx, 38
       imul     edx, edx, 71
       sub      ecx, edx
       shl      ecx, 6
       shl      r13d, 13
       add      ecx, r13d
       mov      rdx, 0xD1FFAB1E
       mov      rdx, gword ptr [rdx]
       add      rdx, 16
       movsxd   r9, ecx
       mov      r9d, dword ptr [rdx+r9]
       add      ecx, 4
       movsxd   rcx, ecx
 
G_M000_IG12:
       vpcmpeqb ymm2, ymm3, ymmword ptr [rdx+rcx]
       vpmovmskb ecx, ymm2
       cmp      ecx, -1
       sete     cl
       movzx    rcx, cl
       neg      ecx
       sar      ecx, 31
       and      ecx, r9d
       and      eax, ecx
       movzx    rax, ax
       mov      ecx, eax
       neg      ecx
       sar      ecx, 31
       not      ecx
       or       ecx, ebp
       mov      ebp, ecx
       mov      ecx, dword ptr [rsp+0x7C]
       shl      rax, 32
       or       rax, rcx
       movsxd   rcx, ebp
       andn     rax, rcx, rax
       movsxd   rcx, ebp
       and      rcx, qword ptr [r12]
       or       rax, rcx
       mov      qword ptr [r12], rax
       andn     eax, r11d, ebp
       mov      ecx, eax
       add      r12, 8
       not      ecx
       movsxd   rcx, ecx
       and      qword ptr [r12], rcx
       mov      r9d, dword ptr [rsp+0x80]
       andn     r9d, r9d, eax
       lea      eax, [rdi-0x04]
       and      eax, r9d
       mov      ecx, r11d
       and      ecx, edi
       or       eax, ecx
       cdqe     
       shl      rax, 2
       sub      rsi, rax
       mov      eax, ebp
       and      eax, r15d
       cdqe     
       sub      rbx, rax
       test     ebp, ebp
       mov      r9d, ebp
       mov      rax, bword ptr [rsp+0x20]
       mov      rcx, bword ptr [rsp+0x30]
       je       G_M000_IG04 ; loop back to parse another command array
 
G_M000_IG13:
       mov      r8, rsi
       sub      r8, qword ptr [rsp+0x28]
       mov      rdx, r8
       sar      rdx, 63
       and      rdx, 15
       add      rdx, r8
       sar      rdx, 4
       mov      rax, bword ptr [rsp+0x140]
       mov      dword ptr [rax], edx
       mov      rax, rbx
       sub      rax, r10
       mov      rcx, bword ptr [rsp+0x148]
       mov      dword ptr [rcx], eax
 
G_M000_IG14:
       vmovaps  xmm6, xmmword ptr [rsp+0xB0]
       vmovaps  xmm7, xmmword ptr [rsp+0xA0]
       vzeroupper 
       add      rsp, 200
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       pop      r12
       pop      r13
       pop      r14
       pop      r15
       ret      
 
G_M000_IG15:
       jmp      G_M000_IG08 ; PGO introduced for unlikely case
 
G_M000_IG16:
       mov      r14, rbx
       sub      r14, r10
       mov      r15d, r14d
       sar      r15d, 31
       and      r15d, 7
       add      r15d, r14d
       sar      r15d, 3
       movsxd   r15, r15d
       mov      r15d, dword ptr [rcx+r15]
       mov      r13d, r14d
       sar      r13d, 31
       and      r13d, 7
       add      r13d, r14d
       and      r13d, -8
       sub      r14d, r13d
       shrx     r14d, r15d, r14d
       not      r14d
       tzcnt    r14d, r14d
       mov      r15d, r14d
       neg      r15d
       add      r15d, 10
       shr      r15d, 31
       or       r15d, ebp
       mov      ebp, r15d
       mov      r15d, r14d
       add      r15, rbx
       mov      r13, r15
       sub      r13, r10
       add      r13d, 2
       mov      r12d, r8d
       sub      r12d, r13d
       sar      r12d, 31
       or       r12d, r11d
       mov      r11d, r12d
       or       ebp, r11d
       movzx    r13, word  ptr [r15]
       xor      r13d, 0xA0D
       neg      r13d
       sar      r13d, 31
       or       ebp, r13d
       mov      r13d, ebp
       and      r13d, 1
       movsxd   r13, r13d
       mov      r12, rbx
       sub      r12, r13
       mov      bword ptr [rsp+0x18], r12
       mov      r13d, r14d
       inc      r13d
       mov      r12d, ebp
       and      r13d, r12d
       movsxd   r13, r13d
       sub      r15, r13
       sub      r15, qword ptr [rsp+0x18]
       mov      r13, 0xD1FFAB1E
       mov      r13, gword ptr [r13]
       lea      r12d, [r15+2*r15]
       shl      r12d, 2
       movsxd   r12, r12d
       lea      r13, bword ptr [r13+4*r12+0x10]
       mov      bword ptr [rsp+0x10], r13
       cmp      r15d, 4
       jg       G_M000_IG20 ; take slowest array length path
       lea      r12d, [8*r15]
       neg      r12d
       add      r12d, 32
       mov      r13, bword ptr [rsp+0x18]
       movbe    r13d, dword ptr [r13]
       shrx     r13d, r13d, r12d
       mov      r12d, r13d
       and      r12d, 0xD1FFAB1E
       lea      r12d, [r12+4*r12]
       and      r13d, 0xD1FFAB1E
       shl      r13d, 8
       lea      r13d, [r13+2*r12]
 
G_M000_IG17:
       mov      r12d, r13d
       and      r12d, -0x10000
       imul     r12, r12, 100
       movzx    r13, r13w
       shl      r13, 16
       add      r13, r12
       shr      r13, 24
       mov      dword ptr [rsp+0x78], r13d
       xor      r15d, 1
       neg      r15d
       sar      r15d, 31
       mov      r12, bword ptr [rsp+0x10]
       sub      r13d, dword ptr [r12]
       sar      r13d, 31
       and      r15d, r13d
       or       r15d, dword ptr [rsp+0x78]
 
G_M000_IG18:
       mov      r12d, r15d
       sar      r12d, 31
       or       ebp, r12d
       add      r14d, 2
       andn     r14d, ebp, r14d
       movsxd   r13, r14d
       add      rbx, r13
       inc      r14d
       lea      r13d, [r15-0x01]
       sar      r13d, 31
       or       ebp, r13d
       mov      edx, r14d
       mov      r14d, r15d
       mov      r15d, edx
       mov      rcx, bword ptr [rsp+0x30]
       jmp      G_M000_IG06 ; backedge to return from slow array path
 
G_M000_IG19:
       mov      dword ptr [rsp+0x84], eax
       mov      r11, 0xD1FFAB1E
       mov      r11, gword ptr [r11]
       mov      gword ptr [rsp+0x90], r11
       mov      r11d, r9d
       shl      r11d, 4
       movsxd   r11, r11d
       mov      rax, gword ptr [rsp+0x90]
       vmovups  xmm2, xmmword ptr [rax+r11+0x10]
       mov      r11, bword ptr [rsp+0x08]
       vpand    xmm2, xmm2, xmmword ptr [r11]
       vpmovzxbw xmm3, xmm2
       vpsrldq  xmm2, xmm2, 8
       vpmovzxbw xmm2, xmm2
       vpmovzxwd xmm4, xmm3
       vpsrldq  xmm3, xmm3, 8
       vpmovzxwd xmm3, xmm3
       vpmovzxwd xmm2, xmm2
       mov      rdx, bword ptr [rsp]
       vmovups  xmm5, xmmword ptr [rdx]
       vmovups  xmm6, xmmword ptr [rdx+0x10]
       vmovups  xmm7, xmmword ptr [rdx+0x20]
       vpmulld  xmm3, xmm3, xmm6
       vpmulld  xmm2, xmm2, xmm7
       vpmulld  xmm4, xmm4, xmm5
       vpsubd   xmm6, xmm4, xmmword ptr [reloc @RWD160]
       vpsubd   xmm5, xmm5, xmmword ptr [reloc @RWD160]
       vpcmpgtd xmm5, xmm5, xmm6
       vpaddd   xmm3, xmm4, xmm3
       vpaddd   xmm2, xmm3, xmm2
       vpmulld  xmm3, xmm2, xmmword ptr [reloc @RWD96]
       vphaddd  xmm3, xmm3, xmm3
       vphaddd  xmm3, xmm3, xmm3
       vmovd    eax, xmm3
       mov      dword ptr [rsp+0x40], eax
       vpsubd   xmm2, xmm2, xmmword ptr [reloc @RWD160]
       vmovd    xmm3, rax
       vpsubd   xmm3, xmm3, xmmword ptr [reloc @RWD160]
       vpcmpgtd xmm2, xmm2, xmm3
       vpor     xmm2, xmm5, xmm2
       mov      r11d, r9d
       xor      r11d, 1
       neg      r11d
       sar      r11d, 31
       sub      eax, dword ptr [rdx]
       sar      eax, 31
       and      eax, r11d
       mov      edx, r9d
       neg      edx
       add      edx, 9
       sar      edx, 31
       vmovd    r9d, xmm2
       neg      r9
       sar      r9, 63
       and      edx, r9d
       or       edx, dword ptr [rsp+0x40]
       or       edx, eax
       mov      eax, dword ptr [rsp+0x84]
       jmp      G_M000_IG10 ; backedge from slowest array path
 
G_M000_IG20:
       mov      r12, 0xD1FFAB1E
       mov      r12, gword ptr [r12]
       mov      gword ptr [rsp+0x90], r12
       mov      r12d, r15d
       shl      r12d, 4
       movsxd   r12, r12d
       mov      r13, gword ptr [rsp+0x90]
       vmovups  xmm2, xmmword ptr [r13+r12+0x10]
       mov      r13, bword ptr [rsp+0x18]
       vpand    xmm2, xmm2, xmmword ptr [r13]
       vpmovzxbw xmm3, xmm2
       vpsrldq  xmm2, xmm2, 8
       vpmovzxbw xmm2, xmm2
       vpmovzxwd xmm4, xmm3
       vpsrldq  xmm3, xmm3, 8
       vpmovzxwd xmm3, xmm3
       vpmovzxwd xmm2, xmm2
       mov      r13, bword ptr [rsp+0x10]
       vmovups  xmm5, xmmword ptr [r13]
       vmovups  xmm6, xmmword ptr [r13+0x10]
       vmovups  xmm7, xmmword ptr [r13+0x20]
       vpmulld  xmm3, xmm3, xmm6
       vpmulld  xmm2, xmm2, xmm7
       vpmulld  xmm4, xmm4, xmm5
       vpsubd   xmm6, xmm4, xmmword ptr [reloc @RWD160]
       vpsubd   xmm5, xmm5, xmmword ptr [reloc @RWD160]
       vpcmpgtd xmm5, xmm5, xmm6
       vpaddd   xmm3, xmm4, xmm3
       vpaddd   xmm2, xmm3, xmm2
       vpmulld  xmm3, xmm2, xmmword ptr [reloc @RWD96]
       vphaddd  xmm3, xmm3, xmm3
       vphaddd  xmm3, xmm3, xmm3
       vmovd    r12d, xmm3
       mov      dword ptr [rsp+0x74], r12d
       vpsubd   xmm2, xmm2, xmmword ptr [reloc @RWD160]
       vmovd    xmm3, r12
       vpsubd   xmm3, xmm3, xmmword ptr [reloc @RWD160]
       vpcmpgtd xmm2, xmm2, xmm3
       vpor     xmm2, xmm5, xmm2
       mov      r12d, r15d
       xor      r12d, 1
       neg      r12d
       sar      r12d, 31
       mov      dword ptr [rsp+0x9C], r12d
       mov      r12d, dword ptr [rsp+0x74]
       sub      r12d, dword ptr [r13]
       sar      r12d, 31
       and      r12d, dword ptr [rsp+0x9C]
       neg      r15d
       add      r15d, 9
       sar      r15d, 31
       vmovd    r13d, xmm2
       neg      r13
       sar      r13, 63
       and      r15d, r13d
       or       r15d, dword ptr [rsp+0x74]
       or       r15d, r12d
       jmp      G_M000_IG18 ; backedge return to slow array path

For a total of 11 branches.  The JIT has introduced 6 of these, mostly 5 backedges as a consequence of getting the likely-to-run-code close together (which trades branches for fewer reads from memory and reduced instruction cache pressure).  One oddball is the jmp G_M000_IG08, which appears replaceable to me – perhaps there’s an alignment concern, or a branch distance issue I’m not seeing.

That’s inline with my desires so far as codegen goes.  The likely hot path has the five explicit branches, and the worst case path doubles it.

The Performance

With any benchmarking, we first have to define what we’re comparing to.  In this case, I’m comparing to:

My top level benchmark varies the mix of commands, the size of the batches, the size of various sub elements, and what percentage of batches are incomplete.  The full set of results can be found here, but here are some of the highlights.

This SIMD/nearly-branchless approach is a bit smaller than Garnet, though the Naive approach (which makes many more function calls) is smaller than both:

MethodCode size (bytes)
Naive2,016
Garnet5,189
SIMD3,610
Commands per batch 4-8, keys of 6-10 bytes, values of 100-499 bytes, 3-10 sub items, and ½ of batches end early

We see that as the command mix increases, SIMD does increasingly well but the Garnet approach remains superior.  This is probably because Garnet doesn’t treat all commands equally, special casing common commands with faster parsing.  Since I chose common commands to benchmark, Garnet does quite well.  As the mix of commands includes more that aren’t special cased, the branchless and SIMD approach closes the gap.

All commands, keys of 6-10 bytes, values of 100-499 bytes, 3-10 sub items, and ½ of batches end early

Both SIMD and Garnet deal well with larger batch sizes, and much better with randomly varying sizes.

All commands, batch size of 4-8, keys of 6-10 bytes, values of 100-499 bytes, 3-10 sub items, and ½ of batches end early

Batches being incomplete doesn’t show an inflection point, suggesting that the reduction in data to parse dominates any misprediction penalty in the Garnet or SIMD cases.

So across the board in what I consider the interesting cases, the SIMD/nearly-branchless approach is close to, but a little worse, than the Garnet approach – excepting code size.  Given the trends, I would expect that the SIMD approach is best when the mix of commands aren’t specially cased by Garnet, and batch sizes are large.

And that is what we see with the “just hash commands”-case, since Garnet doesn’t special case HDEL, HMGET, or HMSET.  As an example:

Batch size of 8, keys of 6-10 bytes, values of 500-999 bytes, 3-10 sub items, and 100% of batches end early

Is It Worth It?

While an interesting exercise, on the balance no.

The fundamental issue is, if your workload falls predictably into Garnet’s assumptions (which are reasonable) the performance just isn’t there – the SIMD/nearly-branchless approach isn’t faster.  

Beyond that, this code is also disgustingly un-maintainable.

But there are some things that might shift the balance:

  • A pointer-y implementation might be a bit more maintainable
    • I did this all with refs since that would allow un-pinned memory for the command buffers, but realistically you’re gonna be reading from a socket and if you really care about perf you’ll pull that memory from the pinned object heap
  • More optimizations around command parsing are possible
    • Commands are parsed twice, once to find boundaries and then once to determine the RespCommand – these could be combined
    • The hash calculation isn’t very sophisticated, a better implementation may be possible as mentioned above
  • Garnet’s big-ol’ switch approach for parsing gets a little worse for each implemented command, and Redis has a lot of commands
    • In fact, the implementation I used here has removed a lot of commands in comparison to mainline Garnet
  • I’ve used .NET’s generic VectorXXX implementation, specializing for a processor (and its vector width) might yield gains
    • While this wouldn’t introduce branches at runtime, it would make the code even harder to maintain
  • Instruction cache and branch prediction slots are global resources, so the smaller and nearly branchless approach frees up more of those resources for other work
    • Any benefit here would be workload dependent, as these resources are allocated dynamically at runtime

Of course, some of the nearly-branchless approaches (like the hashing command lookup) could be adopted independently – so a hybrid approach may come out ahead even if the balance shifted.

While always a little disappointing when an exercise doesn’t pan out, this sure was a fun one to hack together.