LinqAF: Allocating

This is part of a series on LinqAF, you should start with the first post.

When does LinqAF allocate?

The whole point of LinqAF is to be a low-allocation re-implementation of LINQ-to-Objects, so why does it allocate at all?

The answer is that there are a few operations that, logically, must allocate:

  • Distinct
    • Needs to track what has already been yielded
  • Except
    • Needs to track what cannot be yielded
  • GroupJoin
    • Needs to keep track of groups as it iterates over a sequence
  • Intersect
    • Needs to track what can be yielded
  • Join
    • Needs to keep track of items in the first sequence to join against the second
  • OrderBy(Descending) & ThenBy(Descending)
    • Needs to enumerate the whole sequence before it can begin sorting
  • Reverse
    • Needs to enumerate the whole sequence to find the last item before yielding
  • ToArray
    • Has to allocate the returned array
  • ToList
    • Has to allocate the returned list
  • ToLookup
    • Needs to build an internal dictionary to lookup groups by
  • ToDictionary
    • Has to allocate the returned dictionary
  • Union
    • Needs to track what has already been yielded

There are many cases (like passing known Empty enumerables) that can be optimized to not allocate, but in general these are unavoidable.

LinqAF allows for some application-specific optimizations by letting consumers register an alternative IAllocator which handles actual allocations.  For example, you could exploit the knowledge that no enumerables in a web app survive past a request and re-use allocated List<T>s.

IAllocator exposes the following methods:

  • GetArray<T>(int)
  • ResizeArray<T>(ref T[], int)
  • GetEmptyList<T>(int?)
  • GetPopulatedList<T>(IEnumerable<T>)
  • GetEmptyDictionary<TKey, TValue>(int?, IEqualityComparer<TKey>)
  • EnumerableBoxed<T>()
    • This is called for profiling purposes only

How does LinqAF minimize the size of necessary allocations?

animal-welfare-2343456_640

It’s possible to safely reduce the number of cats pictured.

In cases where allocations must be made LinqAF can still minimize allocations by using specialized types rather than the generic ones in the BCL.

CompactSet<T> is used by Intersect, Except, Union, and Distinct in place of HashSet<T>.  It stores all elements in a single array (which can be shared in some cases, reducing duplication) and optimizes the common case where a “bucket” has a single value to avoid allocating an array.  CompactSet<T> only supports adding elements, as that is the only method needed for LINQ operators.

LookupHashtable<TKey, TElement> is used by ToLookup (which itself backs Join, GroupBy, and GroupJoin) in place of Dictionary<TKey, TElement>.  It also stores all elements in a single array, and optimizes the common case where a “bucket” has a single value.  Unlike CompactSet<T>, LookupHashtable<TKey, TElement> can be enumerated in a stable order.

OrderBy(Descending) uses a StructStack<T> in place of a Stack<T> internally to keep track of subsets needing sorting.  StructStack<T> only supports Push(T) and Pop(), as that’s all that is needed for OrderBy(Descending).

LinqAF also reduces allocation sizes by shrinking structs by omitting as many fields as possible, and by liberally applying [StructLayout(LayoutKind.Auto)] to allow the runtime to reorder fields for packing efficiency.

When does LINQ inherently allocate?

There are cases where the LINQ expression-to-method-calls transformation introduces some inherent allocations.  For example, repeat chains of from expressions like

int[] xs, ys, zs =/* initialization code */ 

var e = 
  from x in xs
  from y in ys
  from z in zs
  select x + y + z;

introduces an anonymous object to hold x, y, and z.  The compiler produces code equivalent to

int[] xs, ys, zs =/* initialization code */ 

var e = 
  xs.SelectMany(_ => ys, (x, y) => new { x, y })
    .SelectMany(_ => zs, (xy, z) => xy.x + xy.y + z);

which has an allocates four delegates, one of which itself allocates an anonymous object per invocation.

I personally always use explicit method calls rather than the expression-style of LINQ which makes any “inherent” allocations an explicit and visible part of my code.  If you prefer the expression-style, then you should be aware that there are cases where the compiler will insert allocations outside of LinqAF’s code.

What’s next?

I’ll cover the peculiar case of SelectMany and why LinqAF must have a notion of “boxing”.


LinqAF: All the operations

This is part of a series on LinqAF, you should start with the first post.

What operations does LinqAF support?

baby-cats-955895_640

So many operations.  And cats.

All the ones the LINQ does, and LinqAF sports numerous optimizations that additional type information makes possible.

Concretely:

There are also three static methods on Enumerable that expose special enumerables which sport many optimizations.  They are:

  • Empty
    • Aggregate() with no seed always throws
    • Aggregate() always returns seed, or seed passed to resultSelector, depending on overload
    • Any() is always false
    • AsEnumerable() always results in an empty IEnumerable
    • Average() over non-nullable types always throws an exception
    • Average() overnullable types always returns null
    • DefaultIfEmpty() can be optimized into a “OneItem” enumerable
    • Cast() always results in an empty enumerable of the destination type
    • Concat() calls return the non-empty parameter
    • Contains() always returns false
    • Count() is always 0
    • Distinct() always returns an empty enumerable
    • ElementAt() always throws an exception
    • ElementAtOrDefault() always returns default
    • First() always throws an exception
    • FirstOrDefault() always returns default
    • GroupJoin() results in an empty enumerable
    • Join() results in an empty enumerable
    • Last() always throws an exception
    • LastOrDefault() always returns default
    • LongCount() is always 0
    • Min() and Max() over non-nullable types always throw exceptions
    • Min() and Max() over nullable types always return null
    • OfType() always results in an empty enumerable of the destination type
    • OrderBy() and OrderByDescending() result in a special EmptyOrdered enumerable
    • Reverse() results in an empty enumerable
    • Select() results in an empty enumerable
    • SelectMany() results in an empty enumerable
    • SequenceEqual() always has the same result for many types (DefaultIfEmpty => false, another Empty => true)
    • SequenceEqual() devolves to a Count or Length check on things like List<T> or Dictionary<T,V>
    • Single() always throws an exception
    • SingleOrDefault() always returns default
    • Skip() and SkipWhile() result in an empty enumerable
    • Sum() is always 0
    • Take() and TakeWhile() result in an empty enumerable
    • ToArray(), ToList(), and ToDictionary() results can all be pre-sized to 0
    • ToLookup() can return a global, known empty, Lookup
    • Where() results in an empty enumerable
    • Zip() results in an empty enumerable
  • Range
    • Any() without a predicate just check the inner count
    • Count() and LongCount() just return the inner count
    • Distinct() with no comparer returns itself
    • ElementAt() and ElementAtOrDefault() are simple math ops
    • Contains() is a range check
    • First() and FirstOrDefault() return the starting element, if count > 0
    • Last() and LastOrDefault() are simple math ops
    • Max() is equivalent to Last()
    • Min() is equivalent to First()
    • Reverse() can return be optimized into a “ReverseRange” enumerable
    • SequenceEqual() against another range enumerable can just compare start and count
    • SequenceEqual() against a repeat enumerable is only true if count is 0 or 1
    • Single() and SingleOrDefault() just check against count and return start
    • Skip() advances start and reduces count
    • Take() reduces count
    • ToArray(), ToDictionary(), and ToList() can be pre-sized appropriately
  • Repeat
    • Any() without a predicate just check the inner count
    • Contains() just check against the inner object
    • Count() and LongCount() just return the inner count
    • Distinct() without a comparer returns a repeat enumerable with 0 or 1 items as appropriate
    • ElementAt() and ElementAtOrDefault() return the inner item
    • First() and FirstOrDefault() return the inner item
    • Last() and LastOrDefault() return the inner item
    • Max() and Min() return the inner item
    • SequenceEqual() against an empty enumerable is just a check against the inner count
    • SequenceEqual() against a range enumerable is only true if count is 0 or 1
    • SequenceEqual() against another repeat enumerable just checks for equality between the inner item and the inner count
    • Single() and SingleOrDefault() just check the inner count and return the inner item
    • Skip() is equivalent to reducing the inner count
    • Take() is equivalent to replacing the inner count
    • ToList() and ToArray() can be pre-sized appropriately

As mentioned in the above lists, LinqAF introduces a few new “kinds” of enumerables as optimizations.  Some of these actually exist behind the scenes in LINQ-to-Objects, but the IEnumerable<T> interface conceals that.  These new enumerables are:

  • EmptyOrdered
    • Representing an empty sequence that has been ordered
    • All the optimizations on Empty apply, except that Concat() calls with EmptyOrdered as a parameter cannot return EmptyOrdered – they return Empty if they would otherwise return EmptyOrdered
  • OneItemDefault
    • Represents a sequence that is known to only yield one value, and that value is default(T)
    • All() only checks against the single item
    • Any() without a predicate is always true
    • Contains() checks against the single item
    • Count() and LongCount() are always 1
    • Count() and LongCount() with predicates are 0 or 1, depending on the single item
    • DefaultIfEmpty() returns itself
    • ElementAt() throws if index is not 0, and always returns default(T) if it is
    • ElementAtOrDefault() always returns default(T)
    • First(), FirstOrDefault(), Last(), LastOrDefault(), Single(), and SingleOrDefault() all return default(T)
    • Reverse() returns itself
    • ToArray(), ToList(), and ToDictionary() can a be pre-sized to 1 item
    • OrderBy() and OrderByDescending() result in a special OneItemDefaultOrdered enumerable
  • OneItemSpecific
    • Represents a sequence that is known to only yield one value, but that value is determined at runtime
    • Has all the same optimizations as OneItemDefault, except the specified value is returned where appropriate and OrderBy() and OrderByDescending() result in a special OneItemSpecificOrdered enumerable
  • OneItem(Default|Specific)Ordered
    • Represents one of the OneItem enumerables that have been ordered
    • All the optimization on their paired OneItem enumerable apply, except that Concat() calls cannot return a OneItem(Default|Specific)Ordered – they return the paired unordered enumerable if they otherwise would
  • ReverseRange
    • Represents a Range that has been reversed – that is, it counts down from a value
    • All the optimizations on Range apply, except that calling Reverse() on a  ReverseRange results in a Range
  • SelectSelect
    • Represents repeated calls to Select() – this allows a single enumerator to handle the entire chain of calls
    • Select() results in yet another SelectSelect
    • Where() results in a SelectWhere
  • SelectWhere
    • Represents any number of  Select() calls followed by any number of Where() calls – this allows a single enumerator to handle the entire chain of calls
    • Where() results in a SelectWhere
  • WhereSelect
    • Represents any number of Where() calls followed by any number of Select() calls – this allows a single enumerator to handle the entire chain of calls
    • Select() results in a WhereSelect
  • WhereWhere
    • Represents repeated calls to Where() – this allows a single enumerator to handle the entire chain of calls
    • Select() results in a WhereSelect
    • Where() results in a WhereWhere

These gigantic lists are a pretty good demonstration of how many cases additional compile time typing lets us optimizes.

What’s next?

I’ll cover when LinqAF does allocate, why it does, and how you can provide your own allocator scheme to minimize the impact of those allocations.


LinqAF: Generating way too much code

This is part of a series on LinqAF, you should start with the first post.

Why use codegen for LinqAF?

As mentioned in the previous article, LinqAF defines a type per operator and has to define many instance methods on each type.  When you add it all up, replacing LINQ-to-Objects requires ~50 enumerable structs each of which exposes ~1,300 methods.

So yeah, I didn’t do it by hand.

The C# compiler has been re-implemented and open-sourced, resulting in the Roslyn project.  This makes it really easy to manipulate and reason about C# code, which I used to fashion a templating system to generate LinqAF.  Specifically, I was able to keep the templates mostly “correct” C# that benefited from type checking and Visual Studios navigation faculties.

Concretely:

  • Every LINQ operator has an interface defined (IConcat, IMin, ISelectMany, etc.)
  • Each logically distinct method gets a CommonImplementation… implementation (So Concat gets a few methods, Min gets a lot, SelectMany gets several)
  • A template is defined that implements the interfaces and calls the appropriate CommonImplementation methods.
    • Here dynamic comes into play, as many templates elide generic parameters (these parameters are filled in later)
  • Similarly templates are defined for various extension methods
    • Inter-operating with IEnumerable<T> and other collections defined in the BCL
    • Average
    • Concat for enumerables with two generic parameters
    • Except for enumerables with two generic parameters
    • Intersect for enumerables with two generic parameters
    • Max
    • Min
    • SequenceEqual for enumerables with two generic parameters
    • Sum
    • Union for enumerables with two generic parameters
  • Overrides that replace particular generated methods with other, static, C# code
    • This allows LinqAF to leverage static type information to avoid doing pointless work
    • For example, almost all operators on the EmptyEnumerable are replaced with implementations that elide most work
  • A separate project, LinqAF.Generator, processes all these templates to generate the final code

This code generation approach let me keep the “actual” implementation of LinqAF under 20,000 lines of code.

What are the downsides to code generation?

cat-3069979_640

Code generation is typically a worse idea than a cat

The biggest one is that iteration time is much worse, and it got worse as more operators were implemented.  This is mitigated somewhat by sharing common implementations in (…) CommonImplementation, allowing bug fixes to be copy/pasted for quick testing; bugs in the code generation parts are still slow to fix.

While limited in scope, the reliance on dynamic in certain places also means that the LinqAF project can compile even if there are actually type errors.  This was most common when adding new operators, and commenting out the other operators let me decrease the iteration time considerably.

Code generation is also harder to understand and setup as this post’s existence demonstrates.  Thankfully the Roslyn project, and it’s availability on Nuget, makes code generation considerably less difficult – using something like T4 or outright concatenation would have been even worse.

What’s next?

In what will probably be the longest post in the series, I cover most of the operators and the various optimizations that LinqAF has for them.


LinqAF: Replacing LINQ and not allocating

This is part of a series on LinqAF, you should start with the first post.

How is LINQ-to-Objects replaceable?

The C# language has a number of keywords that are used with any LINQ provider.  Fortunately for me, these keywords are equivalent to methods calls.  In other words: from myItem in myList where myPredicate(myItem) is equivalent to myList.Where(myItem => myPredicate(myItem)).   These method calls are unqualified and untyped, so all that matters is that some code somewhere provides a definition for Where, System.Linq isn’t privileged nor is IEnumerable<T>.

Similarly, C#’s foreach statement makes untyped and unprivileged calls to GetEnumerator().  The BCL uses this trick extensively to reduce allocations when enumerating over collections in System.Collections.Generic (for example see how List<T> has a GetEnumerator() method that returns a struct).

How does LinqAF avoid allocating?

kitten-2249619_640

Allocations frighten kitty

The biggest trick is that all operations in LinqAF are represented as structs, whereas LINQ-to-Objects always returns an IEnumerable<T> or IOrderedEnumerable<T>.  This means that LinqAF is going to perform 0 allocations for most LINQ operations (the exceptions are covered in a later post) versus LINQ-to-Objects minimum of 1 allocation.

LinqAF also implements custom struct enumerators for each enumerable which can result in fewer allocations.  The savings are less for these enumerators because LINQ-to-Objects uses some tricks (like this internal Iterator<T> class) to avoid allocating enumerators in common cases.

Technically LinqAF could achieve these optimizations just by duck typing, but for ease of development and some type checking peace of mind I have all enumerables implement the IStructEnumerable interface and all enumerators implement IStructEnumerator.

There are lots of consequences (and some additional advantages) to having all these different structs flying around.  The first big consequence is that LinqAF cannot implement most operations as extension methods, instead I had to use instance methods.  This is because generic parameter inferencing ignores type constraints, and I have lots of : struct and : SomeInterface constraints to ensure the correct guarantees are upheld.

Another consequence is that I could not have parameters of a single enumerable type, I had to provide overloads for each of the concrete operator types to avoid boxing.  This meant that rather than each enumerable struct having a single Concat method they had sixty-nine(!), with similar increases in method counts for other operators.  All told LinqAF ended up being 2,500+ files and 100,000+ methods, which is much much larger than the official LINQ-to-Objects implementation.  Naturally creating all of those files and methods by hand would have been masochistic, I cover exactly how I generated them in the next post.

Using structs also introduces a new kind of “uninitialized” error.  In C# you cannot hide a struct’s parameterless constructor (technically you cannot implement a structs default constructor either, but that matters less), which means that it’s possible to create a non-null but also not valid enumerable to use with LinqAF.  Since LINQ’s semantics require failing-fast on invalid arguments, I add an IsDefaultValue method to all enumerables to detect this scenario.

There are advantages to having all these different types – I can avoid dynamic dispatch (all calls on structs are non-virtual calls), and allow more optimization opportunities.  LINQ-to-Objects has some optimizations for when the actual type of an IEnumerable<T> is a List, array, WhereEnumerableIterator, and so on – but relies on runtime type checking.  LinqAF can perform many of these same optimizations, but statically – for example WhereEnumerable’s Select method returns a WhereSelectEnumerable and it’s Where method builds up a chain of predicates (both optimizations avoid creating pointless additional enumerables and enumerators).

What’s next?

I cover how LinqAF is structured and how the final (massive) code is produced.


LinqAF: A series of questionable ideas

Introducing LinqAF, a low allocation re-implementation of LINQ-to-Objects.  It’s a pretty questionable idea, but (in my opinion) a really interesting one too.

What is LINQ-to-Objects?

LINQ is a collection of methods provided by .NET that is used to perform a variety of operations like filtering, mapping, grouping, ordering, an so on.  There are a multitude of LINQ providers but one of the most commonly used is LINQ-to-Objects which works on plain old .NET objects in memory (others include LINQ-to-Entities, LINQ-to-XML, and the defunct LINQ-to-SQL).

Microsoft’s implementation is done as a series of extension methods to IEnumerable.  You can browse the source on Github if you’re curious, but I personally believe the best explanation of this pattern is Jon Skeet’s Edulinq series from 2011.

Low Allocation?

kitten-2558459_640

LINQ doesn’t visualize well, so I’m using cat photos to breakup the text.

Many (but not all) of LINQ’s operations can be done without allocating, and when I started LinqAF those were all I intended to provide (LinqAF stood for “Linq: Allocation Free”).  When I decided to extend LinqAF to include reversing, sorting, grouping, and the various ToXXX operations (all of which require allocations) I started aiming for “a low number” of allocations.

LinqAF also includes the ability to re-use previously allocated collections, which I’ll expand on in a later post.

Give me the TL;DR

  • LinqAF is available on Github and Nuget
  • Most operations perform no allocations
  • All operations (including foreach statements) work as you’d expect
  • It’s “type inference compatible” with System.Linq
    • If you’ve written out explicit types, you’ll probably need to make source changes to use LinqAF
    • There’s one exception with SelectMany delegates which is covered in a later post
  • The whole endeavor qualifies as a Bad Idea™, but a fun one

What’s next?

I cover why it’s possible to replace LINQ so seamlessly, and how it’s possible to do so with no allocations.


An Optimization Exercise

Nick Craver, one of my coworkers at Stack Overflow, tweets out snippets of the Stack Overflow occasionally.  About a week ago he showed off a ContainsToken method which has been tuned for performance.  This set off a little bit of a benchmarking contest, which I participated in.  My final attempt (which is among the fastest) ended up using a lot of tricks, which I think may be of general interest – so I’m going to walk through how the code works.

First, we need to define the problem.  We’re creating a function, which:

  • Determines if, when split on a given delimiter, a value string contains a given token
    • Ordinal comparison of characters is acceptable (this wasn’t clear from the tweet initially)
  • Does not allocate

I’m also allowing for optimizations that exploit Stack Overflow’s production environment, so 64-bit Intel and .NET 4.6+ specific tricks are allowed.  Since Stack Overflow already has some unsafe code in it, unsafe is also acceptable..

My full solution is available on GitHub, my solution is in Program.cs along with several others.  Credit to mattwarren for creating the Benchmark gist.

Here’s my full solution, sans comments, which I’ll be going over line-by-line.

public static bool ContainsTokenMonty(string value, string token, char delimiter = ';')
{
  const int charsPerLong = sizeof(long) / sizeof(char);
  const int charsPerInt = sizeof(int) / sizeof(char);
  const int bytesPerChar = sizeof(char) / sizeof(byte);

  if (string.IsNullOrEmpty(token)) return false;
  if (string.IsNullOrEmpty(value)) return false;

  var delimiterTwice = (delimiter << 16) | delimiter;

  var valueLength = value.Length;
  var tokenLength = token.Length;

  if (tokenLength > valueLength) return false;

  int tokenLongs;
  bool tokenTrailingInt, tokenTrailingChar;
  {
    var remainingChars = tokenLength;
    tokenLongs = remainingChars / charsPerLong;
    tokenTrailingInt = (tokenLength & 0x02) != 0;
    tokenTrailingChar = (tokenLength & 0x01) != 0;
  }

  var tokenByteLength = tokenLength * bytesPerChar;

  unsafe
  {
    fixed (char* valuePtr = value, tokenPtr = token)
    {
      var curValuePtr = valuePtr;
      var endValuePtr = valuePtr + valueLength;

      while (true)
      {
        long* tokenLongPtr = (long*)tokenPtr;
        {
          for (var i = 0; i < tokenLongs; i++)
          {
            var tokenLong = *tokenLongPtr;

            var valueLong = *((long*)curValuePtr);

            if (tokenLong == valueLong)
            {
              tokenLongPtr++;
              curValuePtr += charsPerLong;
              continue;
            }
            else
            {
              goto advanceToNextDelimiter;
            }
          }
        }

        int* tokenIntPtr = (int*)tokenLongPtr;
        if (tokenTrailingInt)
        {
          var tokenInt = *tokenIntPtr;

          var valueInt = *((int*)curValuePtr);

          if (tokenInt == valueInt)
          {
            tokenIntPtr++;
            curValuePtr += charsPerInt;
          }
          else
          {
            goto advanceToNextDelimiter;
          }
        }

        char* tokenCharPtr = (char*)tokenIntPtr;
        if (tokenTrailingChar)
        {
          var tokenChar = *tokenCharPtr;

          var valueChar = *curValuePtr;

          if (tokenChar == valueChar)
          {
            tokenCharPtr++;
            curValuePtr++;
          }
          else
          {
            goto advanceToNextDelimiter;
          }
        }

        if (curValuePtr == endValuePtr || *curValuePtr == delimiter)
        {
          return true;
        }

        advanceToNextDelimiter:

        while (true)
        {
          var curVal = *((int*)curValuePtr);

          var masked = curVal ^ delimiterTwice;
          var temp = masked & 0x7FFF7FFF;
          temp = temp + 0x7FFF7FFF;
          temp = (int)(temp & 0x80008000);
          temp = temp | masked;
          temp = temp | 0x7FFF7FFF;
          temp = ~temp;
          var neitherMatch = temp == 0;

          if (neitherMatch)
          {
            curValuePtr += charsPerInt;
            if(curValuePtr >= endValuePtr)
            {
              return false;
            }
            continue;
          }

          var top16 = temp & 0xFFFF0000;
          if (top16 != 0)
          {
            curValuePtr += charsPerInt;

            break;
          }

          var bottom16 = temp & 0x0000FFFF;
          if (bottom16 != 0)
          {
            curValuePtr += 1;
          }
        }

        var remainingBytesInValue = ((byte*)endValuePtr) - ((byte*)curValuePtr);
        if (remainingBytesInValue < tokenByteLength)
        {
          return false;
        }
      }
    }
  }
}

It’s no accident that this resembles C, performance critical C# often does.  Take note that there are no news (or calls out to functions which may allocate), so this code does zero allocations.

The overall algorithm is as follows:

  1. Scan forward in token and value so long as they are equal
  2. If they are not equal, move back to the start of token and advance in value until after the next delimiter
    1. If there is no next delimiter, return false
  3. If the end of token is reached…
    1. If the next character in value is delimiter, or the end of value has been reached, return true
    2. Otherwise, reset to the start of token and advance in value until after the next delimiter
  4. If the end of value is reached, return false

Now for the line-by-line.

const int charsPerLong = sizeof(long) / sizeof(char);
const int charsPerInt = sizeof(int) / sizeof(char);
const int bytesPerChar = sizeof(char) / sizeof(byte);

Some convenient constants for later.  These could be hardcoded since long, int,and char are always 8, 4, and 2 bytes long in C# – but I find this style easier to reason about.

if (string.IsNullOrEmpty(token)) return false;
if (string.IsNullOrEmpty(value)) return false;

This code is copied from the initial tweet, no real optimizations are available here.  The check is important to maintain compatibility with the original code.

var delimiterTwice = (delimiter << 16) | delimiter;

var valueLength = value.Length;
var tokenLength = token.Length;

Some values I’ll need later.  Doubling the delimiter will be used to test for it in value, and the lengths will be used to for bounds calculations.

if (tokenLength > valueLength) return false;

The rest of the code assumes that token is no larger than the remaining parts of value, so I check it right here to make sure that’s true.  If token couldn’t even fit in value, we don’t even bother to look at either string and just return false.

int tokenLongs;
bool tokenTrailingInt, tokenTrailingChar;
{
  var remainingChars = tokenLength;
  tokenLongs = remainingChars / charsPerLong;
  tokenTrailingInt = (tokenLength & 0x02) != 0;
  tokenTrailingChar = (tokenLength & 0x01) != 0;
}

I going to compare token to value in 8-byte/4-char/1-long chunks, so I need to know how many longs can fit in token.  Because token isn’t always a multiple of 4 in length, I also need to see if there are 0, 1, 2, or 3 characters after the longs.  Note that the single division is by a constant power-of-two, so it gets compiled into something much faster than actual division.

var tokenByteLength = tokenLength * bytesPerChar;

One more useful value before we start actually comparing strings.  Since there will be some pointer math later, knowing the number of bytes (importantly not the number of characters) in token will come in handy.

unsafe
{
  fixed (char* valuePtr = value, tokenPtr = token)
  {
    var curValuePtr = valuePtr;
    var endValuePtr = valuePtr + valueLength;

Now for the first scary bits.  The rest of this method is in an unsafe block which means: we can use pointers, and we have to do our own bounds checking with those pointers.

I take and pin pointers to value and token, make a copy of the value pointer (so it can be mutated, which pointers declared in a fixed block cannot be), and calculate endValuePtr so we can use it to do bounds checks on value.

while (true)
{
  long* tokenLongPtr = (long*)tokenPtr;
  {
    for (var i = 0; i < tokenLongs; i++)
    {
      var tokenLong = *tokenLongPtr;

      var valueLong = *((long*)curValuePtr);

      if (tokenLong == valueLong)
      {
        tokenLongPtr++;
        curValuePtr += charsPerLong;
        continue;
      }
      else
      {
        goto advanceToNextDelimiter;
      }
    }
  }

Here I start comparing the token and value strings.  Earlier I calculated how many longs fit in value, so it’s a simple for loop to make sure they all match.  Each iteration compares 4 characters worth of value to token, if they match I continue the loop if they don’t we goto advanceToNextDelimiter.  This is a big win for larger tokens, having a quarter the comparisons and branches.

Important things to note: the long* casts are free, and curValuePtr being incremented is faster than you might assume (since charsPerLong is a const, the increment value is pre-calculated instead of involving a multiplication).

int* tokenIntPtr = (int*)tokenLongPtr;
if (tokenTrailingInt)
{
  var tokenInt = *tokenIntPtr;

  var valueInt = *((int*)curValuePtr);

  if (tokenInt == valueInt)
  {
    tokenIntPtr++;
    curValuePtr += charsPerInt;
  }
  else
  {
    goto advanceToNextDelimiter;
  }
}

Then essentially the same check, but for a single int now – comparing two characters at a time isn’t as good as four but it’s not nothing.  Whether or not this check was necessary was determined earlier, so tokenTrailingInt is unchanging across a single call – this is good for branch prediction.

char* tokenCharPtr = (char*)tokenIntPtr;
if (tokenTrailingChar)
{
  var tokenChar = *tokenCharPtr;

  var valueChar = *curValuePtr;

  if (tokenChar == valueChar)
  {
    tokenCharPtr++;
    curValuePtr++;
  }
  else
  {
    goto advanceToNextDelimiter;
  }
}

And then again for a single char.

if (curValuePtr == endValuePtr || *curValuePtr == delimiter)
{
  return true;
}

If this chunk of code is reached then every character in token has been matched, all that’s left is to make sure that either a delimiter or the end of value has been reached.  If it has then we’ve got a match and return true, otherwise we fall through to advanceToNextDelimiter.

advanceToNextDelimiter:

while (true)
{
  var curVal = *((int*)curValuePtr);

  var masked = curVal ^ delimiterTwice;
  var temp = masked & 0x7FFF7FFF;
  temp = temp + 0x7FFF7FFF;
  temp = (int)(temp & 0x80008000);
  temp = temp | masked;
  temp = temp | 0x7FFF7FFF;
  temp = ~temp;
  var neitherMatch = temp == 0;

  if (neitherMatch)
  {
    curValuePtr += charsPerInt;
    if(curValuePtr >= endValuePtr)
    {
      return false;
    }
    continue;
  }

This code advances curValuePtr forward until a delimiter is found.  There’s a clever combination of tricks here to allow checking two characters with one branch.

The first trick is advancing two characters at a time (thus the int* cast), and being willing to advance 1 character past the end of value.  This works because C# guarantees that a string coerced to a char* is terminated with a null character (not a null byte), so there’s always two additional bytes at the end of the string for us to read into whether we need them or not.

The second trick is a series of bit twiddlings, as follows:

  1. XOR delimiterTwice into curVal, making the the top or bottom char-sized chunks of masked all 0 if they match delimiter
    • delimiterTwice was precalculated
  2. AND masked with 0x7FFF7FFF, clearing the high bits in the top and bottom char-sized chunks in temp, store the result in temp
  3. ADD 0x7FFF7FFF to temp, causing the high bits in the top and bottom char-sized chunks of temp to be 1 if any of the low bits in those respective chunks are set
  4. AND temp with 0x80008000, clearing all but the high bits in the top and bottom char-sized chunks in temp
  5. OR temp with masked, making sure the high bits in each char-sized chunk are set in temp when the high-bits in masked are.
  6. OR temp with 0x7FFF7FFF, set all the low bits in the char-sized chunks in temp which will make temp all 1s if both char-sized chunks in curVal do not match delimiter
  7. INVERT temp, making it all 0s if step #5 made it all 1s
  8. COMPARE temp against 0, setting neitherMatch if so

This is followed up with an if that’s taken if neitherMatch is set in which I increment the pointer into value by two chars.  There’s then a bounds check, comparing curValPtr to endValuePtr (which was pre-calculated for this purpose) and false is returned if the bounds have been exceeded.  Note that endValuePtr points at the null-terminator, so bailing when curValPtr is equal to it is correct.

var top16 = temp & 0xFFFF0000;
if (top16 != 0)
{
  curValuePtr += charsPerInt;

  break;
}

var bottom16 = temp & 0x0000FFFF;
if (bottom16 != 0)
{
  curValuePtr += 1;
}

If this code is run a delimiter has been found but curValuePtr isn’t pointing to that character, just to the int that contains it.  top16 and bottom16 contain the top and bottom chars from temp, whichever one isn’t all zeros is the one that contained the delimiter.  Depending on which is set, curValuePtr is incremented so that it points at the first character after the delimiter.

Note because we’re executing on a little-endian architecture but C# is big-endian the increment sizes here appear to be flipped.  Also remember that token cannot be empty, and that is checked at the start of the method, so checking the second char (in top16) first isn’t an error.

var remainingBytesInValue = ((byte*)endValuePtr) - ((byte*)curValuePtr);
if (remainingBytesInValue < tokenByteLength)
{
  return false;
}

This code runs once the next delimiter has been found, and curValuePtr advanced to one char past it.  It is necessary to reassert that there’s enough space remaining in value for token to fit, just as was asserted back towards the start of the method.

The space check is done in bytes instead of chars (the casts are free) because counting the number of chars between two pointers is a subtraction followed by a division, but counting the number of bytes is only a subtraction.  Pre-calculating tokenByteLength was done for this purpose.

Then the while loop begins again, and continues until a match is found or value is exhausted.

Aaaaaaand that’s it.  Performance wise, this is about 15x faster than the original ContainsToken code and something like 40% faster than than the next best implementation.

I’m not sure I’d encourage anyone to use this code, as relies on several subtleties which make it hard to understand.  It also isn’t tuned for 32-bit architectures, or correct on big-endian ones.  Both could be fixed, but require some build tweaks which aren’t particularly interesting or portable themselves.

It sure was fun to write though.


Scripting and Profiling Improvements in StackExchange.Redis

At Stack Exchange, we make pretty heavy use of Redis.  We’re so invested we’ve had our best minds build not one, but two different .NET libraries for it.  Recent(ish) experience with the Providence project has prompted additions to the modern library, StackExchange.Redis, in the realms of Lua scripting and Profiling.

Lua Scripting Improvements

StackExchange.Redis already had full support for Redis’s scripting commands when Providence started, but Lua scripting wasn’t really used at StackExchange at that time.  Early versions of Providence started making extensive use of Lua scripts to make atomic updates to our data during our daily refresh process, and we started having painful experiences.  The root cause of our pain was that Redis requires you write your script using KEYS and ARGV arrays to pass parameters.

Here’s a toy example to illustrate the issue.  The command

eval "redis.call('set', KEYS[1], ARGV[1])" 1 theKey 1 theArg

rendered in C# as

IDatabase db = /* ... */
var theKey = /* ... */
var theArg = /* ... */
db.ScriptEvaluate(
  "redis.call('set', KEYS[1], ARGV[1])",
  new RedisKey[] { theKey },
  new RedisValue [] { theArg }
);

sets one “the-key” to “the-arg”.  The actual script in your source would be “redis.call(‘set’, KEYS[1], ARGV[1])” which is… less than ideal, especially once your scripts start getting complicated.

In recent versions of StackExchange.Redis there’s an alternative, use the new LuaScript class.  This lets you refer to variables by name in a manner inspired by Dapper (the Stack Exchange micro-ORM).

The previous example would now be rendered in C# as:

IDatabase db = /* ... */
var theKey = /* ... */
var theArg = /* ... */
var script = LuaScript.Prepare("redis.call('set', @theKey, @theArg)");
db.ScriptEvaluate(script, new { theKey, theArg });

LuaScript rewrites your script to refer to ARGV and also handles mapping your parameters to the appropriate ARGV index (and places any parameters of type RedisKey into the KEYS array as well, so as to play nicely with Redis Cluster).  You can use any object to specify parameters, so long as it has a public field or property for each @parameter.

To avoid the overhead of transmitting the script text for every call to IDatabase.ScriptEvaluate, you can convert a LuaScript into a LoadedLuaScript like so:

IServer server = /* ... */
LuaScript script = /* ... */
LoadedLuaScript loaded = server.ScriptLoad(script);

The LoadedLuaScript class provides the same functionality as the LuaScript class, but uses the SCRIPT LOAD and EVALSHA commands to avoid re-transmitting the entire script when evaluated.

Profiling Improvements

While Redis is really (really really) fast, at Stack Overflow we’ve still been profiling its impact on page load times since time immemorial.  Everytime a developer looks at Stack Overflow, they get informed of how much time was spent in talking to Redis.

That 48.8ms includes a fair amount of

That 48.8ms includes a fair amount of “moderator only” work too.

The way Stack Overflow uses Redis is analogous to how we use SQL Server – send query, wait for it to complete, read the response, then proceed.  StackExchange.Redis multiplexes and pipelines so the application doesn’t block waiting for responses, but individual HTTP requests do as a consequence of their logic.

Accordingly, Stack Overflow’s Redis profiling code has been conceptually:

var watch = Stopwatch.StartNew();
IDatabase db = /* ... */
db.GetString(/* ... */);
watch.Stop();
PushToProfilingStats(watch.Elapsed);

Obviously the actual code is more sophisticated.

Providence, however, works quite differently.  For each request to Providence, it dispatches several Redis commands and then combines them once they all finish.

In other words, the Providence Redis code tends to look like:

IDatabase db = /* ... */
var inFlight =
  new List<Task>
  {
    GatherFooFieldsAsync(),  // does an async redis query or two
    GatherBarFieldsAsync()   // likewise
    /* and so on */
  };
 await Task.WhenAll(inFlight);

To better support profiling in situations like Providence’s, StackExchange.Redis now has the IProfiler interface,  the RegisterProfiler method, the BeginProfiling method, and the FinishProfiling method.  These new additions are used by StackExchange.Redis to associate dispatched commands with a user provided context object, even when using the async and await keywords.

Exactly how to determine context is application specific but the following code is used in the Providence API project, which is a stock MVC5 website.

// The IProfiler implementation
public class RedisProfiler : IProfiler
{
    const string RequestContextKey = "RequestProfilingContext";

    public object GetContext()
    {
        var ctx = HttpContext.Current;
        if (ctx == null) return null;

        return ctx.Items[RequestContextKey];
    }

    public object CreateContextForCurrentRequest()
    {
        var ctx = HttpContext.Current;
        if (ctx == null) return null;

        object ret;
        ctx.Items[RequestContextKey] = ret = new object();

        return ret;
    }
}

// In Global.asax.cs
static RedisProfiler RedisProfiler;
static ConnectionMultiplexer RedisConnection;

protected void Application_Start()
{
    RedisProfiler = new RedisProfiler();
    RedisConnection = ConnectionMultiplexer.Connect(...);
    RedisConnection.RegisterProfiler(RedisProfiler);
    // and actual app startup logic ...
}

protected void Application_BeginRequest()
{
    var ctxObj = RedisProfiler.CreateContextForCurrentRequest();
    if (ctxObj != null)
    {
      RedisConnection.BeginProfiling(ctxObj);
    }

    // and actual per-request logic ...
}

protected void Application_EndRequest()
{
    var ctxObj = RedisProfiler.GetContext();
    if (ctxObj != null)
    {
        var timings = RedisConnection.FinishProfiling(ctxObj);

        var totalCommands = 0;
        double totalMilliseconds = 0;

        foreach (var timing in timings)
        {
            totalCommands++;
            totalMilliseconds += timing.ElapsedTime.TotalMilliseconds;
        }

        // and then perf recording logic ...
    }

    // and actual end of request logic ...
}

More examples of profiling with different notions of context can be found in the StackExchange.Redis profiling tests.

Profiling provides details on each command dispatched, not just their count and time to execute.  Timings are exposed as IProfiledCommands, which includes information about:

  • The Redis instance the command was sent to
  • The DB the command was run in
  • The command itself
  • Any CommandFlags specified
  • When the command was created, which corresponds to when the StackExchange.Redis method was called
  • How long it took to enqueue the command after it was created
  • How long it took to send the command after it was enqueued
  • How long it took Redis to respond after the command was sent
  • How long it took to process the response after it was received
  • Total time elapsed from command creation til the response was processed
  • If the command was a retransmission due to Redis Cluster
  • If the command was a retransmitted, a reference to the original commands timings
  • If the command was a retransmission, whether it was an ASK or MOVED that prompted it

StackExchange.Redis is available on GitHub and NuGet