Jil: Dynamic Deserialization

With version 1.3.0 Jil now supports dynamic deserialization of JSON, with the appropriately named JSON.DeserializeDynamic(…) methods.

What Types Are Supported?

Jil’s dynamic deserializer parses the same set of types that its static deserializer does.  Supported types are:

  • Strings and Chars
  • Booleans
  • Integers
  • Floating point numbers
  • DateTimes
  • Nullables
  • Enumerations
  • Guids, in the “D” format
  • Arrays
  • Objects

DateTime formats must be configured through an Options object, and includes four popular JSON date formats.

How Dynamic Is It?

Jil returns a dynamic object, introduced in C# 4, rather than some custom “JSON” class.  Using the parsed object is done with natural feeling C# instead of method or indexer calls.

The following are supported operations on a Jil dynamic:

  • Casts
    • ie. (int)JSON.DeserializeDynamic(“123″)
  • Member access
    • ie. JSON.DeserializeDynamic(@”{“”A””:123}”).A
  • Indexers
    • ie. JSON.DeserializeDynamic(@”{“”A””:123}”)["A"]
    • or JSON.DeserializeDynamic(“[0, 1, 2]“)[0]
  • Foreach loops
    • ie. foreach(var keyValue in JSON.DeserializeDynamic(@”{“”A””:123}”)) { … }
      • in this example, keyValue is a dynamic with Key and Value properties
    • or foreach(var item in JSON.DeserializeDynamic(“[0, 1, 2]“)) { … }
      • in this example, item is a dynamic and will have values 0, 1, and 2
  • Common unary operators (+, -, and !)
  • Common binary operators (&&, ||, +, -, *, /, ==, !=, <, <=, >, and >=)
  • .Length & .Count on arrays
  • .ContainsKey(string) on objects

Jil also allows you to cast JSON arrays to IEnumerable<T>, JSON objects to IDictionary<string, T>, and either to plain IEnumerables.

Dynamic objects returned by Jil are immutable.  Setting properties, setting through an indexer, .Add, +=, and so on are not supported and will throw exceptions at runtime.

Performance Tricks

As always, Jil works quite hard to be quite fast.  While dynamic deserialization is necessarily somewhat slower than its static counterpart, it’s still quite fast.

Custom Number Parsers

.NET’s built in number parsers are much more general purpose than what Jil needs, so a custom parser that only does what is demanded by the JSON spec is used instead.

Custom Number Format

The JSON spec explicitly doesn’t specify the precision of numbers, and Jil exploits that to implement a non-IEEE794 floating point representation while parsing.  Jil uses a less compact, but easier to parse, “Fast Number” of 169-bits for most purposes internally.  Converting from this format to .NET’s built-in floating point values requires some extra work, but is still a net speed up for a small number of accesses.  In addition, converting from integer “Fast Numbers” to .NET’s built-in integer types is quite fast.

 Minimal Allocations

While dynamic dispatch implies a great many more allocations than with Jil’s static deserializer, the dynamic deserializer still strives to minimize them.  You can see this in the methods for parsing strings, where a char[] is used if possible rather than allocating strings via a StringBuilder.

Benchmarks

Benchmarking dynamic code is tricky.  Comparisons to other libraries may not be apples-to-apples if the same dynamic behavior hasn’t been implemented, or if parsing or conversion is delayed to different points.

Rather than put together a, quite possibly misleading, comparison to other JSON libraries.  The following compares Jil’s dynamic deserializer to its static deserializer.

Comparison of Jil's static and dynamic deserializers

Comparison of Jil’s static and dynamic deserializers

 

The code for this benchmark can be found on Github.

While the dynamic deserializer is quite fast, it’s worth remember that dynamic dispatch is considerably slower than static dispatch and that some type conversions are deferred when deserializing dynamically.  In practice this means that using the static deserializer is still noticeably faster than using the dynamic deserializer, but the dynamic deserializer is still usable for most purposes.

Grab the latest Jil on Nuget or Checkout the code on Github

After the inevitable bug fixes, I’ll probably be revisiting Jil’s existing SerializeDynamic() method to make it actually dynamic.  Currently it only dynamically dispatches on the first type it sees, which isn’t nearly as useful.


Jil: Doing JSON Really, Really Quickly

After about three months of work, and some time in a production environment, the second half of Jil (a fast JSON library built on Sigil) is ready for release.

Jil now supports both serializing and deserializing JSON.

As with serialization, Jil deserialization supports very few configuration options and requires static typing.  The interface is, accordingly, just a simple “JSON.Deserialize<T>(…)” which takes either a string or a TextReader.

The Numbers

Jil’s entire raison d’être is ridiculous optimizations in the name of speed, so let’s get right to the benchmarks.

Let’s start with the (recently updated) SimpleSpeedTester:

Performance Graph

The numbers show Jil’s deserializers are about 50% faster than the next fastest JSON library.  Protobuf-net is included as a baseline, it isn’t a JSON serializer (it does Protocol Buffers) but it’s the fastest .NET serializer (of anything) that I’m aware of.

The Tricks

Just like with serializing, Jil does runtime code generation using Sigil to produce tight, monolithic deserialization methods.  That right there accounts for quite a lot of performance.

To speed up deserializing numbers, Jil has a large collection of methods tailored to each built-in number type.  Deserializing bytes, for example, can omit sign checks and all  but one overflow check.  This specialization avoids unnecessary allocations, and sidesteps all the overhead in .NET imposed by support for culture-specific number formatting.

When deserializing JSON objects, Jil tries to avoid allocations when reading member names.  It does this by checking to see if a hash function (Jil uses variants of the Fowler–Noll–Vo hash function) can be truncated into a jump table, ie. Jil calculates hash(memberName) % X for many Xs to see if a collision free option exists.  If there’s an option that works, Jil uses it instead of actually storing member names which saves an allocation and the cost of a lookup in a Dictionary.

A fair amount of performance is gained by being very parsimonious with memory on the heap.  Jil typically allocates a single character buffer, and may allocate a StringBuilder if certain conditions are met by the deserialized data.  Additional string allocations will be made if Jil cannot use hash functions member name lookups.

Interested in Jil?

Grab It On Nuget or Checkout The Code

Next up is probably dynamic deserialization, though naturally I’d love to make the existing functionality even faster.  Don’t hesitate to ping me with more performance tricks.


Jil: Serializing JSON Really, Really Quickly

Several months ago I more or less finished up my “just for fun” coding side project Sigil (which I’ve written about a few times before), and started looking for something new to hack on.  I settled on a JSON serializer for .NET that pulled out all the stops to be absolutely as fast as I could make it.

About three months later, I’m ready to start showing off…

Jil

a fast JSON serializer built on Sigil.

This is still a very early release (arbitrarily numbers 0.5.5, available on Nuget [but not in search results]), so there are a lot of caveats:

  1. Jil has very few configuration options
  2. Jil has very limited support for dynamic serialization
  3. Jil only serializes, no deserialization
  4. Jil only supports a subset of types
  5. Jil has seen very little production use
    • At time of writing, Jil has been pulled into the Stack Exchange API for about a week without issue

Some Numbers

Benchmarks are most the exciting thing about Jil, so let’s get right to them.  The following are from a fork of theburningmonk’s SimpleSpeedTester project.

In tabular form, the same data:

Raw numbers are available in a google document.  There are also further benchmarks I created while making Jil on Github.

The take away from these benchmarks is that Jil is about twice as fast as the next fastest .NET JSON serializer.  Protobuf-net is still faster (as you’d expect from an efficient binary protocol from Google and a library written by Marc Gravell) but Jil’s closer to it than to the next JSON serializer.

Some Tricks

I could write a whole series of how Jil shaves microseconds, and may yet do so.  I’ll briefly go over some of the highlights right now, though.

The first one’s right there in the name, Jil is built on Sigil.  That means a type’s serializer gets created in nice tight CIL, which becomes nice tight machine code, and no reflection at all occurs after the first call.

Second, Jil has a number of very specialized methods for converting data types into strings.  Rather than relying on Object.ToString() or similar, Jil has a separate dedicated methods for shoving Int32s, UInt64s, Guids, and so on into character arrays.  These specialized methods avoid extra allocations, sidestep all the culture-specific infrastructure .NET makes available, and let me do crazy things like divide exactly 14 times to print a DateTime.

As you’d expect of performance focused code in a garbage collected environment, the third thing Jil focuses on is trying not allocate anything, ever.  In practice, Jil can keep allocations to a single small charactar array except when dealing with Guids and IDictionary<TKey, TValue>s.  For Guids Jil must allocate an array for each since Guid.ToByteArray() doesn’t take a buffer, while serializing Dictionaries still allocates an enumerator.

If you’ve clicked through to Jil’s source by now, you might have noticed some MethodImpl attributes.  That’s a part of the Jil’s fourth big trick, trading a fair amount of memory for more speed.  Aggressively inlining code saves a few instructions spent branching, and even more time if instruction prefetching isn’t perfect in the face of method calls.

Last but not least, Jil avoids branches whenever possible; your CPU’s branch predictor can ruin your day.  This means everything from using jump tables, to skipping negative checks on unsigned types, to only doing JSONP specific escaping when requested, and even baking configuration options into serializers to avoid the runtime checks.  This does mean that Jil can create up to 64 different serializers for the same type, though in practice only a few different configurations are used within a single program.

Check Out The Code or Grab Jil From Nuget

I’m definitely interested in any crazy code that shaves more time off.  Also faster ways to create a string (rather than write to a TextWriter), my experiment with capacity estimation works… but not for reliably enough speedups to flip it on by default.


Your Future On Stack Overflow

I recently spent a while working on a pretty fun problem over at Stack Exchange: predicting what tags you’re going to be active answering in.

Confirmed some suspicions, learned some lessons, got about a 10% improvement on answer posting from the homepage  (which I’m choosing to interpret as better surfacing of unanswered questions).

Good times.

Why do we care?

Stack Overflow has had the curious problem of being way too popular for a while now.  So many new questions are asked, new answers posted, and old posts updated that the old “what’s active” homepage would cover maybe the last 10 minutes.  We addressed this years ago by replacing the homepage with the interesting tab, which gives everyone a customized view of stuff to answer.

The interesting algorithm (while kind of magic) has worked pretty well, but the bit where we take your top tags has always seemed a bit sub-par.  Intuitively we know that not all tags are equal in volume or scoring potential, and we also know that activity in one tag isn’t really indicative just in that tag.

What we’d really like in there is your future, what you’re going to want to answer rather than what you already have.  They’re related, certainly, but not identical.

Stated more formally: what we wanted was an algorithm that when given a user and their activity on Stack Overflow to date, predicted for each tag what percentage of their future answers would be on questions in that tag.  “Percentage” is a tad mislead since each question on Stack Overflow can have up to five tags, so the percentages don’t sum to anything meaningful.

The immediate use of such an algorithm would be improving the homepage, making the questions shown to you more tailored to your interests and expertise.  With any luck the insights in such an algorithm would let us do similar tailoring elsewhere.

To TL;DR, you can check out what my system thinks it knows about you by going to /users/tag-future/current on any of the older Stack Exchange sites.  The rest of this post is about how I built it, and what I learned doing it.

Unsurprisingly, that’s a pretty good model of how I think about my Stack Overflow participation.

What Do We Know?

A big part of any modeling process is going to be choosing what data to look at.  Cast too wide a net and your iteration time explodes, too narrow and you risk missing some easy gains.  Practicality is also a factor, as data you technically have but never intended to query en masse may lead you to build something you can’t deploy.

What I ended up using is simply the answers on a site (their text, creation dates, and so on), along with the tags the associated questions had when the answer was posted.  This data set has the advantage of being eminently available, after all Stack Exchange has literally been built for the purpose of serving answers, and public knowledge.

At various times I did try using data from the questions themselves and an answerers history of asking, but to no avail.  I’m sure there’s more data we could pull in, and probably will over time; though I intend to focus on our public data.  In part this is because it’s easier to explain and consume the public data but also because intuitively answerers are making decisions based on what they can see, so it makes sense to focus there first.

Except with about 315,000 different parts.

A Model Of A Stack Exchange

The actual process of deriving a model was throwing a lot of assumptions about how Stack Overflow (and other Stack Exchanges) work against the wall, and seeing what actually matched reality.  Painstaking, time consuming, iteration.  The resulting model does work (confirmed by split testing against the then current homepage), and backs up with data a lot of things we only knew intuitively.

Some Tags Don’t Matter

It stands to reason that a tag that only occurs once on Stack Overflow is meaningless, and twice is probably just as meaningless.  Which begs the question, when, exactly does a tag start to matter?  Turns out, before about forty uses a tag on Stack Overflow has no predictive ability; so all these tags aren’t really worth looking at in isolation.

Similarly a single answer isn’t likely to tell us much about a user, what I’d expect is a habit of answering within a tag to be significant.  How many answers before it matters?  Looks like about three.  My two answers in “windows-desktop-gadgets” say about as much about me as my astrological sign (Pisces if you’re curious).

Pictured: the state of programming Q&A before Stack Overflow.

Most People Are Average (That’s Why It’s An Average)

What’s being asked on Stack Overflow is a pretty good indicator of what’s being used in the greater programming world, so it stands to reason that a lot of people’s future answering behavior is going to look like the “average user’s” answering behavior.  In fact, I found that the best naive algorithm for predicting a user’s future was taking the site average and then overlaying their personal activity.

Surprisingly, despite the seemingly breakneck speed of change in software, looking at recent history when calculating the site average is a much worse predictor than considering all-time.  Likewise when looking at user history, even for very highly active users, recent activity is a worse predictor than all time.

One interpretation of those results, which I have no additional evidence for, is that you don’t really get worse at things over time you mostly just learn new things.  That would gel with recent observations about older developers being more skilled than younger ones.

You Transition Into A Tag

As I mentioned above, our best baseline algorithm was predicting the average tags of the site and then plugging in a user’s actual observed history.  An obvious problem with that is that posting a single answer in say “java.util.date” could get us predicting 10% of your future answers will be in “java.util.date” even though you’ll probably never want to touch that again.

So again I expected there to be some number of uses of a tag after which your history in it is a better predictor than “site average”.  On Stack Overflow, it takes about nine answers before you’re properly “in” the tag.  Of course there needs to be a transition between “site average” and “your average” between three and nine answers, and I found a linear one works pretty well.

We All Kind Of Look The Same

Intuitively we know there are certain “classes” of users on Stack Overflow, but exactly what those classes are is debatable. Tech stack, FOSS vs MS vs Apple vs Google?  Skill level, Junior vs Senior?  Hobbyist vs Professional?  Front-end vs Back-end vs DB?  And on and on.

Instead of trying to guess those lines in the sand, I went with a different intuition which was “users who start off similarly will end up similarly”.  So I clustered users based on some N initial answers, then use what I knew about existing users to make predictions for new users who fall into the cluster.

Turns out you can cut Stack Overflow users into about 440 groups based on about 60 initial tags (or about 30 answers equivalently) using some really naive assumptions about minimum distances in euclidean space.  Eyeballing the clusters, it’s (very approximately) Tech stack + front/back-end that divides users most cleanly.

We’d also expect chocolate in there.

One Tag Implies Another

Spend anytime answering on Stack Overflow and you’ll notice certain tags tend to go together.  Web techs are really good for this like “html” and “css” and “javascript” and “jquery”, but you see it in things like “ios” and “objective-c”.  It stands to reason that answering a few “c#” questions should raise our confidence that you’re going to answer some “linq-to-object” questions then.

Testing that assumption I find that it does, in fact, match reality.  The best approach I found was predicting activity in a tag given activity in commonly co-occurring tags (via a variation on principal component analysis) and making small up or down tweaks to the baseline prediction accordingly.  This approach depends on there being enough data for co-occurrence to be meaningful, which I found to be true for about 12,000 tags on Stack Overflow.

Trust Your Instincts

Using the Force is optional.

One pretty painful lesson I learned doing all this is: don’t put your faith in standard machine learning.  It’s very easy to get the impression online (or in survey courses) that rubbing a neural net or a decision forest against your data is guaranteed to produce improvements.  Perhaps this is true if you’ve done nothing “by hand” to attack the problem or if your problem is perfectly suited to off the shelf algorithms, but what I found over and over again is that the truthiness of my gut (and that of my co-workers) beats the one-size-fits-all solutions.  You know rather a lot about your domain, it makes sense to exploit that expertise.

However you also have to realize your instincts aren’t perfect, and be willing to have the data invalidate your gut.  As an example, I spent about a week trying to find a way to roll title words into the predictor to no avail.  TF-IDF, naive co-occurrence, some neural network approaches, and even our home grown tag suggester never quite did well enough; titles were just too noisy with the tools at my disposal.

Get to testing live as fast as you possibly can, you can’t have any real confidence in your model until it’s actually running against live data.  By necessity much evaluation has to be done offline, especially if you’ve got a whole bunch of gut checks to make, but once you think you’ve got a winner start testing.  The biggest gotcha revealed when my predictor went live is that the way I selected training data made for a really bad predictor for low activity users, effectively shifting everything to the right.  I solved this by training two separate predictors (one for low activity, and one for high).

Finally, as always solving the hard part is 90% of the work, solving the easy part is also 90% of the work.  If you’re coming at a problem indirectly like we were, looking to increase answer rates by improving tag predictions, don’t have a ton of faith in your assumptions about the ease of integration.  It turned out that simply replacing observed history with a better prediction in our homepage algorithm broke some of the magic, and it took about twenty attempts to realize gains in spite of the predictor doing what we’d intended.  The winning approach was considering how unusual a user is when compared to their peers, rather than considering them in isolation.

Again, want to see what we think you’ll be active in?  Hit /users/tag-future/current on your Stack Exchange of choice.


Making Magic: How Sigil Works

Version 3.0.0 of Sigil was just released (grab it on Nuget and check out the source on Github).  The big new feature this release is a disassembler, one which allows for some inspection of the flow of values through a .NET delegate.

But that’s not what I’m writing about.  I figure now’s as good a time as any to write up the “how” of Sigil, given that I covered the “what” and “why” in an earlier post and that recent releases have refactored Sigil’s internals into a state I’m happier with.

Bytecode Verifiers And You

In essence Sigil is a “bytecode verifier”.  If you’ve done JVM or .NET development you should be familiar with the concept, the bytecode verifiers on those platforms make sure that the class files or assemblies you load contain bytecode that can safely be executed.

The definition of “safe” is very particular, a bytecode verifier doesn’t prevent errors from occurring at execution time but rather prevents invariants of the runtime from being violated.  For example, a bytecode verifier would guarantee that invoking an instance method is passed the proper number and types of parameters and that it is invoked against an instance of the appropriate type.

One way to think about bytecode verifiers is that they guarantee that every operation receives the correct types as inputs and every operation leaves the runtime in a well formed state.

Sigil’s Special Features

Where Sigil differs from other bytecode verifiers is that it doesn’t operate on “finished” instruction sequences.  It verifies as you build a sequence, failing as soon as it can be sure the sequence is invalid.

Because Sigil deals with incomplete instruction sequences it also has to do with a lot of unknowns, especially around branches.  It’s quite common to branch to an instruction you haven’t actually emitted yet or emit instructions that aren’t yet provably reachable, both cases a traditional verifier can never encounter.

Sigil also has to explain itself when it fails, so it has to be able to deliver where and why a given sequence became invalid (which can be far removed from the last emitted instruction because of branches).  Similar complications exist when verification is successful, as things like eliding trivial casts and replacing branches with their short forms (which are deferred until an instruction stream is finalized) requires a lot of information about the instruction stream be retained.

Simple Verifying

If you ignore branches, verifying a bytecode sequence is pretty simple.  You can think of it as executing instructions as if they consumed and produced types instead of values.  Since Sigil is a .NET library I’ll be using .NET examples, though the basic idea applies to all similar verifiers.

For example, assume the following is an implementation of a Func<int>:

ldc.i4 1
ldc.i4 2
add
ret

We know “ldc.i4″ consumes nothing, and produces an int, “add” actually consumes and produces a wide range of types but one of them is “int int -> int”.  The “ret” instruction either consumes nothing or a single type, dependent on the signature of the method it is used in; in this case it consume an “int” which we know because the method is a “Func<int>”.

I’ve written out the state of the stack (the .NET runtime is specified as a stack machine) after each instruction executes:

         // --empty--
ldc.i4 1 // int
ldc.i4 2 // int int
add      // int
ret      // --empty--

We need to add one more rule, which is that control flow ends with a “ret” which leaves the stack empty, and with that it’s clear that this sequence would verify.

How each instruction affects the stack, in terms of types.

The following sequence wouldn’t verify:

ldc.i4 1            // int
ldstr "hello world" // string int
mult                // !ERROR!
ret

Since “mult” doesn’t have a transition from “string int” verification could be shown to fail as soon as “mult” is emitted.

The earliest releases of Sigil were simple verifiers as they didn’t try and trace through.  Through version 1.2.9 you instead had to assert types when marking labels under certain circumstances.

Complex Verifying

As soon as you add branches life gets considerably more complicated.  Now you have to deal with cases where you can’t infer what types are on the stack immediately.

Consider the following:

ldc.i4 1
br end_label

middle_label:
add
ret

end_label:
ldc.i4 2
br middle_label

In this example the types passed to “add” are unclear when it is emitted, it’s only when “br middle_label” is encountered that it becomes clear that “add” will verify.

Furthermore, whether or not “ret” verifies depends on the method being built. It would verify if we’re building a Func<int> then it verifies.  If we’re building a Func<string> it should fail when emitted, since there’s no way for “add” to pass a string to “ret”.  If we’re building a Func<double>, then it should fail when “br middle_label” is emitted since then “add” is known to produce and pass an Int32 to “ret”.

How Sigil Copes

Sigil deals with the complexities of verifying partial sequences in two ways: tracking the possible types on the stack, and creating a verifier for each possible code path.

Reconsidering the above example:

                // ** Initial Verifier: V1, with empty stack **
ldc.i4 1        // int
br end_label    // int
                // ** New Verifier: V2 with unknown stack, V1 stashed **
middle_label:
add             // int|long|float|double|etc.
ret             // --empty--
                // ** New Verifier: V3 with unknown stack, V2 stashed **
end_label:      // ** V1 restored (stack is: int) **
ldc.i4 2        // int int
br middle_label // int int (V2 now verified with int int as initial stack)

There’s rather a lot going on now, what with verifiers being stashed and restored.  Also note that when “add” is first encountered it places an “or” clause onto the stack, which allows “ret” to verify if it expects any type in that “or” clause.

How different verifiers see the instruction stream

The logic around creating and restoring verifiers is tricky, but boiled down:

  • At an unconditional branch, store the current stack state and remove the current verifiers
    • If the branched to label has already been marked, take the current stack and check it against the expected initial stack at that label
  • At a conditional branch do the same checks and stores as an unconditional one, but don’t remove the current verifiers
  • When marking a label, if there is no verifier create a new one with an unknown initial stack
    • If there are stacks stored against that label from previous branches, check that those stacks are consistent and make them the new verifiers initial stack
  • Treat “ret” as an unconditional branch which doesn’t store the stack state
  • If at any point an instruction is emitted and there is no current verifier, the newly emitted code is unreachable

This logic is captured in Sigil’s RollingVerifier UnconditionalBranch, ConditionalBranch, Return, and Mark methods.

As for tracking all possible values on the stack (as in the “or” with “add” in the previous example), Sigil considers transitions rather than types on the stack.  So long as there’s at least one transition that could be taken for every instruction a verifier has seen, the instruction stream is considered verifiable.

Take for example the, rather complex, transitions for “add”:

new[]
{
    new StackTransition(new [] { typeof(int), typeof(int) }, new [] { typeof(int) }),
    new StackTransition(new [] { typeof(int), typeof(NativeIntType) }, new [] { typeof(NativeIntType) }),
    new StackTransition(new [] { typeof(long), typeof(long) }, new [] { typeof(long) }),
    new StackTransition(new [] { typeof(NativeIntType), typeof(int) }, new [] { typeof(NativeIntType) }),
    new StackTransition(new [] { typeof(NativeIntType), typeof(NativeIntType) }, new [] { typeof(NativeIntType) }),
    new StackTransition(new [] { typeof(float), typeof(float) }, new [] { typeof(float) }),
    new StackTransition(new [] { typeof(double), typeof(double) }, new [] { typeof(double) }),
    new StackTransition(new [] { typeof(AnyPointerType), typeof(int) }, new [] { typeof(SamePointerType) }),
    new StackTransition(new [] { typeof(AnyPointerType), typeof(NativeIntType) }, new [] { typeof(SamePointerType) }),
    new StackTransition(new [] { typeof(AnyByRefType), typeof(int) }, new [] { typeof(SameByRefType) }),
    new StackTransition(new [] { typeof(AnyByRefType), typeof(NativeIntType) }, new [] { typeof(SameByRefType) })
};

One more thing to track when dealing with verifiers is whether or not we know their initial stack, what I call “baseless” in the code.  It is not an error for an instruction stream to underflow a baseless verifier, since it’s stack could be anything.  Instead of failing verification, Sigil considers the result of underflowing a baseless stack to be a “wildcard” type which satisfies any transition; this is how “add” can pop a value to continue verification after “middle_label”.

Trickier Things

.NET has a couple hairy CIL opcodes that require special handling: dup, localloc, and leave.

“dup” duplicates the current value on the stack, the difficulty being that we only know the type on the stack if we can verify the preceding instructions which isn’t always.  Sigil handles this by making “dup” place a special type on the stack, which when encountered by the verifier pushes a copy the preceeding transition’s result or a wildcard if underflowing a baseless verifier.

“localloc” is analogous to alloca(), pushing a pointer to memory on the stack, which requires that only a single value be on the stack when executed.  This means the current verifier cannot be baseless to verify.  In this case Sigil uses a special transition which asserts that the size of the stack is one if the verifier is based, and is ignored if it is not.

“leave” is an unconditional branch out of an exception or catch block which empties the stack entirely.  Sigil considers this equivalent to “pop”-ing exactly the number of items currently on the stack, which like “localloc” means the current verifier cannot be baseless.  Like “dup” Sigil uses a special type to indicate that the stack needs to be emptied to the verifier.

Optimizing Emitted Code

There are two kinds of optimizations Sigil can do to emitted CIL: eliding trivial casts, and replacing instructions with their short forms.

Conceptually eliding is straight forward, just keep track of what types are guaranteed to go into a castclass or isinst operation and if those types are assignable to the type encoding with the instruction elide it.  Sigil attaches callbacks, like this one, to “castclass” and “isinst” transitions which are called whenever a verifier processes those operations and passed enough information to decide whether to elide themselves or not.

Some short forms are easy, any of the LoadConstant methods with short forms can be changed at call time.  The trickier ones are branches, as we need to wait until we’re done emitting code and can calculate offsets.  Tracking offsets is handled by BufferedILGenerator (which maintains a cache of byte offsets to each instruction) and a last minute call to Emit.PatchBranches patches all the branch instructions that can fit their offsets into a signed byte.

Optimizing Sigil

Sigil employs some tricks to keep it relatively snappy.

Perhaps most importantly, it doesn’t re-verify operations unless it absolutely has to.  Every currently in scope verifier maintains a cache of its state when it last verified the instruction stream and re-uses that cache when new instructions are added.  The cache does have to be invalidated when the initial stack state changes, which only happens when branching to or marking labels.

Sigil also tries to keep the number of verifiers in scope limited by discarding any non-baseless verifiers past the first one.  Since any verifier that isn’t baseless can be traced back to the start of the method, we know that there are no “or” type clauses on the stack so the verifiers are equivalent going forward even if they took different paths through the code.

Other Tidbits

To wrap this up I’m just going to list some minor things of note found in Sigil.

  • Sigil has most of an implementation of Linq-to-Objects to run on pre-.NET 3.5 runtimes, heavily influenced by Jon Skeet’s Edulinq series
  • Sigil has it’s own Tuple implementation for similar reasons
  • Sigil’s disassembler, while doing a great deal more, started as a replacement for the Mono.Reflection disassembler in our code base
  • Sigil’s exception blocks are slightly different from ILGenerators in that you explicitly attach catches and finallys to them, this makes nested exception handling easier to debug

And again, Sigil can be installed from Nuget and the source is available on Github.


Improving Sigil

A few weeks ago I announced Sigil, a fail-fast validating wrapper around .NET’s ILGenerator.  Since then I’ve pushed a number of updates and bug fixes, culminating in the fairly significant 2.0.0 release now on Nuget.

I figured now was a good time to write down what’s changed, and where I see this code going.

A Smarter Validator

The biggest change between 1.x and 2.x is that Sigil no longer requires stack type assertions after unconditional branches.

In version 1.x you had to do something like the following:

var emit = Emit<Func<int>>.NewDynamicMethod();

var label1 = emit.DefineLabel();
var label2 = emit.DefineLabel();

emit.LoadConstant(0);
emit.Branch(label1);

emit.MarkLabel(label2, new [] { typeof(int) }); // Sigil wasn't smart enough to know what the stack was here
emit.Return();

emit.MarkLabel(label1);
emit.Branch(label2);

Version 2.x removes the need to pass types to certain MarkLabel() calls entirely.  This removes the most painful bit of busywork entailed in using the initial release of Sigil.

As a consequence of this new validator, a few breaking changes were made to Sigil’s interface.  The obvious removal of stack assertions, plus the removal of GetStack (as the stack state is not always known) and the addition of some required parameters to array manipulation methods (as the array involved cannot always be inferred).

Better Debugging Information

Sigil is all about failing fast and helpfully, and several changes have been made to the string returned by SigilVerificationException.GetDebugInfo() since Sigil’s initial release.

Examples include:

  • Non-returning paths indicated by the labels that path flows through
  • Instructions that take locals are annotated with the names and types of the involved local
  • The instruction that causes a verification error
  • As in version 1.x the stack state at any error, now including all possible types

Optimizations

Since it’s initial release, Sigil has gained the ability to elide trivial casts (thanks to Marc Gravell for the first version of this feature).

Modest performance improvements have been made to rewriting branches into their short forms, and the ability to disable such rewriting has been added (credit to Jason Punyon for some of this).

Performance

Sigil has been tuned somewhat since it’s initial release, but the introduction of a smarter validator has come at a performance price under certain circumstances.  As a rough guideline, if you’re generating methods with hundreds of labels and branches or tens of thousands of instructions Sigil may be too slow for you.

In addition to the aforementioned optimization options (passed as OptimizationOptions), the option to defer some validation has also been added (via the ValidationOptions enumeration).  Between the two a great deal of Sigil’s utility can still be had even for very large or very branch intensive generated code.

Minor Improvements

  • While debugging all declared labels and in scope locals are exposed as the Labels and Locals properties
  • The WriteLine(string, params Locals[]) method
  • Automatic insertion of the volatile prefix when able
  • The ability to refer to locals and labels by name

The Future

I’m not really satisified with Sigil’s performance just yet.  Profiling suggests that the slowest part of Sigil is gluing together instruction sequences for post-branch validation; I’m sure this can be done more quickly and probably skipped altogether in a lot of cases.

I’d also like to support the last four CIL instructions: Arglist, Mkrefany, Refanytype, and Refanyval.  Sigil has focused mostly on the DynamicMethod case where these instructions are rarely useful, but for completion’s sake they should be available.

Sigil currently has .NET 4.5 and 3.5 builds, it’d be nice to get a .NET 2.0 build as well.  This is non-trivial as Sigil makes heavy use of LINQ and extension methods.

A more long term goal is to abstract most of Sigil’s dependency on System.Reflection and System.Reflection.Emit so that its validation and optimization features can be used for a wider array of meta-programming tasks.  Freely switching between standard .NET and IKVM.Reflection is probably the first place to start, as there’s already a similar feature in protobuf-net’s precompiler.

As before, Sigil’s source is on github and the latest stable releases are on Nuget.


Sigil: Adding Some (More) Magic To IL

A nifty thing you can do in .NET is generate bytecode (properly Common Intermediate Language [CIL], formerly Microsoft Intermediate Language [MSIL], commonly called just IL) on the fly.  Previously I’ve used it to do dumb things with strings, and build a serializer on top of protobuf-net.  Over at Stack Exchange it’s used in small but critical parts of our API, in our caching layer via protobuf-net, and in our micro-ORM Dapper.

The heart of IL generation is the ILGenerator class which lets you emit individual opcodes and keeps track of labels, locals, try/catch/finally blocks, and stack depth.

To illustrate .NET’s built-in IL generation, here’s how you’d add 1 & 2:

var method = new DynamicMethod("AddOneAndTwo", typeof(int), Type.EmptyTypes);
var il = method.GetILGenerator();
il.Emit(OpCodes.Ldc_I4, 1);
il.Emit(OpCodes.Ldc_I4, 2);
il.Emit(OpCodes.Add);
il.Emit(OpCodes.Ret);
var del = (Func<int>)method.CreateDelegate(typeof(Func<int>));

del(); // returns 3

But…

ILGenerator is quite powerful, but it leaves a lot to be desired in terms of ease of use.  For example, leave one of the Ldc_I4’s out of the above…

var method = new DynamicMethod("AddOneAndTwo", typeof(int), Type.EmptyTypes);
var il = method.GetILGenerator();
il.Emit(OpCodes.Ldc_I4, 1);
// Woops, we left out the 2!
il.Emit(OpCodes.Add);
il.Emit(OpCodes.Ret);
var del = (Func<int>)method.CreateDelegate(typeof(Func<int>));

del();

And what happens?  You’d expect an error to be raised when we emit the Add opcode, but I’d understand deferring verification until the delegate was actually created.

Of course nothing’s ever easy, and what actually happens is an InvalidProgramException is thrown when the delegate is first used with a phenomenally unhelpful “Common Language Runtime detected an invalid program.” message.  Most of the time, ILGenerator gives you no indicator as to where or why you went wrong.

Frustrations I’ve had with ILGenerator, in descending severity:

  • Fails very late during code generation, and doesn’t indicate what went wrong
  • Allows obviously malformed instructions, like Emit(OpCodes.Ldc_I4, “hello”)
  • Lack of validation around “native int” allows for code that only works on specific architectures

Enter Sigil

Naturally I have a solution, and that solution’s name is Sigil (defined as “an inscribed or painted symbol considered to have magical power”, pronounced “Si-jil”).  Sigil wraps ILGenerator, exposes much less error prone alternatives to Emit, and does immediate verification of the instruction stream.

The erroneous code above becomes:

var il = Emit<Func<int>>.NewDynamicMethod("AddOneAndTwo");
il.LoadConstant(1);
// Still missing that 2!
il.Add();
il.Return();
var del = il.CreateDelegate();
del();

And Sigil throws an exception at il.Add() with the much more helpful message “Add expects 2 values on the stack”.  Also notice how Sigil does away with all that nasty casting.

Sigil does much more than just checking that enough values are on the stack.  It’ll catch type mismatches (including esoteric ones, like trying to add a float and a double), illegal control transfers (like branching out of catch blocks), and bad method calls.

Data For Debugging

In addition to not failing helpfully, ILGenerator doesn’t give you much to go on when it does fail.  You don’t get an instruction listing or stack states, and your locals and labels are nothing but indexes and offsets.

When verification fails using Sigil the full instruction stream to data and current state of the stack (possibly two stacks, if a branch is involved) are captured by the thrown SigilVerificationException.  Every local and label gets a name in the instruction listing (which you can override), and the values on the stack that caused the failure are indicated.

For example…


var il = Emit<Func<string, Func<string, int>, string>>.NewDynamicMethod("E1");
var invoke = typeof(Func<string, int>).GetMethod("Invoke");
var notNull = il.DefineLabel("not_null");

il.LoadArgument(0);
il.LoadNull();
il.UnsignedBranchIfNotEqual(notNull);
il.LoadNull();
il.Return();
il.MarkLabel(notNull);
il.LoadArgument(1);
il.LoadArgument(0);
il.CallVirtual(invoke);
il.Return();

var d1 = il.CreateDelegate();

… throws an SigilVerificationException on Return(), and calling GetDebugInfo() on it gives you the following:

Top of stack
------------
System.Int32 // Bad value

Instruction stream
------------------
ldarg.0
ldnull
bne.un not_null
ldnull
ret
not_null:
ldarg.1
ldarg.0
callvirt Int32 Invoke(System.String)

You still have to puzzle through it, but’s a lot easier to see what went wrong (that return from the passed delegate needs to be converted to a string before calling Return()).

But Wait, There’s More

Since Sigil is already doing some correctness validation that requires waiting until a method is “finished” (like making sure branches end up with their “expected” stacks), it has all it needs to automated a lot of tedious optimizations you typically do by hand when using ILGenerator.

For example, “Emit(OpCodes.Ldc_I4, {count})” shouldn’t be used if {count} is between -1 and 8; but who wants to remember that, especially if you’re rapidly iterating?  Similarly almost every branching instruction has a short form you should use when the offset (in bytes, not instructions) fits into a single byte.  Sigil automates all of that, you just call “LoadConstant” or “Branch” and move on.

Sigil also automates picking the appropriate version of some opcodes based on type.  In raw IL, there are separate instructions for loading bytes, ints, arbitrary ValueTypes, and reference types from an array.  Using ILGenerator you’d have to pick the appropriate opcode, but with Sigil you just call “LoadElement()” and the preceding instructions are used to figure it out.

Finally, Sigil detects when the Tailcall and Readonly prefixes can be used and inserts them into the command stream.  It’s not possible to detect when the  Volatile and Unaligned prefixes should be inserted (at least so far as I know), but Sigil does only allow them to be added in conjuction with opcodes they’re legal on which is still better than ILGenerator.

Unconditional Branch Caveat

There is one pain point Sigil does not yet address, though I have plans.  Right now, Sigil requires type assertions immediately after unconditional branches (Br, and Leave to be precise) as it’s incapable of inferring the state of the stack in this case.  This doesn’t come up quite as much as you’d expect, since truly unconditional branches are rare; especially when creating DynamicMethods.

Asserting types is attached to marking labels, and looks like the following:

var il = Emit<Func<int>>.NewDynamicMethod();

var b0 = il.DefineLabel("b0"), b1 = il.DefineLabel("b1"), b2 = il.DefineLabel("b2");
il.LoadConstant("abc");
il.Branch(b0); // jump to b0 with "abc"

il.MarkLabel(b1, new [] { typeof(int) }); // incoming: 3
il.LoadConstant(4);
il.Call(typeof(Math).GetMethod("Max", new[] { typeof(int), typeof(int) }));
il.Branch(b2); // jump to b2 with 4

il.MarkLabel(b0, new[] { typeof(string) }); // incoming: "abc"
il.CallVirtual(typeof(string).GetProperty("Length").GetGetMethod());
il.Branch(b1); // jump to b1 with 3
il.MarkLabel(b2, new[] { typeof(int) }); // incoming: 4
il.Return();

You can assert types along with any MarkLabel call, in cases where Sigil can infer the stack state a SigilVerificationException will be thrown when there’s a mismatch.

Check It Out, Try It Out, Break It

Sigil’s source is on github, and it’s available on Nuget.

While I’ve done a fair amount of testing and converted some projects from ILGenerator to Sigil to flush out bugs, I wouldn’t at all be surprised if there are more.  Likewise, I wouldn’t be shocked if Sigil’s validation has some holes or if it’s too strict in some cases.

So grab Sigil and try it out, I love working on this sort of stuff so don’t be shy about opening issues.


Public Broadcasting: A Self-Describing Wrapper Around protobuf-net

Familiar with Protocol Buffers?  It’s a neat binary serialization format out of Google which aims to be efficient and extensible.  Java, C++, and Python have “official” libraries, and there are a plethora of libraries for other platforms.

In fact, we’ve been using protobuf-net over at Stack Exchange for a good long time, since August 18th, 2010 if our commit logs are to be believed.  It’s famously fast and simple to use, which has let it worm its way into basically all of our projects.  We even got the mind behind it to come and slum it with chumps like me.

But…

There is one pain point to using Protocol Buffers and that’s, well, defining the protocol bits.  You can either define your messages in .proto files and compile them, or if you’re using protobuf-net (or similar) annotate your existing types.

With protobuf-net a typical class ends up looking like so:

[ProtoContract]
public class RedisInboxItem : IComparable<RedisInboxItem>
{
  [ProtoMember(1)]
  public DateTime CreationDate { get; set; }
  [ProtoMember(2, IsRequired = true)]
  public InboxItemType Type { get; set; }
  [ProtoMember(3)]
  public int Id { get; set; }
  [ProtoMember(4)]
  public string Title { get; set; }
  [ProtoMember(5)]
  public bool IsPersisted { get; set; }

  // ...

This isn’t a problem if you’re marshaling between different code bases or acting as a service – you need to document the types involved after all; might as well do it in annotations or .proto files.  But if you’re communicating within the same code base, or with internal services, this manual protocol declaration can be onerous and a tad error prone.

Easily beat Brock, or eventually have a giant fire-breathing dragon?

Trade-Offs For Convenience

What would be really keen is a wrapper around Protocol Buffers that carries its own description, so that mapping tag numbers to fields doesn’t need any fore-knowledge of the serialized types.  It’s also been a while since I committed any really terrible ILGenerator code.

So I wrote one, it trades some of protobuf-net’s speed and some of Protocol Buffers compactness for the convenience of not having to use .proto files or annotations.  I call it Public Broadcasting, because that’s the first phrase that popped into my head with a P followed by a B.  Naming is hard.

In short, what Public Broadcasting does is provide a structurally typed wrapper around protobuf-net.  Any member names that match when deserializing are mapped correctly, any missing members are ignored, and any safe conversions (like byte -> int or Enum <-> String) happen automatically.  In addition, Nullable<struct>’s will be converted to default(struct) if necessary when deserializing.  Inheritance, type names, interfaces, and so on are ignored; we care about how the data “looks” not how it’s “named”.

Public Broadcasting works by describing a type using Protocol Buffers, then including that description in an “envelope” along with the actual data.  When deserializing, Public Broadcasting constructs a new type with all the appropriate annotations and then lets protobuf-net do the heavy lifting of deserializing the data.  Since we only care about the “structure” of the data a good deal of .NET’s type system is discarded, namely only classes (no distinction between ReferenceType and ValueType), Lists, Dictionaries, Enumerations, Nullables, and “simple” types (int, byte, string, GUID, etc.) are used.

Since the original type being deserialized doesn’t actually need to be known with Public Broadcasting, there is a Deserialize method which returns dynamic.  Although dynamic is something I normally avoid, in the “grab a data object, then discard”- style I typically use protobuf-net in, I think it’s a good fit.

In my (admittedly limited) testing, Public Broadcasting is typically within an order of magnitude of raw protobuf-net usage.  And protobuf-net is hella-fast, so even 10x slower is going to be plenty fast most of the time.

In terms of compactness, Public Broadcasting is going to be approximately the “length of all involved strings” larger than raw protobuf-net.  As soon as you start having many instances or recursive types this overhead shrinks relative to the size of the message, as Public Broadcasting doesn’t repeat member names like JSON.

However, if you absolutely must have the smallest messages and the fastest (de)serializations then you’ll want to use protobuf-net directly; overhead imposed by Public Broadcasting is noticeable.

If You Like What You’ve Read

Grab the source, or pull the dll off of Nuget, and let me know what you think.


Kids These Days

I freely admit about 1/2 the following content is just an excuse to use this title.

Lately I’ve found myself wondering, how are kids getting into programming these days?  I blame still attending the Stack Exchange “beer bashes” (though I think Github’s “drinkups” have won the naming battle there), while no longer being able to drink for these thoughts.  Programmers like to talk about programming, and drunks like to talk about the past, so drunk programmers… well you see where I’m going with this.  Anyway, it seems like the high level stuff has become more accessible, while the “what is actually happening”-bits are increasingly hidden.

Let Me Explain That Last Bit

Every kid who has touched a computer since 2001 has had access to decent javascript, and the mountains of examples view-source entails.  Today Java, C#, Python, Ruby, etc. are all freely available along with plenty of tutorials and example code.  That’s a hell of a lot more than what was out there when I started, but it’s all a few dozen levels of abstraction up.  And that’s my concern, the bare metal is really deep down and stumbling upon it is quite hard.

When I Was Your Age…

I first starting coding on a TI-99/4A, in TI-BASIC, at the age of 5. That’s a machine you can fit in your head.  The processor spec is ~40 pages long (the errata for the Core 2 Duo is almost 100 pages), and there’s none of the modern magic, like pipe-lining, branch prediction, or pre-fetching (there’s not even an on chip cache!).

Now of course, I didn’t actually understand the whole machine at 5 but the people teaching me did.  Heck, the only reason a machine discontinued in ’84 was available to me in ’92 was because they’d worked on the thing and saved a couple.

A grounding in such a simple machine lead me to question a lot about the higher level languages I eventually learned.  For example when I first started learning Java the idea that two functions with the same name could exist just messed with my head, because it didn’t gel with my previous experience, driving me to read up on namespacing and vtables.

Never did figure out how to catch Mew though.

Looking back, another avenue for programming knowledge is how horribly glitchy video games were.  Actually, video games are still horribly glitchy but the consoles are a lot more sophisticated now.  Used to be when a game went sideways the consoles just did not care, and you could do all wreak all sorts of havoc exploiting glitches.  I can honestly say I “got” pointers/indirection (though I don’t remember if I knew the words or not) when I wrapped my head around the infamous Cinnibar Island glitch.  Some of these style glitches can still happen on modern consoles, but there’s proper memory protection and some spare cycles to spend on validation now so they’re a lot rarer.

If you shift my story forward to the present day, a kid learning on 8 year old hardware would be using a PC running Windows XP (probably with a Pentium 4 in it); a simple machine that is not.  Their consoles have hypervisors, proper operating systems, and patches!  If you play Pokémon today you actually only have one masterball.  The most archaic kit they’re likely to encounter is a TI-83 calculator (which still has a Z80 in it), and even then not till high school.

Lies To Children

Which has some serious issues on the Surface RT; fix is pending approval in the app store.

Thinking about this lead me to write Code Warriors, a dinky little coding puzzle app, as my “see if XAML sucks any less than it used to” side project (the answer is yes, but the REPL is still bad).  Even it’s a lies to children version of assembly, making significant compromises in the name of “fun”.  And while I was thinking about kids learning, I certainly didn’t play test it with any children so… yeah, probably not great for kids.

I have no doubts we’re going to keep producing great programmers.  Colleges, the demands of the job market, and random people on the internet will produce employable ones at least.  But I find myself wondering if we aren’t growing past the point where you can slip into programming as if by accident.  It is after all quite a young field, kids in college when ENIAC was completed are still alive, so perhaps it’s just a natural progression.


Trustable Online Community Elections

Keeping votes secure for a couple thousand years, and counting.

One of the hallmarks of successful online communities is self-governance, typically by having some or all admins/moderators/what-have-yous come from “within” the community.  Many communities select such users through a formal election process.  Wikipedia and Stack Exchange (naturally) are two large examples, you’ve probably stumbled across a number of forums with similar conventions.

But this is the internet, anonymity, sock-puppetry, and paranoia run wild.

How can we prove online elections are on the up-and-up?

First, a disclaimer.  Today the solution is always some variation of reputation; sock puppetry is mitigated by the difficulty inherent in creating many “trusted” users, and the talliers are assumed to be acting in good faith (they, after all, are rarely anonymous and stand to lose a lot if caught).  This works pretty well, the status quo is good; what I’m going to discuss is interesting (I hope), but hardly a pressingly needed change.

Also, this is just me thinking aloud; this doesn’t reflect any planned changes over at Stack Exchange.

Modeling the threat

We need to define what we’re actually worried about (so we can proof our solution against it), and what we explicitly don’t care about.

Legitimate concerns:

  • Ballot box stuffing
  • Vote tampering
  • Vote miscounting

Explicitly ignoring:

  • Voter impersonation
  • Voter intimidation

The concerns are self explanatory, we just want to make sure only people who should vote do vote, that their votes are recorded correctly, and ultimately tallied correctly.

Attacks which we’re explicitly ignoring require a bit more explanation.  They all ultimately descend from the fact that these elections are occurring online, where identity is weak and the stakes are relatively low.

It is hard to prove who’s on the other end of the internet.

Consider how difficult it is to prove Jimmy Wales is actually tweeting from that account, or the one actually editing a Wikipedia article.  Since identity is so much weaker, we’re assuming whatever authentication a community is using for non-election tasks is adequate for elections.

We discount voter intimidation because, for practically all online communities, it’s impractical and unprofitable.  If you were to choose two Stack Overflow users at random, it’s 50/50 whether they’re even on the same side of the Atlantic much less within “driving to their home with a lead pipe”-distance.  Furthermore, the positions being run for aren’t of the high-paying variety.  Essentially, the risk and difficulty grossly outweigh the rewards so I feel confident in discounting the threat.

The Scheme

Before we get too deep into the technical details, I need to credit VoteBox (Sandler, Wallach) for quite a lot of these ideas.

Voter Rolls

To protect against ballot stuffing, it’s important to have a codified list of who can participate in a given election.  This is a really basic idea, you have to register to vote (directly or otherwise) in practically all jurisdictions.  But in the world of software it’s awfully tempting to do a quick “User.CanVote()” check at time of ballot casting, which makes it impossible to detect ballot box stuffing without consulting a historical record of everything that affects eligibility (which itself is almost impossible to guarantee hasn’t been tampered with).

Therefore, prior to an election all eligible voters should be published (for verification).  Once the election starts, this list must remain unchanged.

It’s A Challenge

Vote tampering (by the software typically) is difficult to proof against, but there are some good techniques.  First, all ballots must actually be stored as ballots; you can’t just “Canididate++” in code somewhere, normalized values in a database are no good to anyone; you need the full history for audits.  Again, this is common sense in real world elections but needs to be explicit for online ones.

This does make voting a little more confusing.

A more sophisticated approach is to make individual ballots “challenge-able”.  The idea is simple: break up “casting a ballot” into two parts, “stage” and “commit/challenge”.  When a ballot is staged, you publish it somewhere public (this implies some cryptography magic, which we’ll get back to later).  If the voter then “commits” their vote you make a note of that (again publicly), but if they “challenge” it you reveal the contents of the ballot (this implicitly spoils the ballot, and some more crypto-magic) for verification.  Obviously, if what the voter entered doesn’t match what the ballot contained something is amiss.

Challenge-able ballots make it possible for a group of voters to keep the system honest, as there’s no way for the system to know who will challenge a ballot any vote tampering carries a risk of being discovered. And with the internet being the internet news will get around quickly that something is up once tampered with ballots are being found, making the odds of challenging greater.

Freedom From Hash Chains

For challenges to work there needs to be a method for publishing “committed” ballots.  We could use something low-tech like an RSS feed, mailing list, or similar but ideally we’d have some guarantee that the feed itself isn’t being tampered with.  Consider the attack where some unchallenged ballots are removed from the feed long after they were published, or new ones are added as if they were cast in the past.

Our go-to solution for that is pushing each ballot (as well as some “keep alive” messages, for security reasons) into a modified hash chain.  The idea is simple, every new message contains a hash of the message before it.  Verifying no illicit insertions or deletions is just a matter of rehashing the chain one piece at a time and making sure everything matches.  You can also take a single message and verifying that it still belongs in the chain at a later date, meaning that people who want to verify the chain don’t have to capture every message the instant it’s published.

Note that the VoteBox system goes a step further and creates a “hash mesh” wherein multiple voting machines hash each others messages, to detect tampering done from a single machine.  In our model, all the machines are under the direct control of single entity and voters lack physical access to the hardware; the additional security of a “hash mesh” isn’t really present under this model accordingly.

Time Is Not On Your Side

One subtle attack on this hash chaining scheme: if you’re sufficiently motivated you can pre-compute them.  While far from a sure thing, with sufficient analytic data to predict about when users will vote, how likely they are to vote, challenge, and so on you could run a large number of fake elections “offline” and then try and get away with publishing one that deviates the least (in terms of time and challenged ballots) from reality.  With the same information and considerably more resources you could calculate fake elections on the fly starting from the currently accepted reality and try and steer things that way.  Since the entities involved typically have lots of computing resources, and elastic computing being widely available, we have to mitigate this, admittedly unlikely, but not impossible attack.

The problem in both cases is time.  An attacker has a lot of it before an election (when the voter roll may not be known, but can probably be guessed) and a fair amount of it during one.

Any independent and verifiable random source will do.

We can completely eliminate the time before an election by using a random root message in the hash chain.  Naturally we can’t trust the system to give us a randomly generated number, but if they commit to using something public and uncontrollable as the root message then all’s well.  An example would be to use the front page headline of some national newspapers (say the New York Times, USA Today, and the Wall Street Journal; in that order) from the day the election starts as the root message.  Since the root block is essentially impossible to know, faking a hash chain is no longer easier given the aforementioned analytic data.

To make attacking the hash chain during the election difficult we could introduce a random source for keep-alive payloads, like twitter messages from a Beiber search.  More important would be selecting a hash function that is both memory and time hard, such as scrypt.  Either is probably sufficient, but scrypt (or similar) would be preferable to Beiber if you had to choose.

Tallying Bananas

Thus far we’ve mostly concerned ourselves with individual ballots being tampered with, but what about the final results?  With what’s been discussed thus far, it’s still entirely possible that everyone votes, challenges, and audits to their hearts’ content and at the end of the election totally bogus results are announced.

The problem is that the ballots that count are necessarily opaque (they have to be encrypted clearly, even if we are publishing them), only the challenged ones have been revealed.

There do exist encryption schemes that can work around this.  Namely, any homomorphically additive encryption function; in other words, any encryption function E(x) for which there exist a function F(x1, x2) such that F(E(a), E(b)) = E(a+b).  Two such schemes are (a modifiedElGamal and Paillier (others exists), the ephemeral keys present in ElGamal are useful for our ballot challenge scheme.

Encrypting ballots with ElGamal and publishing them allows anyone to tally the results of the election.  The only constraint imposed by this system is that the voting system being used must be amenable to its ballots being represented by tuples of 1s and 0s; not a big hurdle, but something to be aware of.

This attack is like ballot stuffing with a single very large ballot.

Another subtle attack, since ballots are opaque nothing prevents a ballot from containing something other than 0 or 1 for a slot in a “race” (or all 1s, or any other invalid combination).  Some more crypto-magic in the form of zero knowledge proofs (specifically non-interactive ones) attached to a ballot for every “race” it contains gives us confidence (technically not 100%, but we can get arbitrarily close to it) that every ballot is well-formed.  The exact construction is complicated (and a bit beyond me, honestly), but the Adder system [Kiayias, Korman, Walluck] has provided a reference implementation for ElGamal (adapted into VoteBox, the Adder source seems to have disappeared).

There’s one last elephant in the room with this system.  Although everyone can tally the votes, and everyone can verify each individual vote is well-formed, those running the election are still the only ones who can decrypt the final results (they’re the only ones who have the ElGamal private key).  There honestly isn’t a fool-proof solution to this that I’m aware of, as revealing the private key allows for individual ballots to be decrypted in addition to the tally.  Depending on community this may be acceptable (ie. ballot privacy may only matter during the election), but that is more than likely not the case.  If there is a trusted third party, having them hold in the key in escrow for the duration of the election and then verify and decrypt the tally.  In the absence of a single third-party, a threshold scheme for revealing the private key may be acceptable.

Altogether Now

In short, the proposed system works like so:

  • Prior to the election starting, a full roll of eligible voters is published
  • When the election starts, a hash chain is started with an agreed upon (but random) first message
    • The hash chain uses a memory and time hard hash function
  • A public/private key pair for a homomorphically additive encryption system is stored in an agreed upon manner
  • Optionally, throughout the election agreed upon (but random) messages are added to the hash chain as keep-alives

When a user votes:

  • Their selections are encrypted using the previously generated public key
  • A NIZK is attached to the ballot to prove its well-formedness
  • An identifying token is also attached to the ballot to check against the voter roll
  • All this data is pushed to the hash chain as a single message
  • The voter then chooses to either commit or challenge their ballot
    • If they commit, a message is added to the hash chain to this effect
    • If they challenge, the ephemeral key for the ballot is published; allowing any third-party to decrypt the ballot, letting the voter prove the election is being run in good faith (they may then cast another ballot)

When the election ends:

  • Some pre-agreed upon message is placed in the hash chain
    • It’s not important this message be random, it’s just a signal that those watching can stop as the chain is finished
  • Anyone who wishes to can verify the integrity of the published hash chain, along with the integrity of individual ballots
  • Anyone who wishes to may tally the votes by exploiting the homomorphic properties of the chosen encryption scheme
  • Whichever approach to decrypting the tally has been chosen is undertaken, concluding the election

Improvements

The obvious place to focus on improving is the final decryption.  My gut (with all the weight of a Bachelor’s in Computer Science, so not a lot of weight) tells me there should be a way to construct a proof that the announced tally actually comes from decrypting the summed ciphertexts.  Or perhaps there’s some way to exploit the one-to-many nature of ElGamal encryption to provide (in advance of the election naturally) a function that will transform the tally ciphertext into one a different key can decrypt, but have that transformation not function on the individual ballots.

Of course, as I stated before, the status quo is acceptable.  It’s just fun to think of what we could build to deal with the theoretical problems.


Follow

Get every new post delivered to your Inbox.