Overthinking CSV With Cesil: Source Generators

Posted: 2021/02/05 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

As of 0.8.0, Cesil now has accompanying Source Generators (available as Cesil.SourceGenerators on Nuget). These allow for Ahead-Of-Time (AOT) generation of serializers and deserializers, allowing Cesil to be used without runtime code generation.

What the heck is a Source Generator?

Released as part of .NET 5 and C# 9, Source Generators are a really cool addition to the C# (and VB.NET) compiler (also known as Roslyn). They let you hook into the compiler and add additional source code to a compilation, and join Analyzers (which let you raise custom warnings and errors as part of compilation) as official ways to extend the .NET compilation process.

Source Generators may sound like a small addition – after all, generating code at compile time has a long history even in .NET with things like T4, and (some configurations of) Razor. What sets Source Generators apart is their integration into Roslyn – so you get the actual ASTs and “semantic models” the compiler is using, you’re not stuck with string manipulation or home grown parsers.

Having the compiler’s view of the code makes everything more reliable, even relatively simple things like finding all the uses of an attribute. In the past where I would have done a half-baked approach looking for various string literals (and ignoring aliases and other “hard” problems, now you just write:

// comp is a Microsoft.CodeAnalysis.Compilation
var myAttr = comp.GetTypeByMetadataName("MyNamespace.MyAttribute");
var typesWithAttribute =
  comp.SyntaxTrees
    .SelectMany(
      tree =>
      {
        var semanticModel = comp.GetSemanticModel(tree);
        var typeDecls = tree.GetRoot().DescendantNodesAndSelf().OfType();

        var withAttr =
          typeDecls.Where(
            type =>
              type.AttributeLists.Any(
                attrList =>
                  attrList.Attributes.Any(
                    attr =>
                    {
                      var attrTypeInfo = semanticModel.GetTypeInfo(attr);

                      return myAttr.Equals(attrTypeInfo.Type, SymbolEqualityComparer.Default);
                    }
                  )
              )
          );

        return withAttr;
      }
    );

If you prefer LINQ syntax, you can express this more tersely.

var myAttr = comp.GetTypeByMetadataName("MyNamespace.MyAttribute");
var typesWithAttribute = (
  from tree in comp.SyntaxTrees
  let semanticModel = comp.GetSemanticModel(tree)
  let typeDecls = tree.GetRoot().DescendantNodesAndSelf().OfType()
    from type in typeDecls
    from attrList in type.AttributeLists
    from attr in attrList.Attributes
    where myAttr.Equals(semanticModel.GetTypeInfo(attr).Type, SymbolEqualityComparer.Default)
    select type
  ).Distinct();

It’s also really easy to test Source Generators. There’s a bit of preamble to configure Roslyn enough to compile something, but it’s all just Nuget packages in the end – and once you’ve got a Compilation it’s a simple matter to use a CSharpGeneratorDriver to run an ISourceGenerator and get the updated Compilation. With that done, extracting the generated source or emitting and loading an assembly is a simple matter. For Cesil I wrapped all of this logic up into a few helper methods.

How to use Cesil’s Source Generators

First, you need to add a reference to the Cesil.SourceGenerator nuget project. The versions of Cesil and Cesil.SourceGenerator must match, for reasons I discuss below.

After that, the simplest cases (all public properties on POCOs) just require adding a [GenerateSerializer] or [GenerateDeserializer] to a class, and using the AheadOfTimeTypeDescriber with your Options. For cases where you want to customize behavior (exposing fields, non-public members, using non-property methods for getters and setters, custom parsers, etc.) you apply a [DeserializerMember] or [SerializerMember] attributes with the appropriate options.

For example, a class that uses a custom Parser for a particular (non-public) field would be marked up like so:

namespace MyNamespace
{
  [GenerateDeserializer]
  public class MyType
  {
    [DeserializerMember(ParserType = typeof(MyType), ParserMethodName = nameof(ForDateTime))]
    internal DateTime Fizz;

    public static bool ForDateTime(ReadOnlySpan data, in ReadContext ctx, out DateTime val)
    => DateTime.TryParse(data, out val);
  }
}

There are a few extra constraints when compared to the runtime code generation options Cesil gives you. Since we’re actually generating source, anything exposed has to be accessible inside to that generated code – this means private members and types cannot be read or written. At runtime, Cesil lets you use delegate for most customization options (like Parsers, Getters, and so on) – this isn’t an option with Source Generators because delegates cannot be passed to attributes (nor do they really “exist” at compile time, in the same way they do at runtime).

How Cesil’s Source Generators work

While Source Generators are pretty new, it seems like things are settling on a few different forms:

Filling out partials
- For these, a partial type (potentially with partial members) is filled out with a source generator.
- This means the types being “generated” (and uses of its various members) already exist in the original project, however the partial keyword lets the actual code in tha type be generated.
- An example is StringLiteralGenerator.
Generating new types based on attributes, interfaces, or base classes
- For these, a whole new type is generated based on an existing type in the original project.
- This means the types being generated don’t appear in the original project, and are presumably found via reflection (but not code generation) at runtime.
- An example is Generator.Equals.
Generating new types based on non-code resources
- For these, generated code is generated from something other than C# code. While this must be found somehow through some configuration in the project, the logical source could be a database, the file system, an API, or really anything.
- As with #2, this implies the generated types are found dynamically at runtime.
- An example is (or will be) protobuf-net.

Cesil is an example of #2, each annotated type gets a (de)serializer class generated at compile time and the AheadOfTimeTypeDescriber knows how to find these generated classes given the original type. A consequence of this design is that some runtime reflection does happen, even though no code generation does.

There’s something of a parallel consideration for Source Generators as well: whether the source generated is something that could have been handwritten in a supported way, versus generated code that is an implementation detail. I think of this distinction as being the difference between automating the use of a “construction kit”-style library (say protobuf-net’s protogen, which generates C# that uses only the public bits of protobuf-net), and those that merely seek to avoid runtime code generation but may rely on implementation details to do so.

Cesil’s Source Generators are of the latter style, the code they generate uses conventions that are not public and may change from version to version. This is the main reason that you can only use the Cesil.SourceGenerator package with a Cesil package of the exact same version, as I mentioned above. Another reason is the versioning rough edge discussed below.

Viewing one of the generated classes can illustrate what I mean:

//***************************************************************************//
// This file was automatically generated and should not be modified by hand.
//
// Library: Cesil.SourceGenerator
//
// Purpose: Deserialize
//
// For: MyNamespace.MyClass
// On: 2021-01-26 20:34:50Z
//***************************************************************************//

#nullable disable warnings
#nullable enable annotations

namespace Cesil.SourceGenerator.Generated
{
  #pragma warning disable CS0618 // only Obsolete to prevent direct use
  [Cesil.GeneratedSourceVersionAttribute("0.8.0", typeof(MyNamespace.MyClass), Cesil.GeneratedSourceVersionAttribute.GeneratedTypeKind.Deserializer)]
  #pragma warning restore CS0618
  internal sealed class __Cesil_Generated_Deserializer_MyNamespace_MyClass
  {
    // ... actual code here
  }
}

(As an aside, a handy tip for developing Source Generators: setting EmitCompilerGeneratedFiles to true and CompilerGeneratedFilesOutputPath to a directory in your .csproj will write generated C# out to disk during compilation)

Rough edges with Source Generators

Of course, not everything is perfect with Source Generators. While very, very cool in my opinion there are still some pretty rough edges that you’ll want to be aware of.

Roslyn and Visual Studio versioning are still painful
- Since introduction, using the “wrong” version of Roslyn in an Analyzer has broken Visual Studio integration. This makes some sense (any given Visual Studio version is using a specific Roslyn version internally, after all) but is just incredibly painful during development, especially if you’re routinely updating dependencies and tooling.
- Source Generators have exactly the same problem, except now such breaks affect things more important than your typical diagnostic message.
- At time of writing (and for Visual Studio 16.8.4) you must package a SourceGenerator which targets netstandard2.0, and references Roslyn 3.8.0. Any other combination might work in tests, but cannot be made to work reliably with Visual Studio.
- Over time I expect this to get pretty painful if you try to support older Visual Studio (and implicit, Roslyn) versions with your latest package versions.
- These concerns are part of why I split Cesil’s Source Generators into a separate package, so I can at least keep that pain away from the main Cesil code.
Source Generators are testable, but you’ll be rolling your own conventions
- Any Source Generator is likely to be broken up into separate logical steps – in Cesil these are: finding needed types, discovering annotated types, mapping members on those types to columns, generating any parsers or formatters needed, and generating the actual (de)serializer type.
- However, ISourceGenerator just gives you two methods – and you’ll probably only use Execute(). “Partially” running a Source Generator isn’t a thing, and the passed contexts are not readily mock-able. So you’ll have to break things up and test them however makes sense to you, there’s nothing really built-in to guide you.
- For Cesil, I exposed the results of these interim steps as internal properties and used those in tests. It would be just as valid to break those steps into separate classes, or wrap contexts in your own types to enable mocking, or anything really.
Some things you’d expect to be one-liners aren’t
- With generated code, especially using other code as inputs, there’s lots of small tasks you’d expect to have to do. Some of these are simple tasks with Roslyn, like NormalizeWhitespace(), but lots aren’t. Examples include:
  - Removing unused usings from a file
  - Inlining a method call
  - Converting a generic type or method to a concrete one, via substituting a type
  - Telling if a type needs to be referred to in a particular manner (ie. fully qualified or not) in a given block of code
- All of these things are doable with Roslyn, and oftentimes there’s example code out there, but none of them are trivial. Inlining in particular is something I’d expect lots of folks to need, but isn’t built-in and requires a lot of care to get right (with Cesil I settled on only inlining some very specific calls, in order to avoid the complexity of the more general case).
- In theory these sorts of things could be provided via a third-party library, but shipping dependencies with Source Generators (and Analyzers) is painful because of the versioning issues in #1 – so they kind of need to be built-in in my opinion.
A big use case for Source Generators is avoiding runtime code generation. However, there’s no easy way to check that you have avoided runtime code generation.
- There are some runtimes (like Xamarin.iOS) that don’t allow runtime code generation, full stop. Building and running tests on these would be a kind of check, but that’s far from easy.
- It’d be nice if there was a way to “electrify” the runtime to cause runtime code generation to fail. A less attractive, but easier, option would be to document what types or methods depend on runtime code generation – then an Analyzer could be written that flags uses.

And that about covers Source Generators, and Cesil’s use of them. While I have no specific Open Questions for this topic, any feedback on these Source Generators is welcome – just open an Issue over on GitHub. Next time, I’m going to go over the other new features in C# 9 and .NET 5 and talk about how Cesil is using them.

Kevin Montrose