Overthinking CSV With Cesil: Performance

It’s time to talk about Cesil’s performance.

Now, as I laid out in an earlier post, Cesil’s raison d’etre is to be a “modern” take on a .NET CSV library – not to be the fastest library possible. That said, an awful lot of .NET and C#’s recent additions have had an explicit performance focus. Just off the top of my head, the following additions have all been made to improve performance:

Accordingly, I’d certainly expect a “modern” .NET CSV library to be quite fast.

However, I deliberately chose to wait until now to talk about Cesil’s performance. Since I was actively soliciting feedback on capabilities and interface with Cesil’s Open Questions, it was a real possibility that Cesil’s performance would change in response to feedback. Accordingly, it felt dishonest to lead with performance numbers that I knew could be rapidly outdated.

The Cesil solution has a benchmarking project, using the excellent BenchmarkDotNet library, which includes benchmarks for:

  • Reading and Writing static types
    • These include comparisons to other CSV libraries
  • Reading and Writing dynamic types
    • These are compared to their static equivalents
  • Various internals
    • These are compared to naïve alternative implementations

The command line interface allows selecting single or collections of benchmarks, and running them over ranges of commits – which enables easy comparisons of Cesil’s performance as changes are made.

Benchmarking in general is fraught with issues, so it is important to be clear on what exactly is being compared. The main comparison benchmarks for Cesil compare reading and writing two types of rows (“narrow” and “wide” ones), and a small (10) and large (10,000) number of rows. Narrow rows have a single column of a built-in type, while wide rows have a column for each built-in type. Built-in types benchmarked are:

  • byte
  • char
  • DateTime
  • DateTimeOffset
  • decimal
  • double
  • float
  • Guid
  • int
  • long
  • sbyte
  • short
  • string
  • uint
  • ulong
  • ushort
  • Uri
  • An enum
  • A [Flags] enum
  • And nullable equivalents of all above ValueTypes

Note that while Cesil supports Index and Range out of the box, most other libraries do not do so at time of writing so they are not included in benchmarks.

The only other library compared to currently is CsvHelper (version 15.0.6 specifically). This is because that is the most popular, flexible, and feature-ful .NET CSV library that I’ve previously used – not because it is particularly slow. It was also created almost a decade ago, so provides a good comparison for “modern” C# approaches.

Benchmarks were run under .NET Core 3.1.9, in X64 process, on a machine running Windows 10 (release 10.0.19041.630) with an Intel Core i7-6900K CPU (3.20GHz [Skylake], having 16 logical and 8 physical cores), and 128 GB of RAM.

Cesil’s benchmarks report both runtime and allocations, meaning that there is quite a lot of data to compare. A full summary is checked in, but I have selected some subsets to graph here.

(charts and raw numbers can be found in this Google Sheet)

There are also benchmarks for reading and writing one million of these “wide” rows. Cesil can read ~59,000 wide rows per second (versus ~47,000 for CsvHelper), and write ~97,000 rows per second (versus ~78,000 for CsvHelper).

With all the typical benchmarking caveats (test your own use case, these are defaults not tailor tuned to any particular case, data is synthetic, etc.), Cesil is noticeably faster and performs fewer allocations, especially in the cases where relatively few rows are being written. Be aware that both Cesil and CsvHelper perform a fair amount of setup on the “first hit” for a particular type and configuration pair – and BenchmarkDotNet performs a warmup step that will elide that work from most benchmarks. Accordingly, if your workload is dominated by writing unique types (or configurations) a single time these benchmarks will not be indicative of either library’s performance.

And that wraps up Cesil’s performance, at least for now. There are no new Open Questions for this post, but the issue for naming suggestions is still open.

In the next post, I’ll be digging into Source Generators.


Overthinking CSV With Cesil: Open Source Update

It’s been about 4 months since I started this series on Cesil.  In that I’ve published 12 blog posts and made numerous updates to Cesil.  Having just released a new version (0.6.0), it feels like a good time to do a small retrospective on some of the less technical parts of my efforts.

First, the GitHub sponsors updatenot a single one. I find this unsurprising, as I’m sure most readers do – honestly, it took some effort to not snark about the likely outcome in earlier posts. I do think this serves as a good experimental validation of my expectations though.

I’m not exactly new to OSS, I’ve got a couple libraries with 1M+ downloads, this non-trivial blog, and have some contributions back to the broader ecosystem. In other words, I’m probably a bit above average in terms of OSS footprint. But, I do this for fun (like many others) – I’ve never gone out and solicited sponsorships, or otherwise tried to cultivate a following. Some have seen success with Patreons, or consulting, or sponsored screencasts – all of which I find decidedly unfun.

My big takeaway from this little sponsorship experiment is: things like GitHub Sponsors are tools you can use but creating a sustainable open source project is ultimately a job, and if you’re coding for fun you probably aren’t going to do that job. Modulate your expectations accordingly.

Second, all the Open Questions. I’ve sprinkled nine throughout the blog series so far, and four (~44%) have seen some engagement – not a bad ratio in my opinion.

The “answered” Open Questions which all shipped in version 0.6.0:

Remaining Open Questions at time of writing are:

Third and finally, an aside on naming. When I first started on what became Cesil, I was expecting to do a lot of IL generation which meant I’d probably pull in Sigil, and thus “CSV with Sigil” became Cesil in the same way “JSON with Sigil” became Jil. However that never happened, as I got more into development I became convinced that the future is going to look more AOT-y, more source-generator-y, and just less ILGenerator-y.

Then the second I published Cesil folks pointed out how close it was to Cecil, a library for manipulating IL. Given the above, I’m not particularly attached to the name but didn’t have a good alternative and figured both libraries were in different enough areas it was unlikely to be an issue in practice. So naturally I was immediately proven wrong, as I went to contribute some small improvements to Coverlet… which makes extensive use of Cecil. Discussing these changes (which kept happening as Stack Overflow was also considering using Coverlet) was a real laugh riot.

So, I should really change the name of Cesil. I still don’t have any great ideas (naming is hard, after all) so I’ve opened another “Open Question” Issue to collect alternatives. Primary goal is to find something that won’t be confused for other projects, while still at least hinting at “CSV”.

And that wraps up the Open Source update. In the next post, I’ll be digging into performance and maybe giving an update on naming.


Overthinking CSV With Cesil: Documentation in 2020

It goes without saying that a library’s documentation is important, and Cesil is no exception. Cesil has two broad kinds of documentation: handwritten wiki documentation, and automatically generated reference documentation.

The handwritten documentation is all in a GitHub Wiki, which means it’s basically just a bunch of markdown files sitting in a git repository. This means there’s revision history, diffing, and editing is as simple as pushing to https://github.com/kevin-montrose/Cesil.wiki.git. I wrote high level documentation in the wiki, covering things like using Contexts and an overview of the Default Type Describer rather than covering individual methods.

The automatically generated documentation is extracted from C#’s Documentation Comments. Every member of a type can have a structured /// comment which, in conjunction with some well known XML tags, can be used to create an XML file that editors understand. As an example this pop in Visual Studio:

comes directly from this code comment. Generating this is built into Visual Studio, you just need to check this box:

or you can pass -doc to the compiler. The C# compiler will raise a warning whenever a public member lacks a documentation comment, so by treating warnings as errors (which can also be seen above) I effectively force myself to write some amount of documentation.

Nothing about Documentation Comments is new, they’ve been in C# since 1.0. What is relatively recent is the DocFX tool, which makes it considerably easier to generate static html for a C# project’s reference documentation than it used to be. With a simple enough script, Cesil’s reference documentation is now easily generated, checked into its repo, and served up as a set of GitHub pages. DocFX does have some limitations currently, the biggest one being that it is not aware of nullable reference types.  I’ve taken a whack at adding that functionality in a draft PR, but it’s not a trivial change unfortunately.

It’s a pity that something so important as documentation is rarely all that interesting to write about. Perhaps something meatier will come out of all the Open Questions raised in previous posts, because that’s what the next post will coverOpen Questions all have corresponding Issues in Cesil’s GitHub repository.


Overthinking CSV With Cesil: Testing And Code Coverage in 2020

I’ve always been a proponent of extensive test suites, especially for libraries. Accordingly, Cesil has a lot of tests – 665 at time of writing. I also decided to set some code coverage goals: covering 90+% of lines and 80+% of branches. None of this is new to modern .NET, though there have been some small changes over the years.

The big one compared to some of my older projects is that, since .NET Standard 2.1 in 2018, .NET has a standard way to run tests – dotnet test. I used to always write up a small test runner project so I could easily run tests without other tooling (like Visual Studio’s test runner), but that is no longer necessary with this new command.

I’ve also moved away from the testing framework that shipped with Visual Studio (Microsoft.VisualStudio.TestTools.UnitTesting) and am now using xUnit. There’s no loss in functionality (Visual Studio and dotnet test understand xUnit out of the box), and I find that xUnit is just generally more popular. The final straw was Stack Overflow (my day job) adopting xUnit for their internal tests. If this sounds like a mediocre endorsement, it sort of is – I don’t get super excited about unit testing frameworks provided they check the appropriate boxes. xUnit gives you asserts, parameterization, and decent logging – it’ll do.

While you don’t get code coverage out of the box, it has also gotten a lot easier. Coverlet is a nuget package that, once installed, lets you just add “/p:CollectCoverage=true” to “dotnet test” and get line and branch coverage statistics from your test suite. Results can be exported in a variety of formats, but for Cesil I’ve stuck with ReportGenerator.

This is a notable improvement from OpenCover, what I had been using for code coverage (even early versions of Cesil still used OpenCover). Coverlet is considerably less fiddly and (at least in theory) supports non-Windows platforms.

This modern tooling also plays nicely with GitHub Actions, which let me automate reporting code coverage statistics. Now whenever a change hits the appropriate branches, a full run of tests is kicked off and results are automatically extracted into little badges that appear in Cesil’s README.

A slightly clever trick is using a separate branch (the shields branch, in Cesil’s case) to hold code coverage results – alleviating the need to store any data outside of the repository. The whole dance can be seen in the CodeCoverageBadges.yml file that runs on check in.

Patterns

When aiming for high rates of code coverage you have to figure out patterns for making sure rarely taking code paths are visited. In Cesil these rare paths are those caused by restricted allocations, async optimizations, and cancellation. I dealt with these by introducing various interfaces only implemented during DEBUG builds, and then making extensive use of helpers in tests that use those interfaces to run the test multiple times for each relevant code path. A concrete example is a good illustration of what I’m talking about.

  1. Most async methods in Cesil have optimizations for when a (Value)Task returned by a relied upon method has already completed.
  2. If at any point in this fast path a task has not yet completed, Cesil will switch to an implementation that actually awaits the task.
  3. This means that there is a (hopefully rarely taken) branch in the fast path after every (Value)Task returning method is called.
  4. Each class with methods like this implements ITestableAsyncProvider, and calls AsyncTestHelper.IsCompletedSuccessfully() instead of (Value)Task.IsCompletedSuccessfully.
  5. In Cesil.Tests, the various RunAsyncXXXVariants() methods repeatedly re-run the actual tests (wrapped as delegates) using ITestableAsyncProvider to signal that IsCompletedSuccesfully() should return false after an explicit number of calls.
  6. By increasing the switch-to-async change over point we can ensure that the test explores all branches introduced by this sort of async optimization.

Deficiencies

I don’t have a way to signal that certain statements are unreachable. That may sound like a weird problem to have, but Cesil employs the Throw Helper pattern so it’s commonly found in any error branches. I alleviate it some by having the methods return a value, so I can write return Throw.BlahBlah(…) instead but in void returning methods you still see uncovered, yet unreachable, lines like this one:

It’d be useful if I could signal to a code coverage tool that certain methods never return, and thereby remove these lines from Cesil’s metrics. .NET already has attributes for that purpose, but Coverlet does not support them – I’ve opened an issue to see if this can be improved.

As you can see, Cesil has pretty extensive tests but there are still some real deficiencies. Naturally, that means the Open Question is: Are there any changes I should make to improve Cesil’s testing?

As before, I’ve opened an issue to gather long form responses.  Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

The next post in this series will cover Cesil’s documentation.


Overthinking CSV With Cesil: Adopting Nullable Reference Types

In the previous post I went over how Cesil has adopted all the cool, new, C# 8 things – except for the coolest new thing, nullable reference types. C#’s answer to the billion dollar mistake, I treated this new feature differently from all the rest – I intentionally delayed adopting it until Cesil was fairly far along in development.

I delayed because I felt it was important to learn how an established codebase would adopt nullable reference types. Unlike other features, nullable references types aren’t a feature you want to use in only a few places (though they can be adopted piecemeal, more on that below) – you want them everywhere. Unfortunately (or fortunately, I suppose) as nullable reference types force you to prove to the compiler that a reference can never be null, or explicitly check that a value is non-null before using it, an awful lot of existing code will get new compiler warnings. I put my first pass at adopting nullable reference types into a single squashed commit, which nicely illustrates how it required updates to nearly everything in Cesil.

A Refresher On Nullable Reference Types

Before getting any further into the Cesil specific bits, a quick refresher on nullable reference types. In short, when enabled, all types in C# are assumed to be non-nullable unless a ? is appended to the type name. Take for example, this code:

public string GetFoo() { /* something */ }

In older versions of C# it would be legal for this method to return an instance of a string or null since null is a legal value for all reference types. In C# 8+, if nullable reference types are enabled, the compiler will raise a warning if it cannot prove that /* something */ won’t return a non-null value. If we wanted to be able to return a null, then we’d instead use string? as it’s return type. One additional bit of new syntax, the “null forgiving” ! in suffix position, lets you declare to the compiler that a value is non-null.

Importantly, this is purely a compile time check – there’s nothing preventing a null from sneaking into a variable at runtime. This is a necessary limitation given the need to interoperate with existing .NET code, but it does mean that you have to be diligent about references that come from code that has not opted into nullable reference types or is outside your control. Accordingly, it’s perfectly legal to lie using the null forgiveness operator.

Adopting Nullable Reference Types

If you’ve got a large existing codebase, adopting nullable references type all at once is basically impossible in my opinion. I tried a couple times with Cesil and abandoned each attempt, you just drown in warnings and it’s really difficult to make progress given all the noise. The C# team anticipated this, thankfully, and has made it possible to enable or disable nullable reference times on subsets of your project with the #nullable directive. I worked through Cesil file by file, enable nullable reference types and fixing warnings until everything was converted over at which point I enabled them project wide.

The changes I had to make fell into four buckets:

  1. Adding explicit null checks in places where I “knew” things were always non-null, but needed to prove it to the compiler
    1. An example are the reflection extension methods I added, that now assert that, say, looking up a method actually succeeds
    2. These also aided in debugging, since operations now fail faster and with more useful information
  2. Wrapping “should be initialized before use” nullable reference types in a helper struct or method that asserts a non-null value is available.
    1. I used a NonNull struct and Utils.NonNull method I wrote for the purpose
    2. A good example of these are optional parts of (de)serialization, like Resets. A DeserializeMember won’t always have a non-null Reset, but it should never read a non-null Reset during correct operation
  3. Explicit checks at the public interface for consumers violating (knowingly or otherwise) the nullability constraints on parameters.
    1. I used a Utils.CheckArgumentNull method for this
    2. You have to have these because, as noted earlier, nullable reference types are purely a compile time construct
    3. Interestingly, there is a proposal for a future version of C# that would make it simpler to do this kind of check
  4. Refactoring code so a reference can be proven to always be non-null
    1. A lot of times this has minor stuff, like always fully initializing in a constructor
    2. In some cases what was needed was some value to serve as a placeholder, so I introduced types like EmptyMemoryOwner

Initially most changes fell into the first two buckets, but over time more was converted into the fourth bucket. Once I started doing some light profiling, I found a lot of time was being spent in null checks which prompted even more refactoring.

At time of writing, there are a handful of places where I do use the null forgiving operator. Few enough that an explicit accounting can be given, which can illustrate some of the limitations I found with adopting nullable reference types:

That’s a total of 32 uses over ~32K lines of code, which isn’t all that bad – but any violations of safety are by definition not great. Upon further inspection, you’ll see that 29 of them are some variation of default! being assigned to a generic type – and the one in DynamicRow.GetAtTyped is around casting to a generic type (the remaining two are in DEBUG-only test code). Basically, and unfortunately, nullable references can get awkward around unconstrained generics. The problem is an unconstrained generic T could be any combination of nullable and value or reference types at compile time, but default(T) is going to produce a null for all reference types – it’s a mismatch you can’t easily work around. I’d definitely be interested in solutions to this, admittedly rare, annoyance.

Accommodating Clients In Different Nullable Reference Modes

That covers Cesil’s adoption of nullable reference types, but there’s one big Open Question around client’s adoption of them. A client using Cesil could be in several different “modes”:

  1. Completely oblivious to nullable reference types, neither using nullable annotations themselves nor using types from libraries that do
    1. All pre-C# 8 will be in this mode
    2. Post-C# 8 code that has not enabled nullable reference types, and whose relevant types aren’t provided by libraries that have enabled them are also in this mode
    3. At time of writing this is probably the most common mode
  2. Oblivious themselves, but using types with Cesil that do have nullable annotations
    1. Pre-C# 8 code that uses types provided by dependencies that have been updated with nullable references types is in this mode
    2. Post-C# 8 code with a mix of files with nullable reference types enabled will also be in this mode
  3. Opted into nullable reference types, but using types with Cesil from dependencies that have not enabled nullable reference types
    1. Post-C# 8 code with dependencies that have not yet enabled nullable reference types will be in this mode
  4. Opted into nullable reference types, and all types used with Cesil likewise have them enabled
    1. At time of writing this is the least common mode

Cesil currently does not look for nullable annotations, instead if you want to require a reference type be non-null you must annotate with a DataMemberAttribute setting IsRequired (if you’re using the DefaultTypeDescriber), provide a Parser that never produces a null value, or a Setter that rejects null values. This is behavior is most inline with the “completely oblivious” case (mode #1 above). However, Cesil could check for nullable annotations and default to enforcing them (as always, you’d be able to change that behavior with a custom ITypeDescriber). This aligns most with mode #4, which represents the desired future of C# code. Either behavior can result in some weirdness, with Cesil either doing things the compiler would prevent a client from doing or failing to do things a client could easily do.

To illustrate, if Cesil ignores nullable annotations (as it does today) but client has enabled them (ie. they are in modes #3 or #4) this could happen:

class Example
{
  public string Foo { get; set; } // note that Foo is non-nullable
}

// ...
var cesilFoo = CesilUtils.ReadFromString(@”Foo\r\n”).Single().Foo;
cesilFoo.ToString(); // this will throw NullReferenceException

// ...
var clientExample = new Example();
clientExample.Foo = null; // this will raise a warning

Basically, Cesil could set Example.Foo to null but the client’s own code couldn’t.

However, if Cesil instead enforces nullable annotations but the client is oblivious to them (the client is in mode #2 above) then this is possible:

// Example is defined in some place that _has_ opted into nullable references

public class Example
{
  public string Foo { get; set; } // note that Foo is non-nullable
}

// ...

// Cesil will throw an exception, since Foo has no value
var cesilFoo = CesilUtils.ReadFromString(@”Foo\r\n”);

// ...

var clientExample = new Example();
clientExample.Foo = null; // this code from the client raises no warnings

In this case, Cesil cannot do something that the client can trivially do.

As I see it there are a few different options, and Cesil needs to commit to one of them. The Open Question is:

  • Which of these options for nullable reference treatment should Cesil adopt?
    1. Ignore nullable annotations, clients should perform their own null checks
    2. Enforce nullable annotations as part of the DefaultTypeDescriber, if a client needs to disable it they can provider their own ITypeDescriber
    3. Provide an Option to enable/disable nullable reference type enforcement
      • This will move the logic out of ITypeDescribers, so client control via that method will no longer be available
      • If this is the route to take, what should the value for Options.(Dynamic)Default be?

As before, I’ve opened an issue to gather long form responses.  Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

The next post in this series will cover how I went about developing tests for Cesil.


Overthinking CSV With Cesil: C# 8 Specifics

Way back in the first post of this series I mentioned that part of the motivation for Cesil was to get familiar with new C# 8 features, and to use modern patterns. This post will cover how, and why, Cesil has adopted these features.

The feature with the biggest impact is probably IAsyncEnumerable<T>, and it’s associated await foreach syntax. This shows up in Cesil’s public interface, as the returned value of IAsyncReader<TRow>.EnumerateAllAsync(), a parameter of IAsyncWriter<TRow>.WriteAllAsync(…), and as a returned value or parameter on various CesilUtils methods. IAsyncEnumerable<T> enables a nice way to yield elements that are obtained asynchronously, a perfect match for serialization libraries that consume streams. Pre-C# 8 you could kind of accomplish this with an IEnumerable<Task<T>>, but that’s both more cumbersome for clients to consume and slightly weird since MoveNext() shouldn’t block so you’d have to smuggle if the stream is complete into the yielded T. IAsyncEnumerable<T> is also disposed asynchronously, using another new-to-C#-8 feature…

IAsyncDisposable, which is the async equivalent to IDisposable, also sees substantial used in Cesil – although mostly internally. It is implemented on IAsyncReader<TRow> and IAsyncWriter<TRow> and, importantly, IDisposable is not implemented. Using IAsyncDisposable lets you require that disposal happen asynchronously, which Cesil uses to require that all operations on an XXXAsync interface are themselves async. C# 8 also introduces the await using syntax, which makes consuming IAsyncDisposables as simple for clients as consuming IDisposables. Pre-C# 8 if a library wanted to allow clients to write idiomatic code with usings it would have to support synchronous disposal on interfaces with asynchronous operations, essentially mandating sync-over-async and all the problems that it introduces.

The rest of the features introduced in C# 8 mostly see use internally, resulting in a code base that’s a little easier to work on but not having much impact on consumers. From roughly most to least impact-ful, the features adopted in Cesil’s code are:

  • Static local functions
    • These were extensively used to implement the “actually go async”-parts of reading and writing, while keeping the fast path await-free.
    • The big benefit is having the compiler enforce that local functions aren’t closing over any variables not explicitly passed into them, which means you can be confident invoking the function involves no implicit allocations.
  • Switch expressions
    • These were mostly adopted in a tail position, where previously I’d have a switch where each case returned some calculated value.
    • Using switching expressions instead of switch statements results in more compact code, which is a welcome quality-of-life improvement.
  • Default interface methods
    • These let you attach a method with an implementation to an interface. The primary use case is to allow libraries to make additions to an already published interface without that breaking consumers.
    • There’s another use case though, the one Cesil adopts, which is to attach an implemented method that all implementers of an interface will need. An example of this is ITestableDisposable, where the AssertNotDisposed method is the same everywhere but IsDisposed logic needs to be implemented on each implementing type.
    • In older versions of C#, I’d use an extension method or some other static method to share this implementation but default interface methods let me keep the declarations and implementations closer together. Just another small quality-of-life improvement, but there’s potential for this to be a much bigger help in post-1.0 releases of Cesil.
  • Indices and Ranges
    • These simplify taking elements or slices of strings, Spans, and so on. Cesil also supported reading and writing the new Index and Range types.
    • Another small quality-of-life improvement, though I have seen this one catch some bugs when changing foo[something.Length – 1] to foo[^1].
  • Readonly Members
    • You use these when you can’t make an entire struct readonly, but want the compiler to guarantee certain members don’t mutate the struct.
    • I only did this in a few places, there aren’t that many mutable structs in Cesil, but having the compiler guarantee invariants is always a useful safety net.

Readers who closely follow C# are probably thinking “wait, what about nullable reference types?”. Those were the big new feature in C# 8, and Cesil has adopted them. However, unlike the other new C# 8 features, I intentionally deferred adopting them until Cesil was fairly mature as I wanted to explore converting an existing code base. My next post will go into that process in detail.

There aren’t really any Open Questions around the C# 8 features in this post. There were so many in the previous post on flexibility, that I think it’s probably best to just go and leave your thoughts on them instead.

As a reminder, they were…

  1. Are there any missing Format-specific options Cesil should have?
  2. Is the amount of control given over Cesil’s allocations sufficient?
  3. Are there any interesting .NET types that Cesil’s type mapping scheme doesn’t support?

Overthinking CSV With Cesil: “Maximum” Flexibility

Over the course of this series I’ve alluded to a future post where I’ll dig into all the configuration options Cesil offers.

This is that, gigantic, post.

I conceptualize Cesil’s configurability as being along three axes: the format, memory use, and type mapping. Format options control the style of delimiter separated value (DSV) you’re reading or writing, memory options give fine grained control over allocations, and type mappings handle converting from .NET types to text and vice versa.

To begin, let’s start with…

Format Options

The necessity of being able to configure different format options is clear for any CSV library, since as I said in an earlier post CSV isn’t really a format – it’s a bunch of related formats. For a library like Cesil, which aims to support all reasonable DSV formats, the necessity is even more obvious.

All configuration options relevant to formatting live on the Options type, with corresponding WithXXX methods on OptionsBuilder. These options are:

  • ValueSeparator – the single character used to separate columns in a row
  • RowEnding – whether rows end in the \n, \r, or \r\n character sequence
    • Most CSV files us \r\n, but Cesil can automatically detect this when reading if you use RowEnding.Detect
    • When using Detect, Cesil will use the character sequence it first encounters as the expected row ending
    • When writing, a RowEnding other than Detect must be provided or an exception will be raised when an IWriter<TRow> or IAsyncWriter<TRow> is created
  • EscapedValueStartAndEnd – the character used to start and end an escaped value
    • Typically this is a double quote, but it can be left unset
    • If your format treats , as a value separator and would store Montrose, Kevin as “Montrose, Kevin” then you’re using a double quote for this
  • EscapedValueEscapeCharacter – the character used to start an escape sequence when you are already in an escaped value
    • Typically this is a double quote, but it can be left unset
    • If your format would store Kevin “Monty” Montrose as “Kevin “”Monty”” Montrose” then you’re using a double quote for this
  • ReadHeader – whether to always expect a header row, never expect a header row, or automatically detect a header row
    • This tends to vary file to file, so it is often set to ReadHeader.Detect
    • If you use Always or Detect, Cesil will use the header to infer column order when mapping columns to .NET types
    • If your format supports comments, it is legal for comments to precede a header row
  • CommentCharacter – the single character that starts a comment line
    • Typically this is not set, but if set it is often #
    • For example a single line of #hello world would be a comment of hello world, in formats with # for this
  • WhitespaceTreatment – whether to trim whitespace that is encountered in certain places while parsing
    • Most formats preserve whitespace if it is encountered in a value, and do not permit whitespace as padding around escaped values
    • If your format is one that is unusual, Cesil supports automatically trimming whitespace in certain cases. Refer to WhitespaceTreatments for the full list of trimming behaviors
    • Note that WhitespaceTreatments is a [Flags] enum, and so all different combinations of behavior can be combined.
  • ExtraColumnTreatment – how to handle encountering “extra” columns when reading
    • Cesil considers a column “extra” if it doesn’t map to a member, or if it’s in a column that didn’t appear in the header row (if there is a header row)
    • If you’re reading into dynamics and not requiring a header row, extra columns will be any that have an index greater than the highest index in the first read row
    • This must be one of:
  • WriteHeader – whether or not to write a header row before writing any values
  • WriteTrailingRowEnding – whether or not to end the final written row with the configured RowEnding

And that’s it. Ten options which, hopefully, allow Cesil to cope with all reasonable DSV formats out there. I’d be quite interested to learn of any that Cesil can’t cope with – it’s always a fun challenge to make a system more flexible without sacrificing ease of use or performance.

Now we’ll move on to…

Allocation Options

Beyond “don’t allocate more than necessary” it may strike some as odd to care about memory allocation in a .NET library – after all, .NET is a managed (ie. garbage collected) platform. I believe that in fact a modern .NET library should strive to both minimize allocations and provide ways for clients to control those allocations that must happen. The .NET ecosystem has been evolving in a much more performance focused direction for a while now, with fancy new types like Span and Pipelines encouraging low allocation and low copy patterns, first class support for processor intrinsics, and struct alternatives (that don’t default to allocating on the heap) like ValueTuple and ValueTask. The .NET GC is good, but it’s never going to be free so a laser focus on allocations is common when concerned with performance. It follows that if a library’s clients are focused on controlling allocations, a library needs to give them the tools they need to control allocations.

That said, some heap allocations are unavoidable. Cesil does its best to perform all unavoidable allocations prior to returning an I(Async)Reader<TRow> or I(Async)Writer<TRow> – so creating Options, binding IBoundConfigurations<TRow>, and the actual creation of a reader or writer may allocate but after that, allocations are under client control. There are exceptions, but I’ll dig into those in a later section.

On Options, there are a few relevant members:

  • MemoryPool – when Cesil needs to allocate, the MemoryPool<char> it uses to obtain a block of memory
    • Cesil will always request a size it can work with, but if a client does not return a chunk of memory at least the requested size an exception may be raised
    • Cesil will always call IMemoryOwner<char>.Dispose() when finished using a chunk of memory
    • MemoryPools must be thread safe, as Cesil makes no guarantees that IMemoryOwner<char> references remain on any given thread
  • ReadBufferSizeHint – when reading, Cesil needs a buffer to store characters it has not yet processed. This value specifies how large that buffer should be
    • There is often a tradeoff between buffer size and performance, the larger the buffer the fewer calls to an underlying stream are needed to load all data, and thus reading will complete more quickly. This is not true once a buffer is large enough that the underlying stream cannot fill it on each call, or if the underlying stream is frequently blocked waiting for more data
    • Setting ReadBufferSizeHint to 0 tells Cesil to request a “reasonable default” buffer size
    • The read buffer is obtain from the configured MemoryPool<char>, allowing clients to control precisely how the buffer is allocated
  • WriteBufferSizeHint – when writing, Cesil can stage writes into a buffer to improve performance. This value specifies if a buffer should be used, and how large it should be
    • As with ReadBufferSizeHint there is often a trade-off between buffer size and performance. If there is no buffer, every write must call into the underlying stream which can make writing take considerably longer.
    • Setting WriteBufferSizeHint to 0 disables write buffering, all data will be sent directly to the underlying stream
    • Setting WriteBufferSizeHint to null tells Cesil to request a “reasonable default” buffer size
    • The write buffer is obtain from the configured MemoryPool<char>, allowing clients to control precisely how the buffer is allocated
  • DynamicRowDisposal – controls when dynamic rows obtained during reading are disposed

Moving beyond Options, IReader<TRow> and IAsyncReader<TRow> have the XXXWithReuse() methods – these methods take a ref parameter that points to a row to reuse when reading. When processing many rows in sequence, these methods let you allocate a single row and then just repeatedly reuse it – greatly reducing the number of allocations. There are a few caveats to keep in mind. First, if a row has a Setter backed by constructor parameters (more on those below) a row cannot be reused and will always be reallocated. Second, value types are always zero initialized so there is always a row to reuse if the row is a value type – this means your InstanceProvider (more on those below) may not be invoked when you’d expect it to be, if the row was a reference type. Finally, because the XXXWithReuse() methods return the ref parameter will be initialized with the row that will ultimately be initialized it is possible (especially in async cases, when the underlying stream blocks) for Cesil to allocate a row it ends up not needing.

The last piece of allocation control has lots of overlap with type mapping, which is covered in the next section, but in brief: InstanceProviders give clients control over how rows are obtained, and Parsers give control for how ReadOnlySpans<char> are turned into instances of other types. Other types that participate in type mapping allow for control over accessing members, assigning members, and so on – so a client can customize any step of the process that might concern them.

With allocations covered, let’s now proceed to the final axis of configuration…

Type Mapping

DSVs provide rows and columns of text, and that’s it really. .NET has a much richer type system, and so Cesil must provide some way to move between these two worlds. Complicating that is how many different styles of .NET coding are out there, a good library must provide clients with the tools they need to match Cesil’s behavior to their own applications.

Cesil breaks this process of mapping types to and from text into several logical pieces, many of which have been mentioned in earlier posts:

Each of these types has particular rules about what kind of method, delegate, constructor, etc. can back them which are detailed in the documentation for each type, and on Cesil’s wiki.

Additionally, a number of these types (like Parser) have a notion of failure (indicated by returning false from a method or delegate) and support delegating to another instance as a fallback. This is used via their Else(…) method, which creates and returns a new instance what will delegate on failure.

As mentioned above, the ITypeDescriber interface requires a little more discussion. It has six methods, each of which supports a particular use case for Cesil:

In addition to the raw ITypeDescriber interface, Cesil also provides three implementations of the interface out of the box. They are:

And that wraps up my deep dive in Cesil’s flexibility. There’s even more detail in the wiki and on the reference documentation (linked throughout this post) for the involved types, but this post should at least give you a decent basic understanding.

Which brings us to the Open Questions of this post:

  1. Are there any missing Format-specific options Cesil should have?
  2. Is the amount of control given over Cesil’s allocations sufficient?
  3. Are there any interesting .NET types that Cesil’s type mapping scheme doesn’t support?

As before, I’ve opened three issues to gather long form responses.  Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

My next post will tackle a smaller subject – I’ll be going over some of the new features that came with C# 8, and the how and why of Cesil’s adoption of them.


Overthinking CSV With Cesil: Writing Dynamic Types

I covered how to write known, static, types with Cesil in my previous post. As with reading, Cesil also supports dynamic types.

In my post on dynamic reading, I argued dynamic is still worth supporting due to how convenient it makes some common read operations. I feel the case for writing dynamic types is much weaker – it is rare to want to write heterogeneous types, and even rarer to not be able to easily map such a mixed collection to a single known type. All that said, for symmetry’s sake Cesil does have extensive support for writing dynamic types.

As with reading, writing static and dynamic types is essentially symmetric. All the same methods are provided, supporting all the same operations. The only difference is rather than using Configuration.For<TRow>() you use Configuration.ForDynamic(), and rather than IBoundConfiguration<TRow> being parameterized by a type TRow it’s parameterized by dynamic.

When using the DefaultTypeDescriber, performance varies considerably based on the “kind” of dynamic you are writing. Cesil special cases “well known” dynamic types for improved performance – namely the dynamic rows Cesil creates and ExpandoObject are treated specially. For other DLR aware types Cesil will use IDynamicMetaObjectProvider directly, which is considerably slower. Plain .NET types delegate to the usual EnumerateMembersToSerialize method, which implements “normal” .NET behavior.

Cesil allows customizing the members discovered, and the order they’ll be written in, by using a custom ITypeDescriber with your Options and implementing the GetCellsForDynamicRow directly.  Simple inclusive/exclusive can be controlled by subclass the DefaultTypeDescriber and overriding the ShouldIncludeCell method. I’ll cover how this works in more detail in a later post that goes in depth into all of Cesil’s configuration options.

And that’s about it for dynamic serialization – there’s not a lot to cover since so much of it is “just like writing static types, but dynamic.”  This post’s Open Question is, accordingly, more “tactical” than previous ones:

The interface isn’t technical wrong, but it has the undesirable property that general implementations will allocate at least a little bit for each row written.  An allocation-free alternative would be a marked improvement, provided it doesn’t come at the cost of flexibility or reasonable performance.

As before, I’ve opened an issue to gather long form responses.  Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

In my next post I’ll go into detail on all the configuration options Cesil supports. It’ll be a long post, as Cesil supports customizing the expected format, as well as almost every detail of describing and mapping types.


Overthinking CSV With Cesil: Writing Known Types

My last two posts have covered deserializing with Cesil, the subsequent two will cover serialization. This post will specifically dig into the case where you know the types involved at compile time, while the next one will cover the dynamic type case. If you’ve read the previous posts on read operations hopefully a lot of this will seem intuitive, just in reverse.

Again, CesilUtils exposes a bunch of utility methods – this time with names like WriteXXX. Variants exist for single row, multiple row, synchronous, asynchronous, and “straight to a file” operations. Just like with reading, CesilUtils doesn’t allow you to reuse an IBoundConfiguration<TRow> nor does it expose the underlying I(Async)Writer<TRow> but is convenient when performance and customization aren’t of paramount importance.

As with reading, maximum performance and flexibility is found in using either IWriter<TRow> or IAsyncWriter<TRow> interfaces obtained from an IBoundConfiguration<TRow> created via Configuration.For<TRow>. Creating configurations is mildly expensive, so caching and reusing them can be beneficial.

The writer interfaces expose methods to do the following:

  • Write a collection of rows with WriteAll(Async)
    • The sync version accepts an IEnumerable<T>
    • The async version can take either an IEnumerable<T> or an IAsyncEnumerable<T>
  • Write a single row with Write(Async)
  • Write a comment with WriteComment(Async)
    • If a comment contains a row ending sequence of characters, it will be split into multiple comments automatically

Mapping a type to a set of columns, the order of the those columns, and the conversion of the values of those columns to text is done with the ITypeDescriber registered on the Options provided to Configuration.For<TRow> or the method on CesilUtils (by default, this is an instance of DefaultTypeDescriber). When an IBoundConfiguration<TRow> is created ITypeDescriber.EnumerateMembersToSerialize is invoked once and the returned SerializableMembers detail how Cesil will map a TRow instance to a set of text columns.

Specifically a SerializableMember details

  • The name of column, which may be written as part of a header row
  • The Getter to use to obtain a value from a TRow instance
  • An (optional) ShouldSerialize to control, per-row, whether a column should be included
  • The Formatter used to turn the columns value into a sequence of characters
  • Whether or not to include a column if it has the default value for it’s type
    • Cesil uses Activator.CreateInstance to obtain a default instance of ValueTypes, and use null as the default value for reference types

The order of columns is taken from the order they are yielded by the IEnumerable<SerializableMember> returned by ITypeDescriber.EnumerateMembersToSerialize.

There is quite a lot of flexibility in how Getters, ShouldSerializes, and Formatters can be created. They will be covered in detail in a later post.

There’s less internal state being managed when Cesil is writing in comparison to when it is reading, so there are no fancy state machines or lookup tables. The most interesting part is NeedsEncodeHelper which is used to check for characters that would require escaping, which makes use of the X64 intrinsics supported in modern .NET (provided your processor supports them).

There are some minor additional details to keep in mind while writing with Cesil:

  • All XXXAsync() methods try to make as much progress as they can without blocking, they don’t just yield to yield.
  • All XXXAsync() methods do take an optional CancellationToken, and pass it down to the underlying stream. CancellationTokens are checked at reasonable intervals, but no guarantees are made about how often.
  • If you try to write a comment without having configured your Options with a comment character, an exception will be raised.
  • If you try and write a value that would require escaping without having configured your Options with a way to start and end escaped values, an exception will be raised.
    • Options.Default has ” as it’s escape start and stop characters.
  • If you try to write a value that includes the escape start and stop character, but have not configured your Options with an escape character, an exception will be raised.

And that about covers how to write static types with Cesil.

The Open Question for this post is a return to an earlier one, but with a particular focus on writing: Is there anything missing from IWriter(Async) that you’d expect to be supported in a modern .NET CSV library?

This question has already led to some changes, which will appear in the next release of Cesil – adding comment writing methods that take ReadOnlySpan<char> and ReadOnlyMemory<char> parameters, clarifying some parameter names, and returning counts of the number of rows written from the enumerable taking write methods.

Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

In my next post I’ll cover how Cesil supports writing dynamic types, those not known at compile time. As you might expect from reading static and dynamic types, it is very similar to how static types are read…


Overthinking CSV With Cesil: Reading Dynamic Types

In my last post I went over how to use Cesil to deserialize to known, static, types. Since version 4.0, C# has also had a notion of dynamic types – ones whose bindings, members, and conversions are all resolved at runtime – and Cesil also supports deserializing into these.

In 2020, supporting dynamic isn’t exactly a given – dynamic is relatively rare in the .NET ecosystem, the big “Iron” use cases in 2015 (dynamic languages running on .NET) are all dead as far as I can tell, and the static-vs-dynamic-typing pendulum has been swinging back towards static with the increasing popularity of languages like Go, Rust, and TypeScript (even Python supports type annotations these days). All that said, I still believe there are niches in C# well served by dynamic – “quick and dirty” data loading without declaring types, and loading heterogeneous data. These are both niches Cesil aims to support well, and therefore dynamic support is a first-class feature.

Part of being a first-class feature means that all the flexibility and ease of use from static types is also present when working with dynamic. There aren’t any new types or interfaces, just use Configuration.ForDynamic() instead of Configuration.For<TRow>(), Options.DynamicDefault (which assumes a header row is present) instead of Options.Default (which will detect if a header row is present or not, which isn’t possible with unknown types), and the EnumerateDynamicXXX() methods on CesilUtils. The same readers with the same methods are all available, only now instead of some concrete T you’ll get a dynamic back. And, while dynamic operation does impose additional overhead, Cesil still aims for dynamic operations to be reasonably performant – within a factor of 3 or so of their static equivalent.

Regardless of the Options used, the dynamic rows returned by Cesil always support:

  • Casting to IDisposable
  • Calling the Dispose() method
  • Get accessor with an int (ie. someRow[0]), which returns a dynamic cell
    • This will throw if the int is out of bounds
  • Get accessor with a column name (ie. someRow[“someColumn”]), which returns a dynamic cell
    • If there was no header row present when reading (or if the column name is not found), this will throw
  • Get accessor with an Index (ie. someRow[^1]), which returns a dynamic cell
    • This will throw if the Index is out of bounds
  • Get accessor with a Range (ie. someRow[1..2]), which returns a dynamic row
    • This will throw if the Range is out of bounds
  • Get accessor with a ColumnIdentifier (ie. someRow[ColumnIdentifier.Create(3)]), which returns a dynamic cell

Likewise, regardless of the Options used, dynamic cells (obtained by indexing a dynamic row per above) always support casting to IConvertible. IConvertible is a temperamental interface, so Cesil’s implementation is limited – it doesn’t support non-null IFormatProviders, and makes a very coarse attempt at determining TypeCode. Basically, Cesil does just enough for the various methods on Convert to work “as you’d expect” for dynamic cells.

Just like with static deserialization, the ITypeDescriber on the Options used to create the IBoundConfiguration<TRow> controls how values are mapped to types. The differences are that dynamic conversions are discovered each time they occur (versus once, for static types) and conversion decisions are deferred until a cast (versus happening during reading, for static types). Dynamic deserialization does not allow custom InstanceProviders (as the dynamic backing infrastructure is provided directly by Cesil) – however the XXXWithReuse() methods on I(Async)Reader<TRow> still allow for some control over allocations.

Customization of dynamic conversions can be done with the DynamicRowConverter type (for rows) and the ITypeDescriber.GetDynamicCellParserFor() method (for cells). I’ll dig further into these capabilities in a later post. Out of the box, the DefaultTypeDescriber (used by Options.DynamicDefault) implements the conversions you would expect.

Namely, for dynamic rows Cesil’s defaults allow conversion to:

  • Object
  • Tuples
    • Rows with more than 7 columns can be mapped to nested Tuples using TRest generic parameter
  • ValueTuples, including those with a TRest parameter
    • Rows with more than 7 columns can be mapped to nested ValueTuples using TRest generic parameter
  • IEnumerable<T>
    • Each cell is lazily converted to T
  • IEnumerable
    • Each cell becomes an object, with no conversion occurring
  • Any type with a constructor taking the same number of parameters as the row has columns
    • Each cell is converted to the expected parameter type
  • Any type with a constructor taking zero parameter, provided the row has column names
    • Any properties (public or private, static or instance) whose name matches a column name will be set to the column’s value

If no conversion is possible, Cesil will raise an exception. If a conversion is chosen that requires converting cells to static values, those conversions may also fail and raise exceptions.

For dynamic cells, Cesil’s defaults allow conversion to:

As with rows, finding no conversion or having a conversion fail will cause Cesil to raise an exception.

And that covers the why and what of dynamic deserialization in Cesil. This post leaves me with two Open Questions:

  1. Are there any useful dynamic operations around reading that are missing from Cesil?
  2. Do the conversions provided by the DefaultTypeDescriber for dynamic rows and cells cover all common use cases?

As before, I’ve opened two issues to gather long form responses.  Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Next time I’ll dive into the write operations Cesil supports, starting with static types.