Overthinking CSV With Cesil: “Maximum” Flexibility

Over the course of this series I’ve alluded to a future post where I’ll dig into all the configuration options Cesil offers.

This is that, gigantic, post.

I conceptualize Cesil’s configurability as being along three axes: the format, memory use, and type mapping. Format options control the style of delimiter separated value (DSV) you’re reading or writing, memory options give fine grained control over allocations, and type mappings handle converting from .NET types to text and vice versa.

To begin, let’s start with…

Format Options

The necessity of being able to configure different format options is clear for any CSV library, since as I said in an earlier post CSV isn’t really a format – it’s a bunch of related formats. For a library like Cesil, which aims to support all reasonable DSV formats, the necessity is even more obvious.

All configuration options relevant to formatting live on the Options type, with corresponding WithXXX methods on OptionsBuilder. These options are:

  • ValueSeparator – the single character used to separate columns in a row
  • RowEnding – whether rows end in the \n, \r, or \r\n character sequence
    • Most CSV files us \r\n, but Cesil can automatically detect this when reading if you use RowEnding.Detect
    • When using Detect, Cesil will use the character sequence it first encounters as the expected row ending
    • When writing, a RowEnding other than Detect must be provided or an exception will be raised when an IWriter<TRow> or IAsyncWriter<TRow> is created
  • EscapedValueStartAndEnd – the character used to start and end an escaped value
    • Typically this is a double quote, but it can be left unset
    • If your format treats , as a value separator and would store Montrose, Kevin as “Montrose, Kevin” then you’re using a double quote for this
  • EscapedValueEscapeCharacter – the character used to start an escape sequence when you are already in an escaped value
    • Typically this is a double quote, but it can be left unset
    • If your format would store Kevin “Monty” Montrose as “Kevin “”Monty”” Montrose” then you’re using a double quote for this
  • ReadHeader – whether to always expect a header row, never expect a header row, or automatically detect a header row
    • This tends to vary file to file, so it is often set to ReadHeader.Detect
    • If you use Always or Detect, Cesil will use the header to infer column order when mapping columns to .NET types
    • If your format supports comments, it is legal for comments to precede a header row
  • CommentCharacter – the single character that starts a comment line
    • Typically this is not set, but if set it is often #
    • For example a single line of #hello world would be a comment of hello world, in formats with # for this
  • WhitespaceTreatment – whether to trim whitespace that is encountered in certain places while parsing
    • Most formats preserve whitespace if it is encountered in a value, and do not permit whitespace as padding around escaped values
    • If your format is one that is unusual, Cesil supports automatically trimming whitespace in certain cases. Refer to WhitespaceTreatments for the full list of trimming behaviors
    • Note that WhitespaceTreatments is a [Flags] enum, and so all different combinations of behavior can be combined.
  • ExtraColumnTreatment – how to handle encountering “extra” columns when reading
    • Cesil considers a column “extra” if it doesn’t map to a member, or if it’s in a column that didn’t appear in the header row (if there is a header row)
    • If you’re reading into dynamics and not requiring a header row, extra columns will be any that have an index greater than the highest index in the first read row
    • This must be one of:
  • WriteHeader – whether or not to write a header row before writing any values
  • WriteTrailingRowEnding – whether or not to end the final written row with the configured RowEnding

And that’s it. Ten options which, hopefully, allow Cesil to cope with all reasonable DSV formats out there. I’d be quite interested to learn of any that Cesil can’t cope with – it’s always a fun challenge to make a system more flexible without sacrificing ease of use or performance.

Now we’ll move on to…

Allocation Options

Beyond “don’t allocate more than necessary” it may strike some as odd to care about memory allocation in a .NET library – after all, .NET is a managed (ie. garbage collected) platform. I believe that in fact a modern .NET library should strive to both minimize allocations and provide ways for clients to control those allocations that must happen. The .NET ecosystem has been evolving in a much more performance focused direction for a while now, with fancy new types like Span and Pipelines encouraging low allocation and low copy patterns, first class support for processor intrinsics, and struct alternatives (that don’t default to allocating on the heap) like ValueTuple and ValueTask. The .NET GC is good, but it’s never going to be free so a laser focus on allocations is common when concerned with performance. It follows that if a library’s clients are focused on controlling allocations, a library needs to give them the tools they need to control allocations.

That said, some heap allocations are unavoidable. Cesil does its best to perform all unavoidable allocations prior to returning an I(Async)Reader<TRow> or I(Async)Writer<TRow> – so creating Options, binding IBoundConfigurations<TRow>, and the actual creation of a reader or writer may allocate but after that, allocations are under client control. There are exceptions, but I’ll dig into those in a later section.

On Options, there are a few relevant members:

  • MemoryPool – when Cesil needs to allocate, the MemoryPool<char> it uses to obtain a block of memory
    • Cesil will always request a size it can work with, but if a client does not return a chunk of memory at least the requested size an exception may be raised
    • Cesil will always call IMemoryOwner<char>.Dispose() when finished using a chunk of memory
    • MemoryPools must be thread safe, as Cesil makes no guarantees that IMemoryOwner<char> references remain on any given thread
  • ReadBufferSizeHint – when reading, Cesil needs a buffer to store characters it has not yet processed. This value specifies how large that buffer should be
    • There is often a tradeoff between buffer size and performance, the larger the buffer the fewer calls to an underlying stream are needed to load all data, and thus reading will complete more quickly. This is not true once a buffer is large enough that the underlying stream cannot fill it on each call, or if the underlying stream is frequently blocked waiting for more data
    • Setting ReadBufferSizeHint to 0 tells Cesil to request a “reasonable default” buffer size
    • The read buffer is obtain from the configured MemoryPool<char>, allowing clients to control precisely how the buffer is allocated
  • WriteBufferSizeHint – when writing, Cesil can stage writes into a buffer to improve performance. This value specifies if a buffer should be used, and how large it should be
    • As with ReadBufferSizeHint there is often a trade-off between buffer size and performance. If there is no buffer, every write must call into the underlying stream which can make writing take considerably longer.
    • Setting WriteBufferSizeHint to 0 disables write buffering, all data will be sent directly to the underlying stream
    • Setting WriteBufferSizeHint to null tells Cesil to request a “reasonable default” buffer size
    • The write buffer is obtain from the configured MemoryPool<char>, allowing clients to control precisely how the buffer is allocated
  • DynamicRowDisposal – controls when dynamic rows obtained during reading are disposed

Moving beyond Options, IReader<TRow> and IAsyncReader<TRow> have the XXXWithReuse() methods – these methods take a ref parameter that points to a row to reuse when reading. When processing many rows in sequence, these methods let you allocate a single row and then just repeatedly reuse it – greatly reducing the number of allocations. There are a few caveats to keep in mind. First, if a row has a Setter backed by constructor parameters (more on those below) a row cannot be reused and will always be reallocated. Second, value types are always zero initialized so there is always a row to reuse if the row is a value type – this means your InstanceProvider (more on those below) may not be invoked when you’d expect it to be, if the row was a reference type. Finally, because the XXXWithReuse() methods return the ref parameter will be initialized with the row that will ultimately be initialized it is possible (especially in async cases, when the underlying stream blocks) for Cesil to allocate a row it ends up not needing.

The last piece of allocation control has lots of overlap with type mapping, which is covered in the next section, but in brief: InstanceProviders give clients control over how rows are obtained, and Parsers give control for how ReadOnlySpans<char> are turned into instances of other types. Other types that participate in type mapping allow for control over accessing members, assigning members, and so on – so a client can customize any step of the process that might concern them.

With allocations covered, let’s now proceed to the final axis of configuration…

Type Mapping

DSVs provide rows and columns of text, and that’s it really. .NET has a much richer type system, and so Cesil must provide some way to move between these two worlds. Complicating that is how many different styles of .NET coding are out there, a good library must provide clients with the tools they need to match Cesil’s behavior to their own applications.

Cesil breaks this process of mapping types to and from text into several logical pieces, many of which have been mentioned in earlier posts:

Each of these types has particular rules about what kind of method, delegate, constructor, etc. can back them which are detailed in the documentation for each type, and on Cesil’s wiki.

Additionally, a number of these types (like Parser) have a notion of failure (indicated by returning false from a method or delegate) and support delegating to another instance as a fallback. This is used via their Else(…) method, which creates and returns a new instance what will delegate on failure.

As mentioned above, the ITypeDescriber interface requires a little more discussion. It has six methods, each of which supports a particular use case for Cesil:

In addition to the raw ITypeDescriber interface, Cesil also provides three implementations of the interface out of the box. They are:

And that wraps up my deep dive in Cesil’s flexibility. There’s even more detail in the wiki and on the reference documentation (linked throughout this post) for the involved types, but this post should at least give you a decent basic understanding.

Which brings us to the Open Questions of this post:

  1. Are there any missing Format-specific options Cesil should have?
  2. Is the amount of control given over Cesil’s allocations sufficient?
  3. Are there any interesting .NET types that Cesil’s type mapping scheme doesn’t support?

As before, I’ve opened three issues to gather long form responses.  Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

My next post will tackle a smaller subject – I’ll be going over some of the new features that came with C# 8, and the how and why of Cesil’s adoption of them.


One Comment on “Overthinking CSV With Cesil: “Maximum” Flexibility”

  1. iamauser says:

    Would be cool if you could also do a write up about how you started this project…

    Did you wrote a spec first?

    One of the hardest parts in my opinion is starting a new thing,…
    How should I start?
    What is essential to write the first code?

    Thanks