Overthinking CSV With Cesil: “Maximum” Flexibility

Posted: 2020/06/24 | Author: kevinmontrose | Filed under: code | Tags: cesil |1 Comment

Over the course of this series I’ve alluded to a future post where I’ll dig into all the configuration options Cesil offers.

This is that, gigantic, post.

I conceptualize Cesil’s configurability as being along three axes: the format, memory use, and type mapping. Format options control the style of delimiter separated value (DSV) you’re reading or writing, memory options give fine grained control over allocations, and type mappings handle converting from .NET types to text and vice versa.

To begin, let’s start with…

Format Options

The necessity of being able to configure different format options is clear for any CSV library, since as I said in an earlier post CSV isn’t really a format – it’s a bunch of related formats. For a library like Cesil, which aims to support all reasonable DSV formats, the necessity is even more obvious.

All configuration options relevant to formatting live on the Options type, with corresponding WithXXX methods on OptionsBuilder. These options are:

ValueSeparator – the single character used to separate columns in a row
- This is the C(omma) in CSV, so it is almost always a comma
- This must have a value, it cannot be left unset
- If your format stores the sequential values Kevin Monty Montrose as Kevin,Monty,Montrose then you’re using a comma for this
- Due to feedback on earlier posts, this will become a string in the next release of Cesil
RowEnding – whether rows end in the \n, \r, or \r\n character sequence
- Most CSV files us \r\n, but Cesil can automatically detect this when reading if you use RowEnding.Detect
- When using Detect, Cesil will use the character sequence it first encounters as the expected row ending
- When writing, a RowEnding other than Detect must be provided or an exception will be raised when an IWriter<TRow> or IAsyncWriter<TRow> is created
EscapedValueStartAndEnd – the character used to start and end an escaped value
- Typically this is a double quote, but it can be left unset
- If your format treats , as a value separator and would store Montrose, Kevin as “Montrose, Kevin” then you’re using a double quote for this
EscapedValueEscapeCharacter – the character used to start an escape sequence when you are already in an escaped value
- Typically this is a double quote, but it can be left unset
- If your format would store Kevin “Monty” Montrose as “Kevin “”Monty”” Montrose” then you’re using a double quote for this
ReadHeader – whether to always expect a header row, never expect a header row, or automatically detect a header row
- This tends to vary file to file, so it is often set to ReadHeader.Detect
- If you use Always or Detect, Cesil will use the header to infer column order when mapping columns to .NET types
- If your format supports comments, it is legal for comments to precede a header row
CommentCharacter – the single character that starts a comment line
- Typically this is not set, but if set it is often #
- For example a single line of #hello world would be a comment of hello world, in formats with # for this
WhitespaceTreatment – whether to trim whitespace that is encountered in certain places while parsing
- Most formats preserve whitespace if it is encountered in a value, and do not permit whitespace as padding around escaped values
- If your format is one that is unusual, Cesil supports automatically trimming whitespace in certain cases. Refer to WhitespaceTreatments for the full list of trimming behaviors
- Note that WhitespaceTreatments is a [Flags] enum, and so all different combinations of behavior can be combined.
ExtraColumnTreatment – how to handle encountering “extra” columns when reading
- Cesil considers a column “extra” if it doesn’t map to a member, or if it’s in a column that didn’t appear in the header row (if there is a header row)
- If you’re reading into dynamics and not requiring a header row, extra columns will be any that have an index greater than the highest index in the first read row
- This must be one of:
  - ExtraColumnTreatment.Ignore – extra columns are ignored, provided they don’t violate the format
  - ExtraColumnTreatment.IncludeDynamic – identical to Ignore when reading static types, but if reading into dynamics then extra columns are included. Extra columns are accessed either by index, or via a conversion into an IEnumerable or IEnumerable<T> or other type that permits access
  - ExtraColumnTreatment.ThrowException – an exception is raised if an extra column is encountered
WriteHeader – whether or not to write a header row before writing any values
- This tends to vary file to file, but Options.Default and Options.DynamicDefault do write headers to ease development
WriteTrailingRowEnding – whether or not to end the final written row with the configured RowEnding
- This seems to vary wildly, but Options.Default and Options.DynamicDefault do not write a trailing row ending

And that’s it. Ten options which, hopefully, allow Cesil to cope with all reasonable DSV formats out there. I’d be quite interested to learn of any that Cesil can’t cope with – it’s always a fun challenge to make a system more flexible without sacrificing ease of use or performance.

Now we’ll move on to…

Allocation Options

Beyond “don’t allocate more than necessary” it may strike some as odd to care about memory allocation in a .NET library – after all, .NET is a managed (ie. garbage collected) platform. I believe that in fact a modern .NET library should strive to both minimize allocations and provide ways for clients to control those allocations that must happen. The .NET ecosystem has been evolving in a much more performance focused direction for a while now, with fancy new types like Span and Pipelines encouraging low allocation and low copy patterns, first class support for processor intrinsics, and struct alternatives (that don’t default to allocating on the heap) like ValueTuple and ValueTask. The .NET GC is good, but it’s never going to be free so a laser focus on allocations is common when concerned with performance. It follows that if a library’s clients are focused on controlling allocations, a library needs to give them the tools they need to control allocations.

That said, some heap allocations are unavoidable. Cesil does its best to perform all unavoidable allocations prior to returning an I(Async)Reader<TRow> or I(Async)Writer<TRow> – so creating Options, binding IBoundConfigurations<TRow>, and the actual creation of a reader or writer may allocate but after that, allocations are under client control. There are exceptions, but I’ll dig into those in a later section.

On Options, there are a few relevant members:

MemoryPool – when Cesil needs to allocate, the MemoryPool<char> it uses to obtain a block of memory
- Cesil will always request a size it can work with, but if a client does not return a chunk of memory at least the requested size an exception may be raised
- Cesil will always call IMemoryOwner<char>.Dispose() when finished using a chunk of memory
- MemoryPools must be thread safe, as Cesil makes no guarantees that IMemoryOwner<char> references remain on any given thread
ReadBufferSizeHint – when reading, Cesil needs a buffer to store characters it has not yet processed. This value specifies how large that buffer should be
- There is often a tradeoff between buffer size and performance, the larger the buffer the fewer calls to an underlying stream are needed to load all data, and thus reading will complete more quickly. This is not true once a buffer is large enough that the underlying stream cannot fill it on each call, or if the underlying stream is frequently blocked waiting for more data
- Setting ReadBufferSizeHint to 0 tells Cesil to request a “reasonable default” buffer size
- The read buffer is obtain from the configured MemoryPool<char>, allowing clients to control precisely how the buffer is allocated
WriteBufferSizeHint – when writing, Cesil can stage writes into a buffer to improve performance. This value specifies if a buffer should be used, and how large it should be
- As with ReadBufferSizeHint there is often a trade-off between buffer size and performance. If there is no buffer, every write must call into the underlying stream which can make writing take considerably longer.
- Setting WriteBufferSizeHint to 0 disables write buffering, all data will be sent directly to the underlying stream
- Setting WriteBufferSizeHint to null tells Cesil to request a “reasonable default” buffer size
- The write buffer is obtain from the configured MemoryPool<char>, allowing clients to control precisely how the buffer is allocated
DynamicRowDisposal – controls when dynamic rows obtained during reading are disposed
- Data that backs dynamic rows is kept in memory obtained from the configured MemoryPool<char>, to avoid allocating on the heap if not necessary. This allows Cesil to cast a dynamic to a ValueTuple (for example) without first converting the column values to strings. As a consequence of this, Cesil’s dynamic rows are effectively IDisposable
- The two options for DynamicRowDisposal are:
  - DynamicRowDisposal.OnExplicitDispose – rows must have .Dispose() called on them, not doing so will leak an IMemoryOwner<char>
  - DynamicRowDisposal.OnReaderDispose – rows will be automatically disposed when the I(Async)Reader<TRow> that last touched them is disposed

Moving beyond Options, IReader<TRow> and IAsyncReader<TRow> have the XXXWithReuse() methods – these methods take a ref parameter that points to a row to reuse when reading. When processing many rows in sequence, these methods let you allocate a single row and then just repeatedly reuse it – greatly reducing the number of allocations. There are a few caveats to keep in mind. First, if a row has a Setter backed by constructor parameters (more on those below) a row cannot be reused and will always be reallocated. Second, value types are always zero initialized so there is always a row to reuse if the row is a value type – this means your InstanceProvider (more on those below) may not be invoked when you’d expect it to be, if the row was a reference type. Finally, because the XXXWithReuse() methods return the ref parameter will be initialized with the row that will ultimately be initialized it is possible (especially in async cases, when the underlying stream blocks) for Cesil to allocate a row it ends up not needing.

The last piece of allocation control has lots of overlap with type mapping, which is covered in the next section, but in brief: InstanceProviders give clients control over how rows are obtained, and Parsers give control for how ReadOnlySpans<char> are turned into instances of other types. Other types that participate in type mapping allow for control over accessing members, assigning members, and so on – so a client can customize any step of the process that might concern them.

With allocations covered, let’s now proceed to the final axis of configuration…

Type Mapping

DSVs provide rows and columns of text, and that’s it really. .NET has a much richer type system, and so Cesil must provide some way to move between these two worlds. Complicating that is how many different styles of .NET coding are out there, a good library must provide clients with the tools they need to match Cesil’s behavior to their own applications.

Cesil breaks this process of mapping types to and from text into several logical pieces, many of which have been mentioned in earlier posts:

InstanceProviders, which obtain instances of rows to populate
- These can be created from delegates, MethodInfos, or ConstructorInfos
Parsers, which turn text data into instances of .NET types
- These can be created from delegates, MethodInfos, or ConstructorInfos
Resets, which allow per-column code to run prior to assignment of a row member
- These can be created from delegates, or MethodInfos
Setters, which take the instances produced by Parsers and assign them to members on a row obtained from an InstanceProvider
- These can be created from delegates, MethodInfos, FieldInfos, PropertyInfos, or ParameterInfos
DeserializableMembers, which group a name, Parser, Reset, Setter, and MemberRequired together to describe the treatment of a single column when performing read operations
ShouldSerializes, which allow for per-row control over whether a member is written
- These can be created from delegates, or MethodInfos
Getters, which obtain instances of .NET types from rows which are placed in columns
- These can be created from delegates, MethodInfos, FieldInfos, or PropertyInfos
Formatters, which turn the instances obtained from Getters into text data
- These can be created from delegates, or MethodInfos
SerializableMembers, which group a name, ShouldSerialize, Getter, Formatter, and EmitDefaultValue together to describe the treatment of a single column when performing write operations
DynamicRowConverters, which back casting dynamic rows into concrete instances of .NET types
- These can be created from delegates, MethodInfos, ConstructorInfos, or a combination of ConstructorInfos and Setters
DynamicCellValues, which backs converting cells obtained from a dynamic row into text data
- These can be created from Formatters
ITypeDescribers, the interface which Cesil uses to obtain all the above
- As the primary way Cesil allows customizing this type mapping, it is discussed in more detail below

Each of these types has particular rules about what kind of method, delegate, constructor, etc. can back them which are detailed in the documentation for each type, and on Cesil’s wiki.

Additionally, a number of these types (like Parser) have a notion of failure (indicated by returning false from a method or delegate) and support delegating to another instance as a fallback. This is used via their Else(…) method, which creates and returns a new instance what will delegate on failure.

As mentioned above, the ITypeDescriber interface requires a little more discussion. It has six methods, each of which supports a particular use case for Cesil:

Invoked once per IBoundConfiguration<TRow>
- GetInstanceProvider(TypeInfo) to determine how any created I(Async)Reader<TRow> should obtain the rows they return, if the rows aren’t provided via a XXXWithReuse method
  - When reading dynamic rows, this method is not invoked
- EnumerateMembersToDeserialize(TypeInfo) to determine what columns any I(Async)Reader<TRow> should expect to find, and how to map the text in those columns into members on the row
  - The order of returned DeserializableMembers is used if there is no header row when reading
  - When reading dynamic rows, this method is not invoked
- EnumerateMembersToSerialize(TypeInfo) to determine how many columns an I(Async)Writer<TRow> should write per row, and how to obtain and format the values behind those columns from a row
  - The order of returned SerializableMembers controls the order of the columns when written
  - When writing dynamic rows, this method is not invoked
Invoked as needed in response to operations on dynamic rows
- GetDynamicCellParserFor(in ReadContext, TypeInfo) to obtain a Parser that can convert the text data backing a cell in a dynamic row to a particular .NET type
  - The provided ReadContext describes where the cell occurred in the data, as well as the Options used to originally parse the data
- GetDynamicRowConverter(in ReadContext, IEnumerable, TypeInfo) to obtain a DynamicRowConverter that can convert a whole dynamic row to a particular .NET type
  - The provided ReadContext describes where the row occurred in the data, as well as the Options used to originally parse the data
- GetCellsForDynamicRow(in WriteContext, Object) to obtain the DynamicCellValues to write from a dynamic row
  - The provided WriteContext describes the index of the row being written, as well as the Options used to by the I(Async)Writer<TRow>

In addition to the raw ITypeDescriber interface, Cesil also provides three implementations of the interface out of the box. They are:

The DefaultTypeDescriber which is used by, well, default and implements “normal” .NET (de)serialization behaviors
- This class isn’t sealed, and has numerous virtual methods as extension points for when you want minor tweaks to “normal” behavior
The ManualTypeDescriber which, in conjunction with ManualTypeDescriberBuilder, lets you specify exactly which InstanceProviders, Setters, Parsers, etc. to use for specific types
The SurrogateTypeDescriber which, in conjunction with SurrogateTypeDescriberBuilder, lets you specify a type as a surrogate for some other, original, type. When any of ITypeDescribers members invoked for the original type, a SurrogateTypeDescriber instead inspects it’s surrogate type and then maps the results back to the original type
- Surrogate types are useful when you want to apply attributes to a type you don’t control

And that wraps up my deep dive in Cesil’s flexibility. There’s even more detail in the wiki and on the reference documentation (linked throughout this post) for the involved types, but this post should at least give you a decent basic understanding.

Which brings us to the Open Questions of this post:

As before, I’ve opened three issues to gather long form responses. Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

My next post will tackle a smaller subject – I’ll be going over some of the new features that came with C# 8, and the how and why of Cesil’s adoption of them.

One Comment on “Overthinking CSV With Cesil: “Maximum” Flexibility”

iamauser says:

2020/06/29 at 09:09

Would be cool if you could also do a write up about how you started this project…

Did you wrote a spec first?

One of the hardest parts in my opinion is starting a new thing,…
How should I start?
What is essential to write the first code?
…

Thanks

Kevin Montrose