Overthinking CSV With Cesil: Reading Known Types

Posted: 2020/06/02 | Author: kevinmontrose | Filed under: code | Tags: cesil |Comments Off

The most common operation for a C# serialization library is usually reading into a known, static, type. That is, you’re given a stream or a blob of bytes and need to turn it into an instance of some type T. Cesil aims to make this common operation simple, fast, and customizable.

For cases where performance and customization are less important, CesilUtils exposes a bunch of EnumerateXXX methods. Both synchronous and asynchronous versions available, but all methods return results lazily.

Maximum performance and flexibility is found in using either IReader<TRow> or IAsyncReader <TRow> interfaces, obtained from an IBoundConfiguration <TRow> created via Configuration.For <TRow>. Unlike CesilUtils, using these interfaces lets you cache and reuse an IBoundConfiguration <TRow> and allow you to read comments and reuse rows.

Concretely, I(Async)Reader <TRow> methods let you:

Lazily enumerate rows with EnumerableAll(Async)
- The async version returns an IAsyncEnumerable<T>, which is new to C# 8
Eagerly read rows with ReadAll(Async)
- You can also control the collection read into with specific overloads
Read a single row with TryRead(Async)
- You can reuse an already allocated row with TryReadWithReuse(Async)
Read a row or a comment with TryReadWithComment(Async)
- As above, you can reuse an already allocated row with TryReadWithCommentWithReuse(Async)

Determining what members on the given TRow type map to which columns, how those columns should be parsed, and how members should be set is done with the ITypeDescriber registered on the Options provided to Configuration.For <TRow> or the method on CesilUtils (by default, this is an instance of DefaultTypeDescriber). When an IBoundConfiguration <TRow> is created ITypeDescriber.EnumerateMembersToDeserialize is invoked once and the returned DeserializableMembers detail how Cesil will map rows of data to TRow instances.

Preciesly, you can specify

The name of the column a member maps to
- If a CSV lacks a header row, the order of the DeserializableMembers will be used to match columns instead
The Parser to use to turn a ReadOnlySpan into a specific type
An (optional) Reset to call before setting a member
The Setter to use to place the type created by the Parser on a member of TRow
Whether or not a member is required

A separate call to ITypeDescriber.GetInstanceProvider will be made to obtain an InstanceProvider which is used to get TRow instances needed when reading a row. While the call to get the InstanceProvider always happens, the InstanceProvider won’t be used if the XXXWithReuse methods are called with a non-null TRow reference. InstanceProviders allow you to implement sophisticated row re-use or initialization logic that a simple “ref TRow” isn’t adequate for.

There’s a great deal of flexibility in how InstanceProviders, Parsers, Resets, and Setters can be created which will be covered in a later post.

Internally, Cesil models reading a CSV as transitions through a state machine. Each character read is mapped to a CharacterType (one of EscapeStartAndEnd, Escape, ValueSeparator, CarriageReturn, LineFeed, CommentStart, Whitespace, Other, and DataEnd), which is then used in conjunction with the current State to look up a TransitionRule. TransitionRules specify the new State as well as an AdvanceResult, which instructs Cesil to take certain actions (like skipping the character, appending a character to the read buffer, finishing a column or row, etc.). Only the mapping from char to CharacterType is dependent on the configured Options, Cesil pre-allocates and reuses the TransitionRules that back the state machine.

Although Cesil’s state machine progresses one character at a time, Cesil reads multiple-characters at a time in order to maximize performance and better match modern C# interfaces like PipeReader. Control over the read buffer’s size is provided through ReadBufferSizeHint. Cesil also batches certain common AdvanceResults, like skipping or appending characters, so that the overhead of certain method calls is minimized in hot paths.

Taken altogether, and at a very high level, when Cesil reads a single row this is what happens:

Characters are read into the read buffer, if it is empty
1. If there are no more characters to read into the buffer, proceed as if we have read a single EndOfData CharacterType.
If no instance of TRow has been provided, Cesil obtains one using the InstanceProvider
For each character in the read buffer…
1. The character is mapped to a CharacterType
2. The current State and CharacterType are used to find the next State and an AdvanceResult
  1. If the AdvanceResult is batchable, note is made of it but no action is taken
  2. If the AdvanceResult is not batchable, any pending batched actions are taken and then the new action is taken
    1. If the AdvanceResult finishes a value, the current pending value is Parsed, the Reset for the current column is called (if it exists), the Setter is called
    2. If the AdvanceResult finishes a record, we return the row and are finished
3. Remove the read character from the buffer
If we haven’t returned a row, go back to 1

There are a few consequences of this design:

There can be pending data in the read buffer when a row is returned, which means that you cannot use Cesil to read “up to a particular row” in the underlying data stream. Once Cesil starts reading, no guarantees are made about the state of the underlying stream.
For maximum performance it’s worth reusing IBoundConfigurations, as a decent amount of reflection and lookup creation happens when one is created. All I(Async)Readers that one creates will reuse that work, making a cache very efficient.
In asynchronous cases, Cesil will await only when the read buffer is empty and cannot be filled without blocking. This means that Cesil can “go async” much less frequently than might naively be expected, were it to be reading characters one at a time.

Finally, Cesil does offer support for reading whole line CSV comments. Although non-standard and rather rare, they arise often enough to be worth supporting. The reader interfaces expose TryReadWithComment(WithReuse)(Async) methods that return a ReadWithCommentResult, a tagged union type that wraps the comment or row read. In order to read comments, Options.CommentCharacter must have been set when the IBoundConfiguration<TRow> was created – calling any of the XXXWithComment methods when it has not been set will raise an exception. If a comment is encountered when a non-XXXWithComment method is invoked, but Options was configured with comment support, the comment will be silently skipped.

That wraps up what static deserialization looks like in Cesil.

The Open Question for this post is the same as the previous post, but with a particular focus on reading: Is there anything missing from IReader(Async) that you’d expect to be supported in a modern .NET CSV library?

This question has already led to some planned changes, namely removing the class constraint on I(Async)Reader’s TCollection generic parameter, and adding comment writing methods that take ReadOnlySpan<char> and ReadOnlyMemory<char> parameters.

Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Next time I’ll be discussing reading dynamic types, and why I think that’s still worth supporting in 2020…

Kevin Montrose

Overthinking CSV With Cesil: Reading Known Types

Related

Archive