Overthinking CSV With Cesil: A “Modern” Interface

Posted: 2020/05/29 | Author: kevinmontrose | Filed under: code | Tags: cesil |Comments Off

Part of Cesil’s raison d’être is to be a “modern” library for CSV, one which takes advantage of all the fancy new additions in recent C# and .NET Core versions. What exactly is “modern” is debatable, so this post lays out my particular take.

To make things concrete, the “main” interfaces for Cesil are split into:

Configuration – with the Options (and it’s Builder) and Configuration classes
- The static Configuration class produces instances of the IBoundConfiguration<TRow> interface
Reading – with the IReader<TRow> and IAsyncReader<TRow> interfaces
- Each interface has a way to read single rows, lazily enumerate all rows, greedily read all rows, and read all rows into the provided collection
Writing – with the IWriter<TRow> and IAsyncWriter<TRow> interfaces
- Each interface has a way to write a single row, write several rows lazily, and write several rows greedily.
Utilities – with numerous methods on the CesilUtils static class
- These methods provide single call ways to read and write collections of rows at the expense of some efficiency
Type Describing – with many types describing things like “creating rows” and “getting members”
- These will be covered in detail in a later post

The first thing you’ll notice when using Cesil is that it splits setup into two logical steps, building Options and binding Configurations. Options cover all the generally reusable parts of working with CSVs (things like separators, and memory pools), while Configurations represent a binding of Options to a particular Type. Binding a type implies a fair amount of work, in particular a decent amount of reflection to determine columns. By separating Options and Configurations, Cesil allows easy and efficient reuse of the “cheap” parts of a setup while giving control over when the expensive parts happen.

You’ll also quickly notice that Cesil tends to hand you interfaces instead of base classes. This is a consequence of my belief that encouraging inheritance in end user code is generally a mistake, combined with a desire to keep implementation details hidden. Thus Cesil exposes IReader<TRow> rather than SyncReaderBase<TRow>, and nearly every exported class is sealed.

Cesil splits reading and writing into separate interfaces in a manner similar to the recent System.IO.Pipelines namespace. Coupling reading and writing would mean that certain operations would be allowed by the type system even if they couldn’t possibly work at runtime – say, writing to something that was backed by a ReadOnlySequence<T>. The BCL has some examples of this failure, like Stream, whose Remarks call out that “Depending on the underlying data source or repository, streams might support only some of these capabilities”. Effectively this means that there are methods on all Streams that cannot be safely called in all cases, and that is a poor design choice to make in 2020.

Asynchronous and synchronous operations also get separate interfaces rather than one shared one. While not as footgun-y as mixing reading and writing, mixing synchronous and asynchronous operation is fraught with potential for error – either in correctness (such as starting synchronous operations while asynchronous ones are pending completion) or performance (such as sync-over-async). Potential for error is increased with the introduction of IAsyncDisposable and await using, the synchronous nature of IDisposable and using can be hidden in otherwise asynchronous code. Accordingly, all methods on IAsyncReader<TRow> and IAsyncWriter<TRow> are asynchronous and all methods on IReader<TRow> and IWriter<TRow> are synchronous – the former two implement IAsyncDisposable and the latter implement IDisposable.

Other, less immediately obvious, choices made in Cesil:

Most types are immutable, and all immutable types implement IEquatable<T>
- Mutability is a footgun in the highly concurrent code that is increasingly common, and so is avoided everywhere possible
Relatively few primitives are in the interface, enums (like EmitDefaultValue) and semantic wrappers (like ColumnIdentifier) are preferred
- Primitive types are easy to accidentally misuse and harder to read (ie. what does “true” mean when passed to method “Foo”)
Comments in CSVs are read and written with specific methods (TryReadWithComment(Async) and WriteComment(Async)), by default they are ignored when read (even if supported by a set of Options)
- Comments are relatively rare, so the basic operations shouldn’t be encumbered by having to deal with them
- They must be different methods because the implicit type of all comments is `string` not TRow
Recently introduced types like ReadOnlySequence<T>, IBufferWriter<T>, PipeReader, and PipeWriter have first class support
- Older types like TextReader and TextWriter are also supported, since these are still supported in the BCL and lots of code continues to use them

Having spelled out Cesil’s read and write interfaces leads to the second Open Question: Is there anything missing from IReader(Async) and IWriter(Async) that you’d expect to be supported in a modern .NET CSV library?

As before, I’ve opened an Issue to gather long form responses. Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Now that we’ve covered, at a very high level, the overall interface for Cesil, the next post will dig into how reading static types works in detail…

Kevin Montrose

Overthinking CSV With Cesil: A “Modern” Interface

Related

Archive