Overthinking CSV With Cesil: A “Modern” Interface

Posted: 2020/05/29 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

Part of Cesil’s raison d’être is to be a “modern” library for CSV, one which takes advantage of all the fancy new additions in recent C# and .NET Core versions. What exactly is “modern” is debatable, so this post lays out my particular take.

To make things concrete, the “main” interfaces for Cesil are split into:

Configuration – with the Options (and it’s Builder) and Configuration classes
- The static Configuration class produces instances of the IBoundConfiguration<TRow> interface
Reading – with the IReader<TRow> and IAsyncReader<TRow> interfaces
- Each interface has a way to read single rows, lazily enumerate all rows, greedily read all rows, and read all rows into the provided collection
Writing – with the IWriter<TRow> and IAsyncWriter<TRow> interfaces
- Each interface has a way to write a single row, write several rows lazily, and write several rows greedily.
Utilities – with numerous methods on the CesilUtils static class
- These methods provide single call ways to read and write collections of rows at the expense of some efficiency
Type Describing – with many types describing things like “creating rows” and “getting members”
- These will be covered in detail in a later post

The first thing you’ll notice when using Cesil is that it splits setup into two logical steps, building Options and binding Configurations. Options cover all the generally reusable parts of working with CSVs (things like separators, and memory pools), while Configurations represent a binding of Options to a particular Type. Binding a type implies a fair amount of work, in particular a decent amount of reflection to determine columns. By separating Options and Configurations, Cesil allows easy and efficient reuse of the “cheap” parts of a setup while giving control over when the expensive parts happen.

You’ll also quickly notice that Cesil tends to hand you interfaces instead of base classes. This is a consequence of my belief that encouraging inheritance in end user code is generally a mistake, combined with a desire to keep implementation details hidden. Thus Cesil exposes IReader<TRow> rather than SyncReaderBase<TRow>, and nearly every exported class is sealed.

Cesil splits reading and writing into separate interfaces in a manner similar to the recent System.IO.Pipelines namespace. Coupling reading and writing would mean that certain operations would be allowed by the type system even if they couldn’t possibly work at runtime – say, writing to something that was backed by a ReadOnlySequence<T>. The BCL has some examples of this failure, like Stream, whose Remarks call out that “Depending on the underlying data source or repository, streams might support only some of these capabilities”. Effectively this means that there are methods on all Streams that cannot be safely called in all cases, and that is a poor design choice to make in 2020.

Asynchronous and synchronous operations also get separate interfaces rather than one shared one. While not as footgun-y as mixing reading and writing, mixing synchronous and asynchronous operation is fraught with potential for error – either in correctness (such as starting synchronous operations while asynchronous ones are pending completion) or performance (such as sync-over-async). Potential for error is increased with the introduction of IAsyncDisposable and await using, the synchronous nature of IDisposable and using can be hidden in otherwise asynchronous code. Accordingly, all methods on IAsyncReader<TRow> and IAsyncWriter<TRow> are asynchronous and all methods on IReader<TRow> and IWriter<TRow> are synchronous – the former two implement IAsyncDisposable and the latter implement IDisposable.

Other, less immediately obvious, choices made in Cesil:

Most types are immutable, and all immutable types implement IEquatable<T>
- Mutability is a footgun in the highly concurrent code that is increasingly common, and so is avoided everywhere possible
Relatively few primitives are in the interface, enums (like EmitDefaultValue) and semantic wrappers (like ColumnIdentifier) are preferred
- Primitive types are easy to accidentally misuse and harder to read (ie. what does “true” mean when passed to method “Foo”)
Comments in CSVs are read and written with specific methods (TryReadWithComment(Async) and WriteComment(Async)), by default they are ignored when read (even if supported by a set of Options)
- Comments are relatively rare, so the basic operations shouldn’t be encumbered by having to deal with them
- They must be different methods because the implicit type of all comments is `string` not TRow
Recently introduced types like ReadOnlySequence<T>, IBufferWriter<T>, PipeReader, and PipeWriter have first class support
- Older types like TextReader and TextWriter are also supported, since these are still supported in the BCL and lots of code continues to use them

Having spelled out Cesil’s read and write interfaces leads to the second Open Question: Is there anything missing from IReader(Async) and IWriter(Async) that you’d expect to be supported in a modern .NET CSV library?

As before, I’ve opened an Issue to gather long form responses. Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Now that we’ve covered, at a very high level, the overall interface for Cesil, the next post will dig into how reading static types works in detail…

Overthinking CSV With Cesil: CSV Isn’t A Thing

Posted: 2020/05/28 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

For those who read my previous post, when you read “CSV library” you likely had one of two thoughts depending on how much exposure you’ve had to CSV files – either:

Dealing with CSVs is so simple, how much could there be to write about?
Dealing with CSVs is insanely complicated, why would you ever do that?

My day job is running a data team so I’m firmly in camp #2, lots of things run on CSV and it’s crazy complicated. Fundamentally this is because CSV isn’t a format, it’s a family of related formats. If you work with arbitrary CSV files long enough, you’ll eventually encounter one that doesn’t even use commas for separators.

Like most weird things, this is a consequence of history. The idea of CSV dates back at least 40 years, while the RFC “standardizing” it is from 2005. That’s a lot of time for different versions to flourish.

To get more concrete, CSV is a subset of the Delimiter Separated Values (DSV) family of tabular data formats – one which often (but not always!) uses commas to separate values. The most common variant is almost certainly that produced by Microsoft Excel (on Windows, in an English locale) – it uses commas to separate values, double quotes to start escaped values, double quotes to escape within escaped values, and the carriage-return line-feed character sequence to end a row.

Cesil aims to support all “reasonable” DSV formats, with defaults for the most common kind of CSV. A later post will go into exactly how flexible Cesil can be, but from a format perspective Cesil can handle:

Any single character value separator
Either no way to escape a value, or a single character starting and stopping escaped values
Either no way to escape a character within an escaped value, or a single character escape
Any of the \r, \n, or \r\n character sequences for ending a row
No comments, or “whole row” comments
Optional leading or trailing whitespace around values
Requiring a header row, forbidding a header row, or making a header row optional

This flexibility makes it possible to handle relatively standard things like Tab Separated Value (TSV) files, or CSV files which use an unusual character for escaping as well as kind of crazy things like CSVs using semicolons to separate values, or where values have been visually aligned with whitespace. All of this functionality, and much more, is configured with Cesil’s Options and OptionsBuilder classes.

And now we encounter Cesil’s first Open Question: Do these options provide adequate flexibility?

I’ve opened an Issue to gather long form responses. Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Now that I’ve covered the formats Cesil can handle, in the next post I will cover the whats and whys of the interface it exposes…

Overthinking CSV With Cesil: An Introduction

Posted: 2020/05/28 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

~~Several months ago~~ About a year ago (how time flies) I decided to spin up a new personal project to get familiar with all the new goodies in C# 8 and .NET Core 3. I happened to be dealing with some frustrating CSV issues at the time, so the project was a CSV library.

Once I got into the meat of the project, I started really overthinking things. The end result was Cesil – a pre-release package is available on Nuget, source is on GitHub, and it’s got reference documentation and a prose wiki. It’s released under the MIT license.

When I say I was overthinking things, I mean that rather than build a toy just for my own edification I ended up trying to do The Right Thing™ for a .NET library released in 2020. This, at least 14 part, blog series will cover exactly what that entailed but in short I committed to:

Async as a first class citizen
Maximum consumer flexibility
Extensive documentation
Comprehensive test coverage
Adopting C# 8 features
Modern patterns and conventions
Efficiency, especially in terms of allocations

Interpretations of each of those points can be a matter of opinion, and I’m not going to claim to have 100% correct opinions. I attempted to record both things I consider opinions and open questions, both of which I’ll expound upon as this series continues.

My hope is that Cesil is easy to use, hard to misuse, handles the common cases out of the box, and can be configured to handle almost anything you might want to do with CSV. I intend to respond to feedback and make changes as needed over the course of this series to make it more likely those hopes are realized.

A final bit of overthinking on the whole project has been around sustainable open source. There’s been a fair amount of discussion on the subject, the gist of which is that loads of people and companies benefit from volunteers doing skilled work without compensation – and that is an unsustainable practice. As a small experiment inline with these thoughts, I’ve set up GitHub Sponsors for Cesil with a few low commitment tiers. I’ll both be using the tiers to prioritize responding to some feedback, and reporting on the results of this experiment towards the end of the blog series.

Now with the introduction out of the way, I’m ready to dive into technical bits in the next post…

Kevin Montrose

Overthinking CSV With Cesil: A “Modern” Interface

Overthinking CSV With Cesil: CSV Isn’t A Thing

Overthinking CSV With Cesil: An Introduction

Archive