Overthinking CSV With Cesil: CSV Isn’t A ThingPosted: 2020/05/28
For those who read my previous post, when you read “CSV library” you likely had one of two thoughts depending on how much exposure you’ve had to CSV files – either:
- Dealing with CSVs is so simple, how much could there be to write about?
- Dealing with CSVs is insanely complicated, why would you ever do that?
My day job is running a data team so I’m firmly in camp #2, lots of things run on CSV and it’s crazy complicated. Fundamentally this is because CSV isn’t a format, it’s a family of related formats. If you work with arbitrary CSV files long enough, you’ll eventually encounter one that doesn’t even use commas for separators.
Like most weird things, this is a consequence of history. The idea of CSV dates back at least 40 years, while the RFC “standardizing” it is from 2005. That’s a lot of time for different versions to flourish.
To get more concrete, CSV is a subset of the Delimiter Separated Values (DSV) family of tabular data formats – one which often (but not always!) uses commas to separate values. The most common variant is almost certainly that produced by Microsoft Excel (on Windows, in an English locale) – it uses commas to separate values, double quotes to start escaped values, double quotes to escape within escaped values, and the carriage-return line-feed character sequence to end a row.
Cesil aims to support all “reasonable” DSV formats, with defaults for the most common kind of CSV. A later post will go into exactly how flexible Cesil can be, but from a format perspective Cesil can handle:
- Any single character value separator
- Either no way to escape a value, or a single character starting and stopping escaped values
- Either no way to escape a character within an escaped value, or a single character escape
- Any of the \r, \n, or \r\n character sequences for ending a row
- No comments, or “whole row” comments
- Optional leading or trailing whitespace around values
- Requiring a header row, forbidding a header row, or making a header row optional
This flexibility makes it possible to handle relatively standard things like Tab Separated Value (TSV) files, or CSV files which use an unusual character for escaping as well as kind of crazy things like CSVs using semicolons to separate values, or where values have been visually aligned with whitespace. All of this functionality, and much more, is configured with Cesil’s Options and OptionsBuilder classes.
And now we encounter Cesil’s first Open Question: Do these options provide adequate flexibility?
I’ve opened an Issue to gather long form responses. Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.
Now that I’ve covered the formats Cesil can handle, in the next post I will cover the whats and whys of the interface it exposes…