Overthinking CSV With Cesil: C# 8 Specifics

Posted: 2020/06/30 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

Way back in the first post of this series I mentioned that part of the motivation for Cesil was to get familiar with new C# 8 features, and to use modern patterns. This post will cover how, and why, Cesil has adopted these features.

The feature with the biggest impact is probably IAsyncEnumerable<T>, and it’s associated await foreach syntax. This shows up in Cesil’s public interface, as the returned value of IAsyncReader<TRow>.EnumerateAllAsync(), a parameter of IAsyncWriter<TRow>.WriteAllAsync(…), and as a returned value or parameter on various CesilUtils methods. IAsyncEnumerable<T> enables a nice way to yield elements that are obtained asynchronously, a perfect match for serialization libraries that consume streams. Pre-C# 8 you could kind of accomplish this with an IEnumerable<Task<T>>, but that’s both more cumbersome for clients to consume and slightly weird since MoveNext() shouldn’t block so you’d have to smuggle if the stream is complete into the yielded T. IAsyncEnumerable<T> is also disposed asynchronously, using another new-to-C#-8 feature…

IAsyncDisposable, which is the async equivalent to IDisposable, also sees substantial used in Cesil – although mostly internally. It is implemented on IAsyncReader<TRow> and IAsyncWriter<TRow> and, importantly, IDisposable is not implemented. Using IAsyncDisposable lets you require that disposal happen asynchronously, which Cesil uses to require that all operations on an XXXAsync interface are themselves async. C# 8 also introduces the await using syntax, which makes consuming IAsyncDisposables as simple for clients as consuming IDisposables. Pre-C# 8 if a library wanted to allow clients to write idiomatic code with usings it would have to support synchronous disposal on interfaces with asynchronous operations, essentially mandating sync-over-async and all the problems that it introduces.

The rest of the features introduced in C# 8 mostly see use internally, resulting in a code base that’s a little easier to work on but not having much impact on consumers. From roughly most to least impact-ful, the features adopted in Cesil’s code are:

Static local functions
- These were extensively used to implement the “actually go async”-parts of reading and writing, while keeping the fast path await-free.
- The big benefit is having the compiler enforce that local functions aren’t closing over any variables not explicitly passed into them, which means you can be confident invoking the function involves no implicit allocations.
Switch expressions
- These were mostly adopted in a tail position, where previously I’d have a switch where each case returned some calculated value.
- Using switching expressions instead of switch statements results in more compact code, which is a welcome quality-of-life improvement.
Default interface methods
- These let you attach a method with an implementation to an interface. The primary use case is to allow libraries to make additions to an already published interface without that breaking consumers.
- There’s another use case though, the one Cesil adopts, which is to attach an implemented method that all implementers of an interface will need. An example of this is ITestableDisposable, where the AssertNotDisposed method is the same everywhere but IsDisposed logic needs to be implemented on each implementing type.
- In older versions of C#, I’d use an extension method or some other static method to share this implementation but default interface methods let me keep the declarations and implementations closer together. Just another small quality-of-life improvement, but there’s potential for this to be a much bigger help in post-1.0 releases of Cesil.
Indices and Ranges
- These simplify taking elements or slices of strings, Spans, and so on. Cesil also supported reading and writing the new Index and Range types.
- Another small quality-of-life improvement, though I have seen this one catch some bugs when changing foo[something.Length – 1] to foo[^1].
Readonly Members
- You use these when you can’t make an entire struct readonly, but want the compiler to guarantee certain members don’t mutate the struct.
- I only did this in a few places, there aren’t that many mutable structs in Cesil, but having the compiler guarantee invariants is always a useful safety net.

Readers who closely follow C# are probably thinking “wait, what about nullable reference types?”. Those were the big new feature in C# 8, and Cesil has adopted them. However, unlike the other new C# 8 features, I intentionally deferred adopting them until Cesil was fairly mature as I wanted to explore converting an existing code base. My next post will go into that process in detail.

There aren’t really any Open Questions around the C# 8 features in this post. There were so many in the previous post on flexibility, that I think it’s probably best to just go and leave your thoughts on them instead.

As a reminder, they were…

Overthinking CSV With Cesil: “Maximum” Flexibility

Posted: 2020/06/24 | Author: kevinmontrose | Filed under: code | Tags: cesil | 1 Comment

Over the course of this series I’ve alluded to a future post where I’ll dig into all the configuration options Cesil offers.

This is that, gigantic, post.

I conceptualize Cesil’s configurability as being along three axes: the format, memory use, and type mapping. Format options control the style of delimiter separated value (DSV) you’re reading or writing, memory options give fine grained control over allocations, and type mappings handle converting from .NET types to text and vice versa.

To begin, let’s start with…

Format Options

The necessity of being able to configure different format options is clear for any CSV library, since as I said in an earlier post CSV isn’t really a format – it’s a bunch of related formats. For a library like Cesil, which aims to support all reasonable DSV formats, the necessity is even more obvious.

All configuration options relevant to formatting live on the Options type, with corresponding WithXXX methods on OptionsBuilder. These options are:

ValueSeparator – the single character used to separate columns in a row
- This is the C(omma) in CSV, so it is almost always a comma
- This must have a value, it cannot be left unset
- If your format stores the sequential values Kevin Monty Montrose as Kevin,Monty,Montrose then you’re using a comma for this
- Due to feedback on earlier posts, this will become a string in the next release of Cesil
RowEnding – whether rows end in the \n, \r, or \r\n character sequence
- Most CSV files us \r\n, but Cesil can automatically detect this when reading if you use RowEnding.Detect
- When using Detect, Cesil will use the character sequence it first encounters as the expected row ending
- When writing, a RowEnding other than Detect must be provided or an exception will be raised when an IWriter<TRow> or IAsyncWriter<TRow> is created
EscapedValueStartAndEnd – the character used to start and end an escaped value
- Typically this is a double quote, but it can be left unset
- If your format treats , as a value separator and would store Montrose, Kevin as “Montrose, Kevin” then you’re using a double quote for this
EscapedValueEscapeCharacter – the character used to start an escape sequence when you are already in an escaped value
- Typically this is a double quote, but it can be left unset
- If your format would store Kevin “Monty” Montrose as “Kevin “”Monty”” Montrose” then you’re using a double quote for this
ReadHeader – whether to always expect a header row, never expect a header row, or automatically detect a header row
- This tends to vary file to file, so it is often set to ReadHeader.Detect
- If you use Always or Detect, Cesil will use the header to infer column order when mapping columns to .NET types
- If your format supports comments, it is legal for comments to precede a header row
CommentCharacter – the single character that starts a comment line
- Typically this is not set, but if set it is often #
- For example a single line of #hello world would be a comment of hello world, in formats with # for this
WhitespaceTreatment – whether to trim whitespace that is encountered in certain places while parsing
- Most formats preserve whitespace if it is encountered in a value, and do not permit whitespace as padding around escaped values
- If your format is one that is unusual, Cesil supports automatically trimming whitespace in certain cases. Refer to WhitespaceTreatments for the full list of trimming behaviors
- Note that WhitespaceTreatments is a [Flags] enum, and so all different combinations of behavior can be combined.
ExtraColumnTreatment – how to handle encountering “extra” columns when reading
- Cesil considers a column “extra” if it doesn’t map to a member, or if it’s in a column that didn’t appear in the header row (if there is a header row)
- If you’re reading into dynamics and not requiring a header row, extra columns will be any that have an index greater than the highest index in the first read row
- This must be one of:
  - ExtraColumnTreatment.Ignore – extra columns are ignored, provided they don’t violate the format
  - ExtraColumnTreatment.IncludeDynamic – identical to Ignore when reading static types, but if reading into dynamics then extra columns are included. Extra columns are accessed either by index, or via a conversion into an IEnumerable or IEnumerable<T> or other type that permits access
  - ExtraColumnTreatment.ThrowException – an exception is raised if an extra column is encountered
WriteHeader – whether or not to write a header row before writing any values
- This tends to vary file to file, but Options.Default and Options.DynamicDefault do write headers to ease development
WriteTrailingRowEnding – whether or not to end the final written row with the configured RowEnding
- This seems to vary wildly, but Options.Default and Options.DynamicDefault do not write a trailing row ending

And that’s it. Ten options which, hopefully, allow Cesil to cope with all reasonable DSV formats out there. I’d be quite interested to learn of any that Cesil can’t cope with – it’s always a fun challenge to make a system more flexible without sacrificing ease of use or performance.

Now we’ll move on to…

Allocation Options

Beyond “don’t allocate more than necessary” it may strike some as odd to care about memory allocation in a .NET library – after all, .NET is a managed (ie. garbage collected) platform. I believe that in fact a modern .NET library should strive to both minimize allocations and provide ways for clients to control those allocations that must happen. The .NET ecosystem has been evolving in a much more performance focused direction for a while now, with fancy new types like Span and Pipelines encouraging low allocation and low copy patterns, first class support for processor intrinsics, and struct alternatives (that don’t default to allocating on the heap) like ValueTuple and ValueTask. The .NET GC is good, but it’s never going to be free so a laser focus on allocations is common when concerned with performance. It follows that if a library’s clients are focused on controlling allocations, a library needs to give them the tools they need to control allocations.

That said, some heap allocations are unavoidable. Cesil does its best to perform all unavoidable allocations prior to returning an I(Async)Reader<TRow> or I(Async)Writer<TRow> – so creating Options, binding IBoundConfigurations<TRow>, and the actual creation of a reader or writer may allocate but after that, allocations are under client control. There are exceptions, but I’ll dig into those in a later section.

On Options, there are a few relevant members:

MemoryPool – when Cesil needs to allocate, the MemoryPool<char> it uses to obtain a block of memory
- Cesil will always request a size it can work with, but if a client does not return a chunk of memory at least the requested size an exception may be raised
- Cesil will always call IMemoryOwner<char>.Dispose() when finished using a chunk of memory
- MemoryPools must be thread safe, as Cesil makes no guarantees that IMemoryOwner<char> references remain on any given thread
ReadBufferSizeHint – when reading, Cesil needs a buffer to store characters it has not yet processed. This value specifies how large that buffer should be
- There is often a tradeoff between buffer size and performance, the larger the buffer the fewer calls to an underlying stream are needed to load all data, and thus reading will complete more quickly. This is not true once a buffer is large enough that the underlying stream cannot fill it on each call, or if the underlying stream is frequently blocked waiting for more data
- Setting ReadBufferSizeHint to 0 tells Cesil to request a “reasonable default” buffer size
- The read buffer is obtain from the configured MemoryPool<char>, allowing clients to control precisely how the buffer is allocated
WriteBufferSizeHint – when writing, Cesil can stage writes into a buffer to improve performance. This value specifies if a buffer should be used, and how large it should be
- As with ReadBufferSizeHint there is often a trade-off between buffer size and performance. If there is no buffer, every write must call into the underlying stream which can make writing take considerably longer.
- Setting WriteBufferSizeHint to 0 disables write buffering, all data will be sent directly to the underlying stream
- Setting WriteBufferSizeHint to null tells Cesil to request a “reasonable default” buffer size
- The write buffer is obtain from the configured MemoryPool<char>, allowing clients to control precisely how the buffer is allocated
DynamicRowDisposal – controls when dynamic rows obtained during reading are disposed
- Data that backs dynamic rows is kept in memory obtained from the configured MemoryPool<char>, to avoid allocating on the heap if not necessary. This allows Cesil to cast a dynamic to a ValueTuple (for example) without first converting the column values to strings. As a consequence of this, Cesil’s dynamic rows are effectively IDisposable
- The two options for DynamicRowDisposal are:
  - DynamicRowDisposal.OnExplicitDispose – rows must have .Dispose() called on them, not doing so will leak an IMemoryOwner<char>
  - DynamicRowDisposal.OnReaderDispose – rows will be automatically disposed when the I(Async)Reader<TRow> that last touched them is disposed

Moving beyond Options, IReader<TRow> and IAsyncReader<TRow> have the XXXWithReuse() methods – these methods take a ref parameter that points to a row to reuse when reading. When processing many rows in sequence, these methods let you allocate a single row and then just repeatedly reuse it – greatly reducing the number of allocations. There are a few caveats to keep in mind. First, if a row has a Setter backed by constructor parameters (more on those below) a row cannot be reused and will always be reallocated. Second, value types are always zero initialized so there is always a row to reuse if the row is a value type – this means your InstanceProvider (more on those below) may not be invoked when you’d expect it to be, if the row was a reference type. Finally, because the XXXWithReuse() methods return the ref parameter will be initialized with the row that will ultimately be initialized it is possible (especially in async cases, when the underlying stream blocks) for Cesil to allocate a row it ends up not needing.

The last piece of allocation control has lots of overlap with type mapping, which is covered in the next section, but in brief: InstanceProviders give clients control over how rows are obtained, and Parsers give control for how ReadOnlySpans<char> are turned into instances of other types. Other types that participate in type mapping allow for control over accessing members, assigning members, and so on – so a client can customize any step of the process that might concern them.

With allocations covered, let’s now proceed to the final axis of configuration…

Type Mapping

DSVs provide rows and columns of text, and that’s it really. .NET has a much richer type system, and so Cesil must provide some way to move between these two worlds. Complicating that is how many different styles of .NET coding are out there, a good library must provide clients with the tools they need to match Cesil’s behavior to their own applications.

Cesil breaks this process of mapping types to and from text into several logical pieces, many of which have been mentioned in earlier posts:

InstanceProviders, which obtain instances of rows to populate
- These can be created from delegates, MethodInfos, or ConstructorInfos
Parsers, which turn text data into instances of .NET types
- These can be created from delegates, MethodInfos, or ConstructorInfos
Resets, which allow per-column code to run prior to assignment of a row member
- These can be created from delegates, or MethodInfos
Setters, which take the instances produced by Parsers and assign them to members on a row obtained from an InstanceProvider
- These can be created from delegates, MethodInfos, FieldInfos, PropertyInfos, or ParameterInfos
DeserializableMembers, which group a name, Parser, Reset, Setter, and MemberRequired together to describe the treatment of a single column when performing read operations
ShouldSerializes, which allow for per-row control over whether a member is written
- These can be created from delegates, or MethodInfos
Getters, which obtain instances of .NET types from rows which are placed in columns
- These can be created from delegates, MethodInfos, FieldInfos, or PropertyInfos
Formatters, which turn the instances obtained from Getters into text data
- These can be created from delegates, or MethodInfos
SerializableMembers, which group a name, ShouldSerialize, Getter, Formatter, and EmitDefaultValue together to describe the treatment of a single column when performing write operations
DynamicRowConverters, which back casting dynamic rows into concrete instances of .NET types
- These can be created from delegates, MethodInfos, ConstructorInfos, or a combination of ConstructorInfos and Setters
DynamicCellValues, which backs converting cells obtained from a dynamic row into text data
- These can be created from Formatters
ITypeDescribers, the interface which Cesil uses to obtain all the above
- As the primary way Cesil allows customizing this type mapping, it is discussed in more detail below

Each of these types has particular rules about what kind of method, delegate, constructor, etc. can back them which are detailed in the documentation for each type, and on Cesil’s wiki.

Additionally, a number of these types (like Parser) have a notion of failure (indicated by returning false from a method or delegate) and support delegating to another instance as a fallback. This is used via their Else(…) method, which creates and returns a new instance what will delegate on failure.

As mentioned above, the ITypeDescriber interface requires a little more discussion. It has six methods, each of which supports a particular use case for Cesil:

Invoked once per IBoundConfiguration<TRow>
- GetInstanceProvider(TypeInfo) to determine how any created I(Async)Reader<TRow> should obtain the rows they return, if the rows aren’t provided via a XXXWithReuse method
  - When reading dynamic rows, this method is not invoked
- EnumerateMembersToDeserialize(TypeInfo) to determine what columns any I(Async)Reader<TRow> should expect to find, and how to map the text in those columns into members on the row
  - The order of returned DeserializableMembers is used if there is no header row when reading
  - When reading dynamic rows, this method is not invoked
- EnumerateMembersToSerialize(TypeInfo) to determine how many columns an I(Async)Writer<TRow> should write per row, and how to obtain and format the values behind those columns from a row
  - The order of returned SerializableMembers controls the order of the columns when written
  - When writing dynamic rows, this method is not invoked
Invoked as needed in response to operations on dynamic rows
- GetDynamicCellParserFor(in ReadContext, TypeInfo) to obtain a Parser that can convert the text data backing a cell in a dynamic row to a particular .NET type
  - The provided ReadContext describes where the cell occurred in the data, as well as the Options used to originally parse the data
- GetDynamicRowConverter(in ReadContext, IEnumerable, TypeInfo) to obtain a DynamicRowConverter that can convert a whole dynamic row to a particular .NET type
  - The provided ReadContext describes where the row occurred in the data, as well as the Options used to originally parse the data
- GetCellsForDynamicRow(in WriteContext, Object) to obtain the DynamicCellValues to write from a dynamic row
  - The provided WriteContext describes the index of the row being written, as well as the Options used to by the I(Async)Writer<TRow>

In addition to the raw ITypeDescriber interface, Cesil also provides three implementations of the interface out of the box. They are:

The DefaultTypeDescriber which is used by, well, default and implements “normal” .NET (de)serialization behaviors
- This class isn’t sealed, and has numerous virtual methods as extension points for when you want minor tweaks to “normal” behavior
The ManualTypeDescriber which, in conjunction with ManualTypeDescriberBuilder, lets you specify exactly which InstanceProviders, Setters, Parsers, etc. to use for specific types
The SurrogateTypeDescriber which, in conjunction with SurrogateTypeDescriberBuilder, lets you specify a type as a surrogate for some other, original, type. When any of ITypeDescribers members invoked for the original type, a SurrogateTypeDescriber instead inspects it’s surrogate type and then maps the results back to the original type
- Surrogate types are useful when you want to apply attributes to a type you don’t control

And that wraps up my deep dive in Cesil’s flexibility. There’s even more detail in the wiki and on the reference documentation (linked throughout this post) for the involved types, but this post should at least give you a decent basic understanding.

Which brings us to the Open Questions of this post:

As before, I’ve opened three issues to gather long form responses. Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

My next post will tackle a smaller subject – I’ll be going over some of the new features that came with C# 8, and the how and why of Cesil’s adoption of them.

Overthinking CSV With Cesil: Writing Dynamic Types

Posted: 2020/06/16 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

I covered how to write known, static, types with Cesil in my previous post. As with reading, Cesil also supports dynamic types.

In my post on dynamic reading, I argued dynamic is still worth supporting due to how convenient it makes some common read operations. I feel the case for writing dynamic types is much weaker – it is rare to want to write heterogeneous types, and even rarer to not be able to easily map such a mixed collection to a single known type. All that said, for symmetry’s sake Cesil does have extensive support for writing dynamic types.

As with reading, writing static and dynamic types is essentially symmetric. All the same methods are provided, supporting all the same operations. The only difference is rather than using Configuration.For<TRow>() you use Configuration.ForDynamic(), and rather than IBoundConfiguration<TRow> being parameterized by a type TRow it’s parameterized by dynamic.

When using the DefaultTypeDescriber, performance varies considerably based on the “kind” of dynamic you are writing. Cesil special cases “well known” dynamic types for improved performance – namely the dynamic rows Cesil creates and ExpandoObject are treated specially. For other DLR aware types Cesil will use IDynamicMetaObjectProvider directly, which is considerably slower. Plain .NET types delegate to the usual EnumerateMembersToSerialize method, which implements “normal” .NET behavior.

Cesil allows customizing the members discovered, and the order they’ll be written in, by using a custom ITypeDescriber with your Options and implementing the GetCellsForDynamicRow directly. Simple inclusive/exclusive can be controlled by subclass the DefaultTypeDescriber and overriding the ShouldIncludeCell method. I’ll cover how this works in more detail in a later post that goes in depth into all of Cesil’s configuration options.

And that’s about it for dynamic serialization – there’s not a lot to cover since so much of it is “just like writing static types, but dynamic.” This post’s Open Question is, accordingly, more “tactical” than previous ones:

Is there a better interface for discovering dynamic “cells” than the IEnumerable<DynamicCellValue>-returning GetCellsForDynamicRow() method currently on ITypeDescriber?

The interface isn’t technical wrong, but it has the undesirable property that general implementations will allocate at least a little bit for each row written. An allocation-free alternative would be a marked improvement, provided it doesn’t come at the cost of flexibility or reasonable performance.

As before, I’ve opened an issue to gather long form responses. Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

In my next post I’ll go into detail on all the configuration options Cesil supports. It’ll be a long post, as Cesil supports customizing the expected format, as well as almost every detail of describing and mapping types.

Overthinking CSV With Cesil: Writing Known Types

Posted: 2020/06/12 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

My last two posts have covered deserializing with Cesil, the subsequent two will cover serialization. This post will specifically dig into the case where you know the types involved at compile time, while the next one will cover the dynamic type case. If you’ve read the previous posts on read operations hopefully a lot of this will seem intuitive, just in reverse.

Again, CesilUtils exposes a bunch of utility methods – this time with names like WriteXXX. Variants exist for single row, multiple row, synchronous, asynchronous, and “straight to a file” operations. Just like with reading, CesilUtils doesn’t allow you to reuse an IBoundConfiguration<TRow> nor does it expose the underlying I(Async)Writer<TRow> but is convenient when performance and customization aren’t of paramount importance.

As with reading, maximum performance and flexibility is found in using either IWriter<TRow> or IAsyncWriter<TRow> interfaces obtained from an IBoundConfiguration<TRow> created via Configuration.For<TRow>. Creating configurations is mildly expensive, so caching and reusing them can be beneficial.

The writer interfaces expose methods to do the following:

Write a collection of rows with WriteAll(Async)
- The sync version accepts an IEnumerable<T>
- The async version can take either an IEnumerable<T> or an IAsyncEnumerable<T>
Write a single row with Write(Async)
Write a comment with WriteComment(Async)
- If a comment contains a row ending sequence of characters, it will be split into multiple comments automatically

Mapping a type to a set of columns, the order of the those columns, and the conversion of the values of those columns to text is done with the ITypeDescriber registered on the Options provided to Configuration.For<TRow> or the method on CesilUtils (by default, this is an instance of DefaultTypeDescriber). When an IBoundConfiguration<TRow> is created ITypeDescriber.EnumerateMembersToSerialize is invoked once and the returned SerializableMembers detail how Cesil will map a TRow instance to a set of text columns.

Specifically a SerializableMember details

The name of column, which may be written as part of a header row
The Getter to use to obtain a value from a TRow instance
An (optional) ShouldSerialize to control, per-row, whether a column should be included
The Formatter used to turn the columns value into a sequence of characters
Whether or not to include a column if it has the default value for it’s type
- Cesil uses Activator.CreateInstance to obtain a default instance of ValueTypes, and use null as the default value for reference types

The order of columns is taken from the order they are yielded by the IEnumerable<SerializableMember> returned by ITypeDescriber.EnumerateMembersToSerialize.

There is quite a lot of flexibility in how Getters, ShouldSerializes, and Formatters can be created. They will be covered in detail in a later post.

There’s less internal state being managed when Cesil is writing in comparison to when it is reading, so there are no fancy state machines or lookup tables. The most interesting part is NeedsEncodeHelper which is used to check for characters that would require escaping, which makes use of the X64 intrinsics supported in modern .NET (provided your processor supports them).

There are some minor additional details to keep in mind while writing with Cesil:

All XXXAsync() methods try to make as much progress as they can without blocking, they don’t just yield to yield.
All XXXAsync() methods do take an optional CancellationToken, and pass it down to the underlying stream. CancellationTokens are checked at reasonable intervals, but no guarantees are made about how often.
If you try to write a comment without having configured your Options with a comment character, an exception will be raised.
- Options.Default does not have a comment character set.
If you try and write a value that would require escaping without having configured your Options with a way to start and end escaped values, an exception will be raised.
- Options.Default has ” as it’s escape start and stop characters.
If you try to write a value that includes the escape start and stop character, but have not configured your Options with an escape character, an exception will be raised.
- Options.Default also has ” as it’s escape character.

And that about covers how to write static types with Cesil.

The Open Question for this post is a return to an earlier one, but with a particular focus on writing: Is there anything missing from IWriter(Async) that you’d expect to be supported in a modern .NET CSV library?

This question has already led to some changes, which will appear in the next release of Cesil – adding comment writing methods that take ReadOnlySpan<char> and ReadOnlyMemory<char> parameters, clarifying some parameter names, and returning counts of the number of rows written from the enumerable taking write methods.

Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

In my next post I’ll cover how Cesil supports writing dynamic types, those not known at compile time. As you might expect from reading static and dynamic types, it is very similar to how static types are read…

Overthinking CSV With Cesil: Reading Dynamic Types

Posted: 2020/06/08 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

In my last post I went over how to use Cesil to deserialize to known, static, types. Since version 4.0, C# has also had a notion of dynamic types – ones whose bindings, members, and conversions are all resolved at runtime – and Cesil also supports deserializing into these.

In 2020, supporting dynamic isn’t exactly a given – dynamic is relatively rare in the .NET ecosystem, the big “Iron” use cases in 2015 (dynamic languages running on .NET) are all dead as far as I can tell, and the static-vs-dynamic-typing pendulum has been swinging back towards static with the increasing popularity of languages like Go, Rust, and TypeScript (even Python supports type annotations these days). All that said, I still believe there are niches in C# well served by dynamic – “quick and dirty” data loading without declaring types, and loading heterogeneous data. These are both niches Cesil aims to support well, and therefore dynamic support is a first-class feature.

Part of being a first-class feature means that all the flexibility and ease of use from static types is also present when working with dynamic. There aren’t any new types or interfaces, just use Configuration.ForDynamic() instead of Configuration.For<TRow>(), Options.DynamicDefault (which assumes a header row is present) instead of Options.Default (which will detect if a header row is present or not, which isn’t possible with unknown types), and the EnumerateDynamicXXX() methods on CesilUtils. The same readers with the same methods are all available, only now instead of some concrete T you’ll get a dynamic back. And, while dynamic operation does impose additional overhead, Cesil still aims for dynamic operations to be reasonably performant – within a factor of 3 or so of their static equivalent.

Regardless of the Options used, the dynamic rows returned by Cesil always support:

Casting to IDisposable
Calling the Dispose() method
Get accessor with an int (ie. someRow[0]), which returns a dynamic cell
- This will throw if the int is out of bounds
Get accessor with a column name (ie. someRow[“someColumn”]), which returns a dynamic cell
- If there was no header row present when reading (or if the column name is not found), this will throw
Get accessor with an Index (ie. someRow[^1]), which returns a dynamic cell
- This will throw if the Index is out of bounds
Get accessor with a Range (ie. someRow[1..2]), which returns a dynamic row
- This will throw if the Range is out of bounds
Get accessor with a ColumnIdentifier (ie. someRow[ColumnIdentifier.Create(3)]), which returns a dynamic cell
- If the ColumnIdentifier has a Name, and a header row is present, this will throw if Name is not found.
- If the ColumnIdentifier does not have a Name, or a header row is not present, this will throw if it’s Index is out of bounds

Likewise, regardless of the Options used, dynamic cells (obtained by indexing a dynamic row per above) always support casting to IConvertible. IConvertible is a temperamental interface, so Cesil’s implementation is limited – it doesn’t support non-null IFormatProviders, and makes a very coarse attempt at determining TypeCode. Basically, Cesil does just enough for the various methods on Convert to work “as you’d expect” for dynamic cells.

Just like with static deserialization, the ITypeDescriber on the Options used to create the IBoundConfiguration<TRow> controls how values are mapped to types. The differences are that dynamic conversions are discovered each time they occur (versus once, for static types) and conversion decisions are deferred until a cast (versus happening during reading, for static types). Dynamic deserialization does not allow custom InstanceProviders (as the dynamic backing infrastructure is provided directly by Cesil) – however the XXXWithReuse() methods on I(Async)Reader<TRow> still allow for some control over allocations.

Customization of dynamic conversions can be done with the DynamicRowConverter type (for rows) and the ITypeDescriber.GetDynamicCellParserFor() method (for cells). I’ll dig further into these capabilities in a later post. Out of the box, the DefaultTypeDescriber (used by Options.DynamicDefault) implements the conversions you would expect.

Namely, for dynamic rows Cesil’s defaults allow conversion to:

Object
Tuples
- Rows with more than 7 columns can be mapped to nested Tuples using TRest generic parameter
ValueTuples, including those with a TRest parameter
- Rows with more than 7 columns can be mapped to nested ValueTuples using TRest generic parameter
IEnumerable<T>
- Each cell is lazily converted to T
IEnumerable
- Each cell becomes an object, with no conversion occurring
Any type with a constructor taking the same number of parameters as the row has columns
- Each cell is converted to the expected parameter type
Any type with a constructor taking zero parameter, provided the row has column names
- Any properties (public or private, static or instance) whose name matches a column name will be set to the column’s value

If no conversion is possible, Cesil will raise an exception. If a conversion is chosen that requires converting cells to static values, those conversions may also fail and raise exceptions.

For dynamic cells, Cesil’s defaults allow conversion to:

Any type that has a public constructor which takes a single ReadOnlySpan<char> parameter
Any type that has a public constructor which takes a ReadOnlySpan<char> parameter, and an in ReadContext parameter
Any type that has a default Parser

As with rows, finding no conversion or having a conversion fail will cause Cesil to raise an exception.

And that covers the why and what of dynamic deserialization in Cesil. This post leaves me with two Open Questions:

As before, I’ve opened two issues to gather long form responses. Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Next time I’ll dive into the write operations Cesil supports, starting with static types.

Overthinking CSV With Cesil: Reading Known Types

Posted: 2020/06/02 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

The most common operation for a C# serialization library is usually reading into a known, static, type. That is, you’re given a stream or a blob of bytes and need to turn it into an instance of some type T. Cesil aims to make this common operation simple, fast, and customizable.

For cases where performance and customization are less important, CesilUtils exposes a bunch of EnumerateXXX methods. Both synchronous and asynchronous versions available, but all methods return results lazily.

Maximum performance and flexibility is found in using either IReader<TRow> or IAsyncReader <TRow> interfaces, obtained from an IBoundConfiguration <TRow> created via Configuration.For <TRow>. Unlike CesilUtils, using these interfaces lets you cache and reuse an IBoundConfiguration <TRow> and allow you to read comments and reuse rows.

Concretely, I(Async)Reader <TRow> methods let you:

Lazily enumerate rows with EnumerableAll(Async)
- The async version returns an IAsyncEnumerable<T>, which is new to C# 8
Eagerly read rows with ReadAll(Async)
- You can also control the collection read into with specific overloads
Read a single row with TryRead(Async)
- You can reuse an already allocated row with TryReadWithReuse(Async)
Read a row or a comment with TryReadWithComment(Async)
- As above, you can reuse an already allocated row with TryReadWithCommentWithReuse(Async)

Determining what members on the given TRow type map to which columns, how those columns should be parsed, and how members should be set is done with the ITypeDescriber registered on the Options provided to Configuration.For <TRow> or the method on CesilUtils (by default, this is an instance of DefaultTypeDescriber). When an IBoundConfiguration <TRow> is created ITypeDescriber.EnumerateMembersToDeserialize is invoked once and the returned DeserializableMembers detail how Cesil will map rows of data to TRow instances.

Preciesly, you can specify

The name of the column a member maps to
- If a CSV lacks a header row, the order of the DeserializableMembers will be used to match columns instead
The Parser to use to turn a ReadOnlySpan into a specific type
An (optional) Reset to call before setting a member
The Setter to use to place the type created by the Parser on a member of TRow
Whether or not a member is required

A separate call to ITypeDescriber.GetInstanceProvider will be made to obtain an InstanceProvider which is used to get TRow instances needed when reading a row. While the call to get the InstanceProvider always happens, the InstanceProvider won’t be used if the XXXWithReuse methods are called with a non-null TRow reference. InstanceProviders allow you to implement sophisticated row re-use or initialization logic that a simple “ref TRow” isn’t adequate for.

There’s a great deal of flexibility in how InstanceProviders, Parsers, Resets, and Setters can be created which will be covered in a later post.

Internally, Cesil models reading a CSV as transitions through a state machine. Each character read is mapped to a CharacterType (one of EscapeStartAndEnd, Escape, ValueSeparator, CarriageReturn, LineFeed, CommentStart, Whitespace, Other, and DataEnd), which is then used in conjunction with the current State to look up a TransitionRule. TransitionRules specify the new State as well as an AdvanceResult, which instructs Cesil to take certain actions (like skipping the character, appending a character to the read buffer, finishing a column or row, etc.). Only the mapping from char to CharacterType is dependent on the configured Options, Cesil pre-allocates and reuses the TransitionRules that back the state machine.

Although Cesil’s state machine progresses one character at a time, Cesil reads multiple-characters at a time in order to maximize performance and better match modern C# interfaces like PipeReader. Control over the read buffer’s size is provided through ReadBufferSizeHint. Cesil also batches certain common AdvanceResults, like skipping or appending characters, so that the overhead of certain method calls is minimized in hot paths.

Taken altogether, and at a very high level, when Cesil reads a single row this is what happens:

Characters are read into the read buffer, if it is empty
1. If there are no more characters to read into the buffer, proceed as if we have read a single EndOfData CharacterType.
If no instance of TRow has been provided, Cesil obtains one using the InstanceProvider
For each character in the read buffer…
1. The character is mapped to a CharacterType
2. The current State and CharacterType are used to find the next State and an AdvanceResult
  1. If the AdvanceResult is batchable, note is made of it but no action is taken
  2. If the AdvanceResult is not batchable, any pending batched actions are taken and then the new action is taken
    1. If the AdvanceResult finishes a value, the current pending value is Parsed, the Reset for the current column is called (if it exists), the Setter is called
    2. If the AdvanceResult finishes a record, we return the row and are finished
3. Remove the read character from the buffer
If we haven’t returned a row, go back to 1

There are a few consequences of this design:

There can be pending data in the read buffer when a row is returned, which means that you cannot use Cesil to read “up to a particular row” in the underlying data stream. Once Cesil starts reading, no guarantees are made about the state of the underlying stream.
For maximum performance it’s worth reusing IBoundConfigurations, as a decent amount of reflection and lookup creation happens when one is created. All I(Async)Readers that one creates will reuse that work, making a cache very efficient.
In asynchronous cases, Cesil will await only when the read buffer is empty and cannot be filled without blocking. This means that Cesil can “go async” much less frequently than might naively be expected, were it to be reading characters one at a time.

Finally, Cesil does offer support for reading whole line CSV comments. Although non-standard and rather rare, they arise often enough to be worth supporting. The reader interfaces expose TryReadWithComment(WithReuse)(Async) methods that return a ReadWithCommentResult, a tagged union type that wraps the comment or row read. In order to read comments, Options.CommentCharacter must have been set when the IBoundConfiguration<TRow> was created – calling any of the XXXWithComment methods when it has not been set will raise an exception. If a comment is encountered when a non-XXXWithComment method is invoked, but Options was configured with comment support, the comment will be silently skipped.

That wraps up what static deserialization looks like in Cesil.

The Open Question for this post is the same as the previous post, but with a particular focus on reading: Is there anything missing from IReader(Async) that you’d expect to be supported in a modern .NET CSV library?

This question has already led to some planned changes, namely removing the class constraint on I(Async)Reader’s TCollection generic parameter, and adding comment writing methods that take ReadOnlySpan<char> and ReadOnlyMemory<char> parameters.

Next time I’ll be discussing reading dynamic types, and why I think that’s still worth supporting in 2020…

Overthinking CSV With Cesil: A “Modern” Interface

Posted: 2020/05/29 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

Part of Cesil’s raison d’être is to be a “modern” library for CSV, one which takes advantage of all the fancy new additions in recent C# and .NET Core versions. What exactly is “modern” is debatable, so this post lays out my particular take.

To make things concrete, the “main” interfaces for Cesil are split into:

Configuration – with the Options (and it’s Builder) and Configuration classes
- The static Configuration class produces instances of the IBoundConfiguration<TRow> interface
Reading – with the IReader<TRow> and IAsyncReader<TRow> interfaces
- Each interface has a way to read single rows, lazily enumerate all rows, greedily read all rows, and read all rows into the provided collection
Writing – with the IWriter<TRow> and IAsyncWriter<TRow> interfaces
- Each interface has a way to write a single row, write several rows lazily, and write several rows greedily.
Utilities – with numerous methods on the CesilUtils static class
- These methods provide single call ways to read and write collections of rows at the expense of some efficiency
Type Describing – with many types describing things like “creating rows” and “getting members”
- These will be covered in detail in a later post

The first thing you’ll notice when using Cesil is that it splits setup into two logical steps, building Options and binding Configurations. Options cover all the generally reusable parts of working with CSVs (things like separators, and memory pools), while Configurations represent a binding of Options to a particular Type. Binding a type implies a fair amount of work, in particular a decent amount of reflection to determine columns. By separating Options and Configurations, Cesil allows easy and efficient reuse of the “cheap” parts of a setup while giving control over when the expensive parts happen.

You’ll also quickly notice that Cesil tends to hand you interfaces instead of base classes. This is a consequence of my belief that encouraging inheritance in end user code is generally a mistake, combined with a desire to keep implementation details hidden. Thus Cesil exposes IReader<TRow> rather than SyncReaderBase<TRow>, and nearly every exported class is sealed.

Cesil splits reading and writing into separate interfaces in a manner similar to the recent System.IO.Pipelines namespace. Coupling reading and writing would mean that certain operations would be allowed by the type system even if they couldn’t possibly work at runtime – say, writing to something that was backed by a ReadOnlySequence<T>. The BCL has some examples of this failure, like Stream, whose Remarks call out that “Depending on the underlying data source or repository, streams might support only some of these capabilities”. Effectively this means that there are methods on all Streams that cannot be safely called in all cases, and that is a poor design choice to make in 2020.

Asynchronous and synchronous operations also get separate interfaces rather than one shared one. While not as footgun-y as mixing reading and writing, mixing synchronous and asynchronous operation is fraught with potential for error – either in correctness (such as starting synchronous operations while asynchronous ones are pending completion) or performance (such as sync-over-async). Potential for error is increased with the introduction of IAsyncDisposable and await using, the synchronous nature of IDisposable and using can be hidden in otherwise asynchronous code. Accordingly, all methods on IAsyncReader<TRow> and IAsyncWriter<TRow> are asynchronous and all methods on IReader<TRow> and IWriter<TRow> are synchronous – the former two implement IAsyncDisposable and the latter implement IDisposable.

Other, less immediately obvious, choices made in Cesil:

Most types are immutable, and all immutable types implement IEquatable<T>
- Mutability is a footgun in the highly concurrent code that is increasingly common, and so is avoided everywhere possible
Relatively few primitives are in the interface, enums (like EmitDefaultValue) and semantic wrappers (like ColumnIdentifier) are preferred
- Primitive types are easy to accidentally misuse and harder to read (ie. what does “true” mean when passed to method “Foo”)
Comments in CSVs are read and written with specific methods (TryReadWithComment(Async) and WriteComment(Async)), by default they are ignored when read (even if supported by a set of Options)
- Comments are relatively rare, so the basic operations shouldn’t be encumbered by having to deal with them
- They must be different methods because the implicit type of all comments is `string` not TRow
Recently introduced types like ReadOnlySequence<T>, IBufferWriter<T>, PipeReader, and PipeWriter have first class support
- Older types like TextReader and TextWriter are also supported, since these are still supported in the BCL and lots of code continues to use them

Having spelled out Cesil’s read and write interfaces leads to the second Open Question: Is there anything missing from IReader(Async) and IWriter(Async) that you’d expect to be supported in a modern .NET CSV library?

As before, I’ve opened an Issue to gather long form responses. Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Now that we’ve covered, at a very high level, the overall interface for Cesil, the next post will dig into how reading static types works in detail…

Overthinking CSV With Cesil: CSV Isn’t A Thing

Posted: 2020/05/28 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

For those who read my previous post, when you read “CSV library” you likely had one of two thoughts depending on how much exposure you’ve had to CSV files – either:

Dealing with CSVs is so simple, how much could there be to write about?
Dealing with CSVs is insanely complicated, why would you ever do that?

My day job is running a data team so I’m firmly in camp #2, lots of things run on CSV and it’s crazy complicated. Fundamentally this is because CSV isn’t a format, it’s a family of related formats. If you work with arbitrary CSV files long enough, you’ll eventually encounter one that doesn’t even use commas for separators.

Like most weird things, this is a consequence of history. The idea of CSV dates back at least 40 years, while the RFC “standardizing” it is from 2005. That’s a lot of time for different versions to flourish.

To get more concrete, CSV is a subset of the Delimiter Separated Values (DSV) family of tabular data formats – one which often (but not always!) uses commas to separate values. The most common variant is almost certainly that produced by Microsoft Excel (on Windows, in an English locale) – it uses commas to separate values, double quotes to start escaped values, double quotes to escape within escaped values, and the carriage-return line-feed character sequence to end a row.

Cesil aims to support all “reasonable” DSV formats, with defaults for the most common kind of CSV. A later post will go into exactly how flexible Cesil can be, but from a format perspective Cesil can handle:

Any single character value separator
Either no way to escape a value, or a single character starting and stopping escaped values
Either no way to escape a character within an escaped value, or a single character escape
Any of the \r, \n, or \r\n character sequences for ending a row
No comments, or “whole row” comments
Optional leading or trailing whitespace around values
Requiring a header row, forbidding a header row, or making a header row optional

This flexibility makes it possible to handle relatively standard things like Tab Separated Value (TSV) files, or CSV files which use an unusual character for escaping as well as kind of crazy things like CSVs using semicolons to separate values, or where values have been visually aligned with whitespace. All of this functionality, and much more, is configured with Cesil’s Options and OptionsBuilder classes.

And now we encounter Cesil’s first Open Question: Do these options provide adequate flexibility?

I’ve opened an Issue to gather long form responses. Remember that, as part of the sustainable open source experiment I detailed in the first post of this series, any commentary from a Tier 2 GitHub Sponsor will be addressed in a future comment or post. Feedback from non-sponsors will receive equal consideration, but may not be directly addressed.

Now that I’ve covered the formats Cesil can handle, in the next post I will cover the whats and whys of the interface it exposes…

Overthinking CSV With Cesil: An Introduction

Posted: 2020/05/28 | Author: kevinmontrose | Filed under: code | Tags: cesil | Comments Off

~~Several months ago~~ About a year ago (how time flies) I decided to spin up a new personal project to get familiar with all the new goodies in C# 8 and .NET Core 3. I happened to be dealing with some frustrating CSV issues at the time, so the project was a CSV library.

Once I got into the meat of the project, I started really overthinking things. The end result was Cesil – a pre-release package is available on Nuget, source is on GitHub, and it’s got reference documentation and a prose wiki. It’s released under the MIT license.

When I say I was overthinking things, I mean that rather than build a toy just for my own edification I ended up trying to do The Right Thing™ for a .NET library released in 2020. This, at least 14 part, blog series will cover exactly what that entailed but in short I committed to:

Async as a first class citizen
Maximum consumer flexibility
Extensive documentation
Comprehensive test coverage
Adopting C# 8 features
Modern patterns and conventions
Efficiency, especially in terms of allocations

Interpretations of each of those points can be a matter of opinion, and I’m not going to claim to have 100% correct opinions. I attempted to record both things I consider opinions and open questions, both of which I’ll expound upon as this series continues.

My hope is that Cesil is easy to use, hard to misuse, handles the common cases out of the box, and can be configured to handle almost anything you might want to do with CSV. I intend to respond to feedback and make changes as needed over the course of this series to make it more likely those hopes are realized.

A final bit of overthinking on the whole project has been around sustainable open source. There’s been a fair amount of discussion on the subject, the gist of which is that loads of people and companies benefit from volunteers doing skilled work without compensation – and that is an unsustainable practice. As a small experiment inline with these thoughts, I’ve set up GitHub Sponsors for Cesil with a few low commitment tiers. I’ll both be using the tiers to prioritize responding to some feedback, and reporting on the results of this experiment towards the end of the blog series.

Now with the introduction out of the way, I’m ready to dive into technical bits in the next post…

Adding Static Code Analysis to Stack Overflow

Posted: 2019/10/04 | Author: kevinmontrose | Filed under: code | Comments Off

As of September 23rd 2019 we’re applying static analysis to some of the code behind public Stack Overflow, Stack Overflow for Teams, and Stack Overflow Enterprise in order to pre-emptively find and eliminate certain kinds of vulnerabilities. How we accomplished this is an interesting story, and also illustrative of advancements in .NET’s open source community.

But first…

What did we have before static analysis?

The Stack Overflow codebase has been under continuous development for around a decade, starting all the way back on ASP.NET MVC Preview 2. As .NET has advanced we’ve adopted tools that encourage safe practices like Razor (which defaults to encoding strings, helping protect against cross site scripting vulnerabilities). We’ve also created new tools that encourage doing things the Right Way™, like Dapper which handles parameterizing SQL automatically while still being an incredibly performant (lite-)ORM.

An incomplete, but illustrative, list of default-safe patterns in our codebase:

Automated SQL parameterization with Dapper
Default encoding strings in views with Razor
Requiring cross site request forgery (XSRF) tokens for non-idempotent (ie. POST, PUT, DELETE, etc.) routes by default
HMACs with default expirations and common validation code
Adopting TypeScript, an ongoing process, which increases our confidence around shipping correct JavaScript
Private data, for Teams and Enterprise, is on separate infrastructure with separate access controls

In essence we were safe, at least in theory, from most classes of injection and cross site scripting attacks.

So, …

What did static analysis give us?

In large part, confidence that we were consistently following our pre-established best practices. Even though our engineers are talented and our tooling is easy to use, we’ve had dozens of people working on Stack Overflow for 10+ years – inevitably some mistakes slipped into the codebase. Accordingly most fixes were just moving to doing something “the right way,” and pretty minor. Things like “use our route registration attribute, instead of [HttpPost]” or “remove old uses of SHA1, and switch to SHA256”.

The more “exciting” fixes required introducing new patterns, and updating old code to use them. While we had no evidence that any of these were exploited, or even exploitable in practice, we felt it was best to err on the side of caution and address them anyway. We added three new patterns as part of adopting static code analysis:

We replaced uses of System.Random with an equivalent interface backed by System.Security.Cryptography.RandomNumberGenerator.
1. It is very hard to prove a random number being predictable either is or isn’t safe, so we standardized on always hard to predict.
We now default to forbidding HTTP redirects to domains we do not control, requiring all exceptions be explicitly documented.
1. The concern here is open redirects, which can be used for phishing or other malicious purposes.
2. Most of our redirects were already appropriately validating this, but the checks were scattered across the code base. There were a few missing or buggy checks, but we found no evidence of them being exploited.
We strengthened XSRF checks to account for cases where users move between unauthenticated and authenticated states.
1. Our XSRF checks previously assumed there was a single token tied to a user’s identity. Since this changes during authentication, some of our code suppressed this check and relied on other validation (completing an OAuth flow, for example).
2. Even though all cases did have some kind of XSRF prevention, having any opt-outs of our default XSRF checking code is risky – so we decided to improve our checks to handle this case. Our fix was to allow two tokens to be acceptable, briefly, on certain routes.

Our checks run on every pull request for Stack Overflow, and additionally (and explicitly) on every Enterprise build – meaning we aren’t just confident that we’re following our best practices today but we’re confident we will keep following them in the future.

In terms of Open Web Application Security Project (OWASP) lists, we gained automatic detection of:

SQL injection attacks [2017 #1]
XML external entity (XEE) attacks [2017 #4]
Cross site scripting (XSS) attacks [2017 #7]
Insecure deserialization [2017 #8]
XSRF attacks [2013 #8]
Open redirects [2013 #10]

That wraps up what we found and fixed, but…

How did we add static code analysis?

This is boring because all we did was write a config file and add a PackageReference to SecurityCodeScan.

That’s it – Visual Studio will pick it up as an analyzer (so you get squigglies) and the C# compiler will do the same so you get warnings or errors (we treat all warnings as errors).

Not real code, ’cause by the time I thought to take a screenshot we’d already fixed everything.

Far more interesting is all the open source stuff that made this possible:

In 2014 Microsoft open sourced Roslyn, their C# and VB.NET compiler
Visual Studio 2015 ships with support of Roslyn analyzers
The authors of Security Code Scan start work in 2016
I contribute some minor changes to accommodate Stack Overflow peculiarities in 2019

If you’d told me 6 years ago that we’d be able to add any sort of code analysis to the Stack Overflow solution: trivially, for free, and in a way that contributes back to the greater developer community – I wouldn’t have believed you. It’s great to see “the new Microsoft’s” behavior benefit us so directly, but it’s even greater to see what the OSS community has built because of it.

We’ve only just shipped this, which begs the question…

What’s next with static code analysis?

Security is an ongoing process, not a bit you flip or a feature you add. Accordingly there will always be more to do and places we want to make improvements, and static code analysis is no different.

As I alluded to at the start, we’re only analyzing some of the code behind Stack Overflow. More precisely we’re not analyzing views, or tracing through inter-procedural calls – and analyzing both is an obvious next step.

We’ll be able to start analyzing views once our migration to ASP.NET Core is complete. Pre-Core Razor view compilation doesn’t give us an easy way to add any analyzers, but that should be trivial once we’re upgraded. Razor’s default behavior gives us some confidence around injection attacks, and views usually aren’t doing anything scary – but it will be nice to have stronger guarantees of correctness in the future.

Not tracing through inter-procedural calls is a bit more complicated. Technically this is a limitation of Security Code Scan, there’s an issue for it. That we can’t analyze views reduces the value of inter-procedural analysis today, since we almost always pass user-provided data into views. For now, we’re comfortable focusing on our controller action methods since basically all user-provided data passes through them before going onto views or other inter-procedural calls.

The beauty of open source is that when we do come back and do these next steps (and any other quality of life changes), we’ll be making them available to the community so everyone benefits. It’s a wonderful thing to be able to benefit ourselves, our customers, and .NET developers everywhere – all at the same time.

Kevin Montrose