Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CsvTextFieldFormatter (#18) #19

Merged
merged 3 commits into from
Jan 8, 2023
Merged

Add CsvTextFieldFormatter (#18) #19

merged 3 commits into from
Jan 8, 2023

Conversation

22222
Copy link
Owner

@22222 22222 commented Dec 4, 2022

Adds a formatter to get a CSV file back from the parser's output.

var specialChars = specialCharsCache;
if (specialChars == null)
{
specialChars = new SortedSet<char> { delimiterChar, quoteChar, quoteEscapeChar, '\n', '\r' }.ToArray();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, why go via SortedSet instead of simply new char[]{ delimiterChar, quoteChar, quoteEscapeChar, '\n', '\r'}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason is that the quote and quote escape character are the same character by default. The SortedSet will remove the duplicate.

It's not 100% essential to do that, as the IndexOfAny method won't break if there are duplicates. And there are other ways the duplicates could be avoided, like using the LINQ Distinct method or a more manual check.

{
get
{
return new string[] { delimiterChar.ToString(CultureInfo.InvariantCulture) };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use ToString(CultureInfo.InvariantCulture) here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's to satisfy the .NET Analyzer warning CA1304: Specify CultureInfo.

@osexpert
Copy link

osexpert commented Dec 4, 2022

I did not test it (yet) but looks promising. I added 2 comments. I miss option to escape always (or similar). Reason is that in case leading or trailing white space need to be preserved the spec says it should be quoted. I guess this problem can be solved in 2 ways:
-an option to quote always. pro: easy to implement, con: will quote strings that don't need it
-an option to quote if leading\trailing white space (eg. bool PreserveLeadingAndTrailingWhiteSpace). pro: will only quote string that need it, con: a little more code\logic to implement to check if string start or end with whitespace.

But I guess the logic won't be that hard, maybe like:
bool aBool = PreserveLeadingAndTrailingWhiteSpace && (char.IsWhiteSpace(value[0]) || char.IsWhiteSpace(value[value.Length - 1])));

Edit: I see since the parser has a bool TrimWhiteSpace, the bool in the formatter should simply be named PreserveWhiteSpace (leading\trailing implied)
Edit2: PreserveWhiteSpace sound a bit weird thou. It make it feel like the default is to trim the strings...but this is not what is done. So maybe bool EscapeWhiteSpace is the best naming


private void WriteField(string field)
{
var isQuoteNecessary = field.IndexOfAny(SpecialChars) >= 0;
Copy link

@osexpert osexpert Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I suggested in the comment:

        /// <summary>
        /// Some csv consumers trim leading and trailing white space from fields unless they are escaped.
        /// </summary>
        public bool EscapeWhiteSpace { get; set; }

and

var isQuoteNecessary = field.IndexOfAny(SpecialChars) >= 0 ||
 (EscapeWhiteSpace && field.Length > 0 && (char.IsWhiteSpace(field[0]) || char.IsWhiteSpace(field[field.Length - 1])));

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been going back and forth on this in my head for a few days, and I think I've landed on preferring an option to always quote over an option that's just about whitespace handling.

My thoughts on why:

The parser in this library matches the behavior of Microsoft's VisualBasic TextFieldParser, and that parser doesn't consider quotes at all when it comes to whitespace trimming. Depending on the TrimWhiteSpace option, it either always preserves whitespace (even in unquoted fields) or always trims whitespace (even if the field is enclosed in quotes).

So if the goal of the formatter is to act as a companion to the parser, I don't think an option to quote for leading or trailing whitespace is a good fit.

However, I do think an option to always quote would be reasonable. The parser already has a "HasFieldsEnclosedInQuotes" option, so it already has a concept of a CSV file never using quotes. A corresponding "ForceFieldsEnclosedinQuotes" option would cover the cases where you want to use the formatter with other parsers that handle leading/trailing whitespace differently, and might even handle other situations where you'd want to require quotes (like maybe a parser collapses whitespace in the middle of a field if it's unquoted).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The white space logic is too specific.

@22222 22222 merged commit e2dd94b into master Jan 8, 2023
@22222 22222 deleted the formatter branch January 8, 2023 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants