Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

CSV writing: distinguish between missing and empty strings #865

Closed
ritchie46 opened this issue Feb 25, 2022 · 1 comment
Closed

CSV writing: distinguish between missing and empty strings #865

ritchie46 opened this issue Feb 25, 2022 · 1 comment
Labels
no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@ritchie46
Copy link
Collaborator

Currently the csv writer encodes missing data and empty strings equally. Because a csv field may be quoted, we could use that as the empty string value.

2 empty strings

"",""\n

2 missing data

,\n

Current behavior:

If we encode a single column, we seem to quote both missing and emtpy:

f = io.BytesIO()
pl.DataFrame({
    "s1": ["", None],
}).to_csv(f)

f.seek(0)
print(f.read())
b's1\n""\n""\n'

However if we encode multiple columns, we change our behavior and we don't encode any.

f = io.BytesIO()
pl.DataFrame({
    "s1": ["", None],
    "s2": ["", None],
}).to_csv(f)

f.seek(0)
print(f.read())
b's1,s2\n,\n,\n'

I think we should also be consistent no matter how many columns we encode.

@ritchie46
Copy link
Collaborator Author

This turns out to be harder than I thought in current setup. We dispatch the &[u8] to the csv crate, which handles escaping. It has no notion of a missing string vs and empty string.

One thing that could help this is doing the csv encoding only in the string serializers. This gives us more control and will likely also be faster as we don't encode any other data types. At the moment the whole row is encoded, which in case of no string data is a wasted scan.

When writing the whole row, we can just slam a delimiter between the serialized fields and end with a new line.

@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Feb 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants