Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 encoding of csv #17

Open
timpol opened this issue Sep 19, 2023 · 4 comments
Open

UTF-8 encoding of csv #17

timpol opened this issue Sep 19, 2023 · 4 comments

Comments

@timpol
Copy link

timpol commented Sep 19, 2023

Just a suggestion regarding the documentation...

It seems thatD47data.read() will fail if the input csv file is in UTF-8 encoding. This is not a problem with D47crunch, but may trip up some new users if they were to produce their csv file using the default csv format of certain spreadsheet programs.

Would it be worth putting a warning about this in section 1.2 of the documentation just in case?

@mdaeron
Copy link
Owner

mdaeron commented Sep 19, 2023

Thanks for reporting the issue.

Would it be worth putting a warning about this in section 1.2 of the documentation just in case?

Would you mind sharing a CSV that fails? The best option may be to correct this behavior.

@timpol
Copy link
Author

timpol commented Sep 19, 2023

I should have mentioned that it depends on the ordering of the columns as to whether or not it fails...

If you take the rawdata.csv example file and re-order the columns so that any column other than UID or Session is first, then save as a .csv in UTF-8 encoding from the spreadsheet program, it fails with a KeyError: 'Sample' on line for s in sorted({r['Sample'] for r in self}) when self.refresh_samples() is called (or another such key error).

I think this is because the spreadsheet program puts a UTF-8 byte order mark (BOM) at the start of the file, so D47crunch can't find the column headings it expects to see.

@mdaeron
Copy link
Owner

mdaeron commented Sep 19, 2023

In that case, the best case scenario is that another field gets a spurious BOM in its name, which is likely to cause cryptic problems down the line. If you send me an example file I can try to debug.

The longer-term solution, of course, it to use a more robust csv parsing implementation. There are probably many out there.

On the other hand, you could argue that producing clean csv files is Excel's job, not D47crunch's. Implementing corrective measures in every library likely to encounter this format is expensive. Converting “quasi-CSV” to ”bare-CSV” between Excel and these libraries would probably be less expensive. YMMV.

@timpol
Copy link
Author

timpol commented Sep 19, 2023

I agree 100% that it is an Excel problem not a D47crunch problem... I just thought it might be discouraging for a relatively new python user trying to get their D47crunch / D47calib script working.

A simple error msg informing the user that they need to clean up their filthy csv file could do the job?

See file attached:
rawdata_dirty.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants