UTF-8 encoding of csv #17

timpol · 2023-09-19T01:11:47Z

Just a suggestion regarding the documentation...

It seems thatD47data.read() will fail if the input csv file is in UTF-8 encoding. This is not a problem with D47crunch, but may trip up some new users if they were to produce their csv file using the default csv format of certain spreadsheet programs.

Would it be worth putting a warning about this in section 1.2 of the documentation just in case?

The text was updated successfully, but these errors were encountered:

mdaeron · 2023-09-19T05:03:51Z

Thanks for reporting the issue.

Would it be worth putting a warning about this in section 1.2 of the documentation just in case?

Would you mind sharing a CSV that fails? The best option may be to correct this behavior.

timpol · 2023-09-19T08:34:39Z

I should have mentioned that it depends on the ordering of the columns as to whether or not it fails...

If you take the rawdata.csv example file and re-order the columns so that any column other than UID or Session is first, then save as a .csv in UTF-8 encoding from the spreadsheet program, it fails with a KeyError: 'Sample' on line for s in sorted({r['Sample'] for r in self}) when self.refresh_samples() is called (or another such key error).

I think this is because the spreadsheet program puts a UTF-8 byte order mark (BOM) at the start of the file, so D47crunch can't find the column headings it expects to see.

mdaeron · 2023-09-19T08:42:45Z

In that case, the best case scenario is that another field gets a spurious BOM in its name, which is likely to cause cryptic problems down the line. If you send me an example file I can try to debug.

The longer-term solution, of course, it to use a more robust csv parsing implementation. There are probably many out there.

On the other hand, you could argue that producing clean csv files is Excel's job, not D47crunch's. Implementing corrective measures in every library likely to encounter this format is expensive. Converting “quasi-CSV” to ”bare-CSV” between Excel and these libraries would probably be less expensive. YMMV.

timpol · 2023-09-19T08:53:53Z

I agree 100% that it is an Excel problem not a D47crunch problem... I just thought it might be discouraging for a relatively new python user trying to get their D47crunch / D47calib script working.

A simple error msg informing the user that they need to clean up their filthy csv file could do the job?

See file attached:
rawdata_dirty.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 encoding of csv #17

UTF-8 encoding of csv #17

timpol commented Sep 19, 2023

mdaeron commented Sep 19, 2023

timpol commented Sep 19, 2023

mdaeron commented Sep 19, 2023

timpol commented Sep 19, 2023

UTF-8 encoding of csv #17

UTF-8 encoding of csv #17

Comments

timpol commented Sep 19, 2023

mdaeron commented Sep 19, 2023

timpol commented Sep 19, 2023

mdaeron commented Sep 19, 2023

timpol commented Sep 19, 2023