Filename normalisation of form-data/multipart file uploads (umlauts on Apple clients) #2625
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Normalise filenames to Unicode NFC, such that Mac and iOS clients behave identically to other operating systems. Normally Apple devices use NFD which may cause trouble on other systems.
This patch only affects
form-data/multipart
file uploads, not downloads nor any uploads handled by client side Javascript (which may need additional normalisation by app developers). The problem only affects filenames (not text input fields and such which always seem to use NFC).This affects specifically umlauts and other letters that may be decomposed as two characters. For instance, ä can be represented as
\u00E4
(NFC) or asa\u0308
(NFD) i.e.a
with COMBINING DIARESIS.Some applications may already be doing such normalisation, and should not be affected (only the same work done twice). Otherwise without this PR applications see differently encoded names depending on which OS the client is running, but this patch removes the disparity, and the changes are not expected to be breaking.
On the downloading side of things Mac browsers appear to be accepting either NFC or NFD, converting them to NFD as is native for Apple devices, thus no changes are needed there. NFC should work for everything.
A bit of a background on the filesystem side (not directly affecting Sanic)
MacOS considers them the same, although everything should be in NFD. The above code creates a file with the first created (NFC) filename, with the content of the second file (overwrite). Reading files or overwriting existing files understands either form from the filesystem and preserves it as well (similar to case-insensitivity and probably a side effect of implementing that). Reading filenames off the filesystem on (glob etc) returns whichever form is present on disk (which on MacOS often is NFD, and which then might need to be converted to NFC for interoperation with other systems over the web).
Those who already find incorrectly encoded filenames on their system should find the convmv utility helpful, as it can mass convert names either way: