-
Notifications
You must be signed in to change notification settings - Fork 204
Add the text-file flag #152
Conversation
This adds the text flag which is stored in the internal file attributes.
This is in response to the difference in the zipinfo output as seen in #151. This does not solve that bug however. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things I’d like to check before we merge:
- see the utf8 comment
- can we add a test?
|
||
// Check if the buffer is text (UTF8) or contains binary data. Note that if an UTF8 | ||
// character is split between to calls to write, this will falsly mark it as binary. | ||
// For ASCII characters, this is always correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use Utf8Error::valid_up_to
to figure out if we’re in this situation? I think to know for sure we would have to keep the bytes from any previous read the are potentially a non-complete character and if validation fails, prepend and check again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also just mark it as text for ASCII files. The appnote is a bit vague here:
4.4.14.1 The lowest bit of this field indicates, if set,
that the file is apparently an ASCII or text file. If not
set, that the file apparently contains binary data.
@@ -231,6 +231,8 @@ pub struct ZipFileData { | |||
pub header_start: u64, | |||
/// Specifies where the compressed data of the file starts | |||
pub data_start: u64, | |||
/// Internal file attributes | |||
pub internal_attributes: u16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is out of scope for this PR but I wonder if keeping this internally as a bitflags enum would be better....
I'd rather not provide this API in this way - the issue came up because there are cases where it is significant whether or not this bit is set. There are cases where it will also be important that the flag isn't set, and this only leaves the user the option to deliberately provide invalid UTF-8. There is also the issue with the silent failure when someone is expecting to create a text file, but that seems less important That said, this does get the job done, and implementing this the way I'd like to requires a major rewrite of |
I am fine leaving this out for now. I do not know of any use cases which actually read this flag, except for |
Thanks for bringing it up! We'll have to figure out a good solution in the end, but I don't think this is it - there are too many downsides. I'll let @rylev make the call since I got to it late :P |
For now, we don't have any examples of code that would check the flag, no. I'm fairly sure there'll be some zip-derived format that enforces the binary flag for, say, images, and images that technically contain valid unicode. Also also, this will create false negatives too: "text" doesn't necessarily mean utf8. |
What seems most reasonable to me in the long run: offer a way for the user to decide how files are classified with helpers for such things as declaring all valid utf8 as text. The default would be ASCII being text and everything else being marked binary. |
I'm moving this to issues as there haven't been any requests for the feature, and we're still not sure how best to implement it. |
This adds the text flag which is stored in the internal file attributes.