Skip to content
This repository has been archived by the owner on Jun 2, 2024. It is now read-only.

Add the text-file flag #152

Closed
wants to merge 1 commit into from
Closed

Add the text-file flag #152

wants to merge 1 commit into from

Conversation

mvdnes
Copy link
Collaborator

@mvdnes mvdnes commented Jun 20, 2020

This adds the text flag which is stored in the internal file attributes.

This adds the text flag which is stored in the internal file attributes.
@mvdnes
Copy link
Collaborator Author

mvdnes commented Jun 20, 2020

This is in response to the difference in the zipinfo output as seen in #151. This does not solve that bug however.

Copy link
Contributor

@rylev rylev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things I’d like to check before we merge:

  • see the utf8 comment
  • can we add a test?


// Check if the buffer is text (UTF8) or contains binary data. Note that if an UTF8
// character is split between to calls to write, this will falsly mark it as binary.
// For ASCII characters, this is always correct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Utf8Error::valid_up_to to figure out if we’re in this situation? I think to know for sure we would have to keep the bytes from any previous read the are potentially a non-complete character and if validation fails, prepend and check again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just mark it as text for ASCII files. The appnote is a bit vague here:

       4.4.14.1 The lowest bit of this field indicates, if set, 
       that the file is apparently an ASCII or text file.  If not
       set, that the file apparently contains binary data.

@@ -231,6 +231,8 @@ pub struct ZipFileData {
pub header_start: u64,
/// Specifies where the compressed data of the file starts
pub data_start: u64,
/// Internal file attributes
pub internal_attributes: u16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of scope for this PR but I wonder if keeping this internally as a bitflags enum would be better....

src/write.rs Show resolved Hide resolved
@Plecra
Copy link
Member

Plecra commented Jun 20, 2020

I'd rather not provide this API in this way - the issue came up because there are cases where it is significant whether or not this bit is set. There are cases where it will also be important that the flag isn't set, and this only leaves the user the option to deliberately provide invalid UTF-8.

There is also the issue with the silent failure when someone is expecting to create a text file, but that seems less important

That said, this does get the job done, and implementing this the way I'd like to requires a major rewrite of ZipWriter, so I don't have a leg to stand on here

@mvdnes
Copy link
Collaborator Author

mvdnes commented Jun 20, 2020

I am fine leaving this out for now. I do not know of any use cases which actually read this flag, except for zipinfo. This was a quick addition to test if this fixes the epub problem, but it did not seem to have any effect.

@Plecra
Copy link
Member

Plecra commented Jun 20, 2020

Thanks for bringing it up! We'll have to figure out a good solution in the end, but I don't think this is it - there are too many downsides. I'll let @rylev make the call since I got to it late :P

@rylev
Copy link
Contributor

rylev commented Jun 22, 2020

@mvdnes @Plecra do we know what the use cases are where utf-8 (or even just ascii) data should be marked as a binary file?

@Plecra
Copy link
Member

Plecra commented Jun 22, 2020

For now, we don't have any examples of code that would check the flag, no.

I'm fairly sure there'll be some zip-derived format that enforces the binary flag for, say, images, and images that technically contain valid unicode.

Also also, this will create false negatives too: "text" doesn't necessarily mean utf8.

@rylev
Copy link
Contributor

rylev commented Jun 23, 2020

What seems most reasonable to me in the long run: offer a way for the user to decide how files are classified with helpers for such things as declaring all valid utf8 as text. The default would be ASCII being text and everything else being marked binary.

@Plecra
Copy link
Member

Plecra commented Jun 23, 2020

I'm moving this to issues as there haven't been any requests for the feature, and we're still not sure how best to implement it.

@Plecra Plecra closed this Jun 23, 2020
@Plecra Plecra mentioned this pull request Feb 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants