Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with saving XML attachments #228

Closed
it-can opened this issue Oct 5, 2017 · 13 comments
Closed

Issue with saving XML attachments #228

it-can opened this issue Oct 5, 2017 · 13 comments
Labels

Comments

@it-can
Copy link

it-can commented Oct 5, 2017

Version: 1.0.1

I had an error after issue #226, seems to be related to encoding with "us-ascii"

@Slamdunk These are my email headers btw

------=_NextPart_000_01B0_01D33D5D.1AB50480
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Description: body

------=_NextPart_000_01B0_01D33D5D.1AB50480
Content-Type: text/xml; name="test.xml"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="test.xml"

------=_NextPart_000_01B0_01D33D5D.1AB50480
Content-Type: application/octet-stream; name="test.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="test.pdf"

When saving the attachment, it seems the encoding is screwed of the XML file... The PDF seems correct...

<?xml version="1.0" encoding="utf-8"?>
@Slamdunk
Copy link
Collaborator

Slamdunk commented Oct 5, 2017

Hi,  is the Byte order mark and it is correct you see it if the XML was created with the BOM and you open the file with a software that reads it in ASCII charset.

BOM always generated errors in most softwares and the modern pratice is to create UTF-8 documents without it. Still, if the BOM is there, you need to handle it by yourself: the decoding of the attachment of this library is correct.

@it-can
Copy link
Author

it-can commented Oct 5, 2017

Ok it worked in version 0.5.2

@Slamdunk
Copy link
Collaborator

Slamdunk commented Oct 5, 2017

I don't understand your last message: did it behave different in 0.5.2?

By the way the base64 encoding of the BOM is 77u/: if the first four chars of the first line of the attachment in the raw message are 77u/ you are dealing with an XML with the BOM.

@it-can
Copy link
Author

it-can commented Oct 5, 2017

Well I switched today to version 1 of this library, and now I have this issue... It worked last night correctly...

@Slamdunk
Copy link
Collaborator

Slamdunk commented Oct 5, 2017

If you can provide me the full raw message with sensitive informations obscured I will be happy to inspect the change and publish the eventual fix.

@it-can
Copy link
Author

it-can commented Oct 5, 2017

Should an XML attachment be passed to Transcoder::decode ? I think this is the problem?

https://github.com/ddeboer/imap/blob/master/src/Message/AbstractPart.php#L280

@Slamdunk
Copy link
Collaborator

Slamdunk commented Oct 5, 2017

An attachment never needs a charset decoding, since it's (almost) always sent encoded in Base64.
Even in version 0.5.2 attachment were never charset-decoded.

I'm sorry but I can't help you without the original mail that is causing the output you consider errored.

@it-can
Copy link
Author

it-can commented Oct 5, 2017

Yeah so I think an XML attachment will have a type of TEXT, and that is passed to the transcoder, if the mail was sent with Content-Type: application/octet-stream; name="test.xml" it works correctly (because it is not passed to the transcoder)

I can't send you the email because it is very sensitive to our business...

This works for me now:

$content = $attachment->getContent();

if (AbstractPart::ENCODING_BASE64 === $attachment->getEncoding()) {
    $content = base64_decode($content);
} elseif (AbstractPart::ENCODING_QUOTED_PRINTABLE === $attachment->getEncoding()) {
    $content = quoted_printable_decode($content);
}

@Slamdunk
Copy link
Collaborator

Slamdunk commented Oct 5, 2017

I was wrong: encoded attachment are charset-decoded if they are a text type like an XML.

The issue is that we consider the default server charset as us-ascii. Previous version did some guessing and in your case found the right charset: this is not an acceptable behaviour anymore because it's very brittle.

I need to do further investigation (in the next days).

@Slamdunk Slamdunk added the bug label Oct 5, 2017
@it-can
Copy link
Author

it-can commented Oct 5, 2017

Thanks for the help! I now have a quick "fix" for my issue, for now it works for me... I will keep a close eye on this! Thanks again!

@Slamdunk
Copy link
Collaborator

Slamdunk commented Oct 6, 2017

I have bad news about this issue.

  1. This bug affects only attachments that have a text mime type which are not plain text, like HTML, XML, CSV.
  2. This is not a bug of this library or of the IMAP server: it appears while creating the email, on the client side before it sends the email to the SMTP server

An example: two XML files with the same content composed by charset-specific chars like , but encoded in different charset, the first in US-ASCII and the second in UTF-8.

If we compose the email in Thunderbird 52 with both the attachments, the receiver gets:

--------------9E93C7BA2D80D3B544BCD1A5
Content-Type: text/xml;
 name="att-utf8-xml.xml"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="att-utf8-xml.xml"

eG1sOiBBX1x8ISLCoyQlJigpPT/DoDw+LUAjJ3t9W11fw59f4oKsX1o=
--------------9E93C7BA2D80D3B544BCD1A5
Content-Type: text/xml;
 name="att-ascii-xml.xml"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="att-ascii-xml.xml"

eG1sOiBBX1x8ISI/JCUmKCk9Pz88Pi1AIyd7fVtdXz9fP19a
--------------9E93C7BA2D80D3B544BCD1A5

You can see that in the Content-Type header the charset is missing. In every way you try to charset-decode the content, one attachment will always be decoded wrong because at the starting point, during the email composition, the charset was not declared.

Gmail is smarter: it tries to detect the charset of the attachment and declare it:

--001a113d736c26ee77055adcc524
Content-Type: text/xml; charset="UTF-8"; name="att-utf8-xml.xml"
Content-Disposition: attachment; filename="att-utf8-xml.xml"
Content-Transfer-Encoding: base64

eG1sOiBBX1x8ISLCoyQlJigpPT/DoDw+LUAjJ3t9W11fw59f4oKsX1o=
--001a113d736c26ee77055adcc524
Content-Type: text/xml; charset="US-ASCII"; name="att-ascii-xml.xml"
Content-Disposition: attachment; filename="att-ascii-xml.xml"
Content-Transfer-Encoding: base64

eG1sOiBBX1x8ISI/JCUmKCk9Pz88Pi1AIyd7fVtdXz9fP19a
--001a113d736c26ee77055adcc524

After receiving this email, we can safely charset-decode both attachment the right way.

The fix I pushed in #227 introduce the default behaviour of the most email clients.

At the time of writing I don't see a robust solution to this issue 🙍

@it-can
Copy link
Author

it-can commented Oct 6, 2017

Ok thanks for the explanation... I will have to create a workaround for my tool...

@Slamdunk
Copy link
Collaborator

Slamdunk commented Oct 6, 2017

There is no way to solve this: as is we now completely avoid charset-decoding of attachments.

@it-can I would appreciate a lot your feedback of the new release 1.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants