Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Part decode enhancement when the charset param is mising #417

Closed
ekalchev opened this issue Jul 11, 2018 · 13 comments
Closed

Text Part decode enhancement when the charset param is mising #417

ekalchev opened this issue Jul 11, 2018 · 13 comments

Comments

@ekalchev
Copy link

ekalchev commented Jul 11, 2018

I noticed that for some mime parts that are missing the charset encoding parameter this method fails to decode the text in the case when the encoding is UTF16 even if the BOM for UTF16 was present

MimeKit/MimeKit/TextPart.cs

Lines 299 to 338 in 2249574

public string Text {
get {
if (Content == null)
return string.Empty;
var charset = ContentType.Parameters["charset"];
using (var memory = new MemoryStream ()) {
Content.DecodeTo (memory);
#if !PORTABLE && !NETSTANDARD
var content = memory.GetBuffer ();
#else
var content = memory.ToArray ();
#endif
Encoding encoding = null;
if (charset != null) {
try {
encoding = CharsetUtils.GetEncoding (charset);
} catch (NotSupportedException) {
}
}
if (encoding == null) {
try {
return CharsetUtils.UTF8.GetString (content, 0, (int) memory.Length);
} catch (DecoderFallbackException) {
// fall back to iso-8859-1
encoding = CharsetUtils.Latin1;
}
}
return encoding.GetString (content, 0, (int) memory.Length);
}
}
set {
SetText (Encoding.UTF8, value);
}
}

I was successfully be able to decode the part which previously was failing to decode with this change and I thought I might share it. It make use of Byte order marks

   if (encoding == null) {
	try {
                        encoding = CharsetUtils.UTF8;

                        if (content.Length >= 3 && content[0] == 0x2b && content[1] == 0x2f && content[2] == 0x76) encoding = Encoding.UTF7;
                        else if (content.Length >= 3 && content[0] == 0xef && content[1] == 0xbb && content[2] == 0xbf) encoding = Encoding.UTF8;
                        else if (content.Length >= 2 && content[0] == 0xff && content[1] == 0xfe) encoding = Encoding.Unicode; //UTF-16LE
                        else if (content.Length >= 2 && content[0] == 0xfe && content[1] == 0xff) encoding = Encoding.BigEndianUnicode; //UTF-16BE
                        else if (content.Length >= 4 && content[0] == 0 && content[1] == 0 && content[2] == 0xfe && content[3] == 0xff) encoding = Encoding.UTF32;

						return encoding.GetString (content, 0, (int) memory.Length);
					} catch (DecoderFallbackException) {
						// fall back to iso-8859-1
						encoding = CharsetUtils.Latin1;
					}
				}

This was tested only for UTF16

@jstedfast
Copy link
Owner

Sure, we could do that... UTF-7 should never be used, though, so checking for that is a waste. Same for UTF-32 and no need to bother checking for a UTF-8 BOM, just check for UTF-16 LE/BE and if it's not one of those, try UTF-8 as normal.

jstedfast added a commit that referenced this issue Jul 11, 2018
… found

Fixes issue #417

Thanks to @ekalchev for this suggestion and initial patch
@eriknuds
Copy link

eriknuds commented Aug 28, 2019

UTF-7 should never be used

...are you sure about that? If it's legal mime and a standard encoding it should be supported IMHO. Also, in real life, I keep getting emails from Outlook 2019 with utf-7 mime parts:

x-mailer: Microsoft Outlook 16.0
Content-Type: multipart/related;
	boundary="_003_HE1PR03MB2841801DA6B4AB41D96FD91895A00HE1PR03MB2841eurp_";
	type="text/html"
MIME-Version: 1.0

--_003_HE1PR03MB2841801DA6B4AB41D96FD91895A00HE1PR03MB2841eurp_
Content-Type: text/html; charset="utf-7"

<meta http-equiv="Content-Type" content="text/html; charset=utf-7">
+ADw-html xmlns:v+AD0AIg-urn:schemas-microsoft-com:vml+ACI- xmlns:o+AD0AIg-urn:schemas-microsoft-com:office:office+ACI- xmlns:w+AD0AIg-urn:schemas-microsoft-com:office:word+ACI- xmlns:m+AD0AIg-http://schemas.microsoft.com/office/2004/12/omml+ACI- xmlns+AD0AIg-http://www.w3.org/TR/REC-html40+ACIAPg-
+ADw-head+AD4-
+ADw-meta name+AD0AIg-Generator+ACI- content+AD0AIg-Microsoft Word 15 (filtered medium)+ACIAPg-
+ADwAIQ---+AFs-if +ACE-mso+AF0APgA8-style+AD4-v+AFw-:+ACo- +AHs-behavior:url(+ACM-default+ACM-VML)+ADsAfQ-
o+AFw-:+ACo- +AHs-behavior:url(+ACM-default+ACM-VML)+ADsAfQ-
w+AFw-:+ACo- +AHs-behavior:url(+ACM-default+ACM-VML)+ADsAfQ-
.shape +AHs-behavior:url(+ACM-default+ACM-VML)+ADsAfQ-
+ADw-/style+AD4APAAhAFs-endif+AF0---+AD4APA-style+AD4APAAh---
/+ACo- Font Definitions +ACo-/
+AEA-font-face
	+AHs-font-family:Helvetica+ADs-
	panose-1:2 11 6 4 2 2 2 2 2 4+ADsAfQ-
+AEA-font-face

...and UTF8 decoding fails when I access the MimeMessage TextBody or HtmlBody attributes. The exception "Unable to translate Unicode character \uDE9A at index 108 to specified code page." is thrown.

Stack trace:
`
at System.Text.EncoderExceptionFallbackBuffer.Fallback(Char charUnknown, Int32 index) at System.Text.EncoderFallbackBuffer.InternalFallback(Char ch, Char*& chars) at System.Text.UTF8Encoding.GetBytes(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, EncoderNLS baseEncoder) at System.Text.EncoderNLS.Convert(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, Boolean flush, Int32& charsUsed, Int32& bytesUsed, Boolean& completed) at System.Text.EncoderNLS.Convert(Char[] chars, Int32 charIndex, Int32 charCount, Byte[] bytes, Int32 byteIndex, Int32 byteCount, Boolean flush, Int32& charsUsed, Int32& bytesUsed, Boolean& completed) at MimeKit.IO.Filters.CharsetFilter.Filter(Byte[] input, Int32 startIndex, Int32 length, Int32& outputIndex, Int32& outputLength, Boolean flush) at MimeKit.IO.Filters.MimeFilterBase.Filter(Byte[] input, Int32 startIndex, Int32 length, Int32& outputIndex, Int32& outputLength) at MimeKit.IO.FilteredStream.Write(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken) at MimeKit.IO.FilteredStream.Write(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken) at MimeKit.MimeContent.WriteTo(Stream stream, CancellationToken cancellationToken) at MimeKit.MimeContent.DecodeTo(Stream stream, CancellationToken cancellationToken) at MimeKit.TextPart.GetText(Encoding encoding) at MimeKit.MimeMessage.TryGetMultipartBody(Multipart multipart, TextFormat format, String& body) at MimeKit.MimeMessage.GetTextBody(TextFormat format)

`

As can be seen from the stack trace MimeKit tries to use System.Text.UTF8Encoding.GetBytes() to decode the utf-7 data, which fails.

In my example there is a charset parameter that says utf-7, as shown, but it still doesn't try to decode as utf-7 it seems.

@jstedfast
Copy link
Owner

@eriknuds You are misunderstanding the original discussion in this feature request.

This feature request was for MimeKit to "sniff" the proper unicode charset of the content by checking for a unicode BOM if-and-only-if MimeKit fails to convert the content into a string using the specified charset (or if no charset parameter is specified).

UTF-7 does not have a BOM that I am aware of.

Also: if your content really was UTF-7, then MimeKit would never have tried to fall back to UTF-8.

Try this:

var text = textPart.GetText (Encoding.UTF7);

My prediction is that you will get an exception.

@jstedfast
Copy link
Owner

I just tested this locally and I'm not getting any exceptions. I also tested message.HtmlBody.

I suspect the problem is that your version of .NET might not support UTF-7.

What version of .NET are you using?

@eriknuds
Copy link

Hi Jeffrey,

I realize the original post was about detecting character set when it was not given by the content charset parameter, but the statements made about utf-7 support made it relevant enough I thought.

Also, there are BOM byte patterns, in fact 5, that signifies utf-7 encoding:
https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

About UTF-7 BOM:
...
While a Unicode signature is typically a single, fixed byte sequence, the nature of UTF-7 necessitates 5 variations: The last 2 bits of the 4th byte of the UTF-7 encoding of U+FEFF belong to the following character, resulting in 4 possible bit patterns and therefore 4 different possible bytes in the 4th position. The 5th variation is needed to disambiguate the case where no characters at all follow the signature.
...

I believe .Net 4.7.2 does support UTF-7 just fine but I will do some more investigation at work to pin this issue down, maybe there are some environent or OS issues playing a part in this.

Also, I believe the .NET charset decoders can be set to have a best fit or replacement (question mark) fallback policy. It would be really nice if MimeKit supported that, as there are as you touch on different versions of .NET, OS'es and evolving unicode standards that may make the decoding process fail for certain characters and I think it's much better to get the text with an occational questionmark for new/special unicode characters than nothing at all - at least in my use cases.

https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding#FallbackStrategy

Thanks for the help and quick replies so far Jeffrey!

@jstedfast
Copy link
Owner

jstedfast commented Aug 28, 2019

.NET 4.7.2 should have UTF-7 and it should work fine. I'm not sure why it isn't for you but it is for me? I thought maybe you were using the UniversalWindowsPhone81 or UniversalWindows81 targets (which have limited charset support outside of UTF8).

I know about the fallbacks, and yes, they are valuable. That is why I have TextPart.GetText (Encoding)

The problem with having MimeKit do it behind your back is that then you'd have no idea as a mail client author whether something went wrong or not so that you could deal with it.

As far as the UTF-7 BOMs, well, I stand corrected but it wouldn't matter anyway seeing as how your text/html content does not begin with any of those 5 BOM sequences.

I'm curious what happens when you call TextPart.GetText (Encoding.UTF7) on your system.

I also can't understand why Outlook decided to use UTF-7 in your sample message seeing as how all it does is encode HTML markup tokens, e.g. < character becomes +ADw.

It's not necessary at all.

I suspect that the sending client is misconfigured somehow because my Outlook client doesn't seem to do that (mine sends messages as UTF-8, even the HTML bodies).

@eriknuds
Copy link

eriknuds commented Aug 28, 2019

Outlook/exchange does so many weird and annoying things in different configurations and scenarios. I'm not sure why utf-7 is used, and I agree it seems like a very weird choice, but we get some of these emails more or less every day (among thousands of emails). Unfortunately I don't have user access to the email accounts in question, I pick these examples up from the error logs where they are only partly written. I will continue to investigate and update here with any interesting findings.

If MailKit supported specifying a fallback strategy I would turn on best effort or replacement as in my use cases as I would like that approach and don't mind the occasional missed character conversion that bad. As I would want to log these cases I would first try with the exception fallback and retry with the other fallback strategies in a catch block to get as much as we could of the text - as well as logging the exception.

@jstedfast
Copy link
Owner

Right, but my point was that if MimeKit did the fallback, then you wouldn't know. You'd just get question marks and never be sure if that was supposed to be that way or if it was a fallback character because you'd never get an exception.

@eriknuds
Copy link

That is why I would probably default to exception fallback, so I could try/catch/log any issues and retry decoding (property access) with a more lenient fallback strategy in my catch block, to get the best effort or replacement version of the data. Then I would both know it failed (I can log it and mark it with a warning or whatever) and also get the best text representation I could possibly get at the same time. If fallback strategy was a property of the MimeMessage object I could have a strict exception fallback default, then if an exception is thrown in my try block I could set the more lenient fallback strategy on the MimeMessage object and retry in the catch block to get the text again.

@eriknuds
Copy link

eriknuds commented Aug 29, 2019

I could not reproduce the error observed in the logs using the data from the logs. If there were any unsavory characters there, or characters belonging to a newer revision of the unicode standard than the OS/.NET supported, it may be filtered out of the data before I get to see the data in my log viewer, so that might be the reason. But there are obviously some instances in which "mostly" valid utf-7 data fails to decode, and I must handle it somehow, so I guess I will have to use TextPart.GetText(Encoding) to decode instead of using the MimeMessage.HtmlBody etc. properties directly unless the fallback strategy of those properties will be configurable in the near future.

@jstedfast
Copy link
Owner

jstedfast commented Aug 29, 2019

@eriknuds The TextBody and HtmlBody properties on MimeMessage are quick & dirty ways of getting that text, but I've always recommended that developers a little more serious about getting things right use a MimeVisitor approach along with TextPart.Text and/or TextPart.GetText() (I also provide GetText() methods that take a string denoting the override charset name).

For an example of this, take a look at the HtmlPreviewVisitor class in the FAQ and/or take a look at the MessageReader sample in the samples directory of the MimeKit github repo for an example of how to use this strategy.

I know the class seems "complicated", but that's why I wrote the demo which pretty much anyone should be able to more-or-less copy & paste as a really good starting point for doing this.

In the MessageReader sample, the HtmlPreviewVisitor has a comment about saving the image attachments that are referred to in the HTML body to a temp directory so that the browser control used to render the images can load them using file:// URIs, but it looks like I modified the code to use data: URIs instead (which embed the entire image data as a base64 blob inside of the HTML).

Ignore the comment and I'll go fix it to say the correct thing ;-)

FWIW, you could use file:// URIs if you wanted to and the original version of that implementation did, but I think I changed it to data: because that's what most people seemed to want to do, probably because it doesn't require managing temp files ;-)

Access to the referenced image data is another reason why using a MimeVisitor approach is better than HtmlBody, since that's just the raw HTML content w/o any image data.

@eriknuds
Copy link

Thanks Jeffrey,
I will use a visitor instead as you recommend!

@jstedfast
Copy link
Owner

FWIW, I just updated the HtmlPreviewVisitor implementation the FAQ to allow you to switch easily between file:// and data: URIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants