Text Part decode enhancement when the charset param is mising #417

ekalchev · 2018-07-11T11:11:58Z

I noticed that for some mime parts that are missing the charset encoding parameter this method fails to decode the text in the case when the encoding is UTF16 even if the BOM for UTF16 was present

MimeKit/MimeKit/TextPart.cs

Lines 299 to 338 in 2249574

    
           		public string Text { 
        
           			get { 
        
           				if (Content == null) 
        
           					return string.Empty; 
        
           				var charset = ContentType.Parameters["charset"]; 
        
           				using (var memory = new MemoryStream ()) { 
        
           					Content.DecodeTo (memory); 
        
           #if !PORTABLE && !NETSTANDARD 
        
           					var content = memory.GetBuffer (); 
        
           #else 
        
           					var content = memory.ToArray (); 
        
           #endif 
        
           					Encoding encoding = null; 
        
           					if (charset != null) { 
        
           						try { 
        
           							encoding = CharsetUtils.GetEncoding (charset); 
        
           						} catch (NotSupportedException) { 
        
           						} 
        
           					} 
        
           					if (encoding == null) { 
        
           						try { 
        
           							return CharsetUtils.UTF8.GetString (content, 0, (int) memory.Length); 
        
           						} catch (DecoderFallbackException) { 
        
           							// fall back to iso-8859-1 
        
           							encoding = CharsetUtils.Latin1; 
        
           						} 
        
           					} 
        
           					return encoding.GetString (content, 0, (int) memory.Length); 
        
           				} 
        
           			} 
        
           			set { 
        
           				SetText (Encoding.UTF8, value); 
        
           			} 
        
           		}

I was successfully be able to decode the part which previously was failing to decode with this change and I thought I might share it. It make use of Byte order marks

   if (encoding == null) {
	try {
                        encoding = CharsetUtils.UTF8;

                        if (content.Length >= 3 && content[0] == 0x2b && content[1] == 0x2f && content[2] == 0x76) encoding = Encoding.UTF7;
                        else if (content.Length >= 3 && content[0] == 0xef && content[1] == 0xbb && content[2] == 0xbf) encoding = Encoding.UTF8;
                        else if (content.Length >= 2 && content[0] == 0xff && content[1] == 0xfe) encoding = Encoding.Unicode; //UTF-16LE
                        else if (content.Length >= 2 && content[0] == 0xfe && content[1] == 0xff) encoding = Encoding.BigEndianUnicode; //UTF-16BE
                        else if (content.Length >= 4 && content[0] == 0 && content[1] == 0 && content[2] == 0xfe && content[3] == 0xff) encoding = Encoding.UTF32;

						return encoding.GetString (content, 0, (int) memory.Length);
					} catch (DecoderFallbackException) {
						// fall back to iso-8859-1
						encoding = CharsetUtils.Latin1;
					}
				}

This was tested only for UTF16

The text was updated successfully, but these errors were encountered:

jstedfast · 2018-07-11T12:27:39Z

Sure, we could do that... UTF-7 should never be used, though, so checking for that is a waste. Same for UTF-32 and no need to bother checking for a UTF-8 BOM, just check for UTF-16 LE/BE and if it's not one of those, try UTF-8 as normal.

… found Fixes issue #417

@ekalchev

… found Fixes issue #417 Thanks to @ekalchev for this suggestion and initial patch

eriknuds · 2019-08-28T12:59:52Z

UTF-7 should never be used

...are you sure about that? If it's legal mime and a standard encoding it should be supported IMHO. Also, in real life, I keep getting emails from Outlook 2019 with utf-7 mime parts:

x-mailer: Microsoft Outlook 16.0
Content-Type: multipart/related;
	boundary="_003_HE1PR03MB2841801DA6B4AB41D96FD91895A00HE1PR03MB2841eurp_";
	type="text/html"
MIME-Version: 1.0

--_003_HE1PR03MB2841801DA6B4AB41D96FD91895A00HE1PR03MB2841eurp_
Content-Type: text/html; charset="utf-7"

<meta http-equiv="Content-Type" content="text/html; charset=utf-7">
+ADw-html xmlns:v+AD0AIg-urn:schemas-microsoft-com:vml+ACI- xmlns:o+AD0AIg-urn:schemas-microsoft-com:office:office+ACI- xmlns:w+AD0AIg-urn:schemas-microsoft-com:office:word+ACI- xmlns:m+AD0AIg-http://schemas.microsoft.com/office/2004/12/omml+ACI- xmlns+AD0AIg-http://www.w3.org/TR/REC-html40+ACIAPg-
+ADw-head+AD4-
+ADw-meta name+AD0AIg-Generator+ACI- content+AD0AIg-Microsoft Word 15 (filtered medium)+ACIAPg-
+ADwAIQ---+AFs-if +ACE-mso+AF0APgA8-style+AD4-v+AFw-:+ACo- +AHs-behavior:url(+ACM-default+ACM-VML)+ADsAfQ-
o+AFw-:+ACo- +AHs-behavior:url(+ACM-default+ACM-VML)+ADsAfQ-
w+AFw-:+ACo- +AHs-behavior:url(+ACM-default+ACM-VML)+ADsAfQ-
.shape +AHs-behavior:url(+ACM-default+ACM-VML)+ADsAfQ-
+ADw-/style+AD4APAAhAFs-endif+AF0---+AD4APA-style+AD4APAAh---
/+ACo- Font Definitions +ACo-/
+AEA-font-face
	+AHs-font-family:Helvetica+ADs-
	panose-1:2 11 6 4 2 2 2 2 2 4+ADsAfQ-
+AEA-font-face

...and UTF8 decoding fails when I access the MimeMessage TextBody or HtmlBody attributes. The exception "Unable to translate Unicode character \uDE9A at index 108 to specified code page." is thrown.

Stack trace:
`
at System.Text.EncoderExceptionFallbackBuffer.Fallback(Char charUnknown, Int32 index) at System.Text.EncoderFallbackBuffer.InternalFallback(Char ch, Char*& chars) at System.Text.UTF8Encoding.GetBytes(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, EncoderNLS baseEncoder) at System.Text.EncoderNLS.Convert(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, Boolean flush, Int32& charsUsed, Int32& bytesUsed, Boolean& completed) at System.Text.EncoderNLS.Convert(Char[] chars, Int32 charIndex, Int32 charCount, Byte[] bytes, Int32 byteIndex, Int32 byteCount, Boolean flush, Int32& charsUsed, Int32& bytesUsed, Boolean& completed) at MimeKit.IO.Filters.CharsetFilter.Filter(Byte[] input, Int32 startIndex, Int32 length, Int32& outputIndex, Int32& outputLength, Boolean flush) at MimeKit.IO.Filters.MimeFilterBase.Filter(Byte[] input, Int32 startIndex, Int32 length, Int32& outputIndex, Int32& outputLength) at MimeKit.IO.FilteredStream.Write(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken) at MimeKit.IO.FilteredStream.Write(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken) at MimeKit.MimeContent.WriteTo(Stream stream, CancellationToken cancellationToken) at MimeKit.MimeContent.DecodeTo(Stream stream, CancellationToken cancellationToken) at MimeKit.TextPart.GetText(Encoding encoding) at MimeKit.MimeMessage.TryGetMultipartBody(Multipart multipart, TextFormat format, String& body) at MimeKit.MimeMessage.GetTextBody(TextFormat format)

`

As can be seen from the stack trace MimeKit tries to use System.Text.UTF8Encoding.GetBytes() to decode the utf-7 data, which fails.

In my example there is a charset parameter that says utf-7, as shown, but it still doesn't try to decode as utf-7 it seems.

jstedfast · 2019-08-28T14:16:16Z

@eriknuds You are misunderstanding the original discussion in this feature request.

This feature request was for MimeKit to "sniff" the proper unicode charset of the content by checking for a unicode BOM if-and-only-if MimeKit fails to convert the content into a string using the specified charset (or if no charset parameter is specified).

UTF-7 does not have a BOM that I am aware of.

Also: if your content really was UTF-7, then MimeKit would never have tried to fall back to UTF-8.

Try this:

var text = textPart.GetText (Encoding.UTF7);

My prediction is that you will get an exception.

jstedfast · 2019-08-28T14:27:34Z

I just tested this locally and I'm not getting any exceptions. I also tested message.HtmlBody.

I suspect the problem is that your version of .NET might not support UTF-7.

What version of .NET are you using?

eriknuds · 2019-08-28T15:17:02Z

Hi Jeffrey,

I realize the original post was about detecting character set when it was not given by the content charset parameter, but the statements made about utf-7 support made it relevant enough I thought.

Also, there are BOM byte patterns, in fact 5, that signifies utf-7 encoding:
https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

About UTF-7 BOM:
...
While a Unicode signature is typically a single, fixed byte sequence, the nature of UTF-7 necessitates 5 variations: The last 2 bits of the 4th byte of the UTF-7 encoding of U+FEFF belong to the following character, resulting in 4 possible bit patterns and therefore 4 different possible bytes in the 4th position. The 5th variation is needed to disambiguate the case where no characters at all follow the signature.
...

I believe .Net 4.7.2 does support UTF-7 just fine but I will do some more investigation at work to pin this issue down, maybe there are some environent or OS issues playing a part in this.

Also, I believe the .NET charset decoders can be set to have a best fit or replacement (question mark) fallback policy. It would be really nice if MimeKit supported that, as there are as you touch on different versions of .NET, OS'es and evolving unicode standards that may make the decoding process fail for certain characters and I think it's much better to get the text with an occational questionmark for new/special unicode characters than nothing at all - at least in my use cases.

https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding#FallbackStrategy

Thanks for the help and quick replies so far Jeffrey!

jstedfast · 2019-08-28T16:27:35Z

.NET 4.7.2 should have UTF-7 and it should work fine. I'm not sure why it isn't for you but it is for me? I thought maybe you were using the UniversalWindowsPhone81 or UniversalWindows81 targets (which have limited charset support outside of UTF8).

I know about the fallbacks, and yes, they are valuable. That is why I have TextPart.GetText (Encoding)

The problem with having MimeKit do it behind your back is that then you'd have no idea as a mail client author whether something went wrong or not so that you could deal with it.

As far as the UTF-7 BOMs, well, I stand corrected but it wouldn't matter anyway seeing as how your text/html content does not begin with any of those 5 BOM sequences.

I'm curious what happens when you call TextPart.GetText (Encoding.UTF7) on your system.

I also can't understand why Outlook decided to use UTF-7 in your sample message seeing as how all it does is encode HTML markup tokens, e.g. < character becomes +ADw.

It's not necessary at all.

I suspect that the sending client is misconfigured somehow because my Outlook client doesn't seem to do that (mine sends messages as UTF-8, even the HTML bodies).

eriknuds · 2019-08-28T18:33:10Z

Outlook/exchange does so many weird and annoying things in different configurations and scenarios. I'm not sure why utf-7 is used, and I agree it seems like a very weird choice, but we get some of these emails more or less every day (among thousands of emails). Unfortunately I don't have user access to the email accounts in question, I pick these examples up from the error logs where they are only partly written. I will continue to investigate and update here with any interesting findings.

If MailKit supported specifying a fallback strategy I would turn on best effort or replacement as in my use cases as I would like that approach and don't mind the occasional missed character conversion that bad. As I would want to log these cases I would first try with the exception fallback and retry with the other fallback strategies in a catch block to get as much as we could of the text - as well as logging the exception.

jstedfast · 2019-08-28T19:21:55Z

Right, but my point was that if MimeKit did the fallback, then you wouldn't know. You'd just get question marks and never be sure if that was supposed to be that way or if it was a fallback character because you'd never get an exception.

eriknuds · 2019-08-28T19:41:58Z

That is why I would probably default to exception fallback, so I could try/catch/log any issues and retry decoding (property access) with a more lenient fallback strategy in my catch block, to get the best effort or replacement version of the data. Then I would both know it failed (I can log it and mark it with a warning or whatever) and also get the best text representation I could possibly get at the same time. If fallback strategy was a property of the MimeMessage object I could have a strict exception fallback default, then if an exception is thrown in my try block I could set the more lenient fallback strategy on the MimeMessage object and retry in the catch block to get the text again.

eriknuds · 2019-08-29T11:24:02Z

I could not reproduce the error observed in the logs using the data from the logs. If there were any unsavory characters there, or characters belonging to a newer revision of the unicode standard than the OS/.NET supported, it may be filtered out of the data before I get to see the data in my log viewer, so that might be the reason. But there are obviously some instances in which "mostly" valid utf-7 data fails to decode, and I must handle it somehow, so I guess I will have to use TextPart.GetText(Encoding) to decode instead of using the MimeMessage.HtmlBody etc. properties directly unless the fallback strategy of those properties will be configurable in the near future.

jstedfast · 2019-08-29T12:18:51Z

@eriknuds The TextBody and HtmlBody properties on MimeMessage are quick & dirty ways of getting that text, but I've always recommended that developers a little more serious about getting things right use a MimeVisitor approach along with TextPart.Text and/or TextPart.GetText() (I also provide GetText() methods that take a string denoting the override charset name).

For an example of this, take a look at the HtmlPreviewVisitor class in the FAQ and/or take a look at the MessageReader sample in the samples directory of the MimeKit github repo for an example of how to use this strategy.

I know the class seems "complicated", but that's why I wrote the demo which pretty much anyone should be able to more-or-less copy & paste as a really good starting point for doing this.

In the MessageReader sample, the HtmlPreviewVisitor has a comment about saving the image attachments that are referred to in the HTML body to a temp directory so that the browser control used to render the images can load them using file:// URIs, but it looks like I modified the code to use data: URIs instead (which embed the entire image data as a base64 blob inside of the HTML).

Ignore the comment and I'll go fix it to say the correct thing ;-)

FWIW, you could use file:// URIs if you wanted to and the original version of that implementation did, but I think I changed it to data: because that's what most people seemed to want to do, probably because it doesn't require managing temp files ;-)

Access to the referenced image data is another reason why using a MimeVisitor approach is better than HtmlBody, since that's just the raw HTML content w/o any image data.

eriknuds · 2019-08-29T13:12:13Z

Thanks Jeffrey,
I will use a visitor instead as you recommend!

jstedfast · 2019-08-29T13:30:59Z

FWIW, I just updated the HtmlPreviewVisitor implementation the FAQ to allow you to switch easily between file:// and data: URIs.

jstedfast added a commit that referenced this issue Jul 11, 2018

Check TextPart content for a UTF-16 BOM and use a UTF-16 converter if…

71932a7

… found Fixes issue #417

jstedfast added a commit that referenced this issue Jul 11, 2018

Check TextPart content for a UTF-16 BOM and use a UTF-16 converter if…

8cb80e8

… found Fixes issue #417 Thanks to @ekalchev for this suggestion and initial patch

jstedfast closed this as completed Jul 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Part decode enhancement when the charset param is mising #417

Text Part decode enhancement when the charset param is mising #417

ekalchev commented Jul 11, 2018 •

edited

Loading

jstedfast commented Jul 11, 2018

eriknuds commented Aug 28, 2019 •

edited by jstedfast

Loading

jstedfast commented Aug 28, 2019

jstedfast commented Aug 28, 2019

eriknuds commented Aug 28, 2019

jstedfast commented Aug 28, 2019 •

edited

Loading

eriknuds commented Aug 28, 2019 •

edited

Loading

jstedfast commented Aug 28, 2019

eriknuds commented Aug 28, 2019

eriknuds commented Aug 29, 2019 •

edited

Loading

jstedfast commented Aug 29, 2019 •

edited

Loading

eriknuds commented Aug 29, 2019

jstedfast commented Aug 29, 2019

Text Part decode enhancement when the charset param is mising #417

Text Part decode enhancement when the charset param is mising #417

Comments

ekalchev commented Jul 11, 2018 • edited Loading

jstedfast commented Jul 11, 2018

eriknuds commented Aug 28, 2019 • edited by jstedfast Loading

jstedfast commented Aug 28, 2019

jstedfast commented Aug 28, 2019

eriknuds commented Aug 28, 2019

jstedfast commented Aug 28, 2019 • edited Loading

eriknuds commented Aug 28, 2019 • edited Loading

jstedfast commented Aug 28, 2019

eriknuds commented Aug 28, 2019

eriknuds commented Aug 29, 2019 • edited Loading

jstedfast commented Aug 29, 2019 • edited Loading

eriknuds commented Aug 29, 2019

jstedfast commented Aug 29, 2019

ekalchev commented Jul 11, 2018 •

edited

Loading

eriknuds commented Aug 28, 2019 •

edited by jstedfast

Loading

jstedfast commented Aug 28, 2019 •

edited

Loading

eriknuds commented Aug 28, 2019 •

edited

Loading

eriknuds commented Aug 29, 2019 •

edited

Loading

jstedfast commented Aug 29, 2019 •

edited

Loading