-
-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text Part decode enhancement when the charset param is mising #417
Comments
Sure, we could do that... UTF-7 should never be used, though, so checking for that is a waste. Same for UTF-32 and no need to bother checking for a UTF-8 BOM, just check for UTF-16 LE/BE and if it's not one of those, try UTF-8 as normal. |
...are you sure about that? If it's legal mime and a standard encoding it should be supported IMHO. Also, in real life, I keep getting emails from Outlook 2019 with utf-7 mime parts:
...and UTF8 decoding fails when I access the MimeMessage TextBody or HtmlBody attributes. The exception "Unable to translate Unicode character \uDE9A at index 108 to specified code page." is thrown. Stack trace: ` As can be seen from the stack trace MimeKit tries to use System.Text.UTF8Encoding.GetBytes() to decode the utf-7 data, which fails. In my example there is a charset parameter that says utf-7, as shown, but it still doesn't try to decode as utf-7 it seems. |
@eriknuds You are misunderstanding the original discussion in this feature request. This feature request was for MimeKit to "sniff" the proper unicode charset of the content by checking for a unicode BOM if-and-only-if MimeKit fails to convert the content into a string using the specified charset (or if no charset parameter is specified). UTF-7 does not have a BOM that I am aware of. Also: if your content really was UTF-7, then MimeKit would never have tried to fall back to UTF-8. Try this: var text = textPart.GetText (Encoding.UTF7); My prediction is that you will get an exception. |
I just tested this locally and I'm not getting any exceptions. I also tested I suspect the problem is that your version of .NET might not support UTF-7. What version of .NET are you using? |
Hi Jeffrey, I realize the original post was about detecting character set when it was not given by the content charset parameter, but the statements made about utf-7 support made it relevant enough I thought. Also, there are BOM byte patterns, in fact 5, that signifies utf-7 encoding: About UTF-7 BOM: I believe .Net 4.7.2 does support UTF-7 just fine but I will do some more investigation at work to pin this issue down, maybe there are some environent or OS issues playing a part in this. Also, I believe the .NET charset decoders can be set to have a best fit or replacement (question mark) fallback policy. It would be really nice if MimeKit supported that, as there are as you touch on different versions of .NET, OS'es and evolving unicode standards that may make the decoding process fail for certain characters and I think it's much better to get the text with an occational questionmark for new/special unicode characters than nothing at all - at least in my use cases. https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-encoding#FallbackStrategy Thanks for the help and quick replies so far Jeffrey! |
.NET 4.7.2 should have UTF-7 and it should work fine. I'm not sure why it isn't for you but it is for me? I thought maybe you were using the UniversalWindowsPhone81 or UniversalWindows81 targets (which have limited charset support outside of UTF8). I know about the fallbacks, and yes, they are valuable. That is why I have The problem with having MimeKit do it behind your back is that then you'd have no idea as a mail client author whether something went wrong or not so that you could deal with it. As far as the UTF-7 BOMs, well, I stand corrected but it wouldn't matter anyway seeing as how your text/html content does not begin with any of those 5 BOM sequences. I'm curious what happens when you call I also can't understand why Outlook decided to use UTF-7 in your sample message seeing as how all it does is encode HTML markup tokens, e.g. It's not necessary at all. I suspect that the sending client is misconfigured somehow because my Outlook client doesn't seem to do that (mine sends messages as UTF-8, even the HTML bodies). |
Outlook/exchange does so many weird and annoying things in different configurations and scenarios. I'm not sure why utf-7 is used, and I agree it seems like a very weird choice, but we get some of these emails more or less every day (among thousands of emails). Unfortunately I don't have user access to the email accounts in question, I pick these examples up from the error logs where they are only partly written. I will continue to investigate and update here with any interesting findings. If MailKit supported specifying a fallback strategy I would turn on best effort or replacement as in my use cases as I would like that approach and don't mind the occasional missed character conversion that bad. As I would want to log these cases I would first try with the exception fallback and retry with the other fallback strategies in a catch block to get as much as we could of the text - as well as logging the exception. |
Right, but my point was that if MimeKit did the fallback, then you wouldn't know. You'd just get question marks and never be sure if that was supposed to be that way or if it was a fallback character because you'd never get an exception. |
That is why I would probably default to exception fallback, so I could try/catch/log any issues and retry decoding (property access) with a more lenient fallback strategy in my catch block, to get the best effort or replacement version of the data. Then I would both know it failed (I can log it and mark it with a warning or whatever) and also get the best text representation I could possibly get at the same time. If fallback strategy was a property of the MimeMessage object I could have a strict exception fallback default, then if an exception is thrown in my try block I could set the more lenient fallback strategy on the MimeMessage object and retry in the catch block to get the text again. |
I could not reproduce the error observed in the logs using the data from the logs. If there were any unsavory characters there, or characters belonging to a newer revision of the unicode standard than the OS/.NET supported, it may be filtered out of the data before I get to see the data in my log viewer, so that might be the reason. But there are obviously some instances in which "mostly" valid utf-7 data fails to decode, and I must handle it somehow, so I guess I will have to use TextPart.GetText(Encoding) to decode instead of using the MimeMessage.HtmlBody etc. properties directly unless the fallback strategy of those properties will be configurable in the near future. |
@eriknuds The For an example of this, take a look at the I know the class seems "complicated", but that's why I wrote the demo which pretty much anyone should be able to more-or-less copy & paste as a really good starting point for doing this. In the Ignore the comment and I'll go fix it to say the correct thing ;-) FWIW, you could use Access to the referenced image data is another reason why using a |
Thanks Jeffrey, |
FWIW, I just updated the HtmlPreviewVisitor implementation the FAQ to allow you to switch easily between file:// and data: URIs. |
I noticed that for some mime parts that are missing the charset encoding parameter this method fails to decode the text in the case when the encoding is UTF16 even if the BOM for UTF16 was present
MimeKit/MimeKit/TextPart.cs
Lines 299 to 338 in 2249574
I was successfully be able to decode the part which previously was failing to decode with this change and I thought I might share it. It make use of Byte order marks
This was tested only for UTF16
The text was updated successfully, but these errors were encountered: