-
-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text part encodings is incorrect after saving (problem observed with Japanese encodings) #804
Comments
The problem is that the Content-Type does not define a charset which, according to the MIME specs, means the content is supposed to be us-ascii. MimeKit tries to be smarter than that, so it first tries converting the content using utf-8 and then falls back to iso-8859-1 if that fails. Unfortunately the text in your case is neither. |
Is there anything I should do differently in the code? I am trying to add some text to the Text part. As you can see the MHT file opens ok before the modification, but is completely garbled after making the modification. |
@oren-boop My previous comment was posted from my phone so I wasn't able to view the zip attachments but am on my desktop now and taking a look. The "Good Example" has:
So that's why that one works. Taking a closer look at the raw HTML and it looks like they have a meta tag like this:
I've got to take the dog out for a walk, but I'll respond when I get back about how that could be parsed to get the charset and then you could use: var text = tp.GetText (charset) += "\nBLABLABLA"
tp.SetText (charset, text); One problem with using the TextPart.Text property setter is that currently TextPart will always set the text as UTF-8 (hence why it changed in your output). That may be fine with you, but if it's not, then you can use the SetText() method and either pass a charset string to use or a System.Text.Encoding if you have it. |
private static readonly Regex EncodingRegex = new Regex(@"<meta\s+http-equiv\s*=\s*""Content-Type""\s*content\s*=\s*""[^""]+charset\s*=\s*([^\s;""]+)",
RegexOptions.Compiled);
...
var match = EncodingRegex.Match(ret.Text, 0, Math.Min(1024, ret.Text.Length));
if (match.Success)
{
string charset= match.Groups[1].Value;
try
{
... = Encoding.GetEncoding(charset);
}
catch (ArgumentException)
{
}
}
but unfortunately, when I try forcing the encoding onto the text, it still doesn't help: var text = tp.GetText("Shift_JIS") + "blablabla";
tp.SetText("Shift_JIS", text); I still end up with a gabled file. |
Here's what I came up with to extract the charset: static bool TryGetAttr (HtmlAttributeCollection attributes, HtmlAttributeId id, out HtmlAttribute attribute)
{
for (int i = 0; i < attributes.Count; i++) {
if (attributes[i].Id == id) {
attribute = attributes[i];
return true;
}
}
attribute = null;
return false;
}
static string AutodetectHtmlCharset (TextPart part)
{
using (var content = part.Content.Open ()) {
using (var reader = new StreamReader (content, true)) {
var tokenizer = new HtmlTokenizer (reader);
var insideHead = false;
while (tokenizer.ReadNextToken (out var token)) {
if (token.Kind != HtmlTokenKind.Tag)
continue;
var tag = (HtmlTagToken) token;
if (tag.Id == HtmlTagId.Head) {
if (tag.IsEndTag || tag.IsEmptyElement) {
// Stop tokenizing once w've reached </head>.
break;
}
insideHead = true;
} else if (insideHead && tag.Id == HtmlTagId.Meta && !tag.IsEndTag) {
if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) && attr.Value != null && attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))
continue;
if (TryGetAttr (tag.Attributes, HtmlAttributeId.Content, out attr) && attr.Value != null && ContentType.TryParse (attr.Value, out var contentType))
return contentType.Charset;
}
}
}
}
return null;
} And then my little test case looks like this: var message = MimeMessage.Load (path);
foreach (var part in message.BodyParts.OfType<TextPart> ()) {
Encoding encoding;
string text;
if (string.IsNullOrEmpty (part.ContentType.Charset)) {
var charset = AutodetectHtmlCharset (part);
if (string.IsNullOrEmpty (charset)) {
// Let MimeKit try to auto-detect
text = part.GetText (out encoding);
} else {
encoding = Encoding.GetEncoding (charset);
text = part.GetText (encoding);
}
} else {
// Let MimeKit try to auto-detect
text = part.GetText (out encoding);
}
text += "\nblahblahblah";
part.SetText (encoding, text);
}
message.WriteTo (updatedPath); And yes, I still see that the encoded text in the text/html parts change (other than the added "\nblahblahblah" added at the end). Okay, so the Shift_JIS Encoding's WebName and BodyName are iso-2022-jp which is why MimeKit sets that value (and also because MimeKit has logic for that mapping as well because I think Mono's impl didn't have that right or something) on the Content-Type MIME header. But I don't understand why the output bytes are different because the same System.Text.Encoding is used in |
Actually, on closer inspection, it looks like my code gets it right, the differences are because MimeKit's line-wrapping is using a different margin. If you want, I bet I could modify my solution above to avoid changing the Content-Type charset value to minimize changes. foreach (var part in message.BodyParts.OfType<TextPart> ()) {
string charset, text;
Encoding encoding;
charset = part.ContentType.Charset;
if (string.IsNullOrEmpty (charset)) {
charset = AutodetectHtmlCharset (part);
if (string.IsNullOrEmpty (charset)) {
// Let MimeKit try to auto-detect
text = part.GetText (out encoding);
} else {
encoding = Encoding.GetEncoding (charset);
text = part.GetText (encoding);
}
// Set charset back to null because that's what it originally was
charset = null;
} else {
// Let MimeKit try to auto-detect
text = part.GetText (out encoding);
}
text += "\nblahblahblah";
part.SetText (encoding, text);
// Override the charset value back to what it originally was
part.ContentType.Charset = charset;
} |
Thanks! Your solution works perfectly! Thanks! |
Awesome! And thanks for the donation! ❤️ |
FYI, my code snippet has a bug in it: if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) && attr.Value != null && attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase)) should be: if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) || attr.Value == null || !attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase)) |
I'm also going to integrate the TryGetAttr method into HtmlAttributeCollection as TryGetValue() (like the Dictionary API) and I am also thinking that I'm going to add an HTML charset detection API to TextPart because I definitely want this to be easier than it is right now. |
Addresses the complexity of detecting text encodings in cases like the one encountered in issue #804.
I've added a TryDetectEncoding() method to TextPart that simplifies this whole process. This new API will be included in a future MimeKit v3.4.0 release. |
MimeKit v3.4.0 has been released with this feature. |
Thanks for the update!
Oren
From: Jeffrey Stedfast ***@***.***>
Sent: יום ה 16 יוני 2022 17:26
To: jstedfast/MimeKit ***@***.***>
Cc: Oren Shnitzer ***@***.***>; State change ***@***.***>
Subject: Re: [jstedfast/MimeKit] Text part encodings is incorrect after saving (problem observed with Japanese encodings) (Issue #804)
FYI, my code snippet has a bug in it:
if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) && attr.Value != null && attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))
should be:
if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) || attr.Value == null || !attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))
—
Reply to this email directly, view it on GitHub<#804 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ASNQILTAKDTVYXVXOAHH333VPM2P7ANCNFSM5Y2WDL4Q>.
You are receiving this because you modified the open/close state.Message ID: ***@***.******@***.***>>
|
Description
We have encountered an issue working with some Japanese MHT files. I believe the issue is caused by an encoding problem.
Platform (please complete the following information)
Steps to reproduce the behavior
Expected behavior
Expected: the output file looks like the original
Actual: some characters are gabled
Code Snippets
Here is the original MHT:
Here is the MHT after running the code:
Here is an image of the diff of the files. Changes to the encoding and characters:
Additional context
Add any other context about the problem here.
A good example.zip
Problem (JIS).zip
Problem 2 (EUC).zip
The text was updated successfully, but these errors were encountered: