Text part encodings is incorrect after saving (problem observed with Japanese encodings) #804

oren-boop · 2022-06-15T10:00:30Z

Description

We have encountered an issue working with some Japanese MHT files. I believe the issue is caused by an encoding problem.

Platform (please complete the following information)

OS: Windows
.NET Runtime:
.NET Framework: .Net Core 6 as well as .NET 4.7.2
MimeKit Version: 3.3

Steps to reproduce the behavior

Run the attached code snippet
Open the resulting MHT file

Expected behavior
Expected: the output file looks like the original
Actual: some characters are gabled

Code Snippets

using System;
using System.IO;
using System.Linq;
using System.Text;
using MimeKit;

namespace MimeKit_Encoding
{
    internal class Program
    {
        static void Main(string[] args)
        {
            if (args.Length == 0)
            {
                Console.WriteLine("USage: Mimekit_encoding {mime file path}");
                return;
            }

            using (var stream = new FileStream(args[0], FileMode.Open))
            {
                var message = MimeMessage.Load(stream);

                foreach (var item in message.BodyParts)
                {
                    Console.Write($"{item.GetType()}: {item.ContentType}");
                    if (item is TextPart tp)
                    {
                        Console.Write($" charset:{tp.ContentType.Charset??"NULL"}");
                        tp.Text += "\nBLABLABLA";
                    }
                    Console.WriteLine();
                }

                message.WriteTo(@"out.mht");
            }
        }
    }
}

Here is the original MHT:

Here is the MHT after running the code:

Here is an image of the diff of the files. Changes to the encoding and characters:

Additional context
Add any other context about the problem here.
A good example.zip
Problem (JIS).zip
Problem 2 (EUC).zip

jstedfast · 2022-06-15T11:57:27Z

The problem is that the Content-Type does not define a charset which, according to the MIME specs, means the content is supposed to be us-ascii.

MimeKit tries to be smarter than that, so it first tries converting the content using utf-8 and then falls back to iso-8859-1 if that fails.

Unfortunately the text in your case is neither.

oren-boop · 2022-06-15T12:02:04Z

Is there anything I should do differently in the code? I am trying to add some text to the Text part. As you can see the MHT file opens ok before the modification, but is completely garbled after making the modification.

jstedfast · 2022-06-15T12:17:33Z

@oren-boop My previous comment was posted from my phone so I wasn't able to view the zip attachments but am on my desktop now and taking a look.

The "Good Example" has:

Content-Type: text/html;
	charset="shift_jis"

So that's why that one works.

Taking a closer look at the raw HTML and it looks like they have a meta tag like this:

<meta http-equiv=3D"Content-Type" con=
tent=3D"text/html; charset=3DShift_JIS">

I've got to take the dog out for a walk, but I'll respond when I get back about how that could be parsed to get the charset and then you could use:

var text = tp.GetText (charset) += "\nBLABLABLA"
tp.SetText (charset, text);

One problem with using the TextPart.Text property setter is that currently TextPart will always set the text as UTF-8 (hence why it changed in your output).

That may be fine with you, but if it's not, then you can use the SetText() method and either pass a charset string to use or a System.Text.Encoding if you have it.

oren-boop · 2022-06-15T12:36:44Z

I am able to get the encoding from the <meta ...> tag using:

private static readonly Regex EncodingRegex = new Regex(@"<meta\s+http-equiv\s*=\s*""Content-Type""\s*content\s*=\s*""[^""]+charset\s*=\s*([^\s;""]+)", 
            RegexOptions.Compiled);

...

var match = EncodingRegex.Match(ret.Text, 0, Math.Min(1024, ret.Text.Length));
                            if (match.Success)
                            {
                                string charset= match.Groups[1].Value;
                                try
                                {
                                    ... = Encoding.GetEncoding(charset);
                                }
                                catch (ArgumentException)
                                {
                                }
                            }

but unfortunately, when I try forcing the encoding onto the text, it still doesn't help:

                        var text = tp.GetText("Shift_JIS") + "blablabla";
                        tp.SetText("Shift_JIS", text);

I still end up with a gabled file.

jstedfast · 2022-06-15T13:41:04Z

Here's what I came up with to extract the charset:

static bool TryGetAttr (HtmlAttributeCollection attributes, HtmlAttributeId id, out HtmlAttribute attribute)
{
	for (int i = 0; i < attributes.Count; i++) {
		if (attributes[i].Id == id) {
			attribute = attributes[i];
			return true;
		}
	}

	attribute = null;

	return false;
}

static string AutodetectHtmlCharset (TextPart part)
{
	using (var content = part.Content.Open ()) {
		using (var reader = new StreamReader (content, true)) {
			var tokenizer = new HtmlTokenizer (reader);
			var insideHead = false;

			while (tokenizer.ReadNextToken (out var token)) {
				if (token.Kind != HtmlTokenKind.Tag)
					continue;

				var tag = (HtmlTagToken) token;
				if (tag.Id == HtmlTagId.Head) {
					if (tag.IsEndTag || tag.IsEmptyElement) {
						// Stop tokenizing once w've reached </head>.
						break;
					}

					insideHead = true;
				} else if (insideHead && tag.Id == HtmlTagId.Meta && !tag.IsEndTag) {
					if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) && attr.Value != null && attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))
						continue;

					if (TryGetAttr (tag.Attributes, HtmlAttributeId.Content, out attr) && attr.Value != null && ContentType.TryParse (attr.Value, out var contentType))
						return contentType.Charset;
				}
			}
		}
	}

	return null;
}

And then my little test case looks like this:

var message = MimeMessage.Load (path);

foreach (var part in message.BodyParts.OfType<TextPart> ()) {
	Encoding encoding;
	string text;

	if (string.IsNullOrEmpty (part.ContentType.Charset)) {
		var charset = AutodetectHtmlCharset (part);
		if (string.IsNullOrEmpty (charset)) {
			// Let MimeKit try to auto-detect
			text = part.GetText (out encoding);
		} else {
			encoding = Encoding.GetEncoding (charset);
			text = part.GetText (encoding);
		}
	} else {
		// Let MimeKit try to auto-detect
		text = part.GetText (out encoding);
	}

	text += "\nblahblahblah";
	part.SetText (encoding, text);
}

message.WriteTo (updatedPath);

And yes, I still see that the encoded text in the text/html parts change (other than the added "\nblahblahblah" added at the end).

Okay, so the Shift_JIS Encoding's WebName and BodyName are iso-2022-jp which is why MimeKit sets that value (and also because MimeKit has logic for that mapping as well because I think Mono's impl didn't have that right or something) on the Content-Type MIME header.

But I don't understand why the output bytes are different because the same System.Text.Encoding is used in GetText(Encoding) to convert from bytes->string and in SetText(Encoding, string) to convert from string->bytes, so the only thing I can think of is that the Encoding has multiple ways of encoding the same text?

jstedfast · 2022-06-15T13:53:42Z

Actually, on closer inspection, it looks like my code gets it right, the differences are because MimeKit's line-wrapping is using a different margin.

If you want, I bet I could modify my solution above to avoid changing the Content-Type charset value to minimize changes.

foreach (var part in message.BodyParts.OfType<TextPart> ()) {
	string charset, text;
	Encoding encoding;

	charset = part.ContentType.Charset;

	if (string.IsNullOrEmpty (charset)) {
		charset = AutodetectHtmlCharset (part);
		if (string.IsNullOrEmpty (charset)) {
			// Let MimeKit try to auto-detect
			text = part.GetText (out encoding);
		} else {
			encoding = Encoding.GetEncoding (charset);
			text = part.GetText (encoding);
		}

		// Set charset back to null because that's what it originally was
		charset = null;
	} else {
		// Let MimeKit try to auto-detect
		text = part.GetText (out encoding);
	}

	text += "\nblahblahblah";
	part.SetText (encoding, text);

	// Override the charset value back to what it originally was
	part.ContentType.Charset = charset;
}

oren-boop · 2022-06-15T19:12:57Z

Thanks! Your solution works perfectly!

Thanks!

jstedfast · 2022-06-15T20:31:34Z

Awesome! And thanks for the donation! ❤️

jstedfast · 2022-06-16T14:25:57Z

FYI, my code snippet has a bug in it:

if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) && attr.Value != null && attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))

should be:

if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) || attr.Value == null || !attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))

jstedfast · 2022-06-16T14:29:03Z

I'm also going to integrate the TryGetAttr method into HtmlAttributeCollection as TryGetValue() (like the Dictionary API) and I am also thinking that I'm going to add an HTML charset detection API to TextPart because I definitely want this to be easier than it is right now.

Addresses the complexity of detecting text encodings in cases like the one encountered in issue #804.

jstedfast · 2022-06-26T16:31:38Z

I've added a TryDetectEncoding() method to TextPart that simplifies this whole process.

This new API will be included in a future MimeKit v3.4.0 release.

jstedfast · 2022-08-20T12:16:51Z

MimeKit v3.4.0 has been released with this feature.

oren-boop · 2022-10-11T09:28:55Z

Thanks for the update! Oren From: Jeffrey Stedfast ***@***.***> Sent: יום ה 16 יוני 2022 17:26 To: jstedfast/MimeKit ***@***.***> Cc: Oren Shnitzer ***@***.***>; State change ***@***.***> Subject: Re: [jstedfast/MimeKit] Text part encodings is incorrect after saving (problem observed with Japanese encodings) (Issue #804) FYI, my code snippet has a bug in it: if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) && attr.Value != null && attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase)) should be: if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) || attr.Value == null || !attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase)) — Reply to this email directly, view it on GitHub<#804 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ASNQILTAKDTVYXVXOAHH333VPM2P7ANCNFSM5Y2WDL4Q>. You are receiving this because you modified the open/close state.Message ID: ***@***.******@***.***>>

oren-boop changed the title ~~Detecting some text part encodings (problem observed with Japanese encodings)~~ Text part encodings is incorrect after saving (problem observed with Japanese encodings) Jun 15, 2022

oren-boop closed this as completed Jun 15, 2022

jstedfast added a commit that referenced this issue Jun 26, 2022

Added a new TextPart.TryDetectEncoding() API

7bc4c8c

Addresses the complexity of detecting text encodings in cases like the one encountered in issue #804.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text part encodings is incorrect after saving (problem observed with Japanese encodings) #804

Text part encodings is incorrect after saving (problem observed with Japanese encodings) #804

oren-boop commented Jun 15, 2022 •

edited by jstedfast

Loading

jstedfast commented Jun 15, 2022

oren-boop commented Jun 15, 2022

jstedfast commented Jun 15, 2022

oren-boop commented Jun 15, 2022

jstedfast commented Jun 15, 2022

jstedfast commented Jun 15, 2022

oren-boop commented Jun 15, 2022

jstedfast commented Jun 15, 2022

jstedfast commented Jun 16, 2022 •

edited

Loading

jstedfast commented Jun 16, 2022 •

edited

Loading

jstedfast commented Jun 26, 2022

jstedfast commented Aug 20, 2022

oren-boop commented Oct 11, 2022 via email

Text part encodings is incorrect after saving (problem observed with Japanese encodings) #804

Text part encodings is incorrect after saving (problem observed with Japanese encodings) #804

Comments

oren-boop commented Jun 15, 2022 • edited by jstedfast Loading

jstedfast commented Jun 15, 2022

oren-boop commented Jun 15, 2022

jstedfast commented Jun 15, 2022

oren-boop commented Jun 15, 2022

jstedfast commented Jun 15, 2022

jstedfast commented Jun 15, 2022

oren-boop commented Jun 15, 2022

jstedfast commented Jun 15, 2022

jstedfast commented Jun 16, 2022 • edited Loading

jstedfast commented Jun 16, 2022 • edited Loading

jstedfast commented Jun 26, 2022

jstedfast commented Aug 20, 2022

oren-boop commented Oct 11, 2022 via email

oren-boop commented Jun 15, 2022 •

edited by jstedfast

Loading

jstedfast commented Jun 16, 2022 •

edited

Loading

jstedfast commented Jun 16, 2022 •

edited

Loading