Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text part encodings is incorrect after saving (problem observed with Japanese encodings) #804

Closed
oren-boop opened this issue Jun 15, 2022 · 13 comments

Comments

@oren-boop
Copy link

oren-boop commented Jun 15, 2022

Description

We have encountered an issue working with some Japanese MHT files. I believe the issue is caused by an encoding problem.

Platform (please complete the following information)

  • OS: Windows
  • .NET Runtime:
  • .NET Framework: .Net Core 6 as well as .NET 4.7.2
  • MimeKit Version: 3.3

Steps to reproduce the behavior

  1. Run the attached code snippet
  2. Open the resulting MHT file

Expected behavior
Expected: the output file looks like the original
Actual: some characters are gabled

Code Snippets

using System;
using System.IO;
using System.Linq;
using System.Text;
using MimeKit;

namespace MimeKit_Encoding
{
    internal class Program
    {
        static void Main(string[] args)
        {
            if (args.Length == 0)
            {
                Console.WriteLine("USage: Mimekit_encoding {mime file path}");
                return;
            }

            using (var stream = new FileStream(args[0], FileMode.Open))
            {
                var message = MimeMessage.Load(stream);

                foreach (var item in message.BodyParts)
                {
                    Console.Write($"{item.GetType()}: {item.ContentType}");
                    if (item is TextPart tp)
                    {
                        Console.Write($" charset:{tp.ContentType.Charset??"NULL"}");
                        tp.Text += "\nBLABLABLA";
                    }
                    Console.WriteLine();
                }

                message.WriteTo(@"out.mht");
            }
        }
    }
}

Here is the original MHT:
image

Here is the MHT after running the code:
image

Here is an image of the diff of the files. Changes to the encoding and characters:
image

Additional context
Add any other context about the problem here.
A good example.zip
Problem (JIS).zip
Problem 2 (EUC).zip

@oren-boop oren-boop changed the title Detecting some text part encodings (problem observed with Japanese encodings) Text part encodings is incorrect after saving (problem observed with Japanese encodings) Jun 15, 2022
@jstedfast
Copy link
Owner

The problem is that the Content-Type does not define a charset which, according to the MIME specs, means the content is supposed to be us-ascii.

MimeKit tries to be smarter than that, so it first tries converting the content using utf-8 and then falls back to iso-8859-1 if that fails.

Unfortunately the text in your case is neither.

@oren-boop
Copy link
Author

Is there anything I should do differently in the code? I am trying to add some text to the Text part. As you can see the MHT file opens ok before the modification, but is completely garbled after making the modification.

@jstedfast
Copy link
Owner

@oren-boop My previous comment was posted from my phone so I wasn't able to view the zip attachments but am on my desktop now and taking a look.

The "Good Example" has:

Content-Type: text/html;
	charset="shift_jis"

So that's why that one works.

Taking a closer look at the raw HTML and it looks like they have a meta tag like this:

<meta http-equiv=3D"Content-Type" con=
tent=3D"text/html; charset=3DShift_JIS">

I've got to take the dog out for a walk, but I'll respond when I get back about how that could be parsed to get the charset and then you could use:

var text = tp.GetText (charset) += "\nBLABLABLA"
tp.SetText (charset, text);

One problem with using the TextPart.Text property setter is that currently TextPart will always set the text as UTF-8 (hence why it changed in your output).

That may be fine with you, but if it's not, then you can use the SetText() method and either pass a charset string to use or a System.Text.Encoding if you have it.

@oren-boop
Copy link
Author

  1. I am able to get the encoding from the <meta ...> tag using:
private static readonly Regex EncodingRegex = new Regex(@"<meta\s+http-equiv\s*=\s*""Content-Type""\s*content\s*=\s*""[^""]+charset\s*=\s*([^\s;""]+)", 
            RegexOptions.Compiled);

...

var match = EncodingRegex.Match(ret.Text, 0, Math.Min(1024, ret.Text.Length));
                            if (match.Success)
                            {
                                string charset= match.Groups[1].Value;
                                try
                                {
                                    ... = Encoding.GetEncoding(charset);
                                }
                                catch (ArgumentException)
                                {
                                }
                            }

but unfortunately, when I try forcing the encoding onto the text, it still doesn't help:

                        var text = tp.GetText("Shift_JIS") + "blablabla";
                        tp.SetText("Shift_JIS", text);

I still end up with a gabled file.

@jstedfast
Copy link
Owner

Here's what I came up with to extract the charset:

static bool TryGetAttr (HtmlAttributeCollection attributes, HtmlAttributeId id, out HtmlAttribute attribute)
{
	for (int i = 0; i < attributes.Count; i++) {
		if (attributes[i].Id == id) {
			attribute = attributes[i];
			return true;
		}
	}

	attribute = null;

	return false;
}

static string AutodetectHtmlCharset (TextPart part)
{
	using (var content = part.Content.Open ()) {
		using (var reader = new StreamReader (content, true)) {
			var tokenizer = new HtmlTokenizer (reader);
			var insideHead = false;

			while (tokenizer.ReadNextToken (out var token)) {
				if (token.Kind != HtmlTokenKind.Tag)
					continue;

				var tag = (HtmlTagToken) token;
				if (tag.Id == HtmlTagId.Head) {
					if (tag.IsEndTag || tag.IsEmptyElement) {
						// Stop tokenizing once w've reached </head>.
						break;
					}

					insideHead = true;
				} else if (insideHead && tag.Id == HtmlTagId.Meta && !tag.IsEndTag) {
					if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) && attr.Value != null && attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))
						continue;

					if (TryGetAttr (tag.Attributes, HtmlAttributeId.Content, out attr) && attr.Value != null && ContentType.TryParse (attr.Value, out var contentType))
						return contentType.Charset;
				}
			}
		}
	}

	return null;
}

And then my little test case looks like this:

var message = MimeMessage.Load (path);

foreach (var part in message.BodyParts.OfType<TextPart> ()) {
	Encoding encoding;
	string text;

	if (string.IsNullOrEmpty (part.ContentType.Charset)) {
		var charset = AutodetectHtmlCharset (part);
		if (string.IsNullOrEmpty (charset)) {
			// Let MimeKit try to auto-detect
			text = part.GetText (out encoding);
		} else {
			encoding = Encoding.GetEncoding (charset);
			text = part.GetText (encoding);
		}
	} else {
		// Let MimeKit try to auto-detect
		text = part.GetText (out encoding);
	}

	text += "\nblahblahblah";
	part.SetText (encoding, text);
}

message.WriteTo (updatedPath);

And yes, I still see that the encoded text in the text/html parts change (other than the added "\nblahblahblah" added at the end).

Okay, so the Shift_JIS Encoding's WebName and BodyName are iso-2022-jp which is why MimeKit sets that value (and also because MimeKit has logic for that mapping as well because I think Mono's impl didn't have that right or something) on the Content-Type MIME header.

But I don't understand why the output bytes are different because the same System.Text.Encoding is used in GetText(Encoding) to convert from bytes->string and in SetText(Encoding, string) to convert from string->bytes, so the only thing I can think of is that the Encoding has multiple ways of encoding the same text?

@jstedfast
Copy link
Owner

Actually, on closer inspection, it looks like my code gets it right, the differences are because MimeKit's line-wrapping is using a different margin.

If you want, I bet I could modify my solution above to avoid changing the Content-Type charset value to minimize changes.

foreach (var part in message.BodyParts.OfType<TextPart> ()) {
	string charset, text;
	Encoding encoding;

	charset = part.ContentType.Charset;

	if (string.IsNullOrEmpty (charset)) {
		charset = AutodetectHtmlCharset (part);
		if (string.IsNullOrEmpty (charset)) {
			// Let MimeKit try to auto-detect
			text = part.GetText (out encoding);
		} else {
			encoding = Encoding.GetEncoding (charset);
			text = part.GetText (encoding);
		}

		// Set charset back to null because that's what it originally was
		charset = null;
	} else {
		// Let MimeKit try to auto-detect
		text = part.GetText (out encoding);
	}

	text += "\nblahblahblah";
	part.SetText (encoding, text);

	// Override the charset value back to what it originally was
	part.ContentType.Charset = charset;
}

@oren-boop
Copy link
Author

Thanks! Your solution works perfectly!

Thanks!

@jstedfast
Copy link
Owner

Awesome! And thanks for the donation! ❤️

@jstedfast
Copy link
Owner

jstedfast commented Jun 16, 2022

FYI, my code snippet has a bug in it:

if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) && attr.Value != null && attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))

should be:

if (!TryGetAttr (tag.Attributes, HtmlAttributeId.HttpEquiv, out var attr) || attr.Value == null || !attr.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))

@jstedfast
Copy link
Owner

jstedfast commented Jun 16, 2022

I'm also going to integrate the TryGetAttr method into HtmlAttributeCollection as TryGetValue() (like the Dictionary API) and I am also thinking that I'm going to add an HTML charset detection API to TextPart because I definitely want this to be easier than it is right now.

jstedfast added a commit that referenced this issue Jun 26, 2022
Addresses the complexity of detecting text encodings in cases like the one
encountered in issue #804.
@jstedfast
Copy link
Owner

I've added a TryDetectEncoding() method to TextPart that simplifies this whole process.

This new API will be included in a future MimeKit v3.4.0 release.

@jstedfast
Copy link
Owner

MimeKit v3.4.0 has been released with this feature.

@oren-boop
Copy link
Author

oren-boop commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants