HTML entity for carriage return is being encoded #808

SenaraW · 2022-06-24T12:35:44Z

Describe the bug
When you convert a mimetext containing  it will be encoded an additional time, resulting in the output being &#13;
Other html entities seem to not be affected.

Platform (please complete the following information):

OS: Windows
.NET Runtime: CLR
.NET Framework: .NET 4.8
MimeKit Version: 3.3.0

To Reproduce
Convert the following mimetext to html:

Content-Type: text/html; charset=utf-8

<html>
<body>
I'm on holiday until&nbsp; June 17, 2022.&#13;
</body>
</html>

This will result in the following html:

<html>
<body oncontextmenu="return false;">
I&#39;m on holiday until&#160; June 17, 2022.&amp;#13;
</body>
</html>

Expected behavior
Output should be 

The text was updated successfully, but these errors were encountered:

jstedfast · 2022-06-24T13:11:58Z

How are you converting?

If you could write one up, a simple test-case program would be ideal so I can easily reproduce this to see what is going wrong.

My gut instinct is that the HtmlTokenizer's DecodeCharacterReferences property is set to false and then the HtmlCDataToken content is being written via HtmlWriter.WriteText()?

FWIW, if you are doing this manually and the above diagnosis sounds correct, then you want to use HtmlWriter.WriteMarkup() so that the cdata doesn't get encoded.

SenaraW · 2022-06-24T14:55:26Z

Here's a small test program to reproduce the bug: HTMLEntityTest.zip
It converts the above mentioned mime text and converts it to html and outputs it to the textbox.

I hope this helps!

jstedfast · 2022-06-26T02:36:36Z

Okay, so the problem is that according to the W3 HTML5 specification, is a parse error.

Note here: https://dev.w3.org/html5/spec-LC/tokenization.html#consume-a-character-reference

... If the number is in the range 0x0001 to 0x0008, 0x000E to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF...
then this is a parse error.

When MimeKit's HtmlTokenizer encounters a parse error, it just emits the raw entity instead of decoding it and when the HtmlToHtml converter writes out the tokens it gets, the token re-encodes the cdata, therefore creating this issue.

I think the solution is to tell the tokenizer not to decode character references which will resolve this.

Fixes issue #808

jstedfast · 2022-08-20T12:16:30Z

MimeKit v3.4.0 has been released with this fix.

jstedfast added a commit that referenced this issue Jun 26, 2022

Changed HtmlToHtml converter to avoid decoding character references.

9aa69b0

Fixes issue #808

jstedfast closed this as completed Jun 26, 2022

jstedfast added a commit that referenced this issue Jun 26, 2022

Added unit test for issue #808

4a574bd

jstedfast mentioned this issue Jul 1, 2022

There is one more unexpected space(' ') appear in the attachment name after parsing #809

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML entity for carriage return is being encoded #808

HTML entity for carriage return is being encoded #808

SenaraW commented Jun 24, 2022

jstedfast commented Jun 24, 2022 •

edited

Loading

SenaraW commented Jun 24, 2022

jstedfast commented Jun 26, 2022

jstedfast commented Aug 20, 2022

HTML entity for carriage return is being encoded #808

HTML entity for carriage return is being encoded #808

Comments

SenaraW commented Jun 24, 2022

jstedfast commented Jun 24, 2022 • edited Loading

SenaraW commented Jun 24, 2022

jstedfast commented Jun 26, 2022

jstedfast commented Aug 20, 2022

jstedfast commented Jun 24, 2022 •

edited

Loading