Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML entity for carriage return is being encoded #808

Closed
SenaraW opened this issue Jun 24, 2022 · 4 comments
Closed

HTML entity for carriage return is being encoded #808

SenaraW opened this issue Jun 24, 2022 · 4 comments

Comments

@SenaraW
Copy link

SenaraW commented Jun 24, 2022

Describe the bug
When you convert a mimetext containing 
 it will be encoded an additional time, resulting in the output being 
Other html entities seem to not be affected.

Platform (please complete the following information):

  • OS: Windows
  • .NET Runtime: CLR
  • .NET Framework: .NET 4.8
  • MimeKit Version: 3.3.0

To Reproduce
Convert the following mimetext to html:

Content-Type: text/html; charset=utf-8

<html>
<body>
I'm on holiday until&nbsp; June 17, 2022.&#13;
</body>
</html>

This will result in the following html:

<html>
<body oncontextmenu="return false;">
I&#39;m on holiday until&#160; June 17, 2022.&amp;#13;
</body>
</html>

Expected behavior
Output should be &#13;

@jstedfast
Copy link
Owner

jstedfast commented Jun 24, 2022

How are you converting?

If you could write one up, a simple test-case program would be ideal so I can easily reproduce this to see what is going wrong.

My gut instinct is that the HtmlTokenizer's DecodeCharacterReferences property is set to false and then the HtmlCDataToken content is being written via HtmlWriter.WriteText()?

FWIW, if you are doing this manually and the above diagnosis sounds correct, then you want to use HtmlWriter.WriteMarkup() so that the cdata doesn't get encoded.

@SenaraW
Copy link
Author

SenaraW commented Jun 24, 2022

Here's a small test program to reproduce the bug: HTMLEntityTest.zip
It converts the above mentioned mime text and converts it to html and outputs it to the textbox.

I hope this helps!

@jstedfast
Copy link
Owner

Okay, so the problem is that according to the W3 HTML5 specification, is a parse error.

Note here: https://dev.w3.org/html5/spec-LC/tokenization.html#consume-a-character-reference

... If the number is in the range 0x0001 to 0x0008, 0x000E to 0x001F, 0x007F to 0x009F, 0xFDD0 to 0xFDEF...
then this is a parse error.

When MimeKit's HtmlTokenizer encounters a parse error, it just emits the raw entity instead of decoding it and when the HtmlToHtml converter writes out the tokens it gets, the token re-encodes the cdata, therefore creating this issue.

I think the solution is to tell the tokenizer not to decode character references which will resolve this.

@jstedfast
Copy link
Owner

MimeKit v3.4.0 has been released with this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants