Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double encoding of HTML entities in attribute values #84

Closed
mlawry opened this issue Aug 19, 2016 · 5 comments
Closed

Double encoding of HTML entities in attribute values #84

mlawry opened this issue Aug 19, 2016 · 5 comments

Comments

@mlawry
Copy link

mlawry commented Aug 19, 2016

Sanitizing the following HTML fragment

<input type="text" name="my_name" value="&lt;insert name&gt;" />

Results in the following output:

<input type="text" name="my_name" value="&amp;lt;insert name&amp;gt;">

The intended text field display is <insert name> but after sanitation the display is &lt;insert name&gt;. There seems to be double encoding of the value= attribute value. I traced the source code and the reason seems to be

  • After parsing but before sanitation (i.e. before calling this method), AngleSharp converts the HTML entities back into text and stores the value= attribute value as <insert name>.
  • Sanitation of attributes converts greater than and less than characters into their equivalent HTML entities. See this line. Converted string is stored back into AngleSharp DOM as the new attribute value. So now the value= attribute value is &lt;insert name&gt;.
  • When AngleSharp DOM is converted to HTML, AngleSharp's HtmlMarkupFormatter will again encode the attribute value, which results in the output value="&amp;lt;insert name&amp;gt;".

To resolve this issue, I think HtmlSanitizer shouldn't encode the greater than and less than characters at this line; AngleSharp will do it later so there is no need. What do you think?

@mlawry mlawry changed the title Double encoding of HTML entities Double encoding of HTML entities in attribute values Aug 19, 2016
@mganss
Copy link
Owner

mganss commented Aug 19, 2016

Thanks for the report and the detailed analysis.

This behavior is a remnant from the time CsQuery was used instead of AngleSharp. The problem is that AngleSharp does not encode < and >, although it does encode &:

var parser = new HtmlParser(new Configuration().WithCss());
var html = @"<input type=""text"" name=""my_name"" value=""&lt;insert name&gt;"" />";
var dom = parser.Parse(html);
var html2 = dom.ToHtml();
 // → <html><head></head><body><input type="text" name="my_name" value="<insert name>"></body></html>

This behavior is by design.

Currently, I have no idea how to fix this using the HtmlMarkupFormatter.

@mlawry
Copy link
Author

mlawry commented Aug 20, 2016

Ahh, OK. I didn't realise AngleSharp does not encode < and >, which is not required in the specs.

However, I don't see why HtmlSanitizer should choose to encode < and > in attributes values. As you've pointed out, it is not required by the specs. If dom.ToHtml() guarantees that attribute values are always surrounded by double quotes, then what's the worry?

It could be something that I don't understand, but what's the reason behind HtmlSanitizer encoding < and > characters?

mlawry pushed a commit to mlawry/HtmlSanitizer that referenced this issue Aug 22, 2016
@mganss
Copy link
Owner

mganss commented Aug 22, 2016

It can be exploited in buggy browsers, e.g. see https://html5sec.org/#59 and https://html5sec.org/#102 (although I must admit I couldn't repro these in IETester).

@mganss
Copy link
Owner

mganss commented Aug 22, 2016

I'm ashamed to say that there's even a test case for this issue (SanitizeEscapeAttrTest()) which I seem to have just adjusted to match the output of AngleSharp (double encoding) when the switch was done from CsQuery 😳

@mganss mganss closed this as completed in 553b8dd Aug 22, 2016
@mlawry
Copy link
Author

mlawry commented Aug 22, 2016

Thanks for the fix @mganss. I was also thinking along the lines of a customised HtmlMarkupFormatter. Although in my case I prefer to encode attributes values using System.Web.Security.AntiXss.AntiXssEncoder, which I can now do using my own IMarkupFormatter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants