Double encoding of HTML entities in attribute values #84

mlawry · 2016-08-19T02:09:17Z

Sanitizing the following HTML fragment

<input type="text" name="my_name" value="&lt;insert name&gt;" />

Results in the following output:

<input type="text" name="my_name" value="&amp;lt;insert name&amp;gt;">

The intended text field display is <insert name> but after sanitation the display is <insert name>. There seems to be double encoding of the value= attribute value. I traced the source code and the reason seems to be

After parsing but before sanitation (i.e. before calling this method), AngleSharp converts the HTML entities back into text and stores the value= attribute value as <insert name>.
Sanitation of attributes converts greater than and less than characters into their equivalent HTML entities. See this line. Converted string is stored back into AngleSharp DOM as the new attribute value. So now the value= attribute value is <insert name>.
When AngleSharp DOM is converted to HTML, AngleSharp's HtmlMarkupFormatter will again encode the attribute value, which results in the output value="&lt;insert name&gt;".

To resolve this issue, I think HtmlSanitizer shouldn't encode the greater than and less than characters at this line; AngleSharp will do it later so there is no need. What do you think?

The text was updated successfully, but these errors were encountered:

mganss · 2016-08-19T13:34:33Z

Thanks for the report and the detailed analysis.

This behavior is a remnant from the time CsQuery was used instead of AngleSharp. The problem is that AngleSharp does not encode < and >, although it does encode &:

var parser = new HtmlParser(new Configuration().WithCss());
var html = @"<input type=""text"" name=""my_name"" value=""&lt;insert name&gt;"" />";
var dom = parser.Parse(html);
var html2 = dom.ToHtml();
 // → <html><head></head><body><input type="text" name="my_name" value="<insert name>"></body></html>

This behavior is by design.

Currently, I have no idea how to fix this using the HtmlMarkupFormatter.

mlawry · 2016-08-20T01:50:30Z

Ahh, OK. I didn't realise AngleSharp does not encode < and >, which is not required in the specs.

However, I don't see why HtmlSanitizer should choose to encode < and > in attributes values. As you've pointed out, it is not required by the specs. If dom.ToHtml() guarantees that attribute values are always surrounded by double quotes, then what's the worry?

It could be something that I don't understand, but what's the reason behind HtmlSanitizer encoding < and > characters?

mganss · 2016-08-22T10:57:30Z

It can be exploited in buggy browsers, e.g. see https://html5sec.org/#59 and https://html5sec.org/#102 (although I must admit I couldn't repro these in IETester).

mganss · 2016-08-22T11:04:15Z

I'm ashamed to say that there's even a test case for this issue (SanitizeEscapeAttrTest()) which I seem to have just adjusted to match the output of AngleSharp (double encoding) when the switch was done from CsQuery 😳

mlawry · 2016-08-22T23:20:36Z

Thanks for the fix @mganss. I was also thinking along the lines of a customised HtmlMarkupFormatter. Although in my case I prefer to encode attributes values using System.Web.Security.AntiXss.AntiXssEncoder, which I can now do using my own IMarkupFormatter.

mlawry changed the title ~~Double encoding of HTML entities~~ Double encoding of HTML entities in attribute values Aug 19, 2016

mlawry pushed a commit to mlawry/HtmlSanitizer that referenced this issue Aug 22, 2016

Temp fix for mganss/HtmlSanitizer mganss#84

32139d7

mganss closed this as completed in 553b8dd Aug 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double encoding of HTML entities in attribute values #84

Double encoding of HTML entities in attribute values #84

mlawry commented Aug 19, 2016

mganss commented Aug 19, 2016 •

edited

Loading

mlawry commented Aug 20, 2016

mganss commented Aug 22, 2016

mganss commented Aug 22, 2016

mlawry commented Aug 22, 2016

Double encoding of HTML entities in attribute values #84

Double encoding of HTML entities in attribute values #84

Comments

mlawry commented Aug 19, 2016

mganss commented Aug 19, 2016 • edited Loading

mlawry commented Aug 20, 2016

mganss commented Aug 22, 2016

mganss commented Aug 22, 2016

mlawry commented Aug 22, 2016

mganss commented Aug 19, 2016 •

edited

Loading