-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Double encoding of HTML entities in attribute values #84
Comments
Thanks for the report and the detailed analysis. This behavior is a remnant from the time CsQuery was used instead of AngleSharp. The problem is that AngleSharp does not encode < and >, although it does encode &: var parser = new HtmlParser(new Configuration().WithCss());
var html = @"<input type=""text"" name=""my_name"" value=""<insert name>"" />";
var dom = parser.Parse(html);
var html2 = dom.ToHtml();
// → <html><head></head><body><input type="text" name="my_name" value="<insert name>"></body></html> This behavior is by design. Currently, I have no idea how to fix this using the HtmlMarkupFormatter. |
Ahh, OK. I didn't realise AngleSharp does not encode < and >, which is not required in the specs. However, I don't see why HtmlSanitizer should choose to encode < and > in attributes values. As you've pointed out, it is not required by the specs. If It could be something that I don't understand, but what's the reason behind HtmlSanitizer encoding < and > characters? |
It can be exploited in buggy browsers, e.g. see https://html5sec.org/#59 and https://html5sec.org/#102 (although I must admit I couldn't repro these in IETester). |
I'm ashamed to say that there's even a test case for this issue ( |
Thanks for the fix @mganss. I was also thinking along the lines of a customised HtmlMarkupFormatter. Although in my case I prefer to encode attributes values using System.Web.Security.AntiXss.AntiXssEncoder, which I can now do using my own IMarkupFormatter. |
Sanitizing the following HTML fragment
Results in the following output:
The intended text field display is
<insert name>
but after sanitation the display is<insert name>
. There seems to be double encoding of thevalue=
attribute value. I traced the source code and the reason seems to bevalue=
attribute value as<insert name>
.value=
attribute value is<insert name>
.HtmlMarkupFormatter
will again encode the attribute value, which results in the outputvalue="&lt;insert name&gt;"
.To resolve this issue, I think HtmlSanitizer shouldn't encode the greater than and less than characters at this line; AngleSharp will do it later so there is no need. What do you think?
The text was updated successfully, but these errors were encountered: