Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep children of removed tags #75

Closed
waldfee opened this issue Jun 22, 2016 · 8 comments
Closed

Keep children of removed tags #75

waldfee opened this issue Jun 22, 2016 · 8 comments

Comments

@waldfee
Copy link

waldfee commented Jun 22, 2016

I am trying to keep the children of tags that are removed.
What i am currently doing:

string input = "<p>test <script> test2</p>";

HtmlSanitizer sanitizer = new HtmlSanitizer(new []{"p"}, new string[0], new string[0], new string[0], new string[0]);
sanitizer.RemovingTag += (sender, args) => MoveContentOfCurrentNodeToParent(args);

string result = sanitizer.Sanitize(input);
private static void MoveContentOfCurrentNodeToParent(RemovingTagEventArgs args)
{
    foreach (INode childNode in args.Tag.ChildNodes.ToList())
    {
        args.Tag.Parent.InsertBefore(childNode, args.Tag);
    }
}

This works for most inputs, but not for inputs like the one above.
I expected to get <p>test test2</p>, but instead it is <p>test test2&lt;/p&gt;&lt;/body&gt;</p>.

It seems this is related to the way AngleSharp (and browsers) parse the unclosed <script> tag, which looks like this (note the two closing p tags):

<html>
    <head/>
    <body>
        <p>test <script> test2</p>
        </script>
    </p>
</body>
</html>

In issue #25, @ricardobrandao talks about the same use-case. Unfortunately, he has not added his solution to the issue and his (unmerged) PR was before the switch to AngleSharp, so the changes he proposed do not help me on the right path.

Am I on the completely wrong track here or am I misunderstanding something?

Using HtmlSanitizer 3.2.105

@ricardobrandao
Copy link
Contributor

I've followed @mganss suggestion and used the RemovingTag hook like this:

/// <summary>
/// Hook for RemovingTag event. This overrides the default behaviour of removing the tags
/// </summary>
/// <param name="sender"></param>
/// <param name="args"></param>
public void UnwrapTag(object sender, RemovingTagEventArgs args)
{
        args.Cancel = true;
        UnwrapTag(args.Tag);
}

/// <summary>
/// Remove a tag from the document but keep it's children.
/// </summary>
/// <param name="tag">to be removed</param>
private static void UnwrapTag(IElement tag)
{
    if (tag.Children.Any())
    {
        tag.Replace(tag.ChildNodes.ToArray());
    }
    else
    {
        tag.OuterHtml = tag.InnerHtml;
    }
}

I'm using HtmlSanitizer 3.2.103.

Hope this helps

@304NotModified
Copy link
Contributor

Sounds like a FAQ would be nice :)

@waldfee
Copy link
Author

waldfee commented Jun 23, 2016

Thank you very much for sharing your approach. Unfortunately, for my test case this does not help. An encoded closing p tag and an encoded closing body tag are still kept in the result, which looks like this: <p>test test2&lt;/p&gt;&lt;/body&gt;</p>.

Input is <p>test <script> test2</p> and the only allowed tag is p.

Am I wrong in the assumption that this should work at all?

@mganss
Copy link
Owner

mganss commented Jun 23, 2016

The encoded body tag is in the output because the input gets wrapped in <body> before sanitization (see #58 and #63).

I think the parser sees an opening script tag, then scans until it finds the corresponding closing tag, which is not there. Instead, the end of the document is encountered. All text up to this point is added as a text node to the script node and the remaining open tags (p and body) are closed.

Fact is, you've got a broken fragment as input. Except for the encoded body tag, the output is as expected.

I'll try to fix the body issue (not until next week, though). As a workaround you might want to experiment with the SanitizeDocument method which doesn't add the body tag around the input.

@mganss
Copy link
Owner

mganss commented Jun 28, 2016

Have to wait for AngleSharp/AngleSharp#358 (AngleSharp 0.9.7).

@waldfee
Copy link
Author

waldfee commented Jun 29, 2016

Thank you very much.

@mganss mganss closed this as completed in f81cb1d Jul 19, 2016
mganss added a commit that referenced this issue Oct 10, 2016
Remove beta label from version
@deap82
Copy link

deap82 commented May 20, 2022

@ricardobrandao @mganss I tried to use your approach, but the writing to .OuterHtml gives me an exception;

AngleSharp.Dom.DomException: The operation is not supported.

I'm using HtmlSanitizer 4.0.187.

When debugging the line writing to .OuterHtml seems to pass just fine, tag.InnerHtml varies between being an empty string and containing simple text content and I never hit an exception with the debugger, but in the end the above exception is returned in the developer exception page of my .net mvc app where the code is running.

I also tried the approach of @waldfee, looping the child tags and move them but then the html of those child elements aren't sanitized so if they contain unallowed tags they remain in the result from Sanitize... :-/

Any idea how to go about this now?

@mganss
Copy link
Owner

mganss commented May 20, 2022

@deap82 There is now a property called KeepChildNodes that you can set to true for this use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants