-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep children of removed tags #75
Comments
I've followed @mganss suggestion and used the /// <summary>
/// Hook for RemovingTag event. This overrides the default behaviour of removing the tags
/// </summary>
/// <param name="sender"></param>
/// <param name="args"></param>
public void UnwrapTag(object sender, RemovingTagEventArgs args)
{
args.Cancel = true;
UnwrapTag(args.Tag);
}
/// <summary>
/// Remove a tag from the document but keep it's children.
/// </summary>
/// <param name="tag">to be removed</param>
private static void UnwrapTag(IElement tag)
{
if (tag.Children.Any())
{
tag.Replace(tag.ChildNodes.ToArray());
}
else
{
tag.OuterHtml = tag.InnerHtml;
}
} I'm using HtmlSanitizer 3.2.103. Hope this helps |
Sounds like a FAQ would be nice :) |
Thank you very much for sharing your approach. Unfortunately, for my test case this does not help. An encoded closing p tag and an encoded closing body tag are still kept in the result, which looks like this: Input is Am I wrong in the assumption that this should work at all? |
The encoded body tag is in the output because the input gets wrapped in I think the parser sees an opening script tag, then scans until it finds the corresponding closing tag, which is not there. Instead, the end of the document is encountered. All text up to this point is added as a text node to the script node and the remaining open tags (p and body) are closed. Fact is, you've got a broken fragment as input. Except for the encoded body tag, the output is as expected. I'll try to fix the body issue (not until next week, though). As a workaround you might want to experiment with the |
Have to wait for AngleSharp/AngleSharp#358 (AngleSharp 0.9.7). |
Thank you very much. |
@ricardobrandao @mganss I tried to use your approach, but the writing to
I'm using HtmlSanitizer 4.0.187. When debugging the line writing to I also tried the approach of @waldfee, looping the child tags and move them but then the html of those child elements aren't sanitized so if they contain unallowed tags they remain in the result from Sanitize... :-/ Any idea how to go about this now? |
@deap82 There is now a property called |
I am trying to keep the children of tags that are removed.
What i am currently doing:
This works for most inputs, but not for inputs like the one above.
I expected to get
<p>test test2</p>
, but instead it is<p>test test2</p></body></p>
.It seems this is related to the way AngleSharp (and browsers) parse the unclosed
<script>
tag, which looks like this (note the two closing p tags):In issue #25, @ricardobrandao talks about the same use-case. Unfortunately, he has not added his solution to the issue and his (unmerged) PR was before the switch to AngleSharp, so the changes he proposed do not help me on the right path.
Am I on the completely wrong track here or am I misunderstanding something?
Using HtmlSanitizer 3.2.105
The text was updated successfully, but these errors were encountered: