-
-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] extra </span> inserted after Nokogiri::HTML.parse #2796
Comments
@seanstory Sorry you're having a problem. I'll try to explain what's going on here. In summary, the HTML you're parsing is not well-formed, and so parsers will try to "fix it up". Notably, HTML4 does not have a specification for how "fixing up" should be done, and so parsers may all do different things. But HTML5 does have a "fix up" spec, so if you want to match modern browser behavior you should use Here's the start of the markup from <span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput"><h1 class="title"><a href ="PFS.aspx"><span style="float:left;"><img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png"></a></span>Alchemist</h1>... Let me format that better so you can see the structure more clearly: <html>
<body>
<span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
<h1 class="title">
<a href="PFS.aspx">
<span style="float:left;">
<img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png">
</a>
</span>
Alchemist
</h1>
</body>
</html> You should be able to see pretty clearly that the opening and closing tags are mismatched. When the parser sees the closing Click here to see some working code to demonstrate what's happening.#! /usr/bin/env ruby
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri"
end
html = <<~HTML
<html>
<body>
<span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
<h1 class="title">
<a href="PFS.aspx">
<span style="float:left;">
<img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png">
</a>
</span>
Alchemist
</h1>
</body>
</html>
HTML
doc = Nokogiri::HTML4::Document.parse(html)
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
# "<html>\n" +
# " <body>\n" +
# " <span id=\"ctl00_RadDrawer1_Content_MainContent_DetailedOutput\">\n" +
# " <h1 class=\"title\">\n" +
# " <a href=\"PFS.aspx\">\n" +
# " <span style=\"float:left;\">\n" +
# " <img alt=\"PFS Standard\" title=\"PFS Standard\" style=\"height:25px; padding:2px 10px 0px 2px\" src=\"ImagesIconsPFS_Standard.png\">\n" +
# " </span></a>\n" +
# " </h1></span>\n" +
# " Alchemist\n" +
# " \n" +
# " </body>\n" +
# "</html>\n"
doc.errors
# => [#<Nokogiri::XML::SyntaxError: 11:12: ERROR: Unexpected end tag : h1>] So the final, corrected markup will look like: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
<span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
<h1 class="title">
<a href="PFS.aspx">
<span style="float:left;">
<img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="ImagesIconsPFS_Standard.png">
</span></a>
</h1></span>
Alchemist
</body>
</html> But note that libgumbo (Nokogiri::HTML5 on CRuby) corrects this differently! And possibly the same way your browser fixes it up. Click here to see more code demonstrating the HTML5 behavior.#! /usr/bin/env ruby
require "bundler/inline"
gemfile do
source "https://rubygems.org"
gem "nokogiri"
end
html = <<~HTML
<html>
<body>
<span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
<h1 class="title">
<a href="PFS.aspx">
<span style="float:left;">
<img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="Images\Icons\PFS_Standard.png">
</a>
</span>
Alchemist
</h1>
</body>
</html>
HTML
doc = Nokogiri::HTML5::Document.parse(html, max_errors: 10)
doc.to_html
# => "<html><head></head><body>\n" +
# " <span id=\"ctl00_RadDrawer1_Content_MainContent_DetailedOutput\">\n" +
# " <h1 class=\"title\">\n" +
# " <a href=\"PFS.aspx\">\n" +
# " <span style=\"float:left;\">\n" +
# " <img alt=\"PFS Standard\" title=\"PFS Standard\" style=\"height:25px; padding:2px 10px 0px 2px\" src=\"ImagesIconsPFS_Standard.png\">\n" +
# " </span></a>\n" +
# " \n" +
# " Alchemist\n" +
# " </h1>\n" +
# " \n" +
# "\n" +
# "</span></body></html>"
doc.errors
# => [#<Nokogiri::XML::SyntaxError:"1:1: ERROR: Expected a doctype token\n<html>\n^">,
# #<Nokogiri::XML::SyntaxError:"8:11: ERROR: That tag isn't allowed here Currently open tags: html, body, span, h1, a, span.\n </a>\n ^">,
# #<Nokogiri::XML::SyntaxError:"9:9: ERROR: That tag isn't allowed here Currently open tags: html, body, span, h1.\n </span>\n ^">,
# #<Nokogiri::XML::SyntaxError:"12:3: ERROR: That tag isn't allowed here Currently open tags: html, body, span.\n </body>\n ^">] And the parsed HTML5 DOM looks like: <html><head></head><body>
<span id="ctl00_RadDrawer1_Content_MainContent_DetailedOutput">
<h1 class="title">
<a href="PFS.aspx">
<span style="float:left;">
<img alt="PFS Standard" title="PFS Standard" style="height:25px; padding:2px 10px 0px 2px" src="ImagesIconsPFS_Standard.png">
</span></a>
Alchemist
</h1>
</span></body></html> I hope all this makes sense! What questions do you have for me? |
@flavorjones thanks for responding so fast! This explanation makes sense, thank you so much for the help. I'm bummed that this solution isn't available for JRuby, but I see there's an open issue for that, so maybe one day. 🤞 |
Please describe the bug
I'm attempting to parse html content from a site I do not control. Specifically, https://2e.aonprd.com.
I'm taking the raw HTMl and attempting to extract text content, excluding common header and footer text, to use for a search usecase. My plan was to do something like:
However, I've noticed that some pages are getting
""
content.When I go to browser dev tools and access text by the selector with
$$("#ctl00_RadDrawer1_Content_MainContent_DetailedOutput").map(e=>e.textContent)
, I get the text I'm expecting, so it's not an issue with having gotten the selector wrong.When I step through with irb, I can see that an extra
</span>
is being inserted right before the text content, so that the result of theNokogiri::HTML.parse
is closing my identfied element early. I'll attach the raw HTML, for an example page of https://2e.aonprd.com/(X(1)S(jjv5qg45qaziuq55lopb3o45))/Classes.aspx?ID=1html.zip
Help us reproduce what you're seeing
Expected behavior
Nokogiri shouldn't add extra closing tags
Environment
OSX 13.2.1
Platform: arm64-darwin
reproduced in:
The text was updated successfully, but these errors were encountered: