-
-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: relink_namespaces handles attributes properly #2310
base: main
Are you sure you want to change the base?
Conversation
5e8f98a
to
764a1f9
Compare
I've added a commit that introduces test coverage for the HTML5 behavior from 7b3c972. This currently fails, will make it pass shortly. |
764a1f9
to
1837c0e
Compare
Should be green now, I'm curious what you think @stevecheckoway |
1837c0e
to
cf0392e
Compare
Hey @flavorjones, first let me apologize if I've missed other PRs or issues in which you've mentioned me. I need to get my GitHub settings squared away. Every time one of my students creates a repository (so once per assignment), GitHub adds it to my watch list. I'm apparently watching 1300 repos right now! The emails are overwhelming. Namespaces in HTML documents is an area that I'm really pretty weak on. I found the HTML standard confusing and I haven't taken the time to dig in to what browsers do. HTML documents have nodes that live in one of six namespaces. HTML elements live in the HTML namespace, MathML elements live in the MathML namespace, and SVG elements live in the SVG namespace. The XLink, XML, and XMLNS namespaces are for attributes. During parsing, Here's an example showing what I think should happen. <!DOCTYPE html>
<svg viewBox="0 0 200 100" xmlns="http://www.w3.org/2000/svg">
<title><span xml:lang="en-US">title</span></title>
<text xml:lang="en-US" foo:bar="blah">This is some English text</text>
</svg>
<p xml:lang="en-US">More text.
I've bolded the parts that might be surprising. As you can see, it's quite complicated. Some foreign elements, and only in some circumstances (namely when one of these conditions is satisfied), can contain HTML elements whose attributes never have namespaces and can contain colons. And only some attributes with colons in the name in foreign elements end up with namespaces. (It's probably worth noting that this impacts XPath and XSLT.) So the question becomes what behavior should the Nokogiri API have when adding nodes whose attributes contain colons? As a point of comparison, I played around with JavaScript a little bit. If I create a 'svg' element, it puts it in the HTML namespace unless I explicitly tell it the namespace to use. If I create an attribute with the name element = document.createElementNS("http://www.w3.org/2000/svg", "svg");
element.namespaceURI; // "http://www.w3.org/2000/svg"
attr = document.createAttribute("xml:lang");
attr.localName; // "xml:lang"
attr.prefix; // null
attr.namespaceURI; // null
element.setAttributeNode(attr)
element.attributes[0].localName; // "xml:lang"
element.attributes[0].prefix; // null
element.attributes[0].namespaceURI; // null I'd be inclined to follow that behavior. If you want an element/attribute in a namespace, you gotta construct it in that namespace. |
@stevecheckoway Thanks for the context dump, and no need to apologize for notification fatigue. I'm going to take some time to read everything you wrote and make sure I understand it, because my lack of knowledge of HTML5 is showing. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This corresponds to the code added in 7b3c972 Note also that this commit adds the same test for HTML4, commented out because it currently fails, that we will make pass in the next few commits.
Also fix both implementations of relink_namespaces to not skip attributes when the node itself doesn't have a namespace. Finally, the private method XML::Node#add_child_node_and_reparent_attrs is no longer needed after fixing relink_namespaces, so this commit essentially reverts 9fd03d8.
cf0392e
to
8dde633
Compare
Rebased onto currrent |
What problem is this PR intended to solve?
See #1790 for context.
Skip relinking namespaces in HTML4/5 documents
Also fix both implementations of relink_namespaces to not skip
attributes when the node itself doesn't have a namespace.
Finally, the private method
XML::Node#add_child_node_and_reparent_attrs is no longer needed after
fixing relink_namespaces, so this commit essentially reverts 9fd03d8.
Have you included adequate test coverage?
I've added tests for HTML reparenting to ensure namespaces aren't linked.
Does this change affect the behavior of either the C or the Java implementations?
Both implementations are kept in sync with each other here.