Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing with type text/html leads to unwanted default namespace #377

Closed
sdeprez opened this issue Feb 15, 2022 · 3 comments
Closed

Parsing with type text/html leads to unwanted default namespace #377

sdeprez opened this issue Feb 15, 2022 · 3 comments
Labels
awaiting response Maintainers are waiting for information invalid This doesn't seem right spec:DOM Living Standard https://dom.spec.whatwg.org/ spec:HTML spec:Namespaces in XML https://www.w3.org/TR/REC-xml-names/ and https://www.w3.org/TR/xml-names11/

Comments

@sdeprez
Copy link

sdeprez commented Feb 15, 2022

Hi!

When using DOMParser with mime type text/html, the parser adds the default namespace http://www.w3.org/1999/xhtml, because of this line of code. It looks like it's expected, because there is even a test for it but in my opinion, it's quite surprising and make simple xpath queries fail, example using xpath library:

const doc = (new DOMParser()).parseFromString('<input value="abc">', 'text/html');
xpath.evaluate('//input', doc, null, 0, null); // No results

I know I could use local-name but in that case, I don't control the xpath, so what's the rationale of adding this default namespace? Can we opt-out of that behaviour?

As a current workaround, I did this to pass a xmlns object with a setter ignoring the update of '' key. It works, but it's obviously quite brittle and hacky.

const xmlns={set [''] (_) {}}
const doc = (new DOMParser({xmlns})).parseFromString('<input value="abc">', 'text/html');
xpath.evaluate('//input', doc, null, 0, null); // 1 result
@karfau karfau added awaiting response Maintainers are waiting for information invalid This doesn't seem right spec:DOM Living Standard https://dom.spec.whatwg.org/ spec:HTML spec:Namespaces in XML https://www.w3.org/TR/REC-xml-names/ and https://www.w3.org/TR/xml-names11/ labels Feb 15, 2022
@karfau
Copy link
Member

karfau commented Feb 15, 2022

Thank you for reporting this.

I don't have a lot of experience regarding xpath, but this is what I found out:

According to https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate the code you provided should find something, if it works the same way as the browser. I created the linked stackblitz to reproduce it and am able to verify that the xpath library is not able to find anything (output 4).

https://stackblitz.com/edit/js-xmldom359-fd3yir?devtoolsheight=33&file=index.js

import xmldom from '@xmldom/xmldom';
import xpath from 'xpath';
// See https://github.com/xmldom/xmldom/issues/377
const source = `<input value="abc">`;

const doc = new DOMParser().parseFromString(source, 'text/html');
console.log(
  '1. The namespace of the input element in the browser API to parse source',
  doc.getElementsByTagName('input')[0].namespaceURI
);
const xmldomDoc = new xmldom.DOMParser().parseFromString(source, 'text/html');
console.log(
  '2. The namespace of the input element when using @xmldom/xmldom to parse source',
  xmldomDoc.getElementsByTagName('input')[0].namespaceURI
);

let browserXPathResult = doc.evaluate('//input', doc, null, 0, null);
console.log(
  '3. The xpath result when using the browser API',
  browserXPathResult.iterateNext()
);
let xpathResult = xpath.evaluate('//input', doc, null, 0, null);
console.log(
  '4. The xpath result when using the xpath lib',
  xpathResult.iterateNext()
);

image

I disagree that this is because of the namespace added by xmldom (output 2) though, since the namespace is also present when the browser parses the snippet you provide (output 1). Using the browser API to evaluate your xpath also works as described in the MDN docs (output 3).

And I also found the related specs, that state that this is the expected behavior:
https://html.spec.whatwg.org/#interactions-with-xpath-and-xslt

[...]
If the context node is from an HTML DOM, the default element namespace is "http://www.w3.org/1999/xhtml".
[...]
This is equivalent to adding the default element namespace feature of XPath 2.0 to XPath 1.0, and using the HTML namespace as the default element namespace for HTML documents. It is motivated by the desire to have implementations be compatible with legacy HTML content while still supporting the changes that this specification introduces to HTML regarding the namespace used for HTML elements, and by the desire to use XPath 1.0 rather than XPath 2.0.

This change is a willful violation of the XPath 1.0 specification, motivated by desire to have implementations be compatible with legacy content while still supporting the changes that this specification introduces to HTML regarding which namespace is used for HTML elements.

The DOM specification also has someplaces related to that:

So at the moment I don't see how this problem is related to or should be solved by changing how xmldom is adding the HTML namespace in html (and xhtml) documents.

Just an idea: Is it possible that the xpath library is not searching in the default namespace?

I will of course not dig into the code of the xpath library now. But if you have more information, that shows what xmldom is doing differently to cause this issue, feel free to reopen this github issue.

PS: Please be aware that the xmldom library is now published under @xmldom/xmldom since we fixed the security issues in 0.6.0, which is why I filed goto100/xpath#111 . But I don't think there is any difference between 0.6.0 and the current 0.8.1 regarding this specific issue.

@karfau karfau closed this as completed Feb 15, 2022
@sdeprez
Copy link
Author

sdeprez commented Feb 16, 2022

Thanks for your thorough answer!

I agree on your analysis and that adding xhtml namespace is spec compliant. As per the specs you quoted, the issue here boils down to the fact that xpath strictly respects the xpath 1.0 specs but the browsers respect the html spec, which adds a modification ("willful violation" as they say) to this spec to specifically handle that case, which got me confused.

Also FYI I posted goto100/xpath#27 (comment)

@karfau
Copy link
Member

karfau commented Feb 21, 2022

Just for later reference/linking: Somebody just helped me to understand that @xmldom/[email protected] breaks the "HTML mode" detection of xpath, which was previously assuming DOMImplementaiton.hasFeature would return false: bbyars/mountebank#660 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response Maintainers are waiting for information invalid This doesn't seem right spec:DOM Living Standard https://dom.spec.whatwg.org/ spec:HTML spec:Namespaces in XML https://www.w3.org/TR/REC-xml-names/ and https://www.w3.org/TR/xml-names11/
Projects
None yet
Development

No branches or pull requests

2 participants