Parsing with type `text/html` leads to unwanted default namespace #377

sdeprez · 2022-02-15T12:48:58Z

Hi!

When using DOMParser with mime type text/html, the parser adds the default namespace http://www.w3.org/1999/xhtml, because of this line of code. It looks like it's expected, because there is even a test for it but in my opinion, it's quite surprising and make simple xpath queries fail, example using xpath library:

const doc = (new DOMParser()).parseFromString('<input value="abc">', 'text/html');
xpath.evaluate('//input', doc, null, 0, null); // No results

I know I could use local-name but in that case, I don't control the xpath, so what's the rationale of adding this default namespace? Can we opt-out of that behaviour?

As a current workaround, I did this to pass a xmlns object with a setter ignoring the update of '' key. It works, but it's obviously quite brittle and hacky.

const xmlns={set [''] (_) {}}
const doc = (new DOMParser({xmlns})).parseFromString('<input value="abc">', 'text/html');
xpath.evaluate('//input', doc, null, 0, null); // 1 result

The text was updated successfully, but these errors were encountered:

karfau · 2022-02-15T20:55:25Z

Thank you for reporting this.

I don't have a lot of experience regarding xpath, but this is what I found out:

According to https://developer.mozilla.org/en-US/docs/Web/API/Document/evaluate the code you provided should find something, if it works the same way as the browser. I created the linked stackblitz to reproduce it and am able to verify that the xpath library is not able to find anything (output 4).

https://stackblitz.com/edit/js-xmldom359-fd3yir?devtoolsheight=33&file=index.js

import xmldom from '@xmldom/xmldom';
import xpath from 'xpath';
// See https://github.com/xmldom/xmldom/issues/377
const source = `<input value="abc">`;

const doc = new DOMParser().parseFromString(source, 'text/html');
console.log(
  '1. The namespace of the input element in the browser API to parse source',
  doc.getElementsByTagName('input')[0].namespaceURI
);
const xmldomDoc = new xmldom.DOMParser().parseFromString(source, 'text/html');
console.log(
  '2. The namespace of the input element when using @xmldom/xmldom to parse source',
  xmldomDoc.getElementsByTagName('input')[0].namespaceURI
);

let browserXPathResult = doc.evaluate('//input', doc, null, 0, null);
console.log(
  '3. The xpath result when using the browser API',
  browserXPathResult.iterateNext()
);
let xpathResult = xpath.evaluate('//input', doc, null, 0, null);
console.log(
  '4. The xpath result when using the xpath lib',
  xpathResult.iterateNext()
);

I disagree that this is because of the namespace added by xmldom (output 2) though, since the namespace is also present when the browser parses the snippet you provide (output 1). Using the browser API to evaluate your xpath also works as described in the MDN docs (output 3).

And I also found the related specs, that state that this is the expected behavior:
https://html.spec.whatwg.org/#interactions-with-xpath-and-xslt

[...]
If the context node is from an HTML DOM, the default element namespace is "http://www.w3.org/1999/xhtml".
[...]
This is equivalent to adding the default element namespace feature of XPath 2.0 to XPath 1.0, and using the HTML namespace as the default element namespace for HTML documents. It is motivated by the desire to have implementations be compatible with legacy HTML content while still supporting the changes that this specification introduces to HTML regarding the namespace used for HTML elements, and by the desire to use XPath 1.0 rather than XPath 2.0.

This change is a willful violation of the XPath 1.0 specification, motivated by desire to have implementations be compatible with legacy content while still supporting the changes that this specification introduces to HTML regarding which namespace is used for HTML elements.

The DOM specification also has someplaces related to that:

So at the moment I don't see how this problem is related to or should be solved by changing how xmldom is adding the HTML namespace in html (and xhtml) documents.

Just an idea: Is it possible that the xpath library is not searching in the default namespace?

I will of course not dig into the code of the xpath library now. But if you have more information, that shows what xmldom is doing differently to cause this issue, feel free to reopen this github issue.

PS: Please be aware that the xmldom library is now published under @xmldom/xmldom since we fixed the security issues in 0.6.0, which is why I filed goto100/xpath#111 . But I don't think there is any difference between 0.6.0 and the current 0.8.1 regarding this specific issue.

sdeprez · 2022-02-16T10:40:35Z

Thanks for your thorough answer!

I agree on your analysis and that adding xhtml namespace is spec compliant. As per the specs you quoted, the issue here boils down to the fact that xpath strictly respects the xpath 1.0 specs but the browsers respect the html spec, which adds a modification ("willful violation" as they say) to this spec to specifically handle that case, which got me confused.

Also FYI I posted goto100/xpath#27 (comment)

karfau · 2022-02-21T09:38:01Z

Just for later reference/linking: Somebody just helped me to understand that @xmldom/[email protected] breaks the "HTML mode" detection of xpath, which was previously assuming DOMImplementaiton.hasFeature would return false: bbyars/mountebank#660 (comment)

karfau added awaiting response Maintainers are waiting for information invalid This doesn't seem right spec:DOM Living Standard https://dom.spec.whatwg.org/ spec:HTML spec:Namespaces in XML https://www.w3.org/TR/REC-xml-names/ and https://www.w3.org/TR/xml-names11/ labels Feb 15, 2022

karfau closed this as completed Feb 15, 2022

sdeprez mentioned this issue Feb 16, 2022

No matches when searching by tagName in XMLDOM's text/html DOM goto100/xpath#27

Closed

chris48s mentioned this issue Oct 11, 2024

[DynamicXml] parse doc as html if served with text/html content type badges/shields#10607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing with type `text/html` leads to unwanted default namespace #377

Parsing with type `text/html` leads to unwanted default namespace #377

sdeprez commented Feb 15, 2022

karfau commented Feb 15, 2022 •

edited

Loading

sdeprez commented Feb 16, 2022 •

edited

Loading

karfau commented Feb 21, 2022

Parsing with type text/html leads to unwanted default namespace #377

Parsing with type text/html leads to unwanted default namespace #377

Comments

sdeprez commented Feb 15, 2022

karfau commented Feb 15, 2022 • edited Loading

sdeprez commented Feb 16, 2022 • edited Loading

karfau commented Feb 21, 2022

Parsing with type `text/html` leads to unwanted default namespace #377

Parsing with type `text/html` leads to unwanted default namespace #377

karfau commented Feb 15, 2022 •

edited

Loading

sdeprez commented Feb 16, 2022 •

edited

Loading