HTML API: Handle content after BODY, HTML where possible #7312

sirreal · 2024-09-09T09:57:44Z

Improve support for content appearing after BODY, HTML nodes where
possible. Some HTML may be representable by the HTML API without
introducing ordering problems.

When HTML moves to the "after-{body,html,…}" insertion modes, BODY and
HTML remain on the stack of open elements. Certain content is inserted
in place (comments), some content is inserted in BODY (white space
text), and most other content will switch back to the "in body"
insertion mode.

The difficulty with the HTML API is that the result in the tree may be
out-of-order compared to the positioning of the corresponding tokens in
the HTML input text. For example:

</body>FOO<!-- comment -->

Produces this tree where the ordering is the same even though a BODY
closer is reached:

[DOCUMENT]
    └HTML
     ├HEAD
     ├BODY
     │ └#text: FOO
     └#comment: comment

Compare with this case:

</body><!-- comment -->FOO

The tree is identical, the FOO text is out of order in the tree
compared to the HTML string. This case continues to be invalid because
when the FOO text needs to be reached, the parser would have already
reached a BODY tag closer and content outside of the body:

[DOCUMENT]
    └HTML
     ├HEAD
     ├BODY
     │ └#text: FOO
     └#comment: comment

This approach deviates from the explicit steps in the spec and will pop
the BODY and HTML tags from the stack of open elements when content
should appear outside of them. Flags are set for each "after-" mode to
prevent returning to BODY and then continuing to produce content
outside, which would produce the unsupported out-of-order behavior.

See https://html.spec.whatwg.org/#parsing-main-afterbody.

Trac ticket: Core-61576

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

Improve support for content appearing after BODY, HTML nodes where possible. Some HTML may be representable by the HTML API without introducing ordering problems. When HTML moves to the "after-{body,html,…}" insertion modes, BODY and HTML remain on the stack of open elements. Certain content is inserted in place (comments), some content is inserted in BODY (white space text), and most other content will switch back to the "in body" insertion mode. The difficulty with the HTML API is that the result in the tree may be out-of-order compared to the positioning of the corresponding tokens in the HTML input text. For example: </body>FOO Produces this tree where the ordering is the same even though a BODY closer is reached: [DOCUMENT] └HTML ├HEAD ├BODY │ └#text: FOO └#comment: comment Compare with this case: </body>FOO The tree is identical, the FOO text is _out of order_ in the tree compared to the HTML string. This case continues to be invalid because when the FOO text needs to be reached, the parser would have already reached a BODY tag closer and content outside of the body: [DOCUMENT] └HTML ├HEAD ├BODY │ └#text: FOO └#comment: comment This approach deviates from the explicit steps in the spec and will pop the BODY and HTML tags from the stack of open elements when content should appear outside of them. Flags are set for each "after-" mode to prevent returning to BODY and then continuing to produce content outside, which would produce the unsupported out-of-order behavior. See https://html.spec.whatwg.org/#parsing-main-afterbody.

github-actions · 2024-09-09T10:03:18Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell, dmsnell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

github-actions · 2024-09-09T10:09:50Z

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

sirreal · 2024-09-09T10:47:14Z

This work was originally included in #7165. It was reverted in the PR to land a simpler initial version and is now proposed on its own as an enhancement.

dmsnell · 2024-09-09T20:00:50Z

I'm wondering if here we could lean on some of the work I've explored for XML well-formedness error-recovery.

Imagine we start with "proper HTML parsing in tree-order node traversal" but you could opt-in to certain errors.

$processor->parser_tolerance( WP_HTML_Processor::VISIT_NODES_OUT_OF_ORDER, true );
$processor->parser_tolerance( WP_HTML_Processor::OVERLOOK_FOSTERED_NODES, true );

I'm concerned with patches like this because I think there's a real semantic challenge when thinking about operations like set_inner_html(). For a chunk of HTML where the content after BODY itnerleaves IN BODY and AFTER BODY content, can we provide a rational answer to the question: how should the document change when setting inner HTML of the BODY?

sirreal mentioned this pull request Sep 9, 2024

HTML API: Handle after body/html content #7297

Closed

sirreal marked this pull request as ready for review September 9, 2024 10:03

dmsnell mentioned this pull request Sep 11, 2024

HTML API: Plans for 6.8 WordPress/gutenberg#63037

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Handle content after BODY, HTML where possible #7312

HTML API: Handle content after BODY, HTML where possible #7312

sirreal commented Sep 9, 2024 •

edited

Loading

github-actions bot commented Sep 9, 2024 •

edited

Loading

github-actions bot commented Sep 9, 2024

sirreal commented Sep 9, 2024

dmsnell commented Sep 9, 2024

HTML API: Handle content after BODY, HTML where possible #7312

Are you sure you want to change the base?

HTML API: Handle content after BODY, HTML where possible #7312

Conversation

sirreal commented Sep 9, 2024 • edited Loading

github-actions bot commented Sep 9, 2024 • edited Loading

github-actions bot commented Sep 9, 2024

Test using WordPress Playground

Some things to be aware of

sirreal commented Sep 9, 2024

dmsnell commented Sep 9, 2024

sirreal commented Sep 9, 2024 •

edited

Loading

github-actions bot commented Sep 9, 2024 •

edited

Loading