Fix DocType closing bracket recognition in buffered reader #802

BlueGreenMagick · 2024-09-19T03:43:12Z

Fixes #533, fixes #590, fixes #801

codecov-commenter · 2024-09-19T03:52:57Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.17%. Comparing base (7558577) to head (3ebe221).
Report is 96 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #802      +/-   ##
==========================================
- Coverage   61.81%   60.17%   -1.65%     
==========================================
  Files          41       41              
  Lines       16798    15958     -840     
==========================================
- Hits        10384     9603     -781     
+ Misses       6414     6355      -59

Flag	Coverage Δ
unittests	`60.17% <100.00%> (-1.65%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

BlueGreenMagick · 2024-09-19T05:10:11Z

Modified the code so that balance_buf is computed only once.

Mingun

That PR will not fix the problem entirely, because correct DTD can contain unbalanced number of < and >, but I think that is rare case, so until proper fix will be implemented, a partial one also acceptable. The implementation, however, still may be improved. The problem here that we do not store state between invocations of BangType::parse. So the solution is to store balance inside BangType::DocType, then, I think, it will be unnecessary to calculate it over buf (because everything in buf previously was in chunk).

I also see the problem in the external for i in memchr::memchr_iter(b'>', chunk) loop. It will skip <s inside chunks that does not contain >, so the balance will be calculated incorrectly. I think, if create a test carefully, it will catch such situation.

So I ask you to do the following changes:

convert BangType::DocType -> BangType::DocType(i32)
start with balance == 1 (it is < in <!DOCTYPE) here

quick-xml/src/reader/mod.rs

Line 1031 in 2f3824a

Some(b'D') | Some(b'd') => Self::DocType,
when calculate balance take into account saved balance
emit event only when BangType::DocType(0) here

quick-xml/src/reader/state.rs

Line 147 in 2f3824a

BangType::DocType if uncased_starts_with(buf, b"!DOCTYPE") => {
add tests for the mentioned case (chunk with < + chunk with > which closes < from the first chunk). It is enough just craft such a string and use appropriate buffer size to slice string to desired chunks
also add regression test from Unexpected Bang Since 0.23.1 #590 (comment)
try to check if this fixes Deserialization of a doctype with very long content fails #533. I think it should because it seems to be a duplicate of these issues, but need to check

BlueGreenMagick · 2024-09-19T17:36:00Z

Thanks for the detailed feedback! I adjusted the PR.

start with balance == 1 (it is < in <!DOCTYPE) here

As the balance is calculated in up to chunk excluding current found >, I did not add 1 to balance initially.

add tests for the mentioned case (chunk with < + chunk with > which closes < from the first chunk). It is enough just craft such a string and use appropriate buffer size to slice string to desired chunks

Are regression tests for issues 533, 590, 801 enough? Or are additional tests desired?

try to check if this fixes #533. I think it should because it seems to be a duplicate of these issues, but need to check

This PR does indeed fixes the issue. Added a regression test for the case, and modified the initial post so the issue will be closed when this PR is merged.

Mingun

Are regression tests for issues 533, 590, 801 enough? Or are additional tests desired?

I'm not sure that tests still cover the case when we get four chunks with roughly such content:

<!DOCTYPE
< without any >
>, which would close < from point 2, not the doctype
> which should close doctype

If that case already covered, fine, although I would prefer to have test which explicitly guaranties such conditions (can tweak one of added regression tests for that).

And please add changelog entry to Changelog.md that people know that bug (#533) fixed. I'll close other as duplicate.

tests/issues.rs

src/reader/mod.rs

Mingun · 2024-09-19T17:58:14Z

You also can squash all your changes if you wish

BlueGreenMagick · 2024-09-20T06:03:56Z

Done!

I adjusted regression test for issue801 so that all angle brackets are explicitly in different buffer (by lowering buffer size to 2 bytes).

I think it might be better (and easier with GitHub UI) to squash merge on your end, so others can still follow the conversation in this PR in the future.

…fered reader

Mingun · 2024-09-20T16:42:21Z

Thanks!

Mingun requested changes Sep 19, 2024

View reviewed changes

Mingun approved these changes Sep 19, 2024

View reviewed changes

tests/issues.rs Outdated Show resolved Hide resolved

src/reader/mod.rs Outdated Show resolved Hide resolved

src/reader/mod.rs Outdated Show resolved Hide resolved

BlueGreenMagick changed the title ~~Fix DocType closing tag recognition in BufRead~~ Fix DocType closing bracket recognition in buffered reader Sep 20, 2024

Fix incorrect DocType closing bracket detection when parsing with buf…

3ebe221

…fered reader

Mingun force-pushed the fix-bufread-doctype branch from d66842a to 3ebe221 Compare September 20, 2024 16:41

Mingun merged commit 51d9e23 into tafia:master Sep 20, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DocType closing bracket recognition in buffered reader #802

Fix DocType closing bracket recognition in buffered reader #802

BlueGreenMagick commented Sep 19, 2024 •

edited

Loading

codecov-commenter commented Sep 19, 2024 •

edited

Loading

BlueGreenMagick commented Sep 19, 2024

Mingun left a comment •

edited

Loading

BlueGreenMagick commented Sep 19, 2024 •

edited

Loading

Mingun left a comment •

edited

Loading

Mingun commented Sep 19, 2024

BlueGreenMagick commented Sep 20, 2024 •

edited

Loading

Mingun commented Sep 20, 2024

Fix DocType closing bracket recognition in buffered reader #802

Fix DocType closing bracket recognition in buffered reader #802

Conversation

BlueGreenMagick commented Sep 19, 2024 • edited Loading

codecov-commenter commented Sep 19, 2024 • edited Loading

Codecov Report

BlueGreenMagick commented Sep 19, 2024

Mingun left a comment • edited Loading

Choose a reason for hiding this comment

BlueGreenMagick commented Sep 19, 2024 • edited Loading

Mingun left a comment • edited Loading

Choose a reason for hiding this comment

Mingun commented Sep 19, 2024

BlueGreenMagick commented Sep 20, 2024 • edited Loading

Mingun commented Sep 20, 2024

BlueGreenMagick commented Sep 19, 2024 •

edited

Loading

codecov-commenter commented Sep 19, 2024 •

edited

Loading

Mingun left a comment •

edited

Loading

BlueGreenMagick commented Sep 19, 2024 •

edited

Loading

Mingun left a comment •

edited

Loading

BlueGreenMagick commented Sep 20, 2024 •

edited

Loading