Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DocType closing bracket recognition in buffered reader #802

Merged
merged 1 commit into from
Sep 20, 2024

Conversation

BlueGreenMagick
Copy link
Contributor

@BlueGreenMagick BlueGreenMagick commented Sep 19, 2024

Fixes #533, fixes #590, fixes #801

@codecov-commenter
Copy link

codecov-commenter commented Sep 19, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.17%. Comparing base (7558577) to head (3ebe221).
Report is 96 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #802      +/-   ##
==========================================
- Coverage   61.81%   60.17%   -1.65%     
==========================================
  Files          41       41              
  Lines       16798    15958     -840     
==========================================
- Hits        10384     9603     -781     
+ Misses       6414     6355      -59     
Flag Coverage Δ
unittests 60.17% <100.00%> (-1.65%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@BlueGreenMagick
Copy link
Contributor Author

Modified the code so that balance_buf is computed only once.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That PR will not fix the problem entirely, because correct DTD can contain unbalanced number of < and >, but I think that is rare case, so until proper fix will be implemented, a partial one also acceptable. The implementation, however, still may be improved. The problem here that we do not store state between invocations of BangType::parse. So the solution is to store balance inside BangType::DocType, then, I think, it will be unnecessary to calculate it over buf (because everything in buf previously was in chunk).

I also see the problem in the external for i in memchr::memchr_iter(b'>', chunk) loop. It will skip <s inside chunks that does not contain >, so the balance will be calculated incorrectly. I think, if create a test carefully, it will catch such situation.

So I ask you to do the following changes:

@BlueGreenMagick
Copy link
Contributor Author

BlueGreenMagick commented Sep 19, 2024

Thanks for the detailed feedback! I adjusted the PR.

  • start with balance == 1 (it is < in <!DOCTYPE) here

As the balance is calculated in up to chunk excluding current found >, I did not add 1 to balance initially.

add tests for the mentioned case (chunk with < + chunk with > which closes < from the first chunk). It is enough just craft such a string and use appropriate buffer size to slice string to desired chunks

Are regression tests for issues 533, 590, 801 enough? Or are additional tests desired?

try to check if this fixes #533. I think it should because it seems to be a duplicate of these issues, but need to check

This PR does indeed fixes the issue. Added a regression test for the case, and modified the initial post so the issue will be closed when this PR is merged.

Copy link
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are regression tests for issues 533, 590, 801 enough? Or are additional tests desired?

I'm not sure that tests still cover the case when we get four chunks with roughly such content:

  1. <!DOCTYPE
  2. < without any >
  3. >, which would close < from point 2, not the doctype
  4. > which should close doctype

If that case already covered, fine, although I would prefer to have test which explicitly guaranties such conditions (can tweak one of added regression tests for that).

And please add changelog entry to Changelog.md that people know that bug (#533) fixed. I'll close other as duplicate.

tests/issues.rs Outdated Show resolved Hide resolved
src/reader/mod.rs Outdated Show resolved Hide resolved
src/reader/mod.rs Outdated Show resolved Hide resolved
@Mingun
Copy link
Collaborator

Mingun commented Sep 19, 2024

You also can squash all your changes if you wish

@BlueGreenMagick BlueGreenMagick changed the title Fix DocType closing tag recognition in BufRead Fix DocType closing bracket recognition in buffered reader Sep 20, 2024
@BlueGreenMagick
Copy link
Contributor Author

BlueGreenMagick commented Sep 20, 2024

Done!

I adjusted regression test for issue801 so that all angle brackets are explicitly in different buffer (by lowering buffer size to 2 bytes).

I think it might be better (and easier with GitHub UI) to squash merge on your end, so others can still follow the conversation in this PR in the future.

@Mingun Mingun force-pushed the fix-bufread-doctype branch from d66842a to 3ebe221 Compare September 20, 2024 16:41
@Mingun Mingun merged commit 51d9e23 into tafia:master Sep 20, 2024
7 checks passed
@Mingun
Copy link
Collaborator

Mingun commented Sep 20, 2024

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants