[Regression] Eagerly fetch/parse the entire /Pages-tree in corrupt documents (issue 14303, PR 14311 follow-up) #14335

Snuffleupagus · 2021-12-02T12:37:07Z

Please note: This is similar to the method that existed prior to PR #3848, but the new method will only be used as a fallback when parsing of corrupt PDF documents.

The implementation in PR #14311 unfortunately turned out to be way too simplistic, as evident by the recently added test-files in issue #14303, since it may cause infinite loops in PDFDocument.checkLastPage for some corrupt PDF documents.[1]
To avoid this, the easiest solution that I could come up with was to fallback to eagerly parsing the entire /Pages-tree when the /Count-entry validation fails during document initialization.

Fixes at least two of the issues listed in issue #14303, namely the poppler-395-0.pdf... and GHOSTSCRIPT-698804-1.pdf... documents.

[1] The whole point of PR #14311 was obviously to get rid of infinte loops during document initialization, not to introduce any more of those.

Snuffleupagus · 2021-12-02T12:49:17Z

/botio test

pdfjsbot · 2021-12-02T12:49:18Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/f9c3702cc3ab21e/output.txt

pdfjsbot · 2021-12-02T12:49:18Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/d68cdda6296eae1/output.txt

pdfjsbot · 2021-12-02T13:10:51Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/f9c3702cc3ab21e/output.txt

Total script time: 21.54 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 7
  different first/second rendering: 2

Image differences available at: http://54.241.84.105:8877/f9c3702cc3ab21e/reftest-analyzer.html#web=eq.log

…cuments (issue 14303, PR 14311 follow-up) *Please note:* This is similar to the method that existed prior to PR 3848, but the new method will *only* be used as a fallback when parsing of corrupt PDF documents. The implementation in PR 14311 unfortunately turned out to be *way* too simplistic, as evident by the recently added test-files in issue 14303, since it may *cause* infinite loops in `PDFDocument.checkLastPage` for some corrupt PDF documents.[1] To avoid this, the easiest solution that I could come up with was to fallback to eagerly parsing the *entire* /Pages-tree when the /Count-entry validation fails during document initialization. Fixes *at least* two of the issues listed in issue 14303, namely the `poppler-395-0.pdf...` and `GHOSTSCRIPT-698804-1.pdf...` documents. --- [1] The whole point of PR 14311 was obviously to *get rid of* infinte loops during document initialization, not to introduce any more of those.

pdfjsbot · 2021-12-02T13:31:33Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/d68cdda6296eae1/output.txt

Total script time: 42.24 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 10
  different first/second rendering: 1

Image differences available at: http://54.193.163.58:8877/d68cdda6296eae1/reftest-analyzer.html#web=eq.log

timvandermeij · 2021-12-02T18:54:38Z

Nice work; thanks!

Snuffleupagus added core regression corrupted-pdf labels Dec 2, 2021

Snuffleupagus force-pushed the Catalog-getAllPageDicts branch from 4993502 to 1fac637 Compare December 2, 2021 13:31

timvandermeij approved these changes Dec 2, 2021

View reviewed changes

timvandermeij merged commit 4c145fc into mozilla:master Dec 2, 2021

Snuffleupagus deleted the Catalog-getAllPageDicts branch December 2, 2021 19:40

Snuffleupagus mentioned this pull request Dec 2, 2021

Crashes and timeouts on bug tracker corpus files #14303

Closed

Snuffleupagus mentioned this pull request Dec 31, 2021

Convert Catalog.getAllPageDicts to an async method #14411

Merged

Snuffleupagus mentioned this pull request Feb 16, 2022

Get "unnecessary" range on first page #14570

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Regression] Eagerly fetch/parse the entire /Pages-tree in corrupt documents (issue 14303, PR 14311 follow-up) #14335

[Regression] Eagerly fetch/parse the entire /Pages-tree in corrupt documents (issue 14303, PR 14311 follow-up) #14335

Snuffleupagus commented Dec 2, 2021 •

edited

Loading

Snuffleupagus commented Dec 2, 2021

pdfjsbot commented Dec 2, 2021

pdfjsbot commented Dec 2, 2021

pdfjsbot commented Dec 2, 2021

pdfjsbot commented Dec 2, 2021

timvandermeij commented Dec 2, 2021

[Regression] Eagerly fetch/parse the entire /Pages-tree in corrupt documents (issue 14303, PR 14311 follow-up) #14335

[Regression] Eagerly fetch/parse the entire /Pages-tree in corrupt documents (issue 14303, PR 14311 follow-up) #14335

Conversation

Snuffleupagus commented Dec 2, 2021 • edited Loading

Snuffleupagus commented Dec 2, 2021

pdfjsbot commented Dec 2, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Dec 2, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Dec 2, 2021

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Dec 2, 2021

From: Bot.io (Windows)

Failed

timvandermeij commented Dec 2, 2021

Snuffleupagus commented Dec 2, 2021 •

edited

Loading