Send fetch requests for all page dict lookups in parallel #18627

richard-smith-preservica · 2024-08-19T14:30:44Z

When adding page dict candidates to the lookup tree, also initiate fetching them from xref, so if they are not yet loaded at all, the XHR will be sent
We can then await on the cached Promise without making the requests pipeline
This has a significant performance improvement for load-on-demand (i.e. with auto-fetch turned off) when a PDF has a large number of pages in the top level /Pages collection, and those pages are spread through a file, so every candidate needs to be fetched separately
PDFs with many pages where each page is a big image and all the pages are at the top level are quite a common output for digitisation programmes
I would have liked to do something like "if it's the top level collection and page count = number of kids, then just fetch that page without traversing the tree" but unfortunately I agree with comments on Error when I try to view a pdf ( uncaught exception: Page index 0 not found. ) #8088 that there is no good general solution to allow for /Pages nodes with empty /Kids arrays
The other alternative for fixing this use case is to simply not validate the last page at all, so pages can be loaded on demand. But that validation was added for good reasons, and this would also result in a bad experience if you didn't read the document from the front.

richard-smith-preservica · 2024-08-19T17:26:37Z

I also have a further commit that makes these PDFs render really fast but by using some "looks likely" measures of whether it's a PDF with everything in the top level, I will try and discuss that and see if you want it too.

Snuffleupagus · 2024-08-19T17:31:06Z

Thanks for the patch; based on an initial look I've got some overall questions/comments here.

When adding page dict candidates to the lookup tree, also initiate fetching them from xref, so if they are not yet loaded at all, the XHR will be sent

This patch will thus trigger data-loading eagerly, rather than lazily as currently done, which seems like it might lead to unnecessary data-loading in some cases. Hence my immediate worry is how this will affect things, and thus if this is actually a good idea in the general case!?

This has a significant performance improvement for load-on-demand (i.e. with auto-fetch turned off) when a PDF has a large number of pages in the top level /Pages collection, and those pages are spread through a file, so every candidate needs to be fetched separately

The more important question is how it'll affect performance of the normal use-case, i.e. where the entire PDF document is loaded, since that's what e.g. the Firefox PDF Viewer does and this use-case is thus more important to us?

PDFs with many pages where each page is a big image and all the pages are at the top level are quite a common output for digitisation programmes

Given that no such PDF document was provided here, testing this becomes somewhat difficult.

richard-smith-preservica · 2024-08-20T08:41:59Z

This patch will thus trigger data-loading eagerly, rather than lazily as currently done, which seems like it might lead to unnecessary data-loading in some cases

It looks like the last page is always requested when loading a document (since this change d0c4bbd) which means we're always traversing almost the entire /Pages tree serially at the moment. (Indeed, that was the symptom I was trying to fix with this proposal.)

If the data isn't already loaded then this will cause all those /Pages and /Page nodes to be requested immediately via XHR instead, yes. My understanding is that when using streaming/auto-fetch, the data is already loaded by the time loadDocument is called and xref.fetchAsync won't cause another XHR, but maybe I've missed something.

Given that no such PDF document was provided here, testing this becomes somewhat difficult.

The samples I have are 50MB+ (and customer data, but they're from a public portal, I can probably get permission to share), do you want attachments that large?

richard-smith-preservica · 2024-08-20T09:53:21Z

This is the other alternative
https://github.com/mozilla/pdf.js/compare/master...richard-smith-preservica:pdf.js:rcs/assume-all-pages-in-top-level-when-likely?expand=1
But this is potentially opening up incorrect reading of weird PDFs (that had exactly 27 entries in /Pages.Kids and a /Count of 27 but they aren't actually 1:1) which you might be worried about.

Snuffleupagus · 2024-08-20T11:07:59Z

It looks like the last page is always requested when loading a document (since this change d0c4bbd) which means we're always traversing almost the entire /Pages tree serially at the moment.

Yes, but since the patch wants to change the timing of this data-loading it's consequently important that we consider this properly to avoid any possible surprises later. (Especially since the patch seems targeted at a special case of "bad" PDF documents, i.e. those with large single-level /Pages trees.)

My understanding is that when using streaming/auto-fetch, the data is already loaded by the time loadDocument is called and xref.fetchAsync won't cause another XHR, but maybe I've missed something.

I don't believe that there's necessarily any such guarantees in general, it might not be the case with e.g. linearized PDFs.

This is the other alternative
https://github.com/mozilla/pdf.js/compare/master...richard-smith-preservica:pdf.js:rcs/assume-all-pages-in-top-level-when-likely?expand=1
But this is potentially opening up incorrect reading of weird PDFs (that had exactly 27 entries in /Pages.Kids and a /Count of 27 but they aren't actually 1:1) which you might be worried about.

Sorry, but I don't think that's really an option here:

This doesn't validate the data properly, as you already mentioned. This alone is enough (as far as I'm concerned) for that solution to be undesirable.
It adds more "complexity" to a part of the code-base that's crucial to the correct function of the library.
Finally, it introduces unconditional asynchronicity for the "is page"-check.

Currently it appears that the patch doesn't work though, I'll leave some inline comments below. (Please remember to squash commits when updating the patch.)

src/core/catalog.js

richard-smith-preservica · 2024-08-20T11:51:48Z

This doesn't validate the data properly, as you already mentioned. This alone is enough (as far as I'm concerned) for that solution to be undesirable.

That's fair enough which is why I split the two suggestions.

There is definitely a valuable use case here ("I want to quickly render something for a large PDF with all the /Pages at the top level"). If there's a better way to allow for that and not try to load the last page before rendering anything when disableAutoFetch is specified (for example using the condition in that suggestion as a validation rather than calling getPageDict at all in checkLastPage in that scenario) that would be fine too.

I joined the Matrix room too which might be better for that kind of unstructured chat? Or I could add an issue (I think others in a similar area were closed already) to add a discussion thread on here?

Snuffleupagus · 2024-08-20T13:00:37Z

The risk with this patch, as I see it, is that it's trying to optimize for a "pathological" case of bad PDF documents with a very long single-level /Pages tree, perhaps at the expense of correctly structured PDF documents where the /Pages tree has multiple levels.

This could thus regress performance for proper PDF documents, when disableAutoFetch = true is set, by triggering more data-loading than necessary when trying to fetch a specific page.

If there's a better way to allow for that and not try to load the last page before rendering anything when disableAutoFetch is specified (for example using the condition in that suggestion as a validation rather than calling getPageDict at all in checkLastPage in that scenario) that would be fine too.

I unfortunately cannot imagine any way of doing that that would be generally safe, given all the ways real-world PDF documents can be (and often are) broken.

Snuffleupagus · 2024-08-20T13:14:13Z

/botio test

moz-tools-bot · 2024-08-20T13:14:16Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/d99a4aa1e638338/output.txt

moz-tools-bot · 2024-08-20T13:14:16Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/191f298828b139b/output.txt

richard-smith-preservica · 2024-08-20T13:27:46Z

This could thus regress performance for proper PDF documents, when disableAutoFetch = true is set, by triggering more data-loading than necessary when trying to fetch a specific page.

We're already guaranteed to attempt to load the last page, so all of the entries added to nodesToVisit will be inspected already. I don't think this patch results in any additional being loaded, though it does mean it happens more intensively.

If you want to restrict the scope, it could make sense to only do this preload for items in the top level /Pages container - it's probably reasonable to assume that if your PDF has a /Pages tree, rather than all the pages in the top level, the tree is structured and located in a way that makes the existing one-by-one load okay.

moz-tools-bot · 2024-08-20T13:44:36Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/d99a4aa1e638338/output.txt

Total script time: 30.32 mins

Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 11
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/d99a4aa1e638338/reftest-analyzer.html#web=eq.log

moz-tools-bot · 2024-08-20T14:02:02Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/191f298828b139b/output.txt

Total script time: 47.75 mins

Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 3

Image differences available at: http://54.193.163.58:8877/191f298828b139b/reftest-analyzer.html#web=eq.log

richard-smith-preservica · 2024-08-20T14:20:09Z

Those tests seem to have failed as a result of different anti-aliasing in rendering 🤔 unsure how this could have changed that

Snuffleupagus · 2024-08-20T14:41:33Z

If you want to restrict the scope, it could make sense to only do this preload for items in the top level /Pages container

That's an excellent idea, since it'd reduce the impact of these changes. Please implement that and I'd be happy to approve the patch (after a final round of testing).

Those tests seem to have failed as a result of different anti-aliasing in rendering 🤔 unsure how this could have changed that

That's all known intermittent failures, so nothing to worry about here.

richard-smith-preservica · 2024-08-20T16:41:48Z

Cool, I'll push something tomorrow

Snuffleupagus

I don't think that we need the following part included in the commit message, since it's not really relevant to the final patch (and it also describes changes that we cannot safely make).

- The other alternative for fixing this use case is to simply not validate the last page at all, so pages can be loaded on demand. But that validation was added for good reasons, and this would also result in a bad experience if you didn't read the document from the front. Or assume in certain conditions that the top level /Pages contains only pages (see https://github.com/mozilla/pdf.js/compare/master...richard-smith-preservica:pdf.js:rcs/assume-all-pages-in-top-level-when-likely?expand=1), but that allows for particular edge case 'bad' PDFs to render incorrectly

eslint

Review - Fix new promise side of fetch; local cache variable; validation on when to prefetch

r=me, the final comments fixed; thank you for the patch!

src/core/catalog.js

- When adding page dict candidates to the lookup tree, also initiate fetching them from xref, so if they are not yet loaded at all, the XHR will be sent - Only at the top level - assume that if there is a /Pages tree, it is sensibly structured and the number of requests won't be too bad - We can then await on the cached Promise without making the requests pipeline - This has a significant performance improvement for load-on-demand (i.e. with auto-fetch turned off) when a PDF has a large number of pages in the top level /Pages collection, and those pages are spread through a file, so every candidate needs to be fetched separately - PDFs with many pages where each page is a big image and all the pages are at the top level are quite a common output for digitisation programmes - I would have liked to do something like "if it's the top level collection and page count = number of kids, then just fetch that page without traversing the tree" but unfortunately I agree with comments on mozilla#8088 that there is no good general solution to allow for /Pages nodes with empty /Kids arrays

Snuffleupagus · 2024-08-21T10:34:15Z

/botio test

moz-tools-bot · 2024-08-21T10:34:17Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/facafc33e1fa7c0/output.txt

moz-tools-bot · 2024-08-21T10:34:17Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/1b256027939ec51/output.txt

moz-tools-bot · 2024-08-21T11:04:47Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/facafc33e1fa7c0/output.txt

Total script time: 30.49 mins

Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 14
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/facafc33e1fa7c0/reftest-analyzer.html#web=eq.log

moz-tools-bot · 2024-08-21T11:21:02Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/1b256027939ec51/output.txt

Total script time: 46.73 mins

Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 4

Image differences available at: http://54.193.163.58:8877/1b256027939ec51/reftest-analyzer.html#web=eq.log

timvandermeij added core performance labels Aug 19, 2024

Snuffleupagus reviewed Aug 20, 2024

View reviewed changes

src/core/catalog.js Outdated Show resolved Hide resolved

Snuffleupagus reviewed Aug 20, 2024

View reviewed changes

src/core/catalog.js Outdated Show resolved Hide resolved

richard-smith-preservica force-pushed the rcs/send-page-dict-requests-in-parallel branch 2 times, most recently from 348fb11 to 4b36de2 Compare August 20, 2024 11:42

richard-smith-preservica force-pushed the rcs/send-page-dict-requests-in-parallel branch from 4b36de2 to 56358b0 Compare August 21, 2024 09:24

Snuffleupagus approved these changes Aug 21, 2024

View reviewed changes

src/core/catalog.js Outdated Show resolved Hide resolved

src/core/catalog.js Outdated Show resolved Hide resolved

richard-smith-preservica force-pushed the rcs/send-page-dict-requests-in-parallel branch 2 times, most recently from 563e9f0 to 0d32f95 Compare August 21, 2024 10:07

richard-smith-preservica force-pushed the rcs/send-page-dict-requests-in-parallel branch from 0d32f95 to a67b9ae Compare August 21, 2024 10:08

Snuffleupagus merged commit 908f453 into mozilla:master Aug 21, 2024
8 checks passed

richard-smith-preservica mentioned this pull request Aug 21, 2024

Get "unnecessary" range on first page #14570

Closed

YuvrajKaushal mentioned this pull request Sep 30, 2024

[Snyk] Upgrade pdfjs-dist from 4.4.168 to 4.6.82 YuvrajKaushal/lobe-chatSecure#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send fetch requests for all page dict lookups in parallel #18627

Send fetch requests for all page dict lookups in parallel #18627

richard-smith-preservica commented Aug 19, 2024

richard-smith-preservica commented Aug 19, 2024

Snuffleupagus commented Aug 19, 2024

richard-smith-preservica commented Aug 20, 2024

richard-smith-preservica commented Aug 20, 2024

Snuffleupagus commented Aug 20, 2024 •

edited

Loading

richard-smith-preservica commented Aug 20, 2024

Snuffleupagus commented Aug 20, 2024 •

edited

Loading

Snuffleupagus commented Aug 20, 2024

moz-tools-bot commented Aug 20, 2024

moz-tools-bot commented Aug 20, 2024

richard-smith-preservica commented Aug 20, 2024

moz-tools-bot commented Aug 20, 2024

moz-tools-bot commented Aug 20, 2024

richard-smith-preservica commented Aug 20, 2024

Snuffleupagus commented Aug 20, 2024

richard-smith-preservica commented Aug 20, 2024

Snuffleupagus left a comment •

edited

Loading

Snuffleupagus commented Aug 21, 2024

moz-tools-bot commented Aug 21, 2024

moz-tools-bot commented Aug 21, 2024

moz-tools-bot commented Aug 21, 2024

moz-tools-bot commented Aug 21, 2024

Send fetch requests for all page dict lookups in parallel #18627

Send fetch requests for all page dict lookups in parallel #18627

Conversation

richard-smith-preservica commented Aug 19, 2024

richard-smith-preservica commented Aug 19, 2024

Snuffleupagus commented Aug 19, 2024

richard-smith-preservica commented Aug 20, 2024

richard-smith-preservica commented Aug 20, 2024

Snuffleupagus commented Aug 20, 2024 • edited Loading

richard-smith-preservica commented Aug 20, 2024

Snuffleupagus commented Aug 20, 2024 • edited Loading

Snuffleupagus commented Aug 20, 2024

moz-tools-bot commented Aug 20, 2024

From: Bot.io (Linux m4)

Received

moz-tools-bot commented Aug 20, 2024

From: Bot.io (Windows)

Received

richard-smith-preservica commented Aug 20, 2024

moz-tools-bot commented Aug 20, 2024

From: Bot.io (Linux m4)

Failed

moz-tools-bot commented Aug 20, 2024

From: Bot.io (Windows)

Failed

richard-smith-preservica commented Aug 20, 2024

Snuffleupagus commented Aug 20, 2024

richard-smith-preservica commented Aug 20, 2024

Snuffleupagus left a comment • edited Loading

Choose a reason for hiding this comment

Snuffleupagus commented Aug 21, 2024

moz-tools-bot commented Aug 21, 2024

From: Bot.io (Linux m4)

Received

moz-tools-bot commented Aug 21, 2024

From: Bot.io (Windows)

Received

moz-tools-bot commented Aug 21, 2024

From: Bot.io (Linux m4)

Failed

moz-tools-bot commented Aug 21, 2024

From: Bot.io (Windows)

Failed

Snuffleupagus commented Aug 20, 2024 •

edited

Loading

Snuffleupagus commented Aug 20, 2024 •

edited

Loading

Snuffleupagus left a comment •

edited

Loading