-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send fetch requests for all page dict lookups in parallel #18627
Send fetch requests for all page dict lookups in parallel #18627
Conversation
richard-smith-preservica
commented
Aug 19, 2024
- When adding page dict candidates to the lookup tree, also initiate fetching them from xref, so if they are not yet loaded at all, the XHR will be sent
- We can then await on the cached Promise without making the requests pipeline
- This has a significant performance improvement for load-on-demand (i.e. with auto-fetch turned off) when a PDF has a large number of pages in the top level /Pages collection, and those pages are spread through a file, so every candidate needs to be fetched separately
- PDFs with many pages where each page is a big image and all the pages are at the top level are quite a common output for digitisation programmes
- I would have liked to do something like "if it's the top level collection and page count = number of kids, then just fetch that page without traversing the tree" but unfortunately I agree with comments on Error when I try to view a pdf ( uncaught exception: Page index 0 not found. ) #8088 that there is no good general solution to allow for /Pages nodes with empty /Kids arrays
- The other alternative for fixing this use case is to simply not validate the last page at all, so pages can be loaded on demand. But that validation was added for good reasons, and this would also result in a bad experience if you didn't read the document from the front.
I also have a further commit that makes these PDFs render really fast but by using some "looks likely" measures of whether it's a PDF with everything in the top level, I will try and discuss that and see if you want it too. |
Thanks for the patch; based on an initial look I've got some overall questions/comments here.
This patch will thus trigger data-loading eagerly, rather than lazily as currently done, which seems like it might lead to unnecessary data-loading in some cases. Hence my immediate worry is how this will affect things, and thus if this is actually a good idea in the general case!?
The more important question is how it'll affect performance of the normal use-case, i.e. where the entire PDF document is loaded, since that's what e.g. the Firefox PDF Viewer does and this use-case is thus more important to us?
Given that no such PDF document was provided here, testing this becomes somewhat difficult. |
It looks like the last page is always requested when loading a document (since this change d0c4bbd) which means we're always traversing almost the entire /Pages tree serially at the moment. (Indeed, that was the symptom I was trying to fix with this proposal.) If the data isn't already loaded then this will cause all those /Pages and /Page nodes to be requested immediately via XHR instead, yes. My understanding is that when using streaming/auto-fetch, the data is already loaded by the time loadDocument is called and xref.fetchAsync won't cause another XHR, but maybe I've missed something.
The samples I have are 50MB+ (and customer data, but they're from a public portal, I can probably get permission to share), do you want attachments that large? |
This is the other alternative |
Yes, but since the patch wants to change the timing of this data-loading it's consequently important that we consider this properly to avoid any possible surprises later. (Especially since the patch seems targeted at a special case of "bad" PDF documents, i.e. those with large single-level /Pages trees.)
I don't believe that there's necessarily any such guarantees in general, it might not be the case with e.g. linearized PDFs.
Sorry, but I don't think that's really an option here:
Currently it appears that the patch doesn't work though, I'll leave some inline comments below. (Please remember to squash commits when updating the patch.) |
348fb11
to
4b36de2
Compare
That's fair enough which is why I split the two suggestions. There is definitely a valuable use case here ("I want to quickly render something for a large PDF with all the /Pages at the top level"). If there's a better way to allow for that and not try to load the last page before rendering anything when disableAutoFetch is specified (for example using the condition in that suggestion as a validation rather than calling getPageDict at all in checkLastPage in that scenario) that would be fine too. I joined the Matrix room too which might be better for that kind of unstructured chat? Or I could add an issue (I think others in a similar area were closed already) to add a discussion thread on here? |
The risk with this patch, as I see it, is that it's trying to optimize for a "pathological" case of bad PDF documents with a very long single-level /Pages tree, perhaps at the expense of correctly structured PDF documents where the /Pages tree has multiple levels. This could thus regress performance for proper PDF documents, when
I unfortunately cannot imagine any way of doing that that would be generally safe, given all the ways real-world PDF documents can be (and often are) broken. |
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/d99a4aa1e638338/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/191f298828b139b/output.txt |
We're already guaranteed to attempt to load the last page, so all of the entries added to nodesToVisit will be inspected already. I don't think this patch results in any additional being loaded, though it does mean it happens more intensively. If you want to restrict the scope, it could make sense to only do this preload for items in the top level /Pages container - it's probably reasonable to assume that if your PDF has a /Pages tree, rather than all the pages in the top level, the tree is structured and located in a way that makes the existing one-by-one load okay. |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/d99a4aa1e638338/output.txt Total script time: 30.32 mins
Image differences available at: http://54.241.84.105:8877/d99a4aa1e638338/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/191f298828b139b/output.txt Total script time: 47.75 mins
Image differences available at: http://54.193.163.58:8877/191f298828b139b/reftest-analyzer.html#web=eq.log |
Those tests seem to have failed as a result of different anti-aliasing in rendering 🤔 unsure how this could have changed that |
That's an excellent idea, since it'd reduce the impact of these changes. Please implement that and I'd be happy to approve the patch (after a final round of testing).
That's all known intermittent failures, so nothing to worry about here. |
Cool, I'll push something tomorrow |
4b36de2
to
56358b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we need the following part included in the commit message, since it's not really relevant to the final patch (and it also describes changes that we cannot safely make).
- The other alternative for fixing this use case is to simply not validate the last page at all, so pages can be loaded on demand. But that validation was added for good reasons, and this would also result in a bad experience if you didn't read the document from the front. Or assume in certain conditions that the top level /Pages contains only pages (see https://github.com/mozilla/pdf.js/compare/master...richard-smith-preservica:pdf.js:rcs/assume-all-pages-in-top-level-when-likely?expand=1), but that allows for particular edge case 'bad' PDFs to render incorrectly
eslint
Review - Fix new promise side of fetch; local cache variable; validation on when to prefetch
r=me, the final comments fixed; thank you for the patch!
563e9f0
to
0d32f95
Compare
- When adding page dict candidates to the lookup tree, also initiate fetching them from xref, so if they are not yet loaded at all, the XHR will be sent - Only at the top level - assume that if there is a /Pages tree, it is sensibly structured and the number of requests won't be too bad - We can then await on the cached Promise without making the requests pipeline - This has a significant performance improvement for load-on-demand (i.e. with auto-fetch turned off) when a PDF has a large number of pages in the top level /Pages collection, and those pages are spread through a file, so every candidate needs to be fetched separately - PDFs with many pages where each page is a big image and all the pages are at the top level are quite a common output for digitisation programmes - I would have liked to do something like "if it's the top level collection and page count = number of kids, then just fetch that page without traversing the tree" but unfortunately I agree with comments on mozilla#8088 that there is no good general solution to allow for /Pages nodes with empty /Kids arrays
0d32f95
to
a67b9ae
Compare
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/facafc33e1fa7c0/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/1b256027939ec51/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/facafc33e1fa7c0/output.txt Total script time: 30.49 mins
Image differences available at: http://54.241.84.105:8877/facafc33e1fa7c0/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/1b256027939ec51/output.txt Total script time: 46.73 mins
Image differences available at: http://54.193.163.58:8877/1b256027939ec51/reftest-analyzer.html#web=eq.log |