-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[api-minor] Allow specifying custom match logic in PDFFindController #18549
[api-minor] Allow specifying custom match logic in PDFFindController #18549
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on a very quick look, there appears to be some unrelated changes in the patch.
2847d84
to
42b2f48
Compare
Thanks for the first review! I addressed all the comments except for those regarding the changes related to the new The reason I added While this |
Sure, but my issue is that it's completely impossible to understand that from looking only at the code itself. Hence why it feels like increasing the maintenance burden, since you need to either remember or (somehow) figure out why the code has unneeded asynchronicity. (Unless I'm missing things, the async-support also doesn't appear to be tested...)
If, and that's a very big if in my opinion, we should even consider that there needs to be actual users wanting this; not just that it'd be theoretically nice to have. Edit: In the event that my opinion is overruled be a majority wanting to keep the new async-behaviour, I'll however insist on that being properly covered by dedicated unit-tests. |
Well the reason I added it is that the application I'm working on would use it, it's not a theoretical use case. 😛 More specifically, we rely on an external search provider that supports searching semantically based on "similar meaning". This provider is however running asynchronously, and there is no way for me to call it synchronously. However, as I mentioned above, I believe I can workaround a sync-only API as follows: class AsyncPDFFindController extends PDFFindController {
#eventBus;
#currentQuery = null;
#pendingMatches = new Map();
#matchResults = new Map();
constructor(opts) {
super(opts);
this.#eventBus = opts.eventBus;
}
match(query, text, pageIndex) {
if (this.#currentQuery !== query) {
this.#matchResults.clear();
this.#pendingMatches.clear();
this.#currentQuery = query;
}
if (this.#matchResults.has(pageIndex)) {
return this.#matchResults.get(pageIndex);
}
if (!this.#pendingMatches.has(pageIndex)) {
this.#pendingMatches.set(
pageIndex,
this.matchAsync(query, text, pageIndex).then(matches => {
if (this.#currentQuery !== query) return;
this.#matchResults.set(pageIndex, matches);
this.#pendingMatches.delete(pageIndex);
})
);
}
const { state } = this;
if (state.type !== "custom-reloadmatches") {
this.#pendingMatches.get(pageIndex).then(() => {
if (this.#currentQuery !== query) return;
this.#eventBus.dispatch("find", {
...state,
type: "custom-reloadmatches",
});
});
}
return undefined;
}
async matchAsync() {
throw new Error("Must be implemented by a sub-class");
}
} And then I can have my own async search provider by extending this I agree that if we end up having async support I need to add a test for it. |
42b2f48
to
055c4d9
Compare
/botio-linux preview |
From: Bot.io (Linux m4)ReceivedCommand cmd_preview from @timvandermeij received. Current queue size: 0 Live output at: http://54.241.84.105:8877/67527b455647076/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/67527b455647076/output.txt Total script time: 1.14 mins Published |
/botio unittest |
From: Bot.io (Linux m4)ReceivedCommand cmd_unittest from @timvandermeij received. Current queue size: 0 Live output at: http://54.241.84.105:8877/a5b9e383bb6ad15/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_unittest from @timvandermeij received. Current queue size: 0 Live output at: http://54.193.163.58:8877/0932330f74df217/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/a5b9e383bb6ad15/output.txt Total script time: 2.51 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.193.163.58:8877/0932330f74df217/output.txt Total script time: 7.89 mins
|
/botio integrationtest |
From: Bot.io (Windows)ReceivedCommand cmd_integrationtest from @timvandermeij received. Current queue size: 0 Live output at: http://54.193.163.58:8877/7390ed6a542a064/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_integrationtest from @timvandermeij received. Current queue size: 0 Live output at: http://54.241.84.105:8877/b068649f1a0f514/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/b068649f1a0f514/output.txt Total script time: 8.57 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.193.163.58:8877/7390ed6a542a064/output.txt Total script time: 17.68 mins
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, with one comment. Now that the asynchronous bits are removed and it's mainly moving existing code around I think that this refactoring is small and self-contained enough to be accepted.
Before we merge this let's await a check from @Snuffleupagus' too, given the previous questions about the implementation, to make sure we're all aligned. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r=me, with the remaining comments addressed and passing tests; thank you.
Agreed; since thinking more about the suggested async behaviour of the |
This patch allows embedders of PDF.js to provide custom match logic for seaching in PDFs. This is done by subclassing the PDFFindController class and overriding the `match` method. `match` is called once per PDF page, receives as parameters the search query, the page contents, and the page index, and returns an array of { index, length } objects representing the search results.
055c4d9
to
f051597
Compare
Updated — thanks for the reviews! |
/botio unittest |
From: Bot.io (Linux m4)ReceivedCommand cmd_unittest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/04a79ef1a87555b/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_unittest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/95bce410b582ce2/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/04a79ef1a87555b/output.txt Total script time: 2.59 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.193.163.58:8877/95bce410b582ce2/output.txt Total script time: 7.18 mins
|
/botio integrationtest |
From: Bot.io (Windows)ReceivedCommand cmd_integrationtest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/10dc1652d064ee2/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_integrationtest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/0baa4b24e2add7e/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/0baa4b24e2add7e/output.txt Total script time: 8.60 mins
|
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/10dc1652d064ee2/output.txt Total script time: 18.13 mins
|
Regarding the timeout on Windows for "must check the new alt text flow" and "New alt-text flow", this branch does not include those tests because I have not rebased it. Should I rebase? Or is it a flaky test? |
I ignored that failing test here, since I don't understand how this PR could affect that one. |
Thanks for noticing this! I have included this new one in the list of intermittents at #18396 for our overview. /cc @calixteman in case you might have an idea what could cause this to test to fail. |
This is my proposed API for #18482. It is mostly moving code around, to carve out a (public) method with the minimum possible API that non-Firefox embedders can use to provide their own custom search. More specifically:
#calculateMatch
that builds the RegExp has been moved to#calculateRegExpMatch
, so that#calculateMatch
is agnostic to the matching logic and only takes care of running the matcher and updating the state based on the match result#calculateRegExpMatch
has been renamed tomatch(...)
, that subclasses can override#calculateMatch
supports callingmatch
even when it's an async function. This does not affect PDF.js itself (since#calculateMatch()
is already called in a non-awaited.then(() => ...)
), but makes it possible for consumers to have async match logic.I believe that this API is minimal enough that it won't cause problems if in the future PDFFindController needs to be refactored, as @calixteman mentioned in #18482 (comment).
Some examples of how it can be used:
External search provider
Multi-word search
This is already supported by PDF.js, but as far as I can tell it cannot be used through the Firefox UI. This example is how it would be implemented in an alternative timeline where this PR would have happened before adding support for multi-word search.
This example assumes that in that alternative universe
convertToRegExpString
is not private, and it acceptspageIndex
instead ofthis._hasDiacritics[pageIndex]
.Simple multi-page search
EDIT: This example does not apply anymore now that we only support sync
.match
. See #18549 (comment) for an async matcher example.This implementation uses some
_
-prefixed properties ofPDFFindController
. Assuming that they are meant to be private (I can open a PR to replace_
with#
if needed, after that this PR lands), there is also a second implementation that only uses the real public API.There are two questions that for which I keep swinging back and forth:
PDFFindController
already accepts multiple parameters to control its behavior, but having it as a subclass extension point leads to a cleaner implementation.isEntireWord
word check apply after running the matcher, or as part of the default matcher? Running after means that it would easily be available to custom matching logic (simply by settingeintireWord: true
on the dispatched"find"
event), but on the other hand it feels like its part of the match logic itself.Please let me know what you think about this :)
PS. If this feature gets accepted and it will need any maintenance in the future feel free to ping me, similarly to how I have been keeping an eye on the various bugs related to text selection.