Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api-minor] Allow specifying custom match logic in PDFFindController #18549

Merged

Conversation

nicolo-ribaudo
Copy link
Contributor

@nicolo-ribaudo nicolo-ribaudo commented Aug 2, 2024

Allow specifying custom match logic in PDFFindController

This patch allows embedders of PDF.js to provide custom match logic for seaching in PDFs. This is done by subclassing the PDFFindController class and overriding the match method.

match is called once per PDF page, receives as parameters the search query, the page contents, and the page index, and returns an array of { index, length } objects representing the search results.

This is my proposed API for #18482. It is mostly moving code around, to carve out a (public) method with the minimum possible API that non-Firefox embedders can use to provide their own custom search. More specifically:

  • the logic in #calculateMatch that builds the RegExp has been moved to #calculateRegExpMatch, so that #calculateMatch is agnostic to the matching logic and only takes care of running the matcher and updating the state based on the match result
  • #calculateRegExpMatch has been renamed to match(...), that subclasses can override
  • #calculateMatch supports calling match even when it's an async function. This does not affect PDF.js itself (since #calculateMatch() is already called in a non-awaited .then(() => ...)), but makes it possible for consumers to have async match logic.

I believe that this API is minimal enough that it won't cause problems if in the future PDFFindController needs to be refactored, as @calixteman mentioned in #18482 (comment).

Some examples of how it can be used:

External search provider
import fuzzySearch from "some-fuzzy-search-library";

class FuzzyFindController extends PDFFindController {
  // "query" is a string
  match(query, text) {
    const results = fuzzySearch(query, text);
    return results.map(({ index, value }) => ({ index, length: value.length }));
  }
}
Multi-word search

This is already supported by PDF.js, but as far as I can tell it cannot be used through the Firefox UI. This example is how it would be implemented in an alternative timeline where this PR would have happened before adding support for multi-word search.

This example assumes that in that alternative universe convertToRegExpString is not private, and it accepts pageIndex instead of this._hasDiacritics[pageIndex].

class MultiWordFindController extends PDFFindController {
  // "query" is an array of strings
  match(query, text, pageIndex) {
    let isUnicode = false;
    // Words are sorted in reverse order to be sure that "foobar" is matched
    // before "foo" in case the query is "foobar foo".
    const queryStr = query
      .sort()
      .reverse()
      .map(q => {
        const [isUnicodePart, queryPart] = this.convertToRegExpString(
          q,
          pageIndex
        );
        isUnicode ||= isUnicodePart;
        return `(${queryPart})`;
      })
      .join("|");

    const flags = `g${isUnicode ? "u" : ""}${this.state.caseSensitive ? "" : "i"}`;
    query = new RegExp(queryStr, flags);

    const matches = [];
    for (const { index, 0: match } of pageContent.matchAll(query)) {
      matches.push({ index, length: match.length });
    }
    return matches;
  }
}
Simple multi-page search

EDIT: This example does not apply anymore now that we only support sync .match. See #18549 (comment) for an async matcher example.

This implementation uses some _-prefixed properties of PDFFindController. Assuming that they are meant to be private (I can open a PR to replace _ with # if needed, after that this PR lands), there is also a second implementation that only uses the real public API.

class MultiPageFindController extends PDFFindController {
  // "query" is a string
  async match(query, text, pageIndex) {
    let prefix = "", suffix = "";
    if (pageIndex > 0) {
      await this._extractTextPromises[pageIndex - 1];
      prefix = this._pageContents[pageIndex - 1].slice(1 - query.length) + " ";
    }
    if (pageIndex + 1 < this._linkService.pagesCount) {
      await this._extractTextPromises[pageIndex + 1];
      suffix = " " + this._pageContents[pageIndex + 1].slice(0, query.length - 1);
    }
    text = prefix + text + suffix;

    const matches = [];
    let index = -1;
    while ((index = text.indexOf(query, index + 1)) !== -1) {
      let start = Math.max(prefix.length, index);
      let end = Math.min(index + query.length, prefix.length + text.length);
      matches.push({ index: start - prefix.length, length: end - start });
    }
    return matches;
  }
}
class MultiPageFindController extends PDFFindController {
  #linkService;
  #pageContents;
  #pageContentsPromises = [];

  constructor(opts) {
    super(opts);
    this.#linkService = opts.linkService;
  }

  async #getPageContent(index) {
    if (this.#pageContents[index] == null) {
      this.#pageContentsPromises[pageIndex - 1] ??= Promise.withResolvers();
      await this.#pageContentsPromises[pageIndex - 1].promise;
    }
    return this.#pageContents[index];
  }

  // "query" is a string
  async match(query, text, pageIndex) {
    if (this.#pageContents[pageIndex] == null) {
    	this.#pageContents[pageIndex] = text;
    	this.#pageContentsPromises[pageIndex]?.resolve();
    }

    const [prevPage, nextPage] = await Promise.all([
      pageIndex > 0 ? this.#getPageContent(pageIndex - 1) : "",
      pageIndex + 1 < this.#linkService.pagesCount ? this.#getPageContent(pageIndex + 1) : "",
    ]);
    const prefix = prevPage.slice(1 - query.length) + " ";
    const suffix = " " + nextPage.slice(0, query.length - 1);
    text = prefix + text + suffix;

    const matches = [];
    let index = -1;
    while ((index = text.indexOf(query, index + 1)) !== -1) {
      let start = Math.max(prefix.length, index);
      let end = Math.min(index + query.length, prefix.length + text.length);
      matches.push({ index: start - prefix.length, length: end - start });
    }
    return matches;
  }
}

There are two questions that for which I keep swinging back and forth:

  • Should the API be subclass-based, or parameter-based?
    class CustomFindController extends PDFFindController {
      match(...) {}
    }
    vs
    new PDFFindController({
      /* ... various other options... ,*/
      matcher(...) {}
    }
    PDFFindController already accepts multiple parameters to control its behavior, but having it as a subclass extension point leads to a cleaner implementation.
  • Should the isEntireWord word check apply after running the matcher, or as part of the default matcher? Running after means that it would easily be available to custom matching logic (simply by setting eintireWord: true on the dispatched "find" event), but on the other hand it feels like its part of the match logic itself.

Please let me know what you think about this :)

PS. If this feature gets accepted and it will need any maintenance in the future feel free to ping me, similarly to how I have been keeping an eye on the various bugs related to text selection.

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on a very quick look, there appears to be some unrelated changes in the patch.

web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
web/pdf_find_controller.js Outdated Show resolved Hide resolved
@Snuffleupagus Snuffleupagus changed the title Allow specifying custom match logic in PDFFindController [api-minor] Allow specifying custom match logic in PDFFindController Aug 2, 2024
@nicolo-ribaudo nicolo-ribaudo force-pushed the custom-find-matcher-subclass branch from 2847d84 to 42b2f48 Compare August 4, 2024 20:17
@nicolo-ribaudo
Copy link
Contributor Author

nicolo-ribaudo commented Aug 4, 2024

Thanks for the first review! I addressed all the comments except for those regarding the changes related to the new await.

The reason I added await in front of the this.match call (and thus, for making #calculateMatch async) is so that subclasses can easily use async search logic (mostly, for calling to external services) without being limited by the sync-ness of the API. This causes minimal changes to PDFFindController, since #calculateMatch was already called in a .then callback. I tried to document the need for await in the JSDoc comment of match, which lists Promise<SingleFindMatch[]> | SingleFindMatch[] as the return value.

While this await is a very-nice-to-have, it's not strictly necessary for consumers that need async search logic. Even if PDFFindController only supported sync searches, they would still be able to use async logic by triggering two separate find events with the same query (one that spawns all the search jobs and returns no matches, and then once they are done one to collect all the results) — if that await is a blocking problem for this PR and my explanation isn't convincing, I can remove it.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Aug 5, 2024

I tried to document the need for await in the JSDoc comment of match, which lists Promise<SingleFindMatch[]> | SingleFindMatch[] as the return value.

Sure, but my issue is that it's completely impossible to understand that from looking only at the code itself. Hence why it feels like increasing the maintenance burden, since you need to either remember or (somehow) figure out why the code has unneeded asynchronicity. (Unless I'm missing things, the async-support also doesn't appear to be tested...)

While this await is a very-nice-to-have, it's not strictly necessary for consumers that need async search logic.

If, and that's a very big if in my opinion, we should even consider that there needs to be actual users wanting this; not just that it'd be theoretically nice to have.
Can we please skip the extra asynchronicity for now, and wait until an actual real-world use-cases (that cannot be solved otherwise) emerges first?

Edit: In the event that my opinion is overruled be a majority wanting to keep the new async-behaviour, I'll however insist on that being properly covered by dedicated unit-tests.

@nicolo-ribaudo
Copy link
Contributor Author

nicolo-ribaudo commented Aug 5, 2024

Well the reason I added it is that the application I'm working on would use it, it's not a theoretical use case. 😛 More specifically, we rely on an external search provider that supports searching semantically based on "similar meaning". This provider is however running asynchronously, and there is no way for me to call it synchronously.

However, as I mentioned above, I believe I can workaround a sync-only API as follows:

class AsyncPDFFindController extends PDFFindController {
  #eventBus;

  #currentQuery = null;
  #pendingMatches = new Map();
  #matchResults = new Map();

  constructor(opts) {
    super(opts);
    this.#eventBus = opts.eventBus;
  }

  match(query, text, pageIndex) {
    if (this.#currentQuery !== query) {
      this.#matchResults.clear();
      this.#pendingMatches.clear();
      this.#currentQuery = query;
    }

    if (this.#matchResults.has(pageIndex)) {
      return this.#matchResults.get(pageIndex);
    }

    if (!this.#pendingMatches.has(pageIndex)) {
      this.#pendingMatches.set(
        pageIndex,
        this.matchAsync(query, text, pageIndex).then(matches => {
		  if (this.#currentQuery !== query) return;
          this.#matchResults.set(pageIndex, matches);
          this.#pendingMatches.delete(pageIndex);
        })
      );
    }

    const { state } = this;
    if (state.type !== "custom-reloadmatches") {
      this.#pendingMatches.get(pageIndex).then(() => {
		if (this.#currentQuery !== query) return;
        this.#eventBus.dispatch("find", {
          ...state,
          type: "custom-reloadmatches",
        });
      });
    }

    return undefined;
  }

  async matchAsync() {
    throw new Error("Must be implemented by a sub-class");
  }
}

And then I can have my own async search provider by extending this AsyncPDFFindController class and defining a matchAsync method. It's not as nice as PDFFindController directly supporting an async match, but it should work too. For now I will remove it from this PR, and I will come back in the future if my approach ends up not working.

I agree that if we end up having async support I need to add a test for it.

@timvandermeij
Copy link
Contributor

/botio-linux preview

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/67527b455647076/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/67527b455647076/output.txt

Total script time: 1.14 mins

Published

@timvandermeij
Copy link
Contributor

/botio unittest

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_unittest from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/a5b9e383bb6ad15/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_unittest from @timvandermeij received. Current queue size: 0

Live output at: http://54.193.163.58:8877/0932330f74df217/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/a5b9e383bb6ad15/output.txt

Total script time: 2.51 mins

  • Unit Tests: Passed

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/0932330f74df217/output.txt

Total script time: 7.89 mins

  • Unit Tests: Passed

@timvandermeij
Copy link
Contributor

/botio integrationtest

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_integrationtest from @timvandermeij received. Current queue size: 0

Live output at: http://54.193.163.58:8877/7390ed6a542a064/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_integrationtest from @timvandermeij received. Current queue size: 0

Live output at: http://54.241.84.105:8877/b068649f1a0f514/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/b068649f1a0f514/output.txt

Total script time: 8.57 mins

  • Integration Tests: Passed

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/7390ed6a542a064/output.txt

Total script time: 17.68 mins

  • Integration Tests: Passed

Copy link
Contributor

@timvandermeij timvandermeij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, with one comment. Now that the asynchronous bits are removed and it's mainly moving existing code around I think that this refactoring is small and self-contained enough to be accepted.

Before we merge this let's await a check from @Snuffleupagus' too, given the previous questions about the implementation, to make sure we're all aligned. Thanks!

web/pdf_find_controller.js Outdated Show resolved Hide resolved
Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me, with the remaining comments addressed and passing tests; thank you.

web/pdf_find_controller.js Outdated Show resolved Hide resolved
@Snuffleupagus
Copy link
Collaborator

Now that the asynchronous bits are removed and it's mainly moving existing code around I think that this refactoring is small and self-contained enough to be accepted.

Agreed; since thinking more about the suggested async behaviour of the match-method, I'm less convinced that it'd have been generally safe and correct unfortunately. With that being async it'd have been possible for e.g. the active search-term to change while a previously pending (and slow) match-call resolves, in which case we'd then update various state with "old" data.

This patch allows embedders of PDF.js to provide custom match
logic for seaching in PDFs. This is done by subclassing the
PDFFindController class and overriding the `match` method.

`match` is called once per PDF page, receives as parameters the
search query, the page contents, and the page index, and returns
an array of { index, length } objects representing the search
results.
@nicolo-ribaudo nicolo-ribaudo force-pushed the custom-find-matcher-subclass branch from 055c4d9 to f051597 Compare August 13, 2024 08:46
@nicolo-ribaudo
Copy link
Contributor Author

Updated — thanks for the reviews!

@Snuffleupagus
Copy link
Collaborator

/botio unittest

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/04a79ef1a87555b/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/95bce410b582ce2/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/04a79ef1a87555b/output.txt

Total script time: 2.59 mins

  • Unit Tests: Passed

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/95bce410b582ce2/output.txt

Total script time: 7.18 mins

  • Unit Tests: Passed

@Snuffleupagus
Copy link
Collaborator

/botio integrationtest

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/10dc1652d064ee2/output.txt

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/0baa4b24e2add7e/output.txt

@Snuffleupagus Snuffleupagus linked an issue Aug 13, 2024 that may be closed by this pull request
@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/0baa4b24e2add7e/output.txt

Total script time: 8.60 mins

  • Integration Tests: Passed

@moz-tools-bot
Copy link
Collaborator

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/10dc1652d064ee2/output.txt

Total script time: 18.13 mins

  • Integration Tests: FAILED

@nicolo-ribaudo
Copy link
Contributor Author

Regarding the timeout on Windows for "must check the new alt text flow" and "New alt-text flow", this branch does not include those tests because I have not rebased it. Should I rebase? Or is it a flaky test?

@Snuffleupagus Snuffleupagus merged commit a999b34 into mozilla:master Aug 13, 2024
6 checks passed
@nicolo-ribaudo nicolo-ribaudo deleted the custom-find-matcher-subclass branch August 13, 2024 10:23
@Snuffleupagus
Copy link
Collaborator

Regarding the timeout on Windows for "must check the new alt text flow" and "New alt-text flow", this branch does not include those tests because I have not rebased it. Should I rebase? Or is it a flaky test?

I ignored that failing test here, since I don't understand how this PR could affect that one.
Given that it's a new test I'm guessing that it's got some intermittent problems.

@timvandermeij
Copy link
Contributor

timvandermeij commented Aug 13, 2024

Thanks for noticing this! I have included this new one in the list of intermittents at #18396 for our overview.

/cc @calixteman in case you might have an idea what could cause this to test to fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Simple API for custom search for PDF.js embedders
4 participants