Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extractors returned too many records #1658

Closed
Zamiell opened this issue Nov 8, 2022 · 16 comments
Closed

Extractors returned too many records #1658

Zamiell opened this issue Nov 8, 2022 · 16 comments

Comments

@Zamiell
Copy link

Zamiell commented Nov 8, 2022

EDIT - See my guide below on how to solve this issue.

Michael King told me to submit a bug report on GitHub.
You can reference our conversation here: https://discourse.algolia.com/t/extractors-returned-757-records-the-maximum-is-750/16493

@shortcuts
Copy link
Member

Michael King told me to submit a bug report on GitHub.

You can reference our conversation here: https://discourse.algolia.com/t/extractors-returned-757-records-the-maximum-is-750/16493

Hey, yeah this is expected, not a bug. We set this hard limit to prevent crawling/indexing inconsistencies when fetching large pages.

Do you have the aggregateContent option enabled? An other way to reduce the number of records is to ensure you have precise enough selectors

@Zamiell
Copy link
Author

Zamiell commented Nov 8, 2022

Yes, as I explained in the linked thread, my Algolia config looks like this, which does include the aggregateContent option:

    algolia: {
      appId: "ZCC397CSMF", // cspell:disable-line
      apiKey: "212a5e2442aa0e579f2f7bba22ee529a",
      indexName: "isaacscript",
      contextualSearch: false, // Enabled by default; only useful for versioned sites.
      recordExtractor: ({ _, helpers }) =>
        helpers.docsearch({
          recordProps: {
            lvl0: "header h1",
            lvl1: "article h2",
            lvl2: "article h3",
            lvl3: "article h4",
            lvl4: "article h5",
            lvl5: "article h6",
            content: "article p, article li",
          },
          aggregateContent: true,
        }),
    },

However, this seems to have no effect.

An other way to reduce the number of records is to ensure you have precise enough selectors

I'm not sure I understand. Are the selectors that you are referring to correspond to the "recordProps" values in the config? I am just using what is recommended by your support article, and they don't seem to be doing anything.

@shortcuts
Copy link
Member

as I explained in the linked thread, my Algolia config looks like this,

sorry, was on the phone, I missed the link

However, this seems to have no effect.

setting it to true is correct, disabling it creates a record for each content occurence in the DOM, so not what you are looking for here.

Are the selectors that you are referring to correspond to the "recordProps" values in the config?

Yup indeed, but they seem correct. Looking at the page that trigger errors, it seems to index the h4s, which are all equal to defined in (e.g. https://isaacscript.github.io/isaac-typescript-definitions/enums/TrinketType/).

I'd personally remove the lvl3 and higher selectors for those pages, as it does not bring value to the search. It will significantly reduce the number of records.

@Zamiell
Copy link
Author

Zamiell commented Nov 8, 2022

Thank you shortcuts,

From what I understand of your reply, you want me to set the config to the following:

    algolia: {
      appId: "ZCC397CSMF", // cspell:disable-line
      apiKey: "212a5e2442aa0e579f2f7bba22ee529a",
      indexName: "isaacscript",
      contextualSearch: false, // Enabled by default; only useful for versioned sites.
      recordExtractor: ({ _, helpers }) =>
        helpers.docsearch({
          recordProps: {
            lvl0: "header h1",
            lvl1: "article h2",
            lvl2: "article h3",
          },
          aggregateContent: true,
        }),
    },

However, my worry is that this would somehow reduce the quality of the search results for other pages than TrinketType (and the other 2-3 "overflow" pages).

Subsequently, is there a way to make the recordExtractor section only apply to the "overflow" pages? Or would this not be relevant for my use case?

@shortcuts
Copy link
Member

Subsequently, is there a way to make the recordExtractor section only apply to the "overflow" pages? Or would this not be relevant for my use case?

actions can be scoped to a specific path, see https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/. You can use the snippet you shared for only those pages, and keep the other pages unchanged

@shortcuts
Copy link
Member

closing until you answer if there's any other issue :D

@Zamiell
Copy link
Author

Zamiell commented Jan 14, 2023

actions can be scoped to a specific path, see https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/. You can use the snippet you shared for only those pages, and keep the other pages unchanged

Thanks, I'll try that out.

closing until you answer if there's any other issue :D

@shortcuts Even with the config that is specified in my previous post in this issue thread, I still get Algolia crawler errors:

image

So it looks like the aggregateContent tag isn't working properly. Can you confirm?

@shortcuts
Copy link
Member

Hey @Zamiell, looking at your Crawler configuration, I don't see the changes above being applied. Do you need help applying the changes?

@Zamiell
Copy link
Author

Zamiell commented Feb 6, 2023

@shortcuts What do you mean by "the changes above"? Do you mean making the configuration scoped to a more specific path? I don't want to do that right now because I just want to have the simplest possible configuration, to eliminate any possible bugs. Yet even though the aggregateContent is applying to the entire repository, it still doesn't work at all!

For reference, here is my configuration: https://github.com/IsaacScript/isaacscript/blob/main/packages/docs/docusaurus.config.js#L91-L105

@shortcuts
Copy link
Member

Docusaurus does not control the indexing, only the search, so those lines should be updated in the Crawler config directly

@Zamiell
Copy link
Author

Zamiell commented Feb 6, 2023

@shortcuts Can you explain how to update the crawler configuration directly?

I assume that you are directing me to this page:

image

But I don't see any boxes to paste in a recordExtractor or a recordProps or anything like that.

@shortcuts
Copy link
Member

The editor button on the left of your screenshot, it will show a JSON config that you can edit.

From my past comment, #1658 (comment), you can find links to the API reference, guides and stuff:

let me know if you need further helping with this!

@Zamiell
Copy link
Author

Zamiell commented Feb 6, 2023

Great, thank you, that got me on to the right track.

I'll write up a guide on how to solve this problem for others that encounter it.

@Zamiell Zamiell changed the title Extractors returned 757 records, the maximum is 750 Extractors returned too many records Feb 6, 2023
@Zamiell
Copy link
Author

Zamiell commented Feb 6, 2023

How to Solve "Extractor returned too many records"

Sometimes, when Algolia goes to index a very large page, it will give an error:

Extractors returned too many records

If it is mandatory that you have a very large page like this, then you can work around the error by changing the configuration for the Algolia crawler. This guide will walk you through how to do that.

  • First, go to the Crawler homepage: https://crawler.algolia.com/admin/crawlers/
  • Click on the index that is triggering the error in question, which will take you to the "Overview" page.
  • Click on "Editor" from the left menu, which will open up a VSCode-like interface.
  • Scroll down to the actions array that is part of the object passed to the Crawler constructor.

Personally, I'm using Algolia to crawl my Docusaurus website. For reference, the default actions array for a Docusaurus website (as of February 2023) looks something like this:

  actions: [
    {
      indexName: "foo",
      pathsToMatch: ["https://foo.github.io/**"],
      recordExtractor: ({ $, helpers }) => {
        // priority order: deepest active sub list header -> navbar active item -> 'Documentation'
        const lvl0 =
          $(
            ".menu__link.menu__link--sublist.menu__link--active, .navbar__item.navbar__link--active"
          )
            .last()
            .text() || "Documentation";

        return helpers.docsearch({
          recordProps: {
            lvl0: {
              selectors: "",
              defaultValue: lvl0,
            },
            lvl1: "header h1",
            lvl2: "article h2",
            lvl3: "article h3",
            lvl4: "article h4",
            lvl5: "article h5, article td:first-child",
            content: "article p, article li, article td:last-child",
          },
          indexHeadings: true,
        });
      },
    },
  ],

Step 1 - Create a new action

The first thing we should do is to add a new action that is based on the existing one. For example, let's imagine that we have 3 specific pages that cause the "too many records" error: Foo, Bar, and Baz. First, we need to copy paste the action into another action, and then modify the pathsToMatch entry for each of the two actions:

  actions: [
    {
      indexName: "foo",
      pathsToMatch: [
        "https://foo.github.io/**",
        "!https://foo.github.io/docs/Foo/",
        "!https://foo.github.io/docs/Bar/",
        "!https://foo.github.io/docs/Baz/",
      ],
      recordExtractor: ({ $, helpers }) => {
        // priority order: deepest active sub list header -> navbar active item -> 'Documentation'
        const lvl0 =
          $(
            ".menu__link.menu__link--sublist.menu__link--active, .navbar__item.navbar__link--active"
          )
            .last()
            .text() || "Documentation";

        return helpers.docsearch({
          recordProps: {
            lvl0: {
              selectors: "",
              defaultValue: lvl0,
            },
            lvl1: "header h1",
            lvl2: "article h2",
            lvl3: "article h3",
            lvl4: "article h4",
            lvl5: "article h5, article td:first-child",
            content: "article p, article li, article td:last-child",
          },
          indexHeadings: true,
        });
      },
    },
    {
      indexName: "foo",
      pathsToMatch: [
        "https://foo.github.io/docs/Foo/",
        "https://foo.github.io/docs/Bar/",
        "https://foo.github.io/docs/Baz/",
      ],
      recordExtractor: ({ $, helpers }) => {
        // priority order: deepest active sub list header -> navbar active item -> 'Documentation'
        const lvl0 =
          $(
            ".menu__link.menu__link--sublist.menu__link--active, .navbar__item.navbar__link--active"
          )
            .last()
            .text() || "Documentation";

        return helpers.docsearch({
          recordProps: {
            lvl0: {
              selectors: "",
              defaultValue: lvl0,
            },
            lvl1: "header h1",
            lvl2: "article h2",
            lvl3: "article h3",
            lvl4: "article h4",
            lvl5: "article h5, article td:first-child",
            content: "article p, article li, article td:last-child",
          },
          indexHeadings: true,
        });
      },
    },
  ],

First, note that we are generating a second action because we don't want to influence or somehow harm the search results of the normal non-problematic pages on our website. We want to only tinker with the problematic pages specifically.

Second, note that we have to negate the problematic URLs in the first action because the crawler will create a record for each matched action. In other words, there is no ordering of actions - all URLs get processed through all actions.

Now, we have a action that will only apply to problematic pages, but right now it is just the exact same config (since it was copy-pasted). So, the next step is to modify the second action.

Step 2 - aggregateContent

The first thing you can try is to add aggregateContent: true to the bottom of the second action (below the indexHeadings line). Then, trigger a re-index of the site to see if the error still happens. (You can force a re-index by clicking on the "Restart crawling" button on the "Overview" page.)

You can learn more about the aggregateContent option on the recordExtractor documentation page. In short, it reduces the total amount of records that are generated for any particular page. For me, doing this changed how the pages showed up in the search slightly, but your mileage may vary.

If that made the error go away, then great! You're done. If not, then go on to the next step.

Step 3 - Customizing the selectors

The "lvl0" through "lvl5" fields are the selector fields. We can try to customize those to reduce the amount of records that are being gathered.

First, remove the lvl5 line from the second action. Save the config and trigger a re-index to see if the problem still happens.

If the problem persists, we can try removing the lvl4 line. And then, if that doesn't work, we can try removing the lvl3 line. And then the lvl2 line. (The idea here is that we are removing more and more content until the page is small enough to be swallowed by the crawler.)

If you you only have lvl0 and lvl1 and you still get the error, then you are probably screwed, because lvl1 appears to be a mandatory field in the config.

@shortcuts
Copy link
Member

shortcuts commented Feb 7, 2023

Lovely!! would you mind updating the doc to add a link to your comment?

(thanks a lot for taking the time!)

@Zamiell
Copy link
Author

Zamiell commented Feb 7, 2023

no, but feel free to copy paste the relevant parts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants