Extractors returned too many records #1658

Zamiell · 2022-11-08T17:03:42Z

EDIT - See my guide below on how to solve this issue.

Michael King told me to submit a bug report on GitHub.
You can reference our conversation here: https://discourse.algolia.com/t/extractors-returned-757-records-the-maximum-is-750/16493

shortcuts · 2022-11-08T17:13:53Z

Michael King told me to submit a bug report on GitHub.

You can reference our conversation here: https://discourse.algolia.com/t/extractors-returned-757-records-the-maximum-is-750/16493

Hey, yeah this is expected, not a bug. We set this hard limit to prevent crawling/indexing inconsistencies when fetching large pages.

Do you have the aggregateContent option enabled? An other way to reduce the number of records is to ensure you have precise enough selectors

Zamiell · 2022-11-08T19:44:46Z

Yes, as I explained in the linked thread, my Algolia config looks like this, which does include the aggregateContent option:

    algolia: {
      appId: "ZCC397CSMF", // cspell:disable-line
      apiKey: "212a5e2442aa0e579f2f7bba22ee529a",
      indexName: "isaacscript",
      contextualSearch: false, // Enabled by default; only useful for versioned sites.
      recordExtractor: ({ _, helpers }) =>
        helpers.docsearch({
          recordProps: {
            lvl0: "header h1",
            lvl1: "article h2",
            lvl2: "article h3",
            lvl3: "article h4",
            lvl4: "article h5",
            lvl5: "article h6",
            content: "article p, article li",
          },
          aggregateContent: true,
        }),
    },

However, this seems to have no effect.

An other way to reduce the number of records is to ensure you have precise enough selectors

I'm not sure I understand. Are the selectors that you are referring to correspond to the "recordProps" values in the config? I am just using what is recommended by your support article, and they don't seem to be doing anything.

shortcuts · 2022-11-08T21:05:03Z

as I explained in the linked thread, my Algolia config looks like this,

sorry, was on the phone, I missed the link

However, this seems to have no effect.

setting it to true is correct, disabling it creates a record for each content occurence in the DOM, so not what you are looking for here.

Are the selectors that you are referring to correspond to the "recordProps" values in the config?

Yup indeed, but they seem correct. Looking at the page that trigger errors, it seems to index the h4s, which are all equal to defined in (e.g. https://isaacscript.github.io/isaac-typescript-definitions/enums/TrinketType/).

I'd personally remove the lvl3 and higher selectors for those pages, as it does not bring value to the search. It will significantly reduce the number of records.

Zamiell · 2022-11-08T21:21:04Z

Thank you shortcuts,

From what I understand of your reply, you want me to set the config to the following:

    algolia: {
      appId: "ZCC397CSMF", // cspell:disable-line
      apiKey: "212a5e2442aa0e579f2f7bba22ee529a",
      indexName: "isaacscript",
      contextualSearch: false, // Enabled by default; only useful for versioned sites.
      recordExtractor: ({ _, helpers }) =>
        helpers.docsearch({
          recordProps: {
            lvl0: "header h1",
            lvl1: "article h2",
            lvl2: "article h3",
          },
          aggregateContent: true,
        }),
    },

However, my worry is that this would somehow reduce the quality of the search results for other pages than TrinketType (and the other 2-3 "overflow" pages).

Subsequently, is there a way to make the recordExtractor section only apply to the "overflow" pages? Or would this not be relevant for my use case?

shortcuts · 2022-11-08T21:30:36Z

Subsequently, is there a way to make the recordExtractor section only apply to the "overflow" pages? Or would this not be relevant for my use case?

actions can be scoped to a specific path, see https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/. You can use the snippet you shared for only those pages, and keep the other pages unchanged

shortcuts · 2022-11-16T10:42:53Z

closing until you answer if there's any other issue :D

Zamiell · 2023-01-14T18:22:34Z

actions can be scoped to a specific path, see https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/. You can use the snippet you shared for only those pages, and keep the other pages unchanged

Thanks, I'll try that out.

closing until you answer if there's any other issue :D

@shortcuts Even with the config that is specified in my previous post in this issue thread, I still get Algolia crawler errors:

So it looks like the aggregateContent tag isn't working properly. Can you confirm?

shortcuts · 2023-02-06T10:56:37Z

Hey @Zamiell, looking at your Crawler configuration, I don't see the changes above being applied. Do you need help applying the changes?

Zamiell · 2023-02-06T17:11:17Z

@shortcuts What do you mean by "the changes above"? Do you mean making the configuration scoped to a more specific path? I don't want to do that right now because I just want to have the simplest possible configuration, to eliminate any possible bugs. Yet even though the aggregateContent is applying to the entire repository, it still doesn't work at all!

For reference, here is my configuration: https://github.com/IsaacScript/isaacscript/blob/main/packages/docs/docusaurus.config.js#L91-L105

shortcuts · 2023-02-06T20:59:29Z

Docusaurus does not control the indexing, only the search, so those lines should be updated in the Crawler config directly

Zamiell · 2023-02-06T21:03:47Z

@shortcuts Can you explain how to update the crawler configuration directly?

I assume that you are directing me to this page:

But I don't see any boxes to paste in a recordExtractor or a recordProps or anything like that.

shortcuts · 2023-02-06T21:14:31Z

The editor button on the left of your screenshot, it will show a JSON config that you can edit.

From my past comment, #1658 (comment), you can find links to the API reference, guides and stuff:

let me know if you need further helping with this!

Zamiell · 2023-02-06T23:15:19Z

Great, thank you, that got me on to the right track.

I'll write up a guide on how to solve this problem for others that encounter it.

Zamiell · 2023-02-06T23:58:50Z

How to Solve "Extractor returned too many records"

Sometimes, when Algolia goes to index a very large page, it will give an error:

Extractors returned too many records

If it is mandatory that you have a very large page like this, then you can work around the error by changing the configuration for the Algolia crawler. This guide will walk you through how to do that.

First, go to the Crawler homepage: https://crawler.algolia.com/admin/crawlers/
Click on the index that is triggering the error in question, which will take you to the "Overview" page.
Click on "Editor" from the left menu, which will open up a VSCode-like interface.
Scroll down to the actions array that is part of the object passed to the Crawler constructor.

Personally, I'm using Algolia to crawl my Docusaurus website. For reference, the default actions array for a Docusaurus website (as of February 2023) looks something like this:

  actions: [
    {
      indexName: "foo",
      pathsToMatch: ["https://foo.github.io/**"],
      recordExtractor: ({ $, helpers }) => {
        // priority order: deepest active sub list header -> navbar active item -> 'Documentation'
        const lvl0 =
          $(
            ".menu__link.menu__link--sublist.menu__link--active, .navbar__item.navbar__link--active"
          )
            .last()
            .text() || "Documentation";

        return helpers.docsearch({
          recordProps: {
            lvl0: {
              selectors: "",
              defaultValue: lvl0,
            },
            lvl1: "header h1",
            lvl2: "article h2",
            lvl3: "article h3",
            lvl4: "article h4",
            lvl5: "article h5, article td:first-child",
            content: "article p, article li, article td:last-child",
          },
          indexHeadings: true,
        });
      },
    },
  ],

Step 1 - Create a new action

The first thing we should do is to add a new action that is based on the existing one. For example, let's imagine that we have 3 specific pages that cause the "too many records" error: Foo, Bar, and Baz. First, we need to copy paste the action into another action, and then modify the pathsToMatch entry for each of the two actions:

  actions: [
    {
      indexName: "foo",
      pathsToMatch: [
        "https://foo.github.io/**",
        "!https://foo.github.io/docs/Foo/",
        "!https://foo.github.io/docs/Bar/",
        "!https://foo.github.io/docs/Baz/",
      ],
      recordExtractor: ({ $, helpers }) => {
        // priority order: deepest active sub list header -> navbar active item -> 'Documentation'
        const lvl0 =
          $(
            ".menu__link.menu__link--sublist.menu__link--active, .navbar__item.navbar__link--active"
          )
            .last()
            .text() || "Documentation";

        return helpers.docsearch({
          recordProps: {
            lvl0: {
              selectors: "",
              defaultValue: lvl0,
            },
            lvl1: "header h1",
            lvl2: "article h2",
            lvl3: "article h3",
            lvl4: "article h4",
            lvl5: "article h5, article td:first-child",
            content: "article p, article li, article td:last-child",
          },
          indexHeadings: true,
        });
      },
    },
    {
      indexName: "foo",
      pathsToMatch: [
        "https://foo.github.io/docs/Foo/",
        "https://foo.github.io/docs/Bar/",
        "https://foo.github.io/docs/Baz/",
      ],
      recordExtractor: ({ $, helpers }) => {
        // priority order: deepest active sub list header -> navbar active item -> 'Documentation'
        const lvl0 =
          $(
            ".menu__link.menu__link--sublist.menu__link--active, .navbar__item.navbar__link--active"
          )
            .last()
            .text() || "Documentation";

        return helpers.docsearch({
          recordProps: {
            lvl0: {
              selectors: "",
              defaultValue: lvl0,
            },
            lvl1: "header h1",
            lvl2: "article h2",
            lvl3: "article h3",
            lvl4: "article h4",
            lvl5: "article h5, article td:first-child",
            content: "article p, article li, article td:last-child",
          },
          indexHeadings: true,
        });
      },
    },
  ],

First, note that we are generating a second action because we don't want to influence or somehow harm the search results of the normal non-problematic pages on our website. We want to only tinker with the problematic pages specifically.

Second, note that we have to negate the problematic URLs in the first action because the crawler will create a record for each matched action. In other words, there is no ordering of actions - all URLs get processed through all actions.

Now, we have a action that will only apply to problematic pages, but right now it is just the exact same config (since it was copy-pasted). So, the next step is to modify the second action.

Step 2 - `aggregateContent`

The first thing you can try is to add aggregateContent: true to the bottom of the second action (below the indexHeadings line). Then, trigger a re-index of the site to see if the error still happens. (You can force a re-index by clicking on the "Restart crawling" button on the "Overview" page.)

You can learn more about the aggregateContent option on the recordExtractor documentation page. In short, it reduces the total amount of records that are generated for any particular page. For me, doing this changed how the pages showed up in the search slightly, but your mileage may vary.

If that made the error go away, then great! You're done. If not, then go on to the next step.

Step 3 - Customizing the selectors

The "lvl0" through "lvl5" fields are the selector fields. We can try to customize those to reduce the amount of records that are being gathered.

First, remove the lvl5 line from the second action. Save the config and trigger a re-index to see if the problem still happens.

If the problem persists, we can try removing the lvl4 line. And then, if that doesn't work, we can try removing the lvl3 line. And then the lvl2 line. (The idea here is that we are removing more and more content until the page is small enough to be swallowed by the crawler.)

If you you only have lvl0 and lvl1 and you still get the error, then you are probably screwed, because lvl1 appears to be a mandatory field in the config.

shortcuts · 2023-02-07T11:37:16Z

Lovely!! would you mind updating the doc to add a link to your comment?

(thanks a lot for taking the time!)

Zamiell · 2023-02-07T16:45:39Z

no, but feel free to copy paste the relevant parts

shortcuts closed this as completed Nov 16, 2022

Zamiell mentioned this issue Feb 6, 2023

Add "@satisfies" JSDoc tag to every "@type" JSDoc tag throughout the repository facebook/docusaurus#8639

Closed

Zamiell changed the title ~~Extractors returned 757 records, the maximum is 750~~ Extractors returned too many records Feb 6, 2023

Chriscbr mentioned this issue Jun 20, 2023

Cannot find APIs through docs search bar winglang/wing#2710

Closed

timngyn mentioned this issue Feb 23, 2024

Parsing error when "new Crawler()" is not called on line 1 of the crawler editor #2190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extractors returned too many records #1658

Extractors returned too many records #1658

Zamiell commented Nov 8, 2022 •

edited

Loading

shortcuts commented Nov 8, 2022

Zamiell commented Nov 8, 2022

shortcuts commented Nov 8, 2022

Zamiell commented Nov 8, 2022

shortcuts commented Nov 8, 2022

shortcuts commented Nov 16, 2022

Zamiell commented Jan 14, 2023 •

edited

Loading

shortcuts commented Feb 6, 2023

Zamiell commented Feb 6, 2023

shortcuts commented Feb 6, 2023

Zamiell commented Feb 6, 2023

shortcuts commented Feb 6, 2023

Zamiell commented Feb 6, 2023

Zamiell commented Feb 6, 2023 •

edited

Loading

shortcuts commented Feb 7, 2023 •

edited

Loading

Zamiell commented Feb 7, 2023

Extractors returned too many records #1658

Extractors returned too many records #1658

Comments

Zamiell commented Nov 8, 2022 • edited Loading

shortcuts commented Nov 8, 2022

Zamiell commented Nov 8, 2022

shortcuts commented Nov 8, 2022

Zamiell commented Nov 8, 2022

shortcuts commented Nov 8, 2022

shortcuts commented Nov 16, 2022

Zamiell commented Jan 14, 2023 • edited Loading

shortcuts commented Feb 6, 2023

Zamiell commented Feb 6, 2023

shortcuts commented Feb 6, 2023

Zamiell commented Feb 6, 2023

shortcuts commented Feb 6, 2023

Zamiell commented Feb 6, 2023

Zamiell commented Feb 6, 2023 • edited Loading

How to Solve "Extractor returned too many records"

Step 1 - Create a new action

Step 2 - aggregateContent

Step 3 - Customizing the selectors

shortcuts commented Feb 7, 2023 • edited Loading

Zamiell commented Feb 7, 2023

Zamiell commented Nov 8, 2022 •

edited

Loading

Zamiell commented Jan 14, 2023 •

edited

Loading

Zamiell commented Feb 6, 2023 •

edited

Loading

Step 2 - `aggregateContent`

shortcuts commented Feb 7, 2023 •

edited

Loading