-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extractors returned too many records #1658
Comments
Hey, yeah this is expected, not a bug. We set this hard limit to prevent crawling/indexing inconsistencies when fetching large pages. Do you have the aggregateContent option enabled? An other way to reduce the number of records is to ensure you have precise enough selectors |
Yes, as I explained in the linked thread, my Algolia config looks like this, which does include the algolia: {
appId: "ZCC397CSMF", // cspell:disable-line
apiKey: "212a5e2442aa0e579f2f7bba22ee529a",
indexName: "isaacscript",
contextualSearch: false, // Enabled by default; only useful for versioned sites.
recordExtractor: ({ _, helpers }) =>
helpers.docsearch({
recordProps: {
lvl0: "header h1",
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "article p, article li",
},
aggregateContent: true,
}),
}, However, this seems to have no effect.
I'm not sure I understand. Are the selectors that you are referring to correspond to the "recordProps" values in the config? I am just using what is recommended by your support article, and they don't seem to be doing anything. |
sorry, was on the phone, I missed the link
setting it to
Yup indeed, but they seem correct. Looking at the page that trigger errors, it seems to index the I'd personally remove the |
Thank you shortcuts, From what I understand of your reply, you want me to set the config to the following: algolia: {
appId: "ZCC397CSMF", // cspell:disable-line
apiKey: "212a5e2442aa0e579f2f7bba22ee529a",
indexName: "isaacscript",
contextualSearch: false, // Enabled by default; only useful for versioned sites.
recordExtractor: ({ _, helpers }) =>
helpers.docsearch({
recordProps: {
lvl0: "header h1",
lvl1: "article h2",
lvl2: "article h3",
},
aggregateContent: true,
}),
}, However, my worry is that this would somehow reduce the quality of the search results for other pages than Subsequently, is there a way to make the |
actions can be scoped to a specific path, see https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/. You can use the snippet you shared for only those pages, and keep the other pages unchanged |
closing until you answer if there's any other issue :D |
Thanks, I'll try that out.
@shortcuts Even with the config that is specified in my previous post in this issue thread, I still get Algolia crawler errors: So it looks like the |
Hey @Zamiell, looking at your Crawler configuration, I don't see the changes above being applied. Do you need help applying the changes? |
@shortcuts What do you mean by "the changes above"? Do you mean making the configuration scoped to a more specific path? I don't want to do that right now because I just want to have the simplest possible configuration, to eliminate any possible bugs. Yet even though the For reference, here is my configuration: https://github.com/IsaacScript/isaacscript/blob/main/packages/docs/docusaurus.config.js#L91-L105 |
Docusaurus does not control the indexing, only the search, so those lines should be updated in the Crawler config directly |
@shortcuts Can you explain how to update the crawler configuration directly? I assume that you are directing me to this page: But I don't see any boxes to paste in a |
The editor button on the left of your screenshot, it will show a JSON config that you can edit. From my past comment, #1658 (comment), you can find links to the API reference, guides and stuff:
let me know if you need further helping with this! |
Great, thank you, that got me on to the right track. I'll write up a guide on how to solve this problem for others that encounter it. |
How to Solve "Extractor returned too many records"Sometimes, when Algolia goes to index a very large page, it will give an error:
If it is mandatory that you have a very large page like this, then you can work around the error by changing the configuration for the Algolia crawler. This guide will walk you through how to do that.
Personally, I'm using Algolia to crawl my Docusaurus website. For reference, the default actions: [
{
indexName: "foo",
pathsToMatch: ["https://foo.github.io/**"],
recordExtractor: ({ $, helpers }) => {
// priority order: deepest active sub list header -> navbar active item -> 'Documentation'
const lvl0 =
$(
".menu__link.menu__link--sublist.menu__link--active, .navbar__item.navbar__link--active"
)
.last()
.text() || "Documentation";
return helpers.docsearch({
recordProps: {
lvl0: {
selectors: "",
defaultValue: lvl0,
},
lvl1: "header h1",
lvl2: "article h2",
lvl3: "article h3",
lvl4: "article h4",
lvl5: "article h5, article td:first-child",
content: "article p, article li, article td:last-child",
},
indexHeadings: true,
});
},
},
], Step 1 - Create a new actionThe first thing we should do is to add a new action that is based on the existing one. For example, let's imagine that we have 3 specific pages that cause the "too many records" error: Foo, Bar, and Baz. First, we need to copy paste the action into another action, and then modify the actions: [
{
indexName: "foo",
pathsToMatch: [
"https://foo.github.io/**",
"!https://foo.github.io/docs/Foo/",
"!https://foo.github.io/docs/Bar/",
"!https://foo.github.io/docs/Baz/",
],
recordExtractor: ({ $, helpers }) => {
// priority order: deepest active sub list header -> navbar active item -> 'Documentation'
const lvl0 =
$(
".menu__link.menu__link--sublist.menu__link--active, .navbar__item.navbar__link--active"
)
.last()
.text() || "Documentation";
return helpers.docsearch({
recordProps: {
lvl0: {
selectors: "",
defaultValue: lvl0,
},
lvl1: "header h1",
lvl2: "article h2",
lvl3: "article h3",
lvl4: "article h4",
lvl5: "article h5, article td:first-child",
content: "article p, article li, article td:last-child",
},
indexHeadings: true,
});
},
},
{
indexName: "foo",
pathsToMatch: [
"https://foo.github.io/docs/Foo/",
"https://foo.github.io/docs/Bar/",
"https://foo.github.io/docs/Baz/",
],
recordExtractor: ({ $, helpers }) => {
// priority order: deepest active sub list header -> navbar active item -> 'Documentation'
const lvl0 =
$(
".menu__link.menu__link--sublist.menu__link--active, .navbar__item.navbar__link--active"
)
.last()
.text() || "Documentation";
return helpers.docsearch({
recordProps: {
lvl0: {
selectors: "",
defaultValue: lvl0,
},
lvl1: "header h1",
lvl2: "article h2",
lvl3: "article h3",
lvl4: "article h4",
lvl5: "article h5, article td:first-child",
content: "article p, article li, article td:last-child",
},
indexHeadings: true,
});
},
},
], First, note that we are generating a second action because we don't want to influence or somehow harm the search results of the normal non-problematic pages on our website. We want to only tinker with the problematic pages specifically. Second, note that we have to negate the problematic URLs in the first action because the crawler will create a record for each matched action. In other words, there is no ordering of actions - all URLs get processed through all actions. Now, we have a action that will only apply to problematic pages, but right now it is just the exact same config (since it was copy-pasted). So, the next step is to modify the second action. Step 2 -
|
Lovely!! would you mind updating the doc to add a link to your comment? (thanks a lot for taking the time!) |
no, but feel free to copy paste the relevant parts |
EDIT - See my guide below on how to solve this issue.
Michael King told me to submit a bug report on GitHub.
You can reference our conversation here: https://discourse.algolia.com/t/extractors-returned-757-records-the-maximum-is-750/16493
The text was updated successfully, but these errors were encountered: