Anchors are being stripped out (using `sitemaps`, `linkExtractor` and `externalData`) #1831

bojanrajh · 2023-03-21T11:24:49Z

Description

We are using Algolia Crawler UI for parsing our mixed static HTML & SPA website (using hash router). All URLs are provided in sitemaps Crawler config.

new Crawler({
  startUrls: [],
  sitemaps: ["https://example.com/sitemap.xml"],
  // ...
})

Steps to reproduce

Use a sitemap with the following content:

<!-- ... -->
<url>
  <loc>https://example.com/page.html</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<url>
  <loc>https://example.com/subpage.html#/foo</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<url>
  <loc>https://example.com/subpage.html#/bar</loc>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>
<!-- ... -->

... or using the static linkExtractor:

new Crawler({
  // ...
  linkExtractor: () => {
    return [
      "https://example.com/page.html",
      "https://example.com/subpage.html#/foo",
      "https://example.com/subpage.html#/bar",
    ];
  },
  // ...
})

Then run the URL Tester.

Result:

LINKS
Found 2 links matching your configuration 
 - https://example.com/page.html
 - https://example.com/subpage.html

Expected behavior

Expected result:

LINKS
Found 3 links matching your configuration 
 - https://example.com/page.html
 - https://example.com/subpage.html#/foo
 - https://example.com/subpage.html#/bar

Note those are not section anchors. Those are actual pages, correctly parsed in URL Tester with the renderJavaScript: true option when passing the full URL with the anchor.

Environment

Algolia Crawler UI

Similar issues:

The text was updated successfully, but these errors were encountered:

shortcuts · 2023-03-21T11:28:34Z

Hey, thanks for opening the issue. #1823 seems related.

I'll investigate if there's a way for us to differentiate hash routed pages from anchored sections

bojanrajh · 2023-03-21T11:37:32Z

Thank you for a quick response!
Just for more clarity: we don't mind adding or implementing a custom linkExtractor or recordExtractor with custom set objectID. We just need those URLs to be accepted (crawling works as intended when manually running the crawl from the UI).

bojanrajh · 2023-03-29T14:49:55Z

Hey @shortcuts, any news on this one?

Somehow related, I tried to provide anchored URLs to the Crawler with externalData: ['myCSV], as described in your docs, and those URLs were again stripped down to one.

Example CSV:

url;title;content
"https://example.com/subpage.html#/foo";"Foo";"Foo content"
"https://example.com/subpage.html#/bar";"Bar";"Bar content"

Single URL under Crawler admin > External Data: https://example.com/subpage.html

I would expect the same issue would appear with your API client (JS), but I've just successfully created 2 objects containing URLs with anchors in our demo app (free plan, app ID BZSKX72NEG). However, I was not able to create admin API key for our app (DOCSEARCH plan, app ID J1Y01X9HGM) because the "All API Keys" section/tab is missing. By using the Admin API key I received error 400 - Not enough rights to update an object near line:1.

So, technically, my wild guess would be your system supports anchored URLs, they are just not supported by the crawler?

bojanrajh · 2023-10-16T08:44:58Z

Hey @shortcuts, and news about this one?

bojanrajh changed the title ~~Anchors are being stripped out (using sitemaps and/or linkExtractor)~~ Anchors are being stripped out (using sitemaps, linkExtractor and externalData) Mar 30, 2023

randombeeper added the crawler issue related to the indexing label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anchors are being stripped out (using `sitemaps`, `linkExtractor` and `externalData`) #1831

Anchors are being stripped out (using `sitemaps`, `linkExtractor` and `externalData`) #1831

bojanrajh commented Mar 21, 2023 •

edited

Loading

shortcuts commented Mar 21, 2023

bojanrajh commented Mar 21, 2023 •

edited

Loading

bojanrajh commented Mar 29, 2023

bojanrajh commented Oct 16, 2023

Anchors are being stripped out (using sitemaps, linkExtractor and externalData) #1831

Anchors are being stripped out (using sitemaps, linkExtractor and externalData) #1831

Comments

bojanrajh commented Mar 21, 2023 • edited Loading

Description

Steps to reproduce

Expected behavior

Environment

shortcuts commented Mar 21, 2023

bojanrajh commented Mar 21, 2023 • edited Loading

bojanrajh commented Mar 29, 2023

bojanrajh commented Oct 16, 2023

Anchors are being stripped out (using `sitemaps`, `linkExtractor` and `externalData`) #1831

Anchors are being stripped out (using `sitemaps`, `linkExtractor` and `externalData`) #1831

bojanrajh commented Mar 21, 2023 •

edited

Loading

bojanrajh commented Mar 21, 2023 •

edited

Loading