You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run this configuration, URLs containing query parameters like hl=, which should be excluded, are still being crawled. Here's an example of some of the URLs that shouldn't appear:
INFO PlaywrightCrawler: Crawling: Page 10 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=de...
INFO PlaywrightCrawler: Crawling: Page 11 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=es-419...
INFO PlaywrightCrawler: Crawling: Page 12 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=fr...
INFO PlaywrightCrawler: Crawling: Page 13 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=id...
I've tried modifying the exclude array with different patterns, like:
"**hl\=**"
However, some URLs still make it through, like (but hey so much better! so that's a W in my book haha):
INFO PlaywrightCrawler: Crawling: Page 926 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=de...
INFO PlaywrightCrawler: Crawling: Page 927 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=id...
INFO PlaywrightCrawler: Crawling: Page 928 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-gcp-loadbalancing-logs?hl=id...
Environment
Operating Systems Tested: MacOS, Windows, Linux (Ubuntu and Debian in WSL2 and on a VM)
URLs matching the exclude patterns, especially those with hl=, should not be crawled.
Actual Behavior
URLs with hl= are still being crawled despite being listed in the exclude patterns. (with varying degrees of success based on config)
Additional Context
I've tried various exclude patterns, but nothing seems to fully exclude these URLs. Has anyone encountered a similar issue or have suggestions on how to resolve this?
Thanks in advance for any help!
The text was updated successfully, but these errors were encountered:
Hi Everyone!
I think I found a bug with the
exclude
option in theconfig.ts
file (or maybe I'm just using it wrong haha). Here's my current setup:My Current
config.ts
:Problem
When I run this configuration, URLs containing query parameters like
hl=
, which should be excluded, are still being crawled. Here's an example of some of the URLs that shouldn't appear:I've tried modifying the
exclude
array with different patterns, like:However, some URLs still make it through, like (but hey so much better! so that's a W in my book haha):
Environment
Steps to Reproduce:
npm i
Update config from above.
Run the crawl and observe the URLs being crawled.
Expected Behavior
URLs matching the
exclude
patterns, especially those withhl=
, should not be crawled.Actual Behavior
URLs with
hl=
are still being crawled despite being listed in theexclude
patterns. (with varying degrees of success based on config)Additional Context
I've tried various
exclude
patterns, but nothing seems to fully exclude these URLs. Has anyone encountered a similar issue or have suggestions on how to resolve this?Thanks in advance for any help!
The text was updated successfully, but these errors were encountered: