Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exclude Pattern in config.ts Not Working as Expected #179

Open
ChronicleCoder opened this issue Oct 22, 2024 · 0 comments
Open

exclude Pattern in config.ts Not Working as Expected #179

ChronicleCoder opened this issue Oct 22, 2024 · 0 comments

Comments

@ChronicleCoder
Copy link

ChronicleCoder commented Oct 22, 2024

Hi Everyone!

I think I found a bug with the exclude option in the config.ts file (or maybe I'm just using it wrong haha). Here's my current setup:

My Current config.ts:
import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://cloud.google.com/chronicle/docs",
  match: "https://cloud.google.com/chronicle/docs/**",
  exclude: [
    "https://cloud.google.com/chronicle/docs/**hl=**",
    "https://cloud.google.com/chronicle/docs/soar/**",
    "https://cloud.google.com/chronicle/docs/ingestion/parser-list/*-changelog",
  ],
  selector: `.devsite-article-body`,
  maxPagesToCrawl: 50000,
  outputFileName: "ChronicleDocsAll.json",
  maxTokens: 500000,
};

Problem

When I run this configuration, URLs containing query parameters like hl=, which should be excluded, are still being crawled. Here's an example of some of the URLs that shouldn't appear:

INFO  PlaywrightCrawler: Crawling: Page 10 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=de...
INFO  PlaywrightCrawler: Crawling: Page 11 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=es-419...
INFO  PlaywrightCrawler: Crawling: Page 12 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=fr...
INFO  PlaywrightCrawler: Crawling: Page 13 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=id...

I've tried modifying the exclude array with different patterns, like:

"**hl\=**"

However, some URLs still make it through, like (but hey so much better! so that's a W in my book haha):

INFO  PlaywrightCrawler: Crawling: Page 926 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=de...
INFO  PlaywrightCrawler: Crawling: Page 927 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=id...
INFO  PlaywrightCrawler: Crawling: Page 928 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-gcp-loadbalancing-logs?hl=id...

Environment

  • Operating Systems Tested: MacOS, Windows, Linux (Ubuntu and Debian in WSL2 and on a VM)
  • Crawler version: v1.5.0

Steps to Reproduce:

git clone https://github.com/builderio/gpt-crawler
  1. npm i

  2. Update config from above.

  3. Run the crawl and observe the URLs being crawled.

Expected Behavior

URLs matching the exclude patterns, especially those with hl=, should not be crawled.

Actual Behavior

URLs with hl= are still being crawled despite being listed in the exclude patterns. (with varying degrees of success based on config)

Additional Context

I've tried various exclude patterns, but nothing seems to fully exclude these URLs. Has anyone encountered a similar issue or have suggestions on how to resolve this?

Thanks in advance for any help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant