`exclude` Pattern in `config.ts` Not Working as Expected #179

ChronicleCoder · 2024-10-22T21:56:44Z

Hi Everyone!

I think I found a bug with the exclude option in the config.ts file (or maybe I'm just using it wrong haha). Here's my current setup:

My Current `config.ts`:

import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://cloud.google.com/chronicle/docs",
  match: "https://cloud.google.com/chronicle/docs/**",
  exclude: [
    "https://cloud.google.com/chronicle/docs/**hl=**",
    "https://cloud.google.com/chronicle/docs/soar/**",
    "https://cloud.google.com/chronicle/docs/ingestion/parser-list/*-changelog",
  ],
  selector: `.devsite-article-body`,
  maxPagesToCrawl: 50000,
  outputFileName: "ChronicleDocsAll.json",
  maxTokens: 500000,
};

Problem

When I run this configuration, URLs containing query parameters like hl=, which should be excluded, are still being crawled. Here's an example of some of the URLs that shouldn't appear:

INFO  PlaywrightCrawler: Crawling: Page 10 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=de...
INFO  PlaywrightCrawler: Crawling: Page 11 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=es-419...
INFO  PlaywrightCrawler: Crawling: Page 12 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=fr...
INFO  PlaywrightCrawler: Crawling: Page 13 / 50000 - URL: https://cloud.google.com/chronicle/docs/investigation/udm-search?hl=id...

I've tried modifying the exclude array with different patterns, like:

"**hl\=**"

However, some URLs still make it through, like (but hey so much better! so that's a W in my book haha):

INFO  PlaywrightCrawler: Crawling: Page 926 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=de...
INFO  PlaywrightCrawler: Crawling: Page 927 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-kubernetes-node-logs?hl=id...
INFO  PlaywrightCrawler: Crawling: Page 928 / 50000 - URL: https://cloud.google.com/chronicle/docs/ingestion/default-parsers/collect-gcp-loadbalancing-logs?hl=id...

Environment

Operating Systems Tested: MacOS, Windows, Linux (Ubuntu and Debian in WSL2 and on a VM)
Crawler version: v1.5.0

Steps to Reproduce:

git clone https://github.com/builderio/gpt-crawler

npm i
Update config from above.
Run the crawl and observe the URLs being crawled.

Expected Behavior

URLs matching the exclude patterns, especially those with hl=, should not be crawled.

Actual Behavior

URLs with hl= are still being crawled despite being listed in the exclude patterns. (with varying degrees of success based on config)

Additional Context

I've tried various exclude patterns, but nothing seems to fully exclude these URLs. Has anyone encountered a similar issue or have suggestions on how to resolve this?

Thanks in advance for any help!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`exclude` Pattern in `config.ts` Not Working as Expected #179

`exclude` Pattern in `config.ts` Not Working as Expected #179

ChronicleCoder commented Oct 22, 2024 •

edited

Loading

exclude Pattern in config.ts Not Working as Expected #179

exclude Pattern in config.ts Not Working as Expected #179

Comments

ChronicleCoder commented Oct 22, 2024 • edited Loading

My Current config.ts:

Problem

Environment

Steps to Reproduce:

Expected Behavior

Actual Behavior

Additional Context

`exclude` Pattern in `config.ts` Not Working as Expected #179

`exclude` Pattern in `config.ts` Not Working as Expected #179

ChronicleCoder commented Oct 22, 2024 •

edited

Loading

My Current `config.ts`: