Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Multiple Match Pattern Config; Pattern Avoid; Grap Content with innerHTML Compatible #97

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

FTAndy
Copy link

@FTAndy FTAndy commented Nov 30, 2023

Feat: Multiple Match Pattern Config; Pattern Avoid; Grap Content with innerHTML Compatible

Branch Information

Branch: patch/match-pattern

Description of Changes

This pull request introduces several enhancements to the GPT-crawler project:

  1. Customizable Pattern Matching: Allows users to define multiple patterns for matching, providing more flexibility in what content is crawled.

  2. Expanded Match Options: Introduces additional match options to improve compatibility with various content types.

  3. innerHTML Method for Content Compatibility: Implements an innerHTML method as a fallback when innerText does not contain any content. This ensures more robust content scraping, especially in cases where innerText might be empty.

Testing Done

  • Comprehensive compatibility checks have been conducted.
  • Ensured that npm run start operates smoothly without causing any breaks in the existing functionality.

Screenshots or Code Snippets

Changes Preview

Code Changes

  • The config.ts file has been updated with new matching patterns and configurations.
  • Added minimatch package to handle pattern matching, reflected in package.json and package-lock.json.
  • Significant updates in src/config.ts and src/core.ts to implement the new features.

Dependencies

  • Addition of the minimatch package for enhanced pattern matching capabilities.

Checklist

  • Updated README.md to reflect new changes and configurations.

Conclusion

This PR aims to make the GPT-crawler more versatile and robust, catering to a wider range of use cases. The introduction of customizable pattern matching, expanded match options, and innerHTML compatibility marks a significant improvement in the project's functionality. Your review and feedback on these changes would be greatly appreciated.

config.ts Outdated
Comment on lines 11 to 38
// const treeEndPointUrl = 'https://github.com/BuilderIO/gpt-crawler/tree/main'
// const blobEndPointUrl = 'https://github.com/BuilderIO/gpt-crawler/blob/main'

// export const defaultConfig: Config = {
// url: "https://github.com/BuilderIO/gpt-crawler/tree/main",
// match: [
// {
// // skip the pattern you do not want to crawl
// // pattern: "https://github.com/BuilderIO/gpt-crawler/tree/main/**",
// pattern: `${treeEndPointUrl}/**`,
// skip: true
// },
// {
// // speical case for .md
// // for .md, we need to crawl the raw content in the .markdown-body selector
// // pattern: 'https://github.com/BuilderIO/gpt-crawler/blob/main/**/*.md',
// pattern: `${blobEndPointUrl}/**/*.md`,
// selector: '.markdown-body'
// },
// {
// // other files like .js, .ts, .json, etc
// pattern: `${blobEndPointUrl}/**`,
// selector: '#read-only-cursor-text-area'
// },
// ],
// maxPagesToCrawl: 50,
// outputFileName: "output.json",
// };
Copy link
Contributor

@marcelovicentegc marcelovicentegc Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @FTAndy, thanks for this PR! Is this commented code useful?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test code for the new match parameter option because there is no test process for the project. Should I delete it or just put this example to README?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -24,7 +25,7 @@ export function getPageHtml(page: Page, selector = "body") {
} else {
// Handle as a CSS selector
const el = document.querySelector(selector) as HTMLElement | null;
return el?.innerText || "";
return el?.innerText || el?.innerHTML || "";
Copy link
Contributor

@marcelovicentegc marcelovicentegc Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that this could introduce some undesired content to the crawler output since it would cover use cases like possibly grabbing scripts contents and white-spaces, for example:

<div>
  <script>alert("Hello!");</script>
</div>

Would grab: <script>alert("Hello!");</script> as output (depending, of course, on the selectors config)

I'm likely not seeing the whole picture though, can you provide some practical examples you thought of covering with these changes @FTAndy ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcelovicentegc Yes, when I grab code block content from the tag textarea in github project, I can't get the code content from the API innerText because it is a form controls element. It is a compatiable way to grab content in this case when the element is a form controls element.

I think the API value is more precise in this context because it will not transfrom symbol like > into &gt. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think the value API is a good idea, it is limited to input-type elements.

- Change the `match` property in the `Config` type to accept an array of string or string[]
- Remove unused code in `config.ts`
- Update the `PatternMatch` schema in `config.ts` to include a new property `skip`
- Modify the default config values in `config.ts`
- Update the `OriginMatch` schema in `config.ts` to accept an array of strings
- Fix a typo in the README.md file

Signed-off-by: FTAndy <[email protected]>
- Modify the `README.md` file:
  - Change the `match` property type to accept an array of strings.
- Modify the `src/config.ts` file:
  - Change the `OriginMatch` property type to accept an array of strings.
  - Change the `PatternMatch` property type to accept an array of objects.
- Modify the `src/core.ts` file:
  - Add import statements for `minimatch`, `Config`, `PatternMatch`, and `OriginMatch`.
  - Modify the `crawl` function:
    - Change the `globs` variable declaration to include a semicolon at the end.
    - Change the condition for checking `matchedPattern` to use the optional chaining operator.
    - Add a missing semicolon in the `page.waitForSelector` call.
    - Move the code inside the `else if` condition to a separate block for better readability.

Signed-off-by: FTAndy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants