-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Multiple Match Pattern Config; Pattern Avoid; Grap Content with innerHTML Compatible #97
base: main
Are you sure you want to change the base?
Conversation
config.ts
Outdated
// const treeEndPointUrl = 'https://github.com/BuilderIO/gpt-crawler/tree/main' | ||
// const blobEndPointUrl = 'https://github.com/BuilderIO/gpt-crawler/blob/main' | ||
|
||
// export const defaultConfig: Config = { | ||
// url: "https://github.com/BuilderIO/gpt-crawler/tree/main", | ||
// match: [ | ||
// { | ||
// // skip the pattern you do not want to crawl | ||
// // pattern: "https://github.com/BuilderIO/gpt-crawler/tree/main/**", | ||
// pattern: `${treeEndPointUrl}/**`, | ||
// skip: true | ||
// }, | ||
// { | ||
// // speical case for .md | ||
// // for .md, we need to crawl the raw content in the .markdown-body selector | ||
// // pattern: 'https://github.com/BuilderIO/gpt-crawler/blob/main/**/*.md', | ||
// pattern: `${blobEndPointUrl}/**/*.md`, | ||
// selector: '.markdown-body' | ||
// }, | ||
// { | ||
// // other files like .js, .ts, .json, etc | ||
// pattern: `${blobEndPointUrl}/**`, | ||
// selector: '#read-only-cursor-text-area' | ||
// }, | ||
// ], | ||
// maxPagesToCrawl: 50, | ||
// outputFileName: "output.json", | ||
// }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @FTAndy, thanks for this PR! Is this commented code useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the test code for the new match
parameter option because there is no test process for the project. Should I delete it or just put this example to README?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@@ -24,7 +25,7 @@ export function getPageHtml(page: Page, selector = "body") { | |||
} else { | |||
// Handle as a CSS selector | |||
const el = document.querySelector(selector) as HTMLElement | null; | |||
return el?.innerText || ""; | |||
return el?.innerText || el?.innerHTML || ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that this could introduce some undesired content to the crawler output since it would cover use cases like possibly grabbing scripts contents and white-spaces, for example:
<div>
<script>alert("Hello!");</script>
</div>
Would grab: <script>alert("Hello!");</script>
as output (depending, of course, on the selectors config)
I'm likely not seeing the whole picture though, can you provide some practical examples you thought of covering with these changes @FTAndy ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcelovicentegc Yes, when I grab code block content from the tag textarea
in github project, I can't get the code content from the API innerText
because it is a form controls element. It is a compatiable way to grab content in this case when the element is a form controls element.
I think the API value
is more precise in this context because it will not transfrom symbol like >
into >
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think the value
API is a good idea, it is limited to input-type elements.
- Change the `match` property in the `Config` type to accept an array of string or string[] - Remove unused code in `config.ts` - Update the `PatternMatch` schema in `config.ts` to include a new property `skip` - Modify the default config values in `config.ts` - Update the `OriginMatch` schema in `config.ts` to accept an array of strings - Fix a typo in the README.md file Signed-off-by: FTAndy <[email protected]>
- Modify the `README.md` file: - Change the `match` property type to accept an array of strings. - Modify the `src/config.ts` file: - Change the `OriginMatch` property type to accept an array of strings. - Change the `PatternMatch` property type to accept an array of objects. - Modify the `src/core.ts` file: - Add import statements for `minimatch`, `Config`, `PatternMatch`, and `OriginMatch`. - Modify the `crawl` function: - Change the `globs` variable declaration to include a semicolon at the end. - Change the condition for checking `matchedPattern` to use the optional chaining operator. - Add a missing semicolon in the `page.waitForSelector` call. - Move the code inside the `else if` condition to a separate block for better readability. Signed-off-by: FTAndy <[email protected]>
Feat: Multiple Match Pattern Config; Pattern Avoid; Grap Content with innerHTML Compatible
Branch Information
Branch:
patch/match-pattern
Description of Changes
This pull request introduces several enhancements to the GPT-crawler project:
Customizable Pattern Matching: Allows users to define multiple patterns for matching, providing more flexibility in what content is crawled.
Expanded Match Options: Introduces additional
match
options to improve compatibility with various content types.innerHTML Method for Content Compatibility: Implements an
innerHTML
method as a fallback wheninnerText
does not contain any content. This ensures more robust content scraping, especially in cases whereinnerText
might be empty.Testing Done
npm run start
operates smoothly without causing any breaks in the existing functionality.Screenshots or Code Snippets
Code Changes
config.ts
file has been updated with new matching patterns and configurations.minimatch
package to handle pattern matching, reflected inpackage.json
andpackage-lock.json
.src/config.ts
andsrc/core.ts
to implement the new features.Dependencies
minimatch
package for enhanced pattern matching capabilities.Checklist
README.md
to reflect new changes and configurations.Conclusion
This PR aims to make the GPT-crawler more versatile and robust, catering to a wider range of use cases. The introduction of customizable pattern matching, expanded match options, and
innerHTML
compatibility marks a significant improvement in the project's functionality. Your review and feedback on these changes would be greatly appreciated.