-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rate Limiting, Max Concurrency, Infinite Crawl & Additional Configurations #102
base: main
Are you sure you want to change the base?
Conversation
…will display the infinity symbol. Default is 50
… between page requests to help with rate limiting
… missing file that isn't created until the Docker is.
…'t overwritten in storage
README.md
Outdated
|
||
```ts | ||
type Config = { | ||
/** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */ | ||
|
||
/** Required - URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of using tags instead, for example:
/** Required - URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */ | |
/** | |
* URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap | |
* @required | |
*/ |
This way we could eventually bring jsdoc/typedoc into the mix to generate meaningful documentation from them. Just a suggestion, as even jsdoc/typedoc can infer whether a property is required or not based on its typings 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree generating meaningful documentation is the way to go. I've updated the code accordingly. I will add more to the documentation as we progress.
config.ts
Outdated
const url_prefix = "https://" | ||
const domain = "www.builder.io"; | ||
const url_suffix = "/c/docs"; | ||
const base_url = url_prefix + domain; | ||
const match_url_prefix = base_url + url_suffix; | ||
const match_url = match_url_prefix + "/**"; | ||
|
||
// Now date stamp for output file name | ||
const now = new Date(); | ||
const date = now.toISOString().split('T')[0]; | ||
const time = now.toTimeString().split(' ')[0]; | ||
const outputs_dir = __dirname.split('/').slice(0, -1).join('/') + '/outputs'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nitpick here to guarantee code consistency, let's use camelCase for these variables:
const url_prefix = "https://" | |
const domain = "www.builder.io"; | |
const url_suffix = "/c/docs"; | |
const base_url = url_prefix + domain; | |
const match_url_prefix = base_url + url_suffix; | |
const match_url = match_url_prefix + "/**"; | |
// Now date stamp for output file name | |
const now = new Date(); | |
const date = now.toISOString().split('T')[0]; | |
const time = now.toTimeString().split(' ')[0]; | |
const outputs_dir = __dirname.split('/').slice(0, -1).join('/') + '/outputs'; | |
const urlPrefix = "https://" | |
const domain = "www.builder.io"; | |
const urlSuffix = "/c/docs"; | |
const baseUrl = urlPrefix + domain; | |
const matchUrlPrefix = baseUrl + urlSuffix; | |
const matchUrl = matchUrlPrefix + "/**"; | |
// Now date stamp for output file name | |
const now = new Date(); | |
const date = now.toISOString().split('T')[0]; | |
const time = now.toTimeString().split(' ')[0]; | |
const outputsDir = __dirname.split('/').slice(0, -1).join('/') + '/outputs'; |
Requires changes below 👇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to camelCase as you suggested and will continue to use that convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind nitpicks 👍establishing a shared nomenclature for project is ideal in my opinion. I am always open to feedback and recommendations.
… and config.ts and formatting for jsdoc/typedoc as recommened by @marcelovicentegc in pull request BuilderIO#102, added .prettierignore file
I updated with prettier formatting for the files that failed README.md, src/config.ts, src/core.ts, and config.ts. |
@marcelovicentegc this look good to you to merge? |
const matchUrl_prefix = baseUrl + urlSuffix; | ||
const matchUrl = matchUrl_prefix + "/**"; | ||
|
||
// Now date stamp for output file name | ||
const now = new Date(); | ||
const date = now.toISOString().split("T")[0]; | ||
const time = now.toTimeString().split(" ")[0]; | ||
const outputs_dir = __dirname.split("/").slice(0, -1).join("/") + "/outputs"; | ||
|
||
const outputFileName = | ||
outputs_dir + "/" + domain + "-" + date + "-" + time + ".json"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cpdata! Just a couple more nitpicks (replacing snake_case for camelCase) and a rebase and we are good to go!
const matchUrl_prefix = baseUrl + urlSuffix; | |
const matchUrl = matchUrl_prefix + "/**"; | |
// Now date stamp for output file name | |
const now = new Date(); | |
const date = now.toISOString().split("T")[0]; | |
const time = now.toTimeString().split(" ")[0]; | |
const outputs_dir = __dirname.split("/").slice(0, -1).join("/") + "/outputs"; | |
const outputFileName = | |
outputs_dir + "/" + domain + "-" + date + "-" + time + ".json"; | |
const matchUrlPrefix = baseUrl + urlSuffix; | |
const matchUrl = matchUrlPrefix + "/**"; | |
// Now date stamp for output file name | |
const now = new Date(); | |
const date = now.toISOString().split("T")[0]; | |
const time = now.toTimeString().split(" ")[0]; | |
const outputsDir = __dirname.split("/").slice(0, -1).join("/") + "/outputs"; | |
const outputFileName = | |
outputsDir + "/" + domain + "-" + date + "-" + time + ".json"; |
Hey @steve8708! Happy new years! One rebase and a few nitpicks ☝️ and it occurs to me that we are good to go 🤗 |
Please merge this branch ASAP! |
Initial Improvements
Main Additions
maxPagesToCrawl
if = 0 then will crawl all matching urls and display during progress as 1/∞.maxConcurrency
Sets the number of concurrent crawl requests. If left unset then theundefined
maxConcurrency will do maximum parallel connections like the originals default. Now defaults to 1 to avoid getting IP banned.waitPerPageCrawlTimeoutRange
Defaults to a range of 1 second to 1 second but can be set to create a random delay between any 2 numbers in milliseconds to avoid rate limit rejection when crawling.headless
istrue
by default but can now be configured in the config.ts file for situations that require it.Full Summery
Added *.code-workspace to .gitignore for VSCODE workspaces saved in the root of the project.
Add VSCode workspace file in .gitignore
Final output .json files go to
outputs/
folder so they are not overwritten.Add outputs dir to .gitignore for final outputs
Dynamic domain + date-timestamp final output file name ex. outputs/domain.com-2023-11-28-12:02:51.json
Add Dynamic OutputFileName based on date-timestamp
maxPagesToCrawl
: if set to 0 will continue crawling for all matching URLs and display infinity symbol ex. 1/∞, 2/∞, 3/∞ etc.( default = 50 )Allow maxPagesToCrawl to be optional and infinite by setting 0 which will display the infinity symbol
maxConcurrency
: Some sites will automatically block connections to prevent DDOS attacks. This config sets how many concurrent requests run at a time. ( default = 1 )Added maxConcurrency config to set maximum concurrent parallel requests.
Updates to core.ts to add config paramters for maxPagesToCraw, maxConcurrency, maxRequestsPerCrawl, headless
waitPerPageCrawlTimeoutRange
config added to set a random range in milliseconds between requests. Some sites will automatically block connections so this is a 2 number object that introduces a random delay between requests for rate limit handling( default = 1000 )Update to core.ts for maxPagesToCrawl
headless
is now a config option ( default = true )Addded headless mode as a configuration parameter
Random Rate Limiting Range with
waitPerPageCrawlTimeoutRange
config.Added waitPerPageCrawlTimeoutRange for a random range in milliseconds between page requests to help with rate limiting
1 line improvement to prevent VSCODE warning for non-existent docker container.
Added ts-ignore for docker config.ts to prevent VSCode from declaring missing file that isn't created until the Docker is.
Chunked data goes into the
storage
dir. Final compiled JSON file outputs go into the newoutputs
directory.Added Output Directory for all outputFileName to go into so they aren't overwritten in storage
Added more variables to the ./config.ts file for setting up the config in a more customized way that also includes the automatic naming convention domain-timestamp.json
Additions to dynamic url and match configurations in config.ts
Added details for waitForSelectorTimeout in the README.md file
Added waitForSelectorTimeout to README.md
Added additional Markdown and Typescript formatting to the config.ts and README.md files.
Adding details to README.md and config.ts as well as extra formatting.
13 Commits hopefully makes review a little easier.
I would like to contribute to this project on a regular basis. I have a lot of Web-scraping, A.I./LLMs, CI/CD, Automation, experience and would like to discuss with the main collaborators and see were I can be of the most use.