Rate Limiting, Max Concurrency, Infinite Crawl & Additional Configurations #102

cpdata · 2023-12-04T13:06:55Z

Initial Improvements

Main Additions

maxPagesToCrawl if = 0 then will crawl all matching urls and display during progress as 1/∞.
maxConcurrency Sets the number of concurrent crawl requests. If left unset then the undefined maxConcurrency will do maximum parallel connections like the originals default. Now defaults to 1 to avoid getting IP banned.
waitPerPageCrawlTimeoutRange Defaults to a range of 1 second to 1 second but can be set to create a random delay between any 2 numbers in milliseconds to avoid rate limit rejection when crawling.
headless is true by default but can now be configured in the config.ts file for situations that require it.
Improved README.md & config.ts documentation. ( More to be done.)

Full Summery

Added *.code-workspace to .gitignore for VSCODE workspaces saved in the root of the project.
Add VSCode workspace file in .gitignore
Final output .json files go to outputs/ folder so they are not overwritten.
Add outputs dir to .gitignore for final outputs
Dynamic domain + date-timestamp final output file name ex. outputs/domain.com-2023-11-28-12:02:51.json
Add Dynamic OutputFileName based on date-timestamp
maxPagesToCrawl: if set to 0 will continue crawling for all matching URLs and display infinity symbol ex. 1/∞, 2/∞, 3/∞ etc.( default = 50 )
Allow maxPagesToCrawl to be optional and infinite by setting 0 which will display the infinity symbol
maxConcurrency: Some sites will automatically block connections to prevent DDOS attacks. This config sets how many concurrent requests run at a time. ( default = 1 )
Added maxConcurrency config to set maximum concurrent parallel requests.
Updates to core.ts to add config paramters for maxPagesToCraw, maxConcurrency, maxRequestsPerCrawl, headless
waitPerPageCrawlTimeoutRange config added to set a random range in milliseconds between requests. Some sites will automatically block connections so this is a 2 number object that introduces a random delay between requests for rate limit handling( default = 1000 )
Update to core.ts for maxPagesToCrawl
headless is now a config option ( default = true )
Addded headless mode as a configuration parameter
Random Rate Limiting Range with waitPerPageCrawlTimeoutRange config.
Added waitPerPageCrawlTimeoutRange for a random range in milliseconds between page requests to help with rate limiting
1 line improvement to prevent VSCODE warning for non-existent docker container.
Added ts-ignore for docker config.ts to prevent VSCode from declaring missing file that isn't created until the Docker is.
Chunked data goes into the storage dir. Final compiled JSON file outputs go into the new outputs directory.
Added Output Directory for all outputFileName to go into so they aren't overwritten in storage
Added more variables to the ./config.ts file for setting up the config in a more customized way that also includes the automatic naming convention domain-timestamp.json
Additions to dynamic url and match configurations in config.ts
Added details for waitForSelectorTimeout in the README.md file
Added waitForSelectorTimeout to README.md
Added additional Markdown and Typescript formatting to the config.ts and README.md files.
Adding details to README.md and config.ts as well as extra formatting.
13 Commits hopefully makes review a little easier.

I would like to contribute to this project on a regular basis. I have a lot of Web-scraping, A.I./LLMs, CI/CD, Automation, experience and would like to discuss with the main collaborators and see were I can be of the most use.

…will display the infinity symbol. Default is 50

… between page requests to help with rate limiting

… missing file that isn't created until the Docker is.

…'t overwritten in storage

marcelovicentegc · 2023-12-05T03:04:06Z

README.md


 ```ts
 type Config = {
-  /** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */
+
+  /** Required - URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */


What do you think of using tags instead, for example:

Suggested change

/** Required - URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */

/**

* URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap

* @required

*/

This way we could eventually bring jsdoc/typedoc into the mix to generate meaningful documentation from them. Just a suggestion, as even jsdoc/typedoc can infer whether a property is required or not based on its typings 🤗

I agree generating meaningful documentation is the way to go. I've updated the code accordingly. I will add more to the documentation as we progress.

marcelovicentegc · 2023-12-05T03:06:03Z

config.ts

+const url_prefix = "https://"
+const domain = "www.builder.io";
+const url_suffix = "/c/docs";
+const base_url = url_prefix + domain;
+const match_url_prefix = base_url + url_suffix;
+const match_url = match_url_prefix + "/**";
+
+// Now date stamp for output file name
+const now = new Date();
+const date = now.toISOString().split('T')[0];
+const time = now.toTimeString().split(' ')[0];
+const outputs_dir = __dirname.split('/').slice(0, -1).join('/') + '/outputs';


Just a nitpick here to guarantee code consistency, let's use camelCase for these variables:

Suggested change

const url_prefix = "https://"

const domain = "www.builder.io";

const url_suffix = "/c/docs";

const base_url = url_prefix + domain;

const match_url_prefix = base_url + url_suffix;

const match_url = match_url_prefix + "/**";

// Now date stamp for output file name

const now = new Date();

const date = now.toISOString().split('T')[0];

const time = now.toTimeString().split(' ')[0];

const outputs_dir = __dirname.split('/').slice(0, -1).join('/') + '/outputs';

const urlPrefix = "https://"

const domain = "www.builder.io";

const urlSuffix = "/c/docs";

const baseUrl = urlPrefix + domain;

const matchUrlPrefix = baseUrl + urlSuffix;

const matchUrl = matchUrlPrefix + "/**";

// Now date stamp for output file name

const now = new Date();

const date = now.toISOString().split('T')[0];

const time = now.toTimeString().split(' ')[0];

const outputsDir = __dirname.split('/').slice(0, -1).join('/') + '/outputs';

Requires changes below 👇

I updated to camelCase as you suggested and will continue to use that convention.

I don't mind nitpicks 👍establishing a shared nomenclature for project is ideal in my opinion. I am always open to feedback and recommendations.

src/config.ts

@marcelovicentegc

… and config.ts and formatting for jsdoc/typedoc as recommened by @marcelovicentegc in pull request BuilderIO#102, added .prettierignore file

cpdata · 2023-12-06T23:58:29Z

I updated with prettier formatting for the files that failed README.md, src/config.ts, src/core.ts, and config.ts.
I also added the formatting for jsdoc/typedoc as recommened by @marcelovicentegc in response to my orginal pull request #102. Additionally, I added .prettierignore file.

steve8708 · 2023-12-22T19:19:33Z

@marcelovicentegc this look good to you to merge?

marcelovicentegc · 2024-01-04T14:06:38Z

config.ts

+const matchUrl_prefix = baseUrl + urlSuffix;
+const matchUrl = matchUrl_prefix + "/**";
+
+// Now date stamp for output file name
+const now = new Date();
+const date = now.toISOString().split("T")[0];
+const time = now.toTimeString().split(" ")[0];
+const outputs_dir = __dirname.split("/").slice(0, -1).join("/") + "/outputs";
+
+const outputFileName =
+  outputs_dir + "/" + domain + "-" + date + "-" + time + ".json";


Hey @cpdata! Just a couple more nitpicks (replacing snake_case for camelCase) and a rebase and we are good to go!

Suggested change

const matchUrl_prefix = baseUrl + urlSuffix;

const matchUrl = matchUrl_prefix + "/**";

// Now date stamp for output file name

const now = new Date();

const date = now.toISOString().split("T")[0];

const time = now.toTimeString().split(" ")[0];

const outputs_dir = __dirname.split("/").slice(0, -1).join("/") + "/outputs";

const outputFileName =

outputs_dir + "/" + domain + "-" + date + "-" + time + ".json";

const matchUrlPrefix = baseUrl + urlSuffix;

const matchUrl = matchUrlPrefix + "/**";

// Now date stamp for output file name

const now = new Date();

const date = now.toISOString().split("T")[0];

const time = now.toTimeString().split(" ")[0];

const outputsDir = __dirname.split("/").slice(0, -1).join("/") + "/outputs";

const outputFileName =

outputsDir + "/" + domain + "-" + date + "-" + time + ".json";

marcelovicentegc · 2024-01-04T14:07:51Z

@marcelovicentegc this look good to you to merge?

Hey @steve8708! Happy new years! One rebase and a few nitpicks ☝️ and it occurs to me that we are good to go 🤗

Ademrobert · 2024-01-06T19:30:31Z

Please merge this branch ASAP!

cpdata and others added 13 commits December 4, 2023 04:39

Add VSCode workspace file in .gitignore

dbb6a21

Add outputs dir to .gitignore for final outputs

60ec188

Add Dynamic OutputFileName based on date-timestamp

14eb9fa

Allow maxPagesToCrawl to be optional and infinite by setting 0 which …

e700f6e

…will display the infinity symbol. Default is 50

Added maxConcurrency config to set maximum concurrent parallel requests.

ac0ac25

Update to core.ts for maxPagesToCrawl

a6b4b1f

Addded headless mode as a configuration parameter

c6b6303

Added waitPerPageCrawlTimeoutRange for a random range in milliseconds…

b427b25

… between page requests to help with rate limiting

Added ts-ignore for docker config.ts to prevent VSCode from declaring…

35ea95b

… missing file that isn't created until the Docker is.

Added Output Directory for all outputFileName to go into so they aren…

a996ab1

…'t overwritten in storage

Additions to dynamic url and match configurations in config.ts

33faaf5

Added waitForSelectorTimeout to README.md

83a5b0c

Adding details to README.md and config.ts as well as extra formatting.

401fa9b

marcelovicentegc reviewed Dec 5, 2023

View reviewed changes

marcelovicentegc assigned cpdata Dec 5, 2023

marcelovicentegc added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 5, 2023

Update prettier formatting for README.md, src/config.ts, src/core.ts,…

62521b7

… and config.ts and formatting for jsdoc/typedoc as recommened by @marcelovicentegc in pull request BuilderIO#102, added .prettierignore file

cpdata requested a review from marcelovicentegc December 7, 2023 00:03

marcelovicentegc reviewed Jan 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate Limiting, Max Concurrency, Infinite Crawl & Additional Configurations #102

Rate Limiting, Max Concurrency, Infinite Crawl & Additional Configurations #102

cpdata commented Dec 4, 2023

marcelovicentegc Dec 5, 2023

cpdata Dec 6, 2023

marcelovicentegc Dec 5, 2023

cpdata Dec 6, 2023

cpdata Dec 6, 2023

cpdata commented Dec 6, 2023

steve8708 commented Dec 22, 2023

marcelovicentegc Jan 4, 2024

marcelovicentegc commented Jan 4, 2024

Ademrobert commented Jan 6, 2024

Rate Limiting, Max Concurrency, Infinite Crawl & Additional Configurations #102

Are you sure you want to change the base?

Rate Limiting, Max Concurrency, Infinite Crawl & Additional Configurations #102

Conversation

cpdata commented Dec 4, 2023

Initial Improvements

Main Additions

Full Summery

marcelovicentegc Dec 5, 2023

Choose a reason for hiding this comment

cpdata Dec 6, 2023

Choose a reason for hiding this comment

marcelovicentegc Dec 5, 2023

Choose a reason for hiding this comment

cpdata Dec 6, 2023

Choose a reason for hiding this comment

cpdata Dec 6, 2023

Choose a reason for hiding this comment

cpdata commented Dec 6, 2023

steve8708 commented Dec 22, 2023

marcelovicentegc Jan 4, 2024

Choose a reason for hiding this comment

marcelovicentegc commented Jan 4, 2024

Ademrobert commented Jan 6, 2024