Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate Limiting, Max Concurrency, Infinite Crawl & Additional Configurations #102

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,9 @@ storage
# any output from the crawler
*.json
pnpm-lock.yaml

# Final ouputs folder
outputs

# VS Code workspace files
*.code-workspace
38 changes: 30 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,22 +64,31 @@ export const defaultConfig: Config = {
};
```

See [config.ts](src/config.ts) for all available options. Here is a sample of the common configu options:
See [config.ts](src/config.ts) for all available options. Here is a sample of the common config options:

```ts
type Config = {
/** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */

/** Required - URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of using tags instead, for example:

Suggested change
/** Required - URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */
/**
* URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap
* @required
*/

This way we could eventually bring jsdoc/typedoc into the mix to generate meaningful documentation from them. Just a suggestion, as even jsdoc/typedoc can infer whether a property is required or not based on its typings 🤗

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree generating meaningful documentation is the way to go. I've updated the code accordingly. I will add more to the documentation as we progress.

url: string;
/** Pattern to match against for links on a page to subsequently crawl */

/** Required - Pattern to match against for links on a page to subsequently crawl */
match: string;
/** Selector to grab the inner text from */

/** Optional - Selector to grab the inner text from */
selector: string;
/** Don't crawl more than this many pages */

/** Optional - Don't crawl more than this many pages (0 = Crawl all, Default = 50)*/
maxPagesToCrawl: number;
/** File name for the finished data */

/** Optional - File name for the finished data */
outputFileName: string;
/** Optional resources to exclude
*

/** Optional - Timeout for waiting for a selector to appear */
waitForSelectorTimeout: number;

/** Optional - Resource file extensions to exclude from crawl
*
* @example
* ['png','jpg','jpeg','gif','svg','css','js','ico','woff','woff2','ttf','eot','otf','mp4','mp3','webm','ogg','wav','flac','aac','zip','tar','gz','rar','7z','exe','dmg','apk','csv','xls','xlsx','doc','docx','pdf','epub','iso','dmg','bin','ppt','pptx','odt','avi','mkv','xml','json','yml','yaml','rss','atom','swf','txt','dart','webp','bmp','tif','psd','ai','indd','eps','ps','zipx','srt','wasm','m4v','m4a','webp','weba','m4b','opus','ogv','ogm','oga','spx','ogx','flv','3gp','3g2','jxr','wdp','jng','hief','avif','apng','avifs','heif','heic','cur','ico','ani','jp2','jpm','jpx','mj2','wmv','wma','aac','tif','tiff','mpg','mpeg','mov','avi','wmv','flv','swf','mkv','m4v','m4p','m4b','m4r','m4a','mp3','wav','wma','ogg','oga','webm','3gp','3g2','flac','spx','amr','mid','midi','mka','dts','ac3','eac3','weba','m3u','m3u8','ts','wpl','pls','vob','ifo','bup','svcd','drc','dsm','dsv','dsa','dss','vivo','ivf','dvd','fli','flc','flic','flic','mng','asf','m2v','asx','ram','ra','rm','rpm','roq','smi','smil','wmf','wmz','wmd','wvx','wmx','movie','wri','ins','isp','acsm','djvu','fb2','xps','oxps','ps','eps','ai','prn','svg','dwg','dxf','ttf','fnt','fon','otf','cab']
*/
Expand All @@ -88,6 +97,19 @@ type Config = {
maxFileSize?: number;
/** Optional maximum number tokens to include in the output file */
maxTokens?: number;
/** Optional - Maximum concurent parellel requets at a time */
maxConcurrency?: number;

/** Optional - waitPerPageCrawlTimeoutRange is a object containing a min and max each for the number of milliseconds to wait after each page crawl.
* Use waitPerPageCrawlTimeoutRange to handle rate limiting.
*/
waitPerPageCrawlTimeoutRange?: {
min: number,
max: number,
};

/** Optional - Boolean parameter to use PlayWright with displayed browser or headless ( default headless=True ). */
headless?: boolean;
};
```

Expand Down
30 changes: 27 additions & 3 deletions config.ts
Original file line number Diff line number Diff line change
@@ -1,8 +1,32 @@
import { Config } from "./src/config";
import { fileURLToPath } from 'url';
import { dirname } from 'path';

const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);

const starting_url = "https://www.builder.io/c/docs/developers";
const url_prefix = "https://"
const domain = "www.builder.io";
const url_suffix = "/c/docs";
const base_url = url_prefix + domain;
const match_url_prefix = base_url + url_suffix;
const match_url = match_url_prefix + "/**";

// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split('T')[0];
const time = now.toTimeString().split(' ')[0];
const outputs_dir = __dirname.split('/').slice(0, -1).join('/') + '/outputs';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nitpick here to guarantee code consistency, let's use camelCase for these variables:

Suggested change
const url_prefix = "https://"
const domain = "www.builder.io";
const url_suffix = "/c/docs";
const base_url = url_prefix + domain;
const match_url_prefix = base_url + url_suffix;
const match_url = match_url_prefix + "/**";
// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split('T')[0];
const time = now.toTimeString().split(' ')[0];
const outputs_dir = __dirname.split('/').slice(0, -1).join('/') + '/outputs';
const urlPrefix = "https://"
const domain = "www.builder.io";
const urlSuffix = "/c/docs";
const baseUrl = urlPrefix + domain;
const matchUrlPrefix = baseUrl + urlSuffix;
const matchUrl = matchUrlPrefix + "/**";
// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split('T')[0];
const time = now.toTimeString().split(' ')[0];
const outputsDir = __dirname.split('/').slice(0, -1).join('/') + '/outputs';

Requires changes below 👇

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to camelCase as you suggested and will continue to use that convention.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind nitpicks 👍establishing a shared nomenclature for project is ideal in my opinion. I am always open to feedback and recommendations.


const outputFileName = outputs_dir + "/" + domain + "-" + date + "-" + time + ".json";

export const defaultConfig: Config = {
url: "https://www.builder.io/c/docs/developers",
match: "https://www.builder.io/c/docs/**",
url: starting_url,
match: match_url,
maxPagesToCrawl: 50,
outputFileName: "output.json",
outputFileName: outputFileName,
waitPerPageCrawlTimeoutRange: {min:1000, max:1000},
headless: true,
maxConcurrency: 1,
};
5 changes: 3 additions & 2 deletions containerapp/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,9 @@ RUN cd /home && git clone https://github.com/builderio/gpt-crawler && cd gpt-cra
npx playwright install && \
npx playwright install-deps

# Directory to mount in the docker container to get the output.json data
# Directories to mount in the docker container to get the output json data
RUN cd /home && mkdir data

# Final output directory
RUN cd /home && mkdir outputs

WORKDIR /home
1 change: 1 addition & 0 deletions containerapp/data/config.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
// @ts-ignore
import { Config } from "./src/config";

export const defaultConfig: Config = {
Expand Down
54 changes: 47 additions & 7 deletions src/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,43 +6,54 @@ const Page: z.ZodType<Page> = z.any();

export const configSchema = z.object({
/**
* **Required:**
marcelovicentegc marked this conversation as resolved.
Show resolved Hide resolved
* URL to start the crawl, if url is a sitemap, it will crawl all pages in the sitemap
* @example "https://www.builder.io/c/docs/developers"
* @example "https://www.builder.io/sitemap.xml"
* @default ""
*/
url: z.string(),
/**
* **Required:**
* Pattern to match against for links on a page to subsequently crawl
* @example "https://www.builder.io/c/docs/**"
* @default ""
*/
match: z.string().or(z.array(z.string())),

/**
* **Optional:**
* Selector to grab the inner text from
* @example ".docs-builder-container"
* @default ""
*/
selector: z.string().optional(),
/**
* **Optional:**
* Don't crawl more than this many pages
* @default 50
*/
maxPagesToCrawl: z.number().int().positive(),
maxPagesToCrawl: z.number().int().nonnegative().or(z.undefined()).optional(),
/**
* **Optional:**
* File name for the finished data
* @default "output.json"
*/
outputFileName: z.string(),
/** Optional cookie to be set. E.g. for Cookie Consent */
/**
* **Optional:**
* Cookie to be set. E.g. for Cookie Consent
* */
cookie: z
.object({
name: z.string(),
value: z.string(),
})
.optional(),
/** Optional function to run for each page found */
/**
* **Optional:**
* Function to run for each page found
* */
onVisitPage: z
.function()
.args(
Expand All @@ -55,21 +66,50 @@ export const configSchema = z.object({
.optional(),
/** Optional timeout for waiting for a selector to appear */
waitForSelectorTimeout: z.number().int().nonnegative().optional(),
/** Optional resources to exclude
/**
* **Optional:**
* Resources to exclude
*
* @example
* ['png','jpg','jpeg','gif','svg','css','js','ico','woff','woff2','ttf','eot','otf','mp4','mp3','webm','ogg','wav','flac','aac','zip','tar','gz','rar','7z','exe','dmg','apk','csv','xls','xlsx','doc','docx','pdf','epub','iso','dmg','bin','ppt','pptx','odt','avi','mkv','xml','json','yml','yaml','rss','atom','swf','txt','dart','webp','bmp','tif','psd','ai','indd','eps','ps','zipx','srt','wasm','m4v','m4a','webp','weba','m4b','opus','ogv','ogm','oga','spx','ogx','flv','3gp','3g2','jxr','wdp','jng','hief','avif','apng','avifs','heif','heic','cur','ico','ani','jp2','jpm','jpx','mj2','wmv','wma','aac','tif','tiff','mpg','mpeg','mov','avi','wmv','flv','swf','mkv','m4v','m4p','m4b','m4r','m4a','mp3','wav','wma','ogg','oga','webm','3gp','3g2','flac','spx','amr','mid','midi','mka','dts','ac3','eac3','weba','m3u','m3u8','ts','wpl','pls','vob','ifo','bup','svcd','drc','dsm','dsv','dsa','dss','vivo','ivf','dvd','fli','flc','flic','flic','mng','asf','m2v','asx','ram','ra','rm','rpm','roq','smi','smil','wmf','wmz','wmd','wvx','wmx','movie','wri','ins','isp','acsm','djvu','fb2','xps','oxps','ps','eps','ai','prn','svg','dwg','dxf','ttf','fnt','fon','otf','cab']
*/
resourceExclusions: z.array(z.string()).optional(),

/** Optional maximum file size in megabytes to include in the output file

/**
* **Optional:**
* Maximum file size in megabytes to include in the output file
* @example 1
*/
maxFileSize: z.number().int().positive().optional(),
/** Optional maximum number tokens to include in the output file

/**
* **Optional:**
* The maximum number tokens to include in the output file
* @example 5000
*/
maxTokens: z.number().int().positive().optional(),
/**
* **Optional:**
* Range for random number of milliseconds between **min** and **max** to wait after each page crawl
* @default {min:1000,max:1000}
* */
waitPerPageCrawlTimeoutRange: z.object({
min: z.number().int().nonnegative(),
max: z.number().int().nonnegative(),
}).optional(),
/**
* **Optional:**
* Headless mode
* @default true
*/
headless: z.boolean().optional(),
/**
* **Optional:**
* maxConcurrency
* description: ( 0 = Unlimited, Doesn't stop until cancelled, undefined = max parellel requests possible )
* @default 1
* */
maxConcurrency: z.number().int().nonnegative().optional(),
});

export type Config = z.infer<typeof configSchema>;
38 changes: 33 additions & 5 deletions src/core.ts
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,14 @@ export async function waitForXPath(page: Page, xpath: string, timeout: number) {
}

export async function crawl(config: Config) {

// Function to delay the next crawl
function delay(time: number) {
return new Promise(function(resolve) {
setTimeout(resolve, time)
});
}

configSchema.parse(config);

if (process.env.NO_CRAWL !== "true") {
Expand All @@ -55,6 +63,12 @@ export async function crawl(config: Config) {
const crawler = new PlaywrightCrawler({
// Use the requestHandler to process each of the crawled pages.
async requestHandler({ request, page, enqueueLinks, log, pushData }) {
// Warn if unlimited crawling is enabled
if (config.maxPagesToCrawl == 0) {
config.maxPagesToCrawl = undefined;
log.warningOnce(`maxPagesToCrawl is set to ${config.maxPagesToCrawl} which means it will contine until it cannot find anymore links defined by match: ${config.match}`);
}

if (config.cookie) {
// Set the cookie for the specific URL
const cookie = {
Expand All @@ -66,9 +80,11 @@ export async function crawl(config: Config) {
}

const title = await page.title();
// Display the pageCounter/maxPagesToCrawl number or pageCounter/∞ if maxPagesToCrawl=0
const maxPagesToCrawlDisplay = config.maxPagesToCrawl == undefined ? "∞" : config.maxPagesToCrawl;
pageCounter++;
log.info(
`Crawling: Page ${pageCounter} / ${config.maxPagesToCrawl} - URL: ${request.loadedUrl}...`,
`Crawling: Page ${pageCounter} / ${maxPagesToCrawlDisplay} - URL: ${request.loadedUrl}...`
);

// Use custom handling for XPath selector
Expand Down Expand Up @@ -101,11 +117,23 @@ export async function crawl(config: Config) {
globs:
typeof config.match === "string" ? [config.match] : config.match,
});
// Use waitPerPageCrawlTimeoutRange to handle rate limiting
if (config.waitPerPageCrawlTimeoutRange) {
// Create a random number between min and max
const randomTimeout = Math.floor(Math.random() * (config.waitPerPageCrawlTimeoutRange.max - config.waitPerPageCrawlTimeoutRange.min + 1) + config.waitPerPageCrawlTimeoutRange.min);
log.info(
`Waiting ${randomTimeout} milliseconds before next crawl to avoid rate limiting...`
);
// Wait for the random amount of time before crawling the next page
await delay(randomTimeout);
}else{
// Wait for 1 second before crawling the next page
await delay(1000);
}
},
// Comment this option to scrape the full website.
maxRequestsPerCrawl: config.maxPagesToCrawl,
// Uncomment this option to see the browser window.
// headless: false,
maxConcurrency: config.maxConcurrency || 1 , // Set the max concurrency
maxRequestsPerCrawl: config.maxPagesToCrawl, // Set the max pages to crawl or set to 0 to scrape the full website.
headless: config.headless ?? true, // Set to false to see the browser in action
preNavigationHooks: [
// Abort requests for certain resource types
async ({ page, log }) => {
Expand Down
Loading