Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate Limiting, Max Concurrency, Infinite Crawl & Additional Configurations #102

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,9 @@ storage
# any output from the crawler
*.json
pnpm-lock.yaml

# Final ouputs folder
outputs

# VS Code workspace files
*.code-workspace
30 changes: 30 additions & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Ignore artifacts

node_modules
.github
storage
outputs
*.code-workspace

## This file tells which files shouldn't be added to source control

.idea
dist
node_modules
apify_storage
crawlee_storage
storage
.DS_Store

## any output from the crawler

*.json
pnpm-lock.yaml

## Final ouputs folder

outputs

## VS Code workspace files

*.code-workspace
120 changes: 108 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,32 +64,112 @@ export const defaultConfig: Config = {
};
```

See [config.ts](src/config.ts) for all available options. Here is a sample of the common configu options:
See [config.ts](src/config.ts) for all available options. Here is a sample of the common config options:

```ts
````ts
type Config = {
/** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */
/**
* URL to start the crawl, if url is a sitemap, it will crawl all pages in the sitemap
* @example "https://www.builder.io/c/docs/developers"
* @example "https://www.builder.io/sitemap.xml"
* @default ""
* @required
*/
url: string;
/** Pattern to match against for links on a page to subsequently crawl */
/**
* Pattern to match against for links on a page to subsequently crawl
* @example "https://www.builder.io/c/docs/**"
* @default ""
*/
match: string;
/** Selector to grab the inner text from */
/**
* Selector to grab the inner text from
* @example ".docs-builder-container"
* @default ""
* @required
*/
selector: string;
/** Don't crawl more than this many pages */
/**
* Don't crawl more than this many pages
* @default 50
*/
maxPagesToCrawl: number;
/** File name for the finished data */
/**
* File name for the finished data
* @example "output.json"
*/
outputFileName: string;
/** Optional resources to exclude
*
/**
* Cookie to be set. E.g. for Cookie Consent
*/
cookie?: {
name: string,
value: string,
url: string,
};
/**
* Function to run for each page found
*/
onVisitPage?: (page: object, data: string);
/**
* Timeout to wait for a selector to appear
*/
waitForSelectorTimeout: object;
/**
* Resource file extensions to exclude from crawl
* @example
* ['png','jpg','jpeg','gif','svg','css','js','ico','woff','woff2','ttf','eot','otf','mp4','mp3','webm','ogg','wav','flac','aac','zip','tar','gz','rar','7z','exe','dmg','apk','csv','xls','xlsx','doc','docx','pdf','epub','iso','dmg','bin','ppt','pptx','odt','avi','mkv','xml','json','yml','yaml','rss','atom','swf','txt','dart','webp','bmp','tif','psd','ai','indd','eps','ps','zipx','srt','wasm','m4v','m4a','webp','weba','m4b','opus','ogv','ogm','oga','spx','ogx','flv','3gp','3g2','jxr','wdp','jng','hief','avif','apng','avifs','heif','heic','cur','ico','ani','jp2','jpm','jpx','mj2','wmv','wma','aac','tif','tiff','mpg','mpeg','mov','avi','wmv','flv','swf','mkv','m4v','m4p','m4b','m4r','m4a','mp3','wav','wma','ogg','oga','webm','3gp','3g2','flac','spx','amr','mid','midi','mka','dts','ac3','eac3','weba','m3u','m3u8','ts','wpl','pls','vob','ifo','bup','svcd','drc','dsm','dsv','dsa','dss','vivo','ivf','dvd','fli','flc','flic','flic','mng','asf','m2v','asx','ram','ra','rm','rpm','roq','smi','smil','wmf','wmz','wmd','wvx','wmx','movie','wri','ins','isp','acsm','djvu','fb2','xps','oxps','ps','eps','ai','prn','svg','dwg','dxf','ttf','fnt','fon','otf','cab']
*/
resourceExclusions?: string[];
/** Optional maximum file size in megabytes to include in the output file */
/**
* Maximum file size in megabytes to include in the output file
* @example 1
*/
maxFileSize?: number;
/** Optional maximum number tokens to include in the output file */
/**
* The maximum number tokens to include in the output file
* @example 5000
*/
maxTokens?: number;
/**
* Maximum concurent parellel requets at a time Maximum concurent parellel requets at a time
* @example
* Specific number of parellel requests
* ```ts
* maxConcurrency: 2;
* ```
* @example
* 0 = Unlimited, Doesn't stop until cancelled
* text outside of the code block as regular text.
* ```ts
* maxConcurrency: 0;
* ```
* @example
* undefined = max parellel requests possible
* ```ts
* maxConcurrency: undefined;
* ```
* @default 1
*/
maxConcurrency?: number;
/**
* Range for random number of milliseconds between **min** and **max** to wait after each page crawl
* @default {min:1000,max:1000}
* @example {min:1000, max:2000}
*/
waitPerPageCrawlTimeoutRange?: {
min: number,
max: number,
};

/** Optional - Boolean parameter to use PlayWright with displayed browser or headless ( default headless=True ). */
/**
* Headless mode
* @default true
*/
headless?: boolean;
};
```
````

#### Run your crawler

Expand All @@ -103,6 +183,22 @@ npm start

To obtain the `output.json` with a containerized execution. Go into the `containerapp` directory. Modify the `config.ts` same as above, the `output.json`file should be generated in the data folder. Note : the `outputFileName` property in the `config.ts` file in containerapp folder is configured to work with the container.

#### [Running as a CLI](#running-as-a-cli)

To run the `./dist/cli.ts` command line interface, follow these instructions:

1. Open a terminal.
2. Navigate to the root directory of the project.
3. Run the following command: `./dist/cli.ts [arguments]`
Replace `[arguments]` with the appropriate command line arguments for your use case.
4. The CLI will execute the specified command and display the output in the terminal.

> Note: Make sure you have the necessary dependencies installed and the project has been built before running the CLI.

#### [Development](#development)

> Instructions for Development will go here...

### Upload your data to OpenAI

The crawl will generate a file called `output.json` at the root of this project. Upload that [to OpenAI](https://platform.openai.com/docs/assistants/overview) to create your custom assistant or custom GPT.
Expand Down
31 changes: 28 additions & 3 deletions config.ts
Original file line number Diff line number Diff line change
@@ -1,8 +1,33 @@
import { Config } from "./src/config";
import { fileURLToPath } from "url";
import { dirname } from "path";

const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);

const startingUrl = "https://www.builder.io/c/docs/developers";
const urlPrefix = "https://";
const domain = "www.builder.io";
const urlSuffix = "/c/docs";
const baseUrl = urlPrefix + domain;
const matchUrl_prefix = baseUrl + urlSuffix;
const matchUrl = matchUrl_prefix + "/**";

// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split("T")[0];
const time = now.toTimeString().split(" ")[0];
const outputs_dir = __dirname.split("/").slice(0, -1).join("/") + "/outputs";

const outputFileName =
outputs_dir + "/" + domain + "-" + date + "-" + time + ".json";
Comment on lines +13 to +23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cpdata! Just a couple more nitpicks (replacing snake_case for camelCase) and a rebase and we are good to go!

Suggested change
const matchUrl_prefix = baseUrl + urlSuffix;
const matchUrl = matchUrl_prefix + "/**";
// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split("T")[0];
const time = now.toTimeString().split(" ")[0];
const outputs_dir = __dirname.split("/").slice(0, -1).join("/") + "/outputs";
const outputFileName =
outputs_dir + "/" + domain + "-" + date + "-" + time + ".json";
const matchUrlPrefix = baseUrl + urlSuffix;
const matchUrl = matchUrlPrefix + "/**";
// Now date stamp for output file name
const now = new Date();
const date = now.toISOString().split("T")[0];
const time = now.toTimeString().split(" ")[0];
const outputsDir = __dirname.split("/").slice(0, -1).join("/") + "/outputs";
const outputFileName =
outputsDir + "/" + domain + "-" + date + "-" + time + ".json";


export const defaultConfig: Config = {
url: "https://www.builder.io/c/docs/developers",
match: "https://www.builder.io/c/docs/**",
url: startingUrl,
match: matchUrl,
maxPagesToCrawl: 50,
outputFileName: "output.json",
outputFileName: outputFileName,
waitPerPageCrawlTimeoutRange: { min: 1000, max: 1000 },
headless: true,
maxConcurrency: 1,
};
5 changes: 3 additions & 2 deletions containerapp/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,9 @@ RUN cd /home && git clone https://github.com/builderio/gpt-crawler && cd gpt-cra
npx playwright install && \
npx playwright install-deps

# Directory to mount in the docker container to get the output.json data
# Directories to mount in the docker container to get the output json data
RUN cd /home && mkdir data

# Final output directory
RUN cd /home && mkdir outputs

WORKDIR /home
1 change: 1 addition & 0 deletions containerapp/data/config.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
// @ts-ignore
import { Config } from "./src/config";

export const defaultConfig: Config = {
Expand Down
66 changes: 54 additions & 12 deletions src/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,16 @@ export const configSchema = z.object({
* @example "https://www.builder.io/c/docs/developers"
* @example "https://www.builder.io/sitemap.xml"
* @default ""
* @required
*/
url: z.string(),
/**
* Pattern to match against for links on a page to subsequently crawl
* @example "https://www.builder.io/c/docs/**"
* @default ""
* @required
*/
match: z.string().or(z.array(z.string())),

/**
* Selector to grab the inner text from
* @example ".docs-builder-container"
Expand All @@ -29,20 +30,24 @@ export const configSchema = z.object({
* Don't crawl more than this many pages
* @default 50
*/
maxPagesToCrawl: z.number().int().positive(),
maxPagesToCrawl: z.number().int().nonnegative().or(z.undefined()).optional(),
/**
* File name for the finished data
* @default "output.json"
* @example "output.json"
*/
outputFileName: z.string(),
/** Optional cookie to be set. E.g. for Cookie Consent */
/**
* Cookie to be set. E.g. for Cookie Consent
*/
cookie: z
.object({
name: z.string(),
value: z.string(),
})
.optional(),
/** Optional function to run for each page found */
/**
* Function to run for each page found
*/
onVisitPage: z
.function()
.args(
Expand All @@ -53,23 +58,60 @@ export const configSchema = z.object({
)
.returns(z.promise(z.void()))
.optional(),
/** Optional timeout for waiting for a selector to appear */
waitForSelectorTimeout: z.number().int().nonnegative().optional(),
/** Optional resources to exclude
*
/**
* Resources to exclude
* @example
* ['png','jpg','jpeg','gif','svg','css','js','ico','woff','woff2','ttf','eot','otf','mp4','mp3','webm','ogg','wav','flac','aac','zip','tar','gz','rar','7z','exe','dmg','apk','csv','xls','xlsx','doc','docx','pdf','epub','iso','dmg','bin','ppt','pptx','odt','avi','mkv','xml','json','yml','yaml','rss','atom','swf','txt','dart','webp','bmp','tif','psd','ai','indd','eps','ps','zipx','srt','wasm','m4v','m4a','webp','weba','m4b','opus','ogv','ogm','oga','spx','ogx','flv','3gp','3g2','jxr','wdp','jng','hief','avif','apng','avifs','heif','heic','cur','ico','ani','jp2','jpm','jpx','mj2','wmv','wma','aac','tif','tiff','mpg','mpeg','mov','avi','wmv','flv','swf','mkv','m4v','m4p','m4b','m4r','m4a','mp3','wav','wma','ogg','oga','webm','3gp','3g2','flac','spx','amr','mid','midi','mka','dts','ac3','eac3','weba','m3u','m3u8','ts','wpl','pls','vob','ifo','bup','svcd','drc','dsm','dsv','dsa','dss','vivo','ivf','dvd','fli','flc','flic','flic','mng','asf','m2v','asx','ram','ra','rm','rpm','roq','smi','smil','wmf','wmz','wmd','wvx','wmx','movie','wri','ins','isp','acsm','djvu','fb2','xps','oxps','ps','eps','ai','prn','svg','dwg','dxf','ttf','fnt','fon','otf','cab']
*/
resourceExclusions: z.array(z.string()).optional(),

/** Optional maximum file size in megabytes to include in the output file
/**
* Maximum file size in megabytes to include in the output file
* @example 1
*/
maxFileSize: z.number().int().positive().optional(),
/** Optional maximum number tokens to include in the output file
/**
* The maximum number tokens to include in the output file
* @example 5000
*/
maxTokens: z.number().int().positive().optional(),
/**
* Maximum concurent parellel requets at a time Maximum concurent parellel requets at a time
* @example
* Specific number of parellel requests
* ```ts
* maxConcurrency: 2;
* ```
* @example
* 0 = Unlimited, Doesn't stop until cancelled
* text outside of the code block as regular text.
* ```ts
* maxConcurrency: 0;
* ```
* @example
* undefined = max parellel requests possible
* ```ts
* maxConcurrency: undefined;
* ```
* @default 1
*/
maxConcurrency: z.number().int().nonnegative().optional(),
/**
* Range for random number of milliseconds between **min** and **max** to wait after each page crawl
* @default {min:1000,max:1000}
* @example {min:1000,max:2000}
*/
waitForSelectorTimeout: z.number().int().nonnegative().optional(),
waitPerPageCrawlTimeoutRange: z
.object({
min: z.number().int().nonnegative(),
max: z.number().int().nonnegative(),
})
.optional(),
/**
* Headless mode
* @default true
*/
headless: z.boolean().optional(),
});

export type Config = z.infer<typeof configSchema>;
Loading
Loading