Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom JSON name and latest dependencies #121

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 25 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,6 @@ Crawl a site to generate knowledge files to create your own custom GPT from one
- [Run your crawler](#run-your-crawler)
- [Alternative methods](#alternative-methods)
- [Running in a container with Docker](#running-in-a-container-with-docker)
- [Running as a CLI](#running-as-a-cli)
- [Development](#development)
- [Upload your data to OpenAI](#upload-your-data-to-openai)
- [Create a custom GPT](#create-a-custom-gpt)
- [Create a custom assistant](#create-a-custom-assistant)
Expand All @@ -32,6 +30,8 @@ This project crawled the docs and generated the file that I uploaded as the basi

## Get started

This update comes with custom json names and latest npm libraries for 2024 support! just `npm i` or `pnpm i` and `npm start` or `pnpm start` to begin having fun!

### Running locally

#### Clone the repository
Expand All @@ -48,19 +48,38 @@ git clone https://github.com/builderio/gpt-crawler
npm i
```

or

```sh
pnpm i
```

#### Configure the crawler

Open [config.ts](config.ts) and edit the `url` and `selector` properties to match your needs.

E.g. to crawl the Builder.io docs to make our custom GPT you can use:
This way will generate a unique json everytime!

```ts
import { Config } from "./src/config";

// protocol
let protocol = "https://www.";
// Important staff for eg: https://www.builder.io/c/docs/**
let domain = "builder";
let tld = ".io";
// url
let extra = "/c/docs/developers";
// match
let content = "/c/docs";
let rest = "/**";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure this is beneficial - it is quite verbose. I think the way configuration is currently handled is intuitive and easy for individuals to structure in any way they want

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure this is beneficial - it is quite verbose. I think the way configuration is currently handled is intuitive and easy for individuals to structure in any way they want

The idea is that as user I cannot see difference between the two links, this may help me to improve the tool on a great way! <3 Thanks for your time, by the way!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Merry Christmas @steve8708


export const defaultConfig: Config = {
url: "https://www.builder.io/c/docs/developers",
match: "https://www.builder.io/c/docs/**",
selector: `.docs-builder-container`,
url: protocol + domain + tld + extra,
match: protocol + domain + tld + content + rest,
maxPagesToCrawl: 50,
outputFileName: "output.json",
outputFileName: domain + ".json",
};
```

Expand Down
13 changes: 10 additions & 3 deletions config.ts
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
import { Config } from "./src/config";

let protocol = "https://www.";
let domain = "builder";
let tld = ".io";
let extra = "/c/docs/developers";
let content = "/c/docs"
let rest = "/**";

export const defaultConfig: Config = {
url: "https://www.builder.io/c/docs/developers",
match: "https://www.builder.io/c/docs/**",
url: protocol + domain + tld + extra,
match: protocol + domain + tld + content + rest,
maxPagesToCrawl: 50,
outputFileName: "output.json",
outputFileName: domain + ".json",
};
Loading
Loading