-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding pagination option #76
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes you've implemented are intriguing and could be particularly beneficial for handling projects with a thousand or more pages. Overall, the modifications seem comprehensive. However, I have reservations about the effectiveness of dividing the output into multiple paginated files. This seems more tailored towards human usability rather than GPT optimization. According to GPT itself, the training efficacy remains consistent whether using one large file or several smaller ones. Therefore, while this approach may not necessarily hinder GPT, it also doesn't guarantee an improvement in its performance.
outputFileName, | ||
} = options; | ||
|
||
// @ts-ignore | ||
const maxPagesToCrawl = parseInt(maxPagesToCrawlStr, 10); | ||
|
||
// @ts-ignore | ||
const pagesPerPagination = parseInt(pagesPerPaginationStr, 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you forcing pagination ?
export async function write(config: Config, paginationCounter: number = 0) { | ||
configSchema.parse(config); | ||
let fileNameParts = config.outputFileName.split('.'); | ||
if (paginationCounter) { | ||
fileNameParts.splice(fileNameParts.length - 1, 0, `${paginationCounter}`); | ||
} | ||
const outputFilePath = fileNameParts.join('.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While crawl should have the responsability to write, which I approve this changes, I'm not sure the pagination has any real usage for GPT. For human readability yes, for GPT, not really. It might, it might not... I don't have the real answer, but GPT itself says that if the content is the same, multiple files or one big file will have the same effectiveness for the training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxime4000 you are right. It used to be that the size of the document would make a difference. But with GPT they have obviously handled this better. Thanks for reviewing the PR, but I will withdraw it as I don't think it will add value now.
Adding pagination to the crawler. This is to allow a user to configure how many pages will be crawled per pagination, and then saving this to a modified version of the output file.
The output file will have a number appended which is the paginationCounter.
This PR includes the following changes: