Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding pagination option #76

Closed
wants to merge 2 commits into from

Conversation

kanehooper
Copy link

Adding pagination to the crawler. This is to allow a user to configure how many pages will be crawled per pagination, and then saving this to a modified version of the output file.

The output file will have a number appended which is the paginationCounter.

This PR includes the following changes:

  1. Moved all calls to write to the crawl function to handle pagination.
  2. Update the filename generation within the write function to handle pagination.
  3. Added pagesPerPagination configuration option as optional
  4. Update the CLI commands with the pagesPerPagination option
  5. Updated the Readme file to reflect the new config option

Copy link

@maxime4000 maxime4000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes you've implemented are intriguing and could be particularly beneficial for handling projects with a thousand or more pages. Overall, the modifications seem comprehensive. However, I have reservations about the effectiveness of dividing the output into multiple paginated files. This seems more tailored towards human usability rather than GPT optimization. According to GPT itself, the training efficacy remains consistent whether using one large file or several smaller ones. Therefore, while this approach may not necessarily hinder GPT, it also doesn't guarantee an improvement in its performance.

image

outputFileName,
} = options;

// @ts-ignore
const maxPagesToCrawl = parseInt(maxPagesToCrawlStr, 10);

// @ts-ignore
const pagesPerPagination = parseInt(pagesPerPaginationStr, 10);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you forcing pagination ?

Comment on lines +159 to +165
export async function write(config: Config, paginationCounter: number = 0) {
configSchema.parse(config);
let fileNameParts = config.outputFileName.split('.');
if (paginationCounter) {
fileNameParts.splice(fileNameParts.length - 1, 0, `${paginationCounter}`);
}
const outputFilePath = fileNameParts.join('.');

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While crawl should have the responsability to write, which I approve this changes, I'm not sure the pagination has any real usage for GPT. For human readability yes, for GPT, not really. It might, it might not... I don't have the real answer, but GPT itself says that if the content is the same, multiple files or one big file will have the same effectiveness for the training.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxime4000 you are right. It used to be that the size of the document would make a difference. But with GPT they have obviously handled this better. Thanks for reviewing the PR, but I will withdraw it as I don't think it will add value now.

@kanehooper kanehooper closed this Nov 29, 2023
@kanehooper kanehooper deleted the feature/pagination branch November 29, 2023 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants