Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Viewer : Why Sequence Length 2049? #123

Closed
prakharg24 opened this issue Oct 25, 2023 · 15 comments
Closed

Batch Viewer : Why Sequence Length 2049? #123

prakharg24 opened this issue Oct 25, 2023 · 15 comments

Comments

@prakharg24
Copy link

prakharg24 commented Oct 25, 2023

Hi,
I am using utils/batch_viewer.py to iterate through Pythia's training data and calculate some batch-level statistics.
Firstly, there are some gaps between the actual code in batch_viewer.py and the expected code according to the README (For example, it doesn't take any 'config file' as input, the 'load file' name needs to be supplied separately, etc.). But these differences were obvious enough that I could fix them on my end and run the code.

However, it's the final step of saving the data after loading the buffer that I'm a bit confused about. I have two questions,

  1. Given that each 'sequence' in the dataset is of a different length, can someone confirm that the training is performed by simply concatenating the whole dataset as a single sequence of tokens, and then dividing it into sentences and batches? This would mean that some 'sequences' are broken into different sentences or batches, and even one 'sentence' of 2048 tokens might contain multiple actual dataset sequences. I believe this is how most LLMs are trained, but I couldn't find the exact details in the paper.
  2. The MMapDataset function attempts to reshape the final concatenated sequence into (-1, 2049). I don't understand why 2049. Isn't the sentence length supposed to be 2048? I'm new to the specifics of how LLMs are trained so I may be missing some trivial detail here, but I don't understand why 2048 became 2049.
@haileyschoelkopf
Copy link
Collaborator

Hi, thanks for your interest!

As a sanity check of the new batch_viewer.py, you can also use the older version here: 899add0 @uSaiPrashanth will be bringing the documentation in the README in line with this updated version.

  1. Yes, this is correct! We tokenize all documents, shuffle them, and concatenate all documents, separating them by a single EOD token . Thus each sample or batch may not start or end with an EOD token, and sample boundaries do not respect document boundaries, nor do we avoid cross-attending to different documents in a context window. This is standard for many public LLM training codebases. For the ground truth on this, v1.0 of the neox repository is a good reference. https://github.com/EleutherAI/gpt-neox/tree/v1.0

  2. The reason for a sequence length of 2049 is because target tokens consist of the input tokens, but left-shifted by one position (so tokens [0 1 2 3] are seen and used to predict token 4 as target, and so on.) So the first 2048 tokens (only excluding token 2049) of the 2049-token window in a sample are used as inputs to the model, and the last 2048 tokens (only excluding the first token) are used as targets for calculating loss. Thus we calculate loss on 2048 tokens per sample.

@prakharg24
Copy link
Author

Hi @haileyschoelkopf Thank you for the response!

One more thing I'd like to clarify. Am I correct to assume that the tokenized data downloaded according to the instructions here is already shuffled - https://github.com/EleutherAI/pythia#reproducing-training ?

Simply put, to reproduce the exact batches used during training, I need to

  1. Load all tokenized documents sequentially from the unsharded dataset created using the instructions in https://github.com/EleutherAI/pythia#reproducing-training
  2. Concetanate them one after the other with 'EOD' tokens between.
  3. Divide them into sequences of 2049 tokens, and then batches of 1024 sequences.
  4. Finally, if I were to, for example, collect 100,000 such batches in order from the start, I will get the exact data seen by the model checkpoint at step100000 (Understandly, for deduplicated data, I'll need to start a 'second epoch' at some point, but I assume that is also done by simply assuming continuous transition from end of the dataset to the start again).

Thank you!!

@itsnamgyu
Copy link

itsnamgyu commented Dec 30, 2023

I have the same questions as @prakharg24. Specifically:

  1. There are two version of the deduped, pre-shuffled datasets mentioned in the README. As I understand:
  • EleutherAI/pythia_deduped_pile_idxmaps contains tokenized documents without any EOD tokens. I've looked at the data and this seems to be the case.
  • EleutherAI/pile-deduped-pythia-preshuffled contains tokenized documents with EOD tokens.

Is this correct?

  1. If the data is divided into 2049-sized sequences naively, then the first token of each sequence will not be seen (as a label) by the model. Is this intended?

Can someone please help with this? @haileyschoelkopf

@sujantkumarkv
Copy link

sujantkumarkv commented Jan 2, 2024

I have similar doubts regarding the nature of data as @itsnamgyu @prakharg24..

If I want to only use a subset (say arXiv only) to train a pythia model, how do I download only those pretokenized data (including EOD tokens)?

Any input is appreciated. cc @haileyschoelkopf @crowsonkb @joshlk

@haileyschoelkopf
Copy link
Collaborator

Hi @itsnamgyu @prakharg24 hopefully I can answer some of your dataset questions here:

If the data is divided into 2049-sized sequences naively, then the first token of each sequence will not be seen (as a label) by the model. Is this intended?

This is correct. Because we do not train using any BOS tokens, there is no way to see the first token of a sequence as a label by the model. This is because one cannot feed in the empty string to a model (unless it was trained with a BOS token that can act as such. You could attempt to simulate this by passing EOD into the Pythia models, but I am unsure of the behavior that would result.)

If I want to only use a subset (say arXiv only) to train a pythia model, how do I download only those pretokenized data (including EOD tokens)?

@sujantkumarkv unfortunately, when tokenizing the Pile dataset, metadata about subsets are not retained. We don't currently have an easy way to train only on say the arXiv subset, and would recommend retokenizing that dataset alone separately using GPT-NeoX's prepare_data.py .

Regarding how to replicate training order:

If using EleutherAI/pile-deduped-pythia-preshuffled, then once you've downloaded and combined the shards in this, this should provide, when loaded using MMapIndexedDataset via the script at https://github.com/EleutherAI/pythia/blob/dc24af59cff8c8159a1d4b106393b39c39a1ef2e/utils/batch_viewer.py , will provide a dataset where dataset[i] is a length-2049 sequence of tokens, that will contain EOD tokens separating the end of one document from the beginning of the next.

To access the j-th batch item at step k of training, you can access dataset[(k * 1024) + j] which should give the context window seen at that item in training.


The EleutherAI/pythia_deduped_pile_idxmaps contains binidx files that can be used with the script at https://github.com/EleutherAI/pythia/blob/899add0f1c71cb27dbf5a7594202584416c0b424/utils/batch_viewer.py .

These binidx files contain the tokenized documents, prior to chopping them into the context windows seen during training.

Here, these binidx files must be loaded using megatron.data.gpt2_dataset.GPT2Dataset with the appropriate arguments, in order to perform shuffling via megatron's dataset code (as was done during training) and chop the documents appropriately into context windows.


We've updated the readme to hopefully make more clear how to use the preshuffled binidx files!

If you're looking to reproduce the Pythia training order, for

  1. viewing the training data contents: we recommend using the preshuffled binidx files and using the most up-to-date README and batch_viewer.py . This lets you dump the context windows seen by Pythia directly to disk and is significantly faster than the old batch_viewer.py .

  2. if you want to re-train Pythia, we recommend doing so in the GPT-NeoX library v1.0 and taking care to use the exact same config file as we provide for the Pythia models.

I hope that this is helpful!

@itsnamgyu
Copy link

itsnamgyu commented Jan 9, 2024

Thanks so much for the detailed answer!

Just to clarify for other readers, I've confirmed that EleutherAI/pythia_deduped_pile_idxmaps does not have any EOD tokens (but please comment if I'm wrong).

@pietrolesci
Copy link

Further to @itsnamgyu comment, I confirm that pile-deduped-pythia-preshuffled does not have any EOD tokens (I checked a 100k sample, let me know if I missed anything).

@itsnamgyu
Copy link

@pietrolesci Actually according to the comment above, pile-deduped-pythia-preshuffled should have EOD tokens while EleutherAI/pythia_deduped_pile_idxmaps does not, so that is contradicting. Are you should you are referring to pile-deduped-pythia-preshuffled?

If using EleutherAI/pile-deduped-pythia-preshuffled, then once you've downloaded and combined the shards in this, this should provide, when loaded using MMapIndexedDataset via the script at https://github.com/EleutherAI/pythia/blob/dc24af59cff8c8159a1d4b106393b39c39a1ef2e/utils/batch_viewer.py , will provide a dataset where dataset[i] is a length-2049 sequence of tokens, that will contain EOD tokens separating the end of one document from the beginning of the next.

Note, batch_viewer.py does not have any code to add EOD tokens.

@itsnamgyu
Copy link

itsnamgyu commented Jan 10, 2024

@haileyschoelkopf sorry for the trouble, but is EleutherAI/pythia_deduped_pile_idxmaps also pre-shuffled?

You mentioned your comment,

Here, these binidx files must be loaded using megatron.data.gpt2_dataset.GPT2Dataset with the appropriate arguments, in order to perform shuffling via megatron's dataset code (as was done during training) and chop the documents appropriately into context windows.

whereas https://github.com/EleutherAI/pythia#reproducing-training says about EleutherAI/pythia_deduped_pile_idxmaps

We recommend downloading this rather than retokenizing the Pile from scratch in order to guarantee preservation of the data order seen by the Pythia models

I'm training on EleutherAI/pythia_deduped_pile_idxmaps (while manually injecting EOD) and (1) some manual inspection and (2) the training loss suggests that it is in fact pre-shuffled.

Related to #127

@pietrolesci
Copy link

pietrolesci commented Jan 10, 2024

@pietrolesci Actually according to the comment above, pile-deduped-pythia-preshuffled should have EOD tokens while EleutherAI/pythia_deduped_pile_idxmaps does not, so that is contradicting. Are you should you are referring to pile-deduped-pythia-preshuffled?

Hi @itsnamgyu, I confirm that -- contrary to what is expected and described in the README -- the pile-deduped-pythia-preshuffled does NOT have an EOD token.

@norabelrose
Copy link
Member

I also ran into the absence of EOD tokens just now 👀

will ping @haileyschoelkopf

@M-HuangX
Copy link

I have checked the preshuffled dataset. And find that the authentic seq_length is 2050 which is not equal to 2049 as described.

@markschoene
Copy link

markschoene commented Sep 30, 2024

Any updates on missing EOD tokens in the past months? Are the https://huggingface.co/datasets/EleutherAI/pile-deduped-pythia-preshuffled files save to use for pretraining?
@haileyschoelkopf

@markschoene
Copy link

@pietrolesci Did you make any progress on the EOD tokens or have advice to share for successful pile training?

@pietrolesci
Copy link

Hi @markschoene,

Thanks for pinging me. I think that for the Pythia runs it is confirmed that no EOD token was added.

Looking at the Pythias, you can have a successful training run with the current pre-tokenised Pile: they are not the best LMs out there, but they work--they underperform models of similar sizes but (I think) it's because they are undertrained, especially the big ones; not because of the EOD issue.

Depending on your budget, nowadays, I would train a Llama-based LM (GQA, rotary pos, etc) on the fineweb dataset. If your budget is small, consider the minipile or samples (10BT) from the fineweb-edu dataset. In both cases, you can process the dataset from scratch so you have full control. If you really want the Pile and an EOD token, I think your only option is to tokenise it from scratch.

I hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants