-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch Viewer : Why Sequence Length 2049? #123
Comments
Hi, thanks for your interest! As a sanity check of the new batch_viewer.py, you can also use the older version here: 899add0 @uSaiPrashanth will be bringing the documentation in the README in line with this updated version.
|
Hi @haileyschoelkopf Thank you for the response! One more thing I'd like to clarify. Am I correct to assume that the tokenized data downloaded according to the instructions here is already shuffled - https://github.com/EleutherAI/pythia#reproducing-training ? Simply put, to reproduce the exact batches used during training, I need to
Thank you!! |
I have the same questions as @prakharg24. Specifically:
Is this correct?
Can someone please help with this? @haileyschoelkopf |
I have similar doubts regarding the nature of data as @itsnamgyu @prakharg24.. If I want to only use a subset (say arXiv only) to train a pythia model, how do I download only those pretokenized data (including EOD tokens)? Any input is appreciated. cc @haileyschoelkopf @crowsonkb @joshlk |
Hi @itsnamgyu @prakharg24 hopefully I can answer some of your dataset questions here:
This is correct. Because we do not train using any BOS tokens, there is no way to see the first token of a sequence as a label by the model. This is because one cannot feed in the empty string to a model (unless it was trained with a BOS token that can act as such. You could attempt to simulate this by passing EOD into the Pythia models, but I am unsure of the behavior that would result.)
@sujantkumarkv unfortunately, when tokenizing the Pile dataset, metadata about subsets are not retained. We don't currently have an easy way to train only on say the arXiv subset, and would recommend retokenizing that dataset alone separately using GPT-NeoX's prepare_data.py . Regarding how to replicate training order: If using To access the The These binidx files contain the tokenized documents, prior to chopping them into the context windows seen during training. Here, these binidx files must be loaded using We've updated the readme to hopefully make more clear how to use the preshuffled binidx files! If you're looking to reproduce the Pythia training order, for
I hope that this is helpful! |
Thanks so much for the detailed answer! Just to clarify for other readers, I've confirmed that |
Further to @itsnamgyu comment, I confirm that |
@pietrolesci Actually according to the comment above,
Note, |
@haileyschoelkopf sorry for the trouble, but is You mentioned your comment,
whereas https://github.com/EleutherAI/pythia#reproducing-training says about
I'm training on Related to #127 |
Hi @itsnamgyu, I confirm that -- contrary to what is expected and described in the README -- the |
I also ran into the absence of EOD tokens just now 👀 will ping @haileyschoelkopf |
I have checked the preshuffled dataset. And find that the authentic seq_length is 2050 which is not equal to 2049 as described. |
Any updates on missing EOD tokens in the past months? Are the https://huggingface.co/datasets/EleutherAI/pile-deduped-pythia-preshuffled files save to use for pretraining? |
@pietrolesci Did you make any progress on the EOD tokens or have advice to share for successful pile training? |
Hi @markschoene, Thanks for pinging me. I think that for the Pythia runs it is confirmed that no EOD token was added. Looking at the Pythias, you can have a successful training run with the current pre-tokenised Pile: they are not the best LMs out there, but they work--they underperform models of similar sizes but (I think) it's because they are undertrained, especially the big ones; not because of the EOD issue. Depending on your budget, nowadays, I would train a Llama-based LM (GQA, rotary pos, etc) on the fineweb dataset. If your budget is small, consider the minipile or samples (10BT) from the fineweb-edu dataset. In both cases, you can process the dataset from scratch so you have full control. If you really want the Pile and an EOD token, I think your only option is to tokenise it from scratch. I hope this helps! |
Hi,
I am using utils/batch_viewer.py to iterate through Pythia's training data and calculate some batch-level statistics.
Firstly, there are some gaps between the actual code in batch_viewer.py and the expected code according to the README (For example, it doesn't take any 'config file' as input, the 'load file' name needs to be supplied separately, etc.). But these differences were obvious enough that I could fix them on my end and run the code.
However, it's the final step of saving the data after loading the buffer that I'm a bit confused about. I have two questions,
The text was updated successfully, but these errors were encountered: