This repository showcases language model pretraining with the awesome TensorFlow Model Garden library.
The following LMs are currently supported:
- BERT Pretraining - see pretraining instructions
- Token Dropping for efficient BERT Pretraining - see pretraining instructions
- Training ELECTRA Augmented with Multi-word Selection (TEAMS) - see pretraining instructions
Additionally, the following features are provided:
- A cheatsheet for TPU VM creation (including all necessary dependencies to pretrain models with TF Model Garden library), which can be found here.
- An extended pretraining data generation script that allows, for example, the use of tokenizers from the Hugging Face Model Hub or different data packing strategies (Original BERT packing or RoBERTa-like packing), which can be found here.
- Conversion scripts that convert TF Model Garden weights to Hugging Face Transformers-compatible models, which can be found here.
Following LMs were pretrained on the (10BT subset) of the famous FineWeb and FineWeb-Edu dataset:
- BERT-based - find the best model checkpoint here
- Token Dropping BERT-based - find the best model checkpoint here
- TEAMS-based - fine the best model checkpoint here
All models can be found in the TensorFlow Model Garden LMs organization on the Model Hub and in this collection.
Detailed evaluation results with the ScandEval library are available in this repository.
This repository is the outcome of the last two years of working with TPUs from the awesome TRC program and the TensorFlow Model Garden library.
Made from Bavarian Oberland with ❤️ and 🥨.