Can I use real dataset for GPT2-gemini training? #2649
-
Hi, I want to use real dataset for gpt2 gemini training. so can I use wikitext or sample dataset like titans? If I can, can I use this code ? I need I need to change gemini/commons/utils.py code in order to use other datasets, but I'm new to preprocess language data... thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 8 replies
-
Hi, of course, you can use a real dataset for training. The code under the example directory is to demonstrate how to use the Gemini in your applications. Because how to load a dataset is usually a personal thing, we use dummy data here. You can refer the the following instructions to prepare a Webtext dataset. |
Beta Was this translation helpful? Give feedback.
-
为什么我这用不了呢? |
Beta Was this translation helpful? Give feedback.
Hi, of course, you can use a real dataset for training. The code under the example directory is to demonstrate how to use the Gemini in your applications. Because how to load a dataset is usually a personal thing, we use dummy data here.
You can refer the the following instructions to prepare a Webtext dataset.
https://github.com/hpcaitech/ColossalAI-Examples/blob/main/language/gpt/README.md#how-to-prepare-webtext-dataset