-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make llama dataset preparation more generic. #32
base: rocm_dev
Are you sure you want to change the base?
Conversation
if [[ $MODEL_NAME == "llama2" ]]; then | ||
TOKENIZER_MODEL_PATH=https://huggingface.co/NousResearch/Llama-2-7b-chat-hf/resolve/main/tokenizer.model | ||
elif [[ $MODEL_NAME == "llama3" ]]; then | ||
TOKENIZER_MODEL_PATH=meta-llama/Llama-3.1-8B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the full path here (like for llama2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you address the comments?
fi | ||
|
||
python3 prepare_bookcorpus_megatron_dataset.py --out-dir ${DATA_PATH} | ||
python3 tools/preprocess_data.py --input ${DATA_PATH}/bookcorpus_megatron.json --tokenizer-type GPTSentencePieceTokenizer \ | ||
python3 ../../tools/preprocess_data.py --input ${DATA_PATH}/bookcorpus_megatron.json --tokenizer-type GPTSentencePieceTokenizer \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that we are running in this directory. It is possible that user also run it from other directories like the root directory of the repo. Could you get the directory of this script as the base directory and then use paths relative to the base directory?
No description provided.