Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make llama dataset preparation more generic. #32

Open
wants to merge 4 commits into
base: rocm_dev
Choose a base branch
from

Conversation

lcskrishna
Copy link
Collaborator

No description provided.

@gargrahul gargrahul changed the title Make llama dataset preparatio n more generic. Make llama dataset preparation more generic. Dec 10, 2024
if [[ $MODEL_NAME == "llama2" ]]; then
TOKENIZER_MODEL_PATH=https://huggingface.co/NousResearch/Llama-2-7b-chat-hf/resolve/main/tokenizer.model
elif [[ $MODEL_NAME == "llama3" ]]; then
TOKENIZER_MODEL_PATH=meta-llama/Llama-3.1-8B
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the full path here (like for llama2)?

Copy link
Collaborator

@wenchenvincent wenchenvincent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you address the comments?

fi

python3 prepare_bookcorpus_megatron_dataset.py --out-dir ${DATA_PATH}
python3 tools/preprocess_data.py --input ${DATA_PATH}/bookcorpus_megatron.json --tokenizer-type GPTSentencePieceTokenizer \
python3 ../../tools/preprocess_data.py --input ${DATA_PATH}/bookcorpus_megatron.json --tokenizer-type GPTSentencePieceTokenizer \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that we are running in this directory. It is possible that user also run it from other directories like the root directory of the repo. Could you get the directory of this script as the base directory and then use paths relative to the base directory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants