Make llama dataset preparation more generic. #32

lcskrishna · 2024-12-10T09:37:16Z

No description provided.

wenchenvincent · 2024-12-10T15:44:28Z

examples/llama/prepare_dataset.sh

+if [[ $MODEL_NAME == "llama2" ]]; then
+    TOKENIZER_MODEL_PATH=https://huggingface.co/NousResearch/Llama-2-7b-chat-hf/resolve/main/tokenizer.model
+elif [[ $MODEL_NAME == "llama3" ]]; then
+    TOKENIZER_MODEL_PATH=meta-llama/Llama-3.1-8B


Do we need the full path here (like for llama2)?

wenchenvincent

Could you address the comments?

wenchenvincent · 2024-12-10T15:46:25Z

examples/llama/prepare_dataset.sh

 fi

 python3 prepare_bookcorpus_megatron_dataset.py --out-dir ${DATA_PATH}
-python3 tools/preprocess_data.py --input ${DATA_PATH}/bookcorpus_megatron.json  --tokenizer-type GPTSentencePieceTokenizer \
+python3 ../../tools/preprocess_data.py --input ${DATA_PATH}/bookcorpus_megatron.json  --tokenizer-type GPTSentencePieceTokenizer \


This assumes that we are running in this directory. It is possible that user also run it from other directories like the root directory of the repo. Could you get the directory of this script as the base directory and then use paths relative to the base directory?

lcskrishna added 4 commits December 10, 2024 00:55

make the dataset prep of llama as generic

5969e31

update data path for prepare dataset

6b93069

make prepare dataset path more generic

d32e112

update preprocess path

6da7091

lcskrishna requested review from gargrahul and wenchenvincent December 10, 2024 09:37

gargrahul changed the title ~~Make llama dataset preparatio n more generic.~~ Make llama dataset preparation more generic. Dec 10, 2024

wenchenvincent reviewed Dec 10, 2024

View reviewed changes

wenchenvincent requested changes Dec 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make llama dataset preparation more generic. #32

Make llama dataset preparation more generic. #32

lcskrishna commented Dec 10, 2024

wenchenvincent Dec 10, 2024

wenchenvincent left a comment

wenchenvincent Dec 10, 2024

Make llama dataset preparation more generic. #32

Are you sure you want to change the base?

Make llama dataset preparation more generic. #32

Conversation

lcskrishna commented Dec 10, 2024

wenchenvincent Dec 10, 2024

Choose a reason for hiding this comment

wenchenvincent left a comment

Choose a reason for hiding this comment

wenchenvincent Dec 10, 2024

Choose a reason for hiding this comment