We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I use the k8s sample example for lora for llama3 8B model it works fine. But for 70b model it fails with OOM. Total number of GPUs: 8 x Gaudi3 GPUs Dataset: databricks-dolly-15k Error: wnloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Downloading shards: 100%|██████████| 30/30 [1:32:34<00:00, 185.16s/it] Loading checkpoint shards: 100%|██████████| 30/30 [00:02<00:00, 12.13it/s] Loading checkpoint shards: 100%|██████████| 30/30 [00:04<00:00, 6.99it/s] Loading checkpoint shards: 100%|██████████| 30/30 [00:04<00:00, 7.43it/s] Loading checkpoint shards: 100%|██████████| 30/30 [00:03<00:00, 7.51it/s] Loading checkpoint shards: 100%|██████████| 30/30 [00:03<00:00, 7.53it/s] Loading checkpoint shards: 100%|██████████| 30/30 [00:04<00:00, 6.27it/s] gaudi-llm-ds-ft-worker-0: [2024-09-11 23:01:54,538] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2831 gaudi-llm-ds-ft-worker-0: [2024-09-11 23:01:54,700] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2832 Map: 100%|██████████| 14411/14411 [00:04<00:00, 3323.03 examples/s] Map: 100%|██████████| 14411/14411 [00:04<00:00, 3233.18 examples/s] Map: 100%|██████████| 14411/14411 [00:04<00:00, 3234.02 examples/s] Map: 100%|██████████| 14411/14411 [00:04<00:00, 3215.13 examples/s] Map: 100%|██████████| 14411/14411 [00:04<00:00, 3205.00 examples/s] Map: 100%|██████████| 600/600 [00:00<00:00, 4528.23 examples/s] Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 9.71MB/s] Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 23.9MB/s] Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 19.7MB/s] Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 22.7MB/s] Downloading builder script: 100%|██████████| 4.20k/4.20k [00:00<00:00, 21.8MB/s] gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:10,224] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2833 gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232 gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232 gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232 gaudi-llm-ds-ft-worker-0: trainable params: 16,384,000 || all params: 70,570,090,496 || trainable%: 0.0232 gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:24,041] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2834 gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:39,007] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2835 gaudi-llm-ds-ft-worker-0: [2024-09-11 23:02:39,008] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2836 gaudi-llm-ds-ft-worker-0: [rank7]: Traceback (most recent call last): gaudi-llm-ds-ft-worker-0: [rank7]: File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 935, in <module> gaudi-llm-ds-ft-worker-0: [rank7]: main() gaudi-llm-ds-ft-worker-0: [rank7]: File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 891, in main gaudi-llm-ds-ft-worker-0: [rank7]: trainer = GaudiTrainer( gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 216, in __init__ gaudi-llm-ds-ft-worker-0: [rank7]: super().__init__( gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 535, in __init__ gaudi-llm-ds-ft-worker-0: [rank7]: self._move_model_to_device(model, args.device) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 299, in _move_model_to_device gaudi-llm-ds-ft-worker-0: [rank7]: model = model.to(device) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 179, in wrapped_to gaudi-llm-ds-ft-worker-0: [rank7]: result = self.original_to(*args, **kwargs) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1176, in to gaudi-llm-ds-ft-worker-0: [rank7]: return self._apply(convert) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank7]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank7]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank7]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank7]: [Previous line repeated 4 more times] gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 804, in _apply gaudi-llm-ds-ft-worker-0: [rank7]: param_applied = fn(param) gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1162, in convert gaudi-llm-ds-ft-worker-0: [rank7]: return t.to( gaudi-llm-ds-ft-worker-0: [rank7]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 57, in __torch_function__ gaudi-llm-ds-ft-worker-0: [rank7]: return super().__torch_function__(func, types, new_args, kwargs) gaudi-llm-ds-ft-worker-0: [rank7]: RuntimeError: [Rank:7] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::469762048 (448)MB gaudi-llm-ds-ft-worker-0: Internal Error: Received signal - Segmentation fault gaudi-llm-ds-ft-worker-0: [rank6]: Traceback (most recent call last): gaudi-llm-ds-ft-worker-0: [rank6]: File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 935, in <module> gaudi-llm-ds-ft-worker-0: [rank6]: main() gaudi-llm-ds-ft-worker-0: [rank6]: File "/optimum-habana/examples/language-modeling/run_lora_clm.py", line 891, in main gaudi-llm-ds-ft-worker-0: [rank6]: trainer = GaudiTrainer( gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 216, in __init__ gaudi-llm-ds-ft-worker-0: [rank6]: super().__init__( gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 535, in __init__ gaudi-llm-ds-ft-worker-0: [rank6]: self._move_model_to_device(model, args.device) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 299, in _move_model_to_device gaudi-llm-ds-ft-worker-0: [rank6]: model = model.to(device) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 179, in wrapped_to gaudi-llm-ds-ft-worker-0: [rank6]: result = self.original_to(*args, **kwargs) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1176, in to gaudi-llm-ds-ft-worker-0: [rank6]: return self._apply(convert) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank6]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank6]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 779, in _apply gaudi-llm-ds-ft-worker-0: [rank6]: module._apply(fn) gaudi-llm-ds-ft-worker-0: [rank6]: [Previous line repeated 4 more times] gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 804, in _apply gaudi-llm-ds-ft-worker-0: [rank6]: param_applied = fn(param) gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1162, in convert gaudi-llm-ds-ft-worker-0: [rank6]: return t.to( gaudi-llm-ds-ft-worker-0: [rank6]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 57, in __torch_function__ gaudi-llm-ds-ft-worker-0: [rank6]: return super().__torch_function__(func, types, new_args, kwargs) gaudi-llm-ds-ft-worker-0: [rank6]: RuntimeError: [Rank:6] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::469762048 (448)MB gaudi-llm-ds-ft-worker-0: Internal Error: Received signal - Segmentation fault
examples
Succesfully run model customization of llama3.1 70b model.
The text was updated successfully, but these errors were encountered:
@premmotgi If I understand correctly, you're tring to replicate https://github.com/huggingface/optimum-habana/blob/main/examples/kubernetes/ci/multi-card-lora-clm-values.yaml with Llama 3.1 70B right?
Sorry, something went wrong.
No branches or pull requests
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Succesfully run model customization of llama3.1 70b model.
The text was updated successfully, but these errors were encountered: