AMD (MI250X) support #1775

TensorTemplar · 2024-10-05T15:09:21Z

PR adds AMD support via the following:

split nvlink check logic to branch on device name, as reported by torch - it looks for amd in the device name and uses rocm-smi --showtopotype instead of the nvidia-smi one, for AMD.
rewrite the build_rope_cache to use a vectorized approach instead of indexing into the nonzero mask - ~~I did not bench performance vs. main, but~~ should be quicker as well as a bonus. The previous code did not work on my hardware, probably due to device detection issues, and resulted in build_rope_cache receiving device=None which then tries the nonzero indexing on the meta device and fails because of missing custom kernels 🤷‍♂️
~~added debug prints in build_rope_cache, since there was no logger (will remove once reviewed, if you have preferences on how to log, let me know)~~
~~reformatted the edited files via isort and black - let me know if you prefer the unlinted versions. Alternatively we could add the optional pre-commit in a separate PR~~

tests results on my machine (linux with a 4090):

1337 passed, 40 skipped, 70 xfailed, 2 xpassed, 381 warnings in 430.23s (0:07:10) ========================================================

Testing finetune_lora with llama 3.1 8B on a 8xMI250X node and seems to run so far with ~65-75% utilization. I am frankly quite shocked this runs quicker than our accelerate FSDP out of the box :)

rasbt · 2024-10-07T14:19:38Z

Thanks for the PR. Just reading through your commens, the premise is great actually. Before merging, we need to check (with some concrete numbers) that this doesn't impact CUDA performance though, but it sounds like it may actually even be a positive impact here 😊.

reformatted the edited files via isort and black - let me know if you prefer the unlinted versions. Alternatively we could add the optional pre-commit in a separate PR

Yes, could you please revert those style edits back? It makes reviewing code a bit harder because there are so many additional changes. I also had many discussions with colleagues on that internally, and the general preference seems to be not to use a linter like black.

TensorTemplar · 2024-10-08T07:07:27Z

Thanks for the PR. Just reading through your commens, the premise is great actually. Before merging, we need to check (with some concrete numbers) that this doesn't impact CUDA performance though, but it sounds like it may actually even be a positive impact here 😊.

reformatted the edited files via isort and black - let me know if you prefer the unlinted versions. Alternatively we could add the optional pre-commit in a separate PR

Yes, could you please revert those style edits back? It makes reviewing code a bit harder because there are so many additional changes. I also had many discussions with colleagues on that internally, and the general preference seems to be not to use a linter like black.

Sounds good, i will reproduce some of the config hub fine-tunes on CUDA and A6000s, but don't have any hardware running that would be a good test for multi-GPU and multi-node FSDP at the moment.

rasbt · 2024-10-08T14:54:27Z

Thanks! And no worries, I can help with multi-GPU comparisons on CUDA

TensorTemplar · 2024-10-09T08:21:36Z

Rebased and reverted the import style back to single-line, as discussed.

Please have a look at the perf benchmarks below:

NVIDIA GeForce RTX 4090
Driver Version: 560.35.03
CUDA Version: 12.6

with config_hub/finetune/llama-3.2-3B/lora.yaml
This branch:
Training time: 262.61s
Memory used: 9.67 GB
Validating ...
Final evaluation | val loss: 0.994 | val ppl: 2.701

0.5.0
Training time: 264.34s
Memory used: 9.67 GB
Validating ...
Final evaluation | val loss: 0.993 | val ppl: 2.700

with config_hub/finetune/llama-3.1-8b/lora.yaml
This branch:
Training time: 349.64s
Memory used: 19.73 GB
Validating ...
Final evaluation | val loss: 0.879 | val ppl: 2.408

0.5.0
Training time: 377.03s
Memory used: 19.73 GB
Validating ...
Final evaluation | val loss: 0.879 | val ppl: 2.409

New SOTA clearly established 😆

rasbt · 2024-10-09T14:32:41Z

Nice, thanks for the numbers! I will also do a run on multi-GPU to confirm, but this looks awesome!

frankly quite shocked this runs quicker than our accelerate FSDP

May I ask what the accelerated FSDP method is? Is this vanilla FSDP from PyTorch or some other method? Just curious if there's maybe some trick that we can add here to make it even faster.

TensorTemplar · 2024-10-09T17:15:20Z

Nice, thanks for the numbers! I will also do a run on multi-GPU to confirm, but this looks awesome!

frankly quite shocked this runs quicker than our accelerate FSDP

May I ask what the accelerated FSDP method is? Is this vanilla FSDP from PyTorch or some other method? Just curious if there's maybe some trick that we can add here to make it even faster.

is FSDP ever vanilla ? 😅 I cannot share the accelerate configs nor more details on that unfortunately, but i will share some stats from llama 70B and 405B fine-tunes, if i get them to run with litgpt on larger datasets.
I am not a big fan of HF/Accelerate/TRL UX and code quality, so will be more than happy to port everything to fabric now ^~

…clarification comment in test Revert failover to cpu in build_rope_cache when device is None Add test fixture for amd multigpu xgmi and nvidia dualgpu nvlink. Update tests to use fixtures Use fixtures for device properties. Follow existing style in fixture order. Mock subprocess.run for new tests Use real device names in mocks Remove redundant mocks Remove warning print, revert import sorting to previous style Remove warning print, revert import sorting to previous style

rasbt · 2024-10-09T17:38:33Z

@TensorTemplar I reverted all the style changes in the non-relevant codes and just see your push overrode these again 😅. Could you please roll those back to make the reviewing easier and focus the PR just on the RoPE improvemenets and NVLink/AMD tests?

TensorTemplar · 2024-10-09T17:41:39Z

@TensorTemplar I reverted all the style changes in the non-relevant codes and just see your push overrode these again 😅. Could you please roll those back to make the reviewing easier and focus the PR just on the RoPE improvemenets and NVLink/AMD tests?

Sorry, thought i squash some of the commits and merge new tests and overrode without checking that you committed into my branch. Style changes now reverted.

rasbt · 2024-10-09T19:11:13Z

Sorry, thought i squash some of the commits and merge new tests and overrode without checking that you committed into my branch. Style changes now reverted.

No worries, and thanks for updating that!

Actually, I was trying to run the code before and after your PR on an 8xA100 machine and noticed issues with the code before your PR:

adjusted_theta[mask_low_freq] = theta[mask_low_freq] / factor
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/torch/_meta_registrations.py", line 2883, in meta_index_Tensor
    nonzero = index.nonzero()
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: aten::nonzero: attempted to run this operator with Meta tensors, but there was no abstract impl or Meta kernel registered.

I must admit that I didn't run the updated RoPE code in multi-GPU settings, only single GPU training. So yes, we should definitely merge your PR 😊

TensorTemplar · 2024-10-09T21:45:12Z

Sorry, thought i squash some of the commits and merge new tests and overrode without checking that you committed into my branch. Style changes now reverted.

No worries, and thanks for updating that!

Actually, I was trying to run the code before and after your PR on an 8xA100 machine and noticed issues with the code before your PR:
adjusted_theta[mask_low_freq] = theta[mask_low_freq] / factor
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/torch/_meta_registrations.py", line 2883, in meta_index_Tensor
    nonzero = index.nonzero()
  File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: aten::nonzero: attempted to run this operator with Meta tensors, but there was no abstract impl or Meta kernel registered. 
I must admit that I didn't run the updated RoPE code in multi-GPU settings, only single GPU training. So yes, we should definitely merge your PR 😊

Oh i speculated this was an issue with missing custom cuda kernels on AMD, since i didn't see this on my local CUDA test machines.

rasbt

Looks all ~~good~~ great to me now! Thanks so much again for the PR!

TensorTemplar requested review from rasbt and lantiga as code owners October 5, 2024 15:09

TensorTemplar changed the title ~~AMD (Mi250X) support~~ AMD (MI250X) support Oct 5, 2024

TensorTemplar marked this pull request as draft October 5, 2024 15:27

TensorTemplar force-pushed the mi250x-support branch 2 times, most recently from d8e45cf to 03b33db Compare October 5, 2024 16:11

TensorTemplar marked this pull request as ready for review October 5, 2024 17:06

TensorTemplar added 2 commits October 9, 2024 09:45

add debug print and alternative mask

bbf3546

Split nvlink check into nvidia vs amd. Add tests

3942bf1

TensorTemplar force-pushed the mi250x-support branch from 34d8cfc to 89aa792 Compare October 9, 2024 06:58

TensorTemplar force-pushed the mi250x-support branch 2 times, most recently from 3a6462c to feabc9a Compare October 9, 2024 08:33

rasbt force-pushed the mi250x-support branch from feabc9a to e7da8e6 Compare October 9, 2024 14:29

rasbt force-pushed the mi250x-support branch from f64449b to b93ef60 Compare October 9, 2024 16:57

TensorTemplar force-pushed the mi250x-support branch from f94ed3a to feabc9a Compare October 9, 2024 17:28

TensorTemplar added 2 commits October 9, 2024 20:33

Merge branch 'main' into mi250x-support

c29666a

TensorTemplar force-pushed the mi250x-support branch from c1bae5e to c29666a Compare October 9, 2024 17:36

TensorTemplar added 2 commits October 9, 2024 21:00

Revert style changes to previous code

531e54d

Use correct fixture for dual-gpu test

16d7e5a

Add new mocks to previous tests

da906a0

rasbt approved these changes Oct 10, 2024

View reviewed changes

rasbt merged commit 46c4337 into Lightning-AI:main Oct 10, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD (MI250X) support #1775

AMD (MI250X) support #1775

TensorTemplar commented Oct 5, 2024 •

edited

Loading

rasbt commented Oct 7, 2024

TensorTemplar commented Oct 8, 2024

rasbt commented Oct 8, 2024

TensorTemplar commented Oct 9, 2024 •

edited

Loading

rasbt commented Oct 9, 2024

TensorTemplar commented Oct 9, 2024 •

edited

Loading

rasbt commented Oct 9, 2024

TensorTemplar commented Oct 9, 2024 •

edited

Loading

rasbt commented Oct 9, 2024

TensorTemplar commented Oct 9, 2024 •

edited

Loading

rasbt left a comment

AMD (MI250X) support #1775

AMD (MI250X) support #1775

Conversation

TensorTemplar commented Oct 5, 2024 • edited Loading

rasbt commented Oct 7, 2024

TensorTemplar commented Oct 8, 2024

rasbt commented Oct 8, 2024

TensorTemplar commented Oct 9, 2024 • edited Loading

rasbt commented Oct 9, 2024

TensorTemplar commented Oct 9, 2024 • edited Loading

rasbt commented Oct 9, 2024

TensorTemplar commented Oct 9, 2024 • edited Loading

rasbt commented Oct 9, 2024

TensorTemplar commented Oct 9, 2024 • edited Loading

rasbt left a comment

Choose a reason for hiding this comment

TensorTemplar commented Oct 5, 2024 •

edited

Loading

TensorTemplar commented Oct 9, 2024 •

edited

Loading

TensorTemplar commented Oct 9, 2024 •

edited

Loading

TensorTemplar commented Oct 9, 2024 •

edited

Loading

TensorTemplar commented Oct 9, 2024 •

edited

Loading