-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD (MI250X) support #1775
AMD (MI250X) support #1775
Conversation
d8e45cf
to
03b33db
Compare
Thanks for the PR. Just reading through your commens, the premise is great actually. Before merging, we need to check (with some concrete numbers) that this doesn't impact CUDA performance though, but it sounds like it may actually even be a positive impact here 😊.
Yes, could you please revert those style edits back? It makes reviewing code a bit harder because there are so many additional changes. I also had many discussions with colleagues on that internally, and the general preference seems to be not to use a linter like black. |
Sounds good, i will reproduce some of the config hub fine-tunes on CUDA and A6000s, but don't have any hardware running that would be a good test for multi-GPU and multi-node FSDP at the moment. |
Thanks! And no worries, I can help with multi-GPU comparisons on CUDA |
34d8cfc
to
89aa792
Compare
Rebased and reverted the import style back to single-line, as discussed. Please have a look at the perf benchmarks below: NVIDIA GeForce RTX 4090 with 0.5.0 with 0.5.0 New SOTA clearly established 😆 |
3a6462c
to
feabc9a
Compare
Nice, thanks for the numbers! I will also do a run on multi-GPU to confirm, but this looks awesome!
May I ask what the accelerated FSDP method is? Is this vanilla FSDP from PyTorch or some other method? Just curious if there's maybe some trick that we can add here to make it even faster. |
is FSDP ever vanilla ? 😅 I cannot share the accelerate configs nor more details on that unfortunately, but i will share some stats from llama 70B and 405B fine-tunes, if i get them to run with litgpt on larger datasets. |
f94ed3a
to
feabc9a
Compare
…clarification comment in test Revert failover to cpu in build_rope_cache when device is None Add test fixture for amd multigpu xgmi and nvidia dualgpu nvlink. Update tests to use fixtures Use fixtures for device properties. Follow existing style in fixture order. Mock subprocess.run for new tests Use real device names in mocks Remove redundant mocks Remove warning print, revert import sorting to previous style Remove warning print, revert import sorting to previous style
c1bae5e
to
c29666a
Compare
@TensorTemplar I reverted all the style changes in the non-relevant codes and just see your push overrode these again 😅. Could you please roll those back to make the reviewing easier and focus the PR just on the RoPE improvemenets and NVLink/AMD tests? |
Sorry, thought i squash some of the commits and merge new tests and overrode without checking that you committed into my branch. Style changes now reverted. |
No worries, and thanks for updating that! Actually, I was trying to run the code before and after your PR on an 8xA100 machine and noticed issues with the code before your PR: adjusted_theta[mask_low_freq] = theta[mask_low_freq] / factor
File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/torch/_meta_registrations.py", line 2883, in meta_index_Tensor
nonzero = index.nonzero()
File "/home/sebastian/miniforge3/envs/litgp2/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in __torch_function__
return func(*args, **kwargs)
NotImplementedError: aten::nonzero: attempted to run this operator with Meta tensors, but there was no abstract impl or Meta kernel registered. I must admit that I didn't run the updated RoPE code in multi-GPU settings, only single GPU training. So yes, we should definitely merge your PR 😊 |
Oh i speculated this was an issue with missing custom cuda kernels on AMD, since i didn't see this on my local CUDA test machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good great to me now! Thanks so much again for the PR!
PR adds AMD support via the following:
rocm-smi --showtopotype
instead of the nvidia-smi one, for AMD.build_rope_cache
to use a vectorized approach instead of indexing into the nonzero mask -I did not bench performance vs. main, butshould be quicker as well as a bonus. The previous code did not work on my hardware, probably due to device detection issues, and resulted inbuild_rope_cache
receivingdevice=None
which then tries the nonzero indexing on themeta
device and fails because of missing custom kernels 🤷♂️added debug prints inbuild_rope_cache
, since there was no logger (will remove once reviewed, if you have preferences on how to log, let me know)reformatted the edited files via isort and black - let me know if you prefer the unlinted versions. Alternatively we could add the optional pre-commit in a separate PRtests results on my machine (linux with a 4090):
Testing
finetune_lora
with llama 3.1 8B on a 8xMI250X node and seems to run so far with ~65-75% utilization. I am frankly quite shocked this runs quicker than our accelerate FSDP out of the box :)