-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU support #69
base: dev
Are you sure you want to change the base?
GPU support #69
Conversation
bf607f4
to
fb8dee1
Compare
Rewrite momenta function without broadcast, because GPUs do not like broadcasts in their kernels. This is essentially ported from QEDjl-project/QEDprocesses.jl#69 since this implementation moved here since that PR was opened.
This builds in the CI with 1.10 and rc, while not executing the GPU tests (because there's no GPU on the runners):
This is by design. To actually use these tests, we would need runners that actually have GPUs, then they would automatically run the tests too. However, versions 1.6 - 1.9 fail because of dependency issues. Currently, the GPU tests are just normal tests and in the test environment, only packages that are installed in the project can be loaded. This means that whether or not the tests will be run, the dependencies are in the Project.toml and have to be installed when running any tests at all. I'm not really sure how to solve this. We could add another file such as |
Can we maybe disable tests and the import command via environment variable, like I did it in this Python project: https://github.com/alpaka-group/bashi/blob/c0b673eb1ecff92bde3c3bb89c277104cdbedde8/tests/test_generate_combination_list.py#L186 CUDA.jl also provides a method to detect GPU's: https://cuda.juliagpu.org/stable/installation/conditional/ |
No, we can't use either of these, because the problem doesn't happen while executing the tests (the tests already are skipped when the libraries are non-functional for whatever reason). |
Looks like CUDA.jl and AMDGPU.jl are hard dependencies for the test, or isn't it? If yes, can remove it from the |
At least as far as I'm aware this does not work. I tried having AMDGPU.jl globally installed but not in the test dependencies, and then load it inside the test. But it just says it's not installed. |
For reference, the part that actually fails is not the
|
It seems like the compat entry for AMDGPU is too restrictive. Anyway, since AMDGPU itself only supports Julia versions >=1.10, I suggest to drop the support for AMDGPU for all julia versions below 1.10 as well. Therefore, we only need to test versions, which already pass the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work so far, here are some comments from my side.
return nothing | ||
end | ||
|
||
PROC_DEF_TUPLES = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you iterate over all spin/pol combinations here using Iterators.product
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can easily do that, the problem just becomes execution time. Testing so many GPU kernels takes a long time, so even with just the 17 total cases it now already takes ~2.5 minutes on my machine. This is fine for now I think, but we might have to reconsider some numbers in the future when the tests get even more extensive.
I opened a discourse thread about our testing problem, but there doesn't really seem to be an answer (yet) It seems the only real option is to more or less setup GPU tests with PackageExtensions manually in the CI by loading the necessary packages only on julia versions and runners/architectures where they will compile. |
Since we have now dropped Julia versions 1.9 and lower, we can properly use package extensions and load AMDGPU and CUDA properly, even when no supported GPU is available. Currently, the tests will simply not run when no GPU is found. So if we get a runner with a GPU and run unit tests on that, it should run the tests. |
What do you think how we should proceed with this PR from here @szabo137 ? |
As discussed offline, we should keep this, at least, as a testing field for the integration of GPU tests. However, maybe we should think about having such a testing branch upstream with less actual functionality. Maybe GPU-tests for CuArrays of |
I'm a bit confused why the GPU tests fail. The AMDGPU tests crash violently in the CI, but pass fine on my own AMDGPU. The CUDA tests seem to compute incorrect results, which also doesn't really make sense to me. |
It looks like a compiler and not a rutime error. This is strange. The only explanation which I have is, that it cannot correctly detect the GPU maybe some features are not enabled. Can you reproduce it in a container? Have a look in the generated yaml code to reproduce it locally (search for the job unit_test_julia_amdgpu_1_11): https://gitlab.com/hzdr/qedjl-project/QEDprocesses-jl/-/jobs/8517785749 |
I'm not sure how to do that, maybe you could show me some time? It still seems strange, because if the problem is in the runner, then shouldn't the GPU tests on QEDbase have failed too? |
309fa8c
to
2de3cf7
Compare
This PR adds GPU support. It adds tests that run conditionally on AMDGPU.jl and/or CUDA.jl, depending on whether they are functional on the machine we're running on.
This PR depends on the fix QEDjl-project/QEDbase.jl#64 and is rebased to #68.
Left to do:
_total_probability
function which currently does not work on GPU because ofquadgk