-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Error 717: Invalid Address Space -- Atomic operation not supported on stack memory. #428
Comments
Shared memory is fine, the warning (which we should probably make optional or disable) is that you now need twice the shared memory for the derivatives as well. |
Great! Thanks for the clarification! |
Alright, working with the code now, I found 3 errors:
Stacktrace for 1:
Removing the sqrt gives the following error:
After halving the shmem size, I then got an issue with
Just tracking my progress a bit |
Thanks for looking at this. The
I guess this means that shared memory should be created over the function boundary, like the following (which works). using CUDA, Enzyme, Test
function mul_kernel(A, shared)
i = threadIdx().x
if i <= length(A)
shared[i] = A[i] * A[i]
A[i] = shared[i]
end
return nothing
end
function grad_mul_kernel(A, dA)
shared = CuStaticSharedArray(Float32, 64)
d_shared = CuStaticSharedArray(Float32, 64)
Enzyme.autodiff_deferred(mul_kernel, Const, Duplicated(A, dA), Duplicated(shared, d_shared))
return nothing
end
A = CUDA.ones(64,)
@cuda threads=length(A) mul_kernel(A, similar(A))
A = CUDA.ones(64,)
dA = similar(A)
dA .= 1
@cuda threads=length(A) grad_mul_kernel(A, dA)
@test all(dA .== 2) |
Shared memory should also work if declared within the kernel. |
Yeah static shared memory should be fine, dynamic shared memory is problematic since that is a launch parameter |
After messing around with this some more, I don't really think the first 2 errors I listed are all that important.
After messing with the kernel a bit more, I found the following lines to be problematic:
When commenting out everything past these lines in the kernel, I get the following error:
This has the same segfault as before, so it might be the cause of the other error. Changing the lines as below fixes the issue:
Still working on it / trying to make a Molly free MWE |
Addressing the These work:
These do not:
Reminder, this is the original code block:
Note that the kernels without enzyme work with no problems and also work if |
Yeah this sounds like a LLVM Codegen bug where we synthesis an instruction that is not valid in this context. Can you post the MWE with the minimized example? |
Still working on that... I'll post it as soon as possible |
Ok, it's not perfect, but I stripped the code down a bit to generate a similar (but different) error:
Note that:
The error this time is:
With the following stacktrace upon leaving Julia (CUDA finalization):
|
using CUDA, Enzyme
Enzyme.API.printall!(true)
struct Atom
ϵ::Float32
end
function kernel!(atoms_var, n)
atoms = atoms_var # CUDA.Const(atoms_var)
# Core.Intrinsics.llvmcall(("declare void @llvm.assume(i1)
# define void @f(i64 %a) alwaysinline {
# %z = icmp sgt i64 %a, 0
# call void @llvm.assume(i1 %z)
# ret void
# }", "f"), Nothing, Tuple{Int64}, n)
j = 0::Int64
#for j = 1:length(atoms)
while j < n
a1 = @inbounds atoms[1].ϵ
a2 = @inbounds atoms[j+1].ϵ
o = a1 * a2
@inbounds atoms[j+1] = Atom(o)
j += 1
end
# sync_threads()
return nothing
end
function grad_kernel!(atoms, datoms)
Enzyme.autodiff_deferred(
kernel!,
Duplicated(atoms, datoms),
Const(length(atoms)::Int64)
)
return
end
n_atoms = 1024
atoms = [Atom(Float32(0.2 + rand() * 0.1)) for _ in 1:n_atoms]
cu_atoms = CuArray(atoms)
d_cu_atoms = copy(cu_atoms)
n_threads = 256
n_blocks = 4
CUDA.@sync @cuda threads=n_threads blocks=n_blocks kernel!(cu_atoms, length(atoms)::Int64)
println("Kernel worked")
CUDA.@sync @cuda threads=n_threads blocks=n_blocks grad_kernel!(
cu_atoms, d_cu_atoms)
println("Grad kernel worked")
|
Looks like you're exhausting the ability of malloc on the GPU to provide cache memory for your code. |
Brief update from last week. As it turns out, I created a MWE for the wrong error. To fix error 700, you need to increase the malloc heap size to be large with:
Putting this in your code before the kernel launch will fix the MWE posted before. The more concerning issue is error 717, which I am still attempting to create a MWE for. |
well, good news, 717 will actually trigger on the previous MWE once adding the line to it from the previous comment:
|
@leios I cannot reproduce the error with that code locally. What version of Julia and Enzyme are you on (commit as relevant) Furthermore, would you be able to simplify the code in Perhaps also get rid of const? Finally for debugging sake can you post the log when you add Enzyme.API.printall!(true) |
Right, sorry for the missing information. GPUs I have tested on:
I'll focus on the P100 error for now or maybe try to find a case that also errors for the V100? I was able to get consisten errors on the P100 and GTX 1080 with the following example:
Here is the Enzyme print and error for the P100:
|
Can you attempt to get rid of the broadcasts (those hit recursive and allocating functions). |
Ah, it seems to be the broadcast causing the error. This kernel errors:
This one does not:
So I guess the short-term fix here is to just avoid broadcasts and set the malloc heap size to be high? |
I mean we still want to figure out the root cause (avoiding broadcasts isn't necessarily a fix, though because it allocates, can be good practice). What if you manually create a new array of the appropriate size and use a for loop to set it with the elems of c2-c1? |
Alternatively, if that doesnt work, you can perhaps inline the definition of the various broadcast calls (I find @ edit / @ which , etc very useful in those circumstances) ? |
Ok, working on it... From what I can tell, it's breaking on Relevant functions in Base.Broadcast:
Relevant functions in
Attempted functions:
Smallest force function:
Enzyme print:
Segfault:
What happens on the cpu:
|
What version of Enzyme/julia are you on? Can you retry with main |
Well, just updated to the new main (again) and now things work for the mwe posted above. My current st:
Sorry for the spam. I'll keep messing with it. Thanks for the patience here! |
Thank you for your patience! |
Ok I finally managed to reproduce... This didn't occur on the A100 I was trying earlier. |
Rather weird... |
Ah without
|
Today I learned about:
Which gives a device core dump:
So:
|
According to https://www.olcf.ornl.gov/wp-content/uploads/2019/12/05_Atomics_Reductions_Warp_Shuffle.pdf page 7
So the atomic add targets offsets from a base ptr passed into the arguments.
So what is
So my hypothesis here is that CUDA does not allow for atomic_add on stack pointers on Pascal (but maybe fine later?). Enzyme uncondintionally emits an atomic add since we don't know that the value us a stack pointer. When everything is inlined all is good. Side-note |
Billy and I confirmed the hypothesis. We In newer HW generation this operation is legal... The conclusion here is that is very hard to fix in Enzyme. There is a potential fix to propagate information about whether pointer are thread-local, but that is rather speculative and a large change. Workarounds are:
Inlining the function is probably the best anyway since that will drastically improve performance. |
Ok, I think we have partial solutions to everything.
If I am missing any highlights, be sure to add them, but I think this issue can be closed for now? |
Let's leave the issue open though to remind us and in case e we can implement a proper fix eventually |
Thanks everyone for your work on this. I have the kernel working with the suggestions above. |
I am on Julia 1.8.0, Enzyme main (71ebce9) and CUDA 3.12.0. The following CUDA kernel runs but the gradient kernel with Enzyme errors.
The output is:
Changing the
dx, dy, dz = shared_fs[1, si], shared_fs[2, si], shared_fs[3, si]
line todx, dy, dz = 1.0f0, 1.0f0, 1.0f0
makes this not error, suggesting a problem when updating an array using the shared memory.I realise that I should be using atomics at the
forces[1, i] -= dx
lines but that errored due to #421. This may be a separate error though since the following simple test works, suggesting that the above should also work.The text was updated successfully, but these errors were encountered: