Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--check-bounds=no is broken on Julia 1.9.0-beta3 #354

Closed
pxl-th opened this issue Jan 20, 2023 · 19 comments · Fixed by #358
Closed

--check-bounds=no is broken on Julia 1.9.0-beta3 #354

pxl-th opened this issue Jan 20, 2023 · 19 comments · Fixed by #358

Comments

@pxl-th
Copy link
Member

pxl-th commented Jan 20, 2023

--check-bounds=no seems to be broken on Julia 1.9.0-beta3.

MWE:

Run below script with --check-bounds=no

using AMDGPU

function memset_kerr!(x, v::Float32)
    i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
    x[i] = v
    return nothing
end

function main()
    x = ROCArray{Float32}(undef, 1024 * 1024)
    wait(@roc groupsize=256 gridsize=length(x) memset_kerr!(x, 1f0))
    return nothing
end
main()

Error:

ERROR: LoadError: InvalidIRError: compiling kernel #memset_kerr!(AMDGPU.Device.ROCDeviceVector{Float32, 1}, Float32) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f_apply_type)
Stacktrace:
 [1] Val
   @ ./essentials.jl:801
 [2] #setindex!
   @ ~/.julia/dev/AMDGPU/src/device/gcn/array.jl:85
 [3] memset_kerr!
   @ ~/.julia/dev/gcn_test.jl:5
Reason: unsupported call to an unknown function (call to ijl_new_structv)
Stacktrace:
 [1] Val
   @ ./essentials.jl:799
 [2] Val
   @ ./essentials.jl:801
 [3] #setindex!
   @ ~/.julia/dev/AMDGPU/src/device/gcn/array.jl:85
 [4] memset_kerr!
   @ ~/.julia/dev/gcn_test.jl:5
Reason: unsupported dynamic function invocation (call to unsafe_store!)
Stacktrace:
 [1] #setindex!
   @ ~/.julia/dev/AMDGPU/src/device/gcn/array.jl:85
 [2] memset_kerr!
   @ ~/.julia/dev/gcn_test.jl:5
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(memset_kerr!), Tuple{AMDGPU.Device.ROCDeviceVector{Float32, 1}, Float32}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/qdoh1/src/validation.jl:141
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/qdoh1/src/driver.jl:418 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LHjFw/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/qdoh1/src/driver.jl:416 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/qdoh1/src/utils.jl:83
  [6] (::AMDGPU.Compiler.var"#33#36"{GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(memset_kerr!), Tuple{AMDGPU.Device.ROCDeviceVector{Float32, 1}, Float32}}}, Core.MethodInstance})(ctx::LLVM.ThreadSafeContext)
    @ AMDGPU.Compiler ~/.julia/dev/AMDGPU/src/compiler.jl:182
  [7] LLVM.ThreadSafeContext(f::AMDGPU.Compiler.var"#33#36"{GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(memset_kerr!), Tuple{AMDGPU.Device.ROCDeviceVector{Float32, 1}, Float32}}}, Core.MethodInstance})
    @ LLVM ~/.julia/packages/LLVM/qc3sa/src/executionengine/ts_module.jl:14
  [8] JuliaContext(f::AMDGPU.Compiler.var"#33#36"{GPUCompiler.CompilerJob{GPUCompiler.GCNCompilerTarget, AMDGPU.Compiler.ROCCompilerParams, GPUCompiler.FunctionSpec{typeof(memset_kerr!), Tuple{AMDGPU.Device.ROCDeviceVector{Float32, 1}, Float32}}}, Core.MethodInstance})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/qdoh1/src/driver.jl:74
  [9] rocfunction_compile(job::GPUCompiler.CompilerJob)
    @ AMDGPU.Compiler ~/.julia/dev/AMDGPU/src/compiler.jl:178
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(AMDGPU.Compiler.rocfunction_compile), linker::typeof(AMDGPU.Compiler.rocfunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/qdoh1/src/cache.jl:90
 [11] rocfunction(f::typeof(memset_kerr!), tt::Type; name::Nothing, device::ROCDevice, global_hooks::NamedTuple{(), Tuple{}})
    @ AMDGPU.Compiler ~/.julia/dev/AMDGPU/src/compiler.jl:164
 [12] rocfunction
    @ ~/.julia/dev/AMDGPU/src/compiler.jl:154 [inlined]
 [13] macro expansion
    @ ~/.julia/dev/AMDGPU/src/highlevel.jl:430 [inlined]
 [14] memset!(x::ROCVector{Float32}, v::Float32)
    @ Main ~/.julia/dev/gcn_test.jl:10
 [15] main()
    @ Main ~/.julia/dev/gcn_test.jl:25
 [16] top-level scope
    @ ~/.julia/dev/gcn_test.jl:35
@pxl-th pxl-th changed the title --check-bounds=no is broken --check-bounds=no is broken on Julia 1.9.0-beta3 Jan 20, 2023
@luraess
Copy link
Contributor

luraess commented Jan 23, 2023

Is only the check-bounds breaking or @inbounds as well?

@pxl-th
Copy link
Member Author

pxl-th commented Jan 23, 2023

Is only the check-bounds breaking or @inbounds as well?

From the quick testing I've done, no. Only with --check-bounds=no.

@pxl-th
Copy link
Member Author

pxl-th commented Jan 23, 2023

I think I'll move this issue to GPUCompiler, as it is probably its responsibility

@pxl-th
Copy link
Member Author

pxl-th commented Jan 23, 2023

Turns out, this is probably AMDGPU's fault, as it passes alignment argument to Base.unsafe_store! which it does not accept:

Base.unsafe_store!(pointer(A), x, index, Val(align))

Removing Val(align) from Base.unsafe_store! & Base.unsafe_load fixes the issue.

@jpsamaroo is it the right move?

@maleadt
Copy link
Member

maleadt commented Jan 23, 2023

Turns out, this is probably AMDGPU's fault, as it passes alignment argument to Base.unsafe_store! which it does not accept:

That is not true, as explained in JuliaGPU/GPUCompiler.jl#387 (comment).

Just don't use --check-bounds=no, see JuliaLang/julia#48245.

@pxl-th
Copy link
Member Author

pxl-th commented Jan 23, 2023

see JuliaLang/julia#48245.

Yes, I've seen it. For me, however, it consistently results in better performance (especially in things like Nerf.jl).
Especially for GPU kernels. Even this example above results in ~2x improvement:
@btime memset!($x, $v)

  • with bounds checking: 219.725 μs (91 allocations: 2.94 KiB)
  • without bounds checking: 101.210 μs (119 allocations: 3.89 KiB)

If there were a way to switch between that for GPU kernels it'd be great. Or is there?
Like for development you'd use boundschecking, but then disable it for max performance.

@maleadt
Copy link
Member

maleadt commented Jan 23, 2023

Yeah, GPUs are sensitive to the branches introduced by bounds checking. We'll probably need a GPUCompiler.jl-based solution, but that may take a while. For the time being, --check-bounds=no on 1.9+ sadly means that constant propagation gets significantly worse, resulting in it being unusable for GPU code.

That said, if you write GPU kernels you should generally be using @inbounds a lot, because yo pass bounds to the launch configuration, and/or your kernels starts with a manual bounds check. So in that regard, a large performance difference when disabling bounds checks globally may indicate that there's something wrong with your kernels.

@luraess
Copy link
Contributor

luraess commented Jan 23, 2023

Is the perf improvement you see between --check-bounds=no and @inbounds or vs keeping bound check @pxl-th ?

I really like that approach also:

Like for development you'd use boundschecking, but then disable it for max performance.

Sure, one could rather use @inbounds but this may lead to add bugs especially when writing a lot of new code all the time. Maybe there could be a way to activate bound check while having @inbounds for debug and dev purpose?

@maleadt
Copy link
Member

maleadt commented Jan 23, 2023

Maybe there could be a way to activate bound check while having @inbounds for debug and dev purpose?

That's --check-bounds=yes, which Pkg.test sets.

@pxl-th
Copy link
Member Author

pxl-th commented Jan 23, 2023

Is the perf improvement you see between --check-bounds=no and @inbounds or vs keeping bound check @pxl-th ?

This is between check-bounds=no and --check-bounds=auto, the kernel did not have @inbounds.

@pxl-th
Copy link
Member Author

pxl-th commented Jan 23, 2023

That said, if you write GPU kernels you should generally be using @inbounds a lot

Agree, but then the code becomes kind of polluted with them, especially if the kernel is more complicated.

Much easier is to have some kind of switch to disable them entirely, like -O3 flag.

@maleadt
Copy link
Member

maleadt commented Jan 23, 2023

Agree, but then the code becomes kind of polluted with them, especially if the kernel is more complicated.

What's wrong with:

@inbounds if is_in_bounds(i)
...
end

@pxl-th
Copy link
Member Author

pxl-th commented Jan 23, 2023

Marking alignment function as @generated solves the issue.

What's wrong with:

Nothing, but also requires one to think if it is propagated properly. And still is more complex than having a switch for that.

@omlins
Copy link
Contributor

omlins commented Jan 23, 2023

I would totally agree with @pxl-th on this.
The feature check-bounds=no is one of the beautiful things of Julia in my opinion: it allows to develop code with the benefit of bounds checking, and once everything works as it should, with a single switch we can remove the bounds checking not necessary anymore to have it run faster. I think there is a large community that will strongly appreciate if this feature can be conserved. :)

@maleadt
Copy link
Member

maleadt commented Jan 23, 2023

And still is more complex than having a switch for that.

A switch that introduces the risk of your entire session crashing when there's an inadvertent OOB access somewhere.

The argument is that it doesn't make sense for the entire session, and instead should be a fine-grained flag. For example, @cuda checkbounds=false; I'd be happy to accept a PR adding such a mode.

@pxl-th
Copy link
Member Author

pxl-th commented Jan 23, 2023

For example, @cuda checkbounds=false

Yes, something like that would be great.

@omlins
Copy link
Contributor

omlins commented Jan 23, 2023

A switch that introduces the risk of your entire session crashing when there's an inadvertent OOB access somewhere.

A switch that you only use if you're sure that your code will not crash (as for example in a production run of a scientific application)! :) Different people have different ways of working!

@vchuravy
Copy link
Member

How can you be sure that there is no oob memory access anywhere in the code? You might hit a new code path or another part of the code correctly handles bounds checking and starts doing the wrong thing when you turn it off globally.

I am strong proponent that like fast math flags like this need to be locally scoped.

@omlins
Copy link
Contributor

omlins commented Jan 23, 2023

How can you be sure that there is no oob memory access anywhere in the code? You might hit a new code path

Of course you can never be totally sure; so whether it makes sense or not to have a global switch to remove bounds checking comes exactly down to what you are bringing up here: do we have a high chance to hit a new code path that might fail?
... and the answer is "no" for the way we are developing scientific applications. It is clear that for other people the way of developing applications is different, and in particular for most computer scientists as you and @maleadt. So I understand very much that for you personally it doesn't make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants