Unsupported `Val` in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v0.15 #358

luciano-drozda · 2022-06-05T16:36:38Z

MWE :

using CUDA
using Enzyme
if has_cuda()
  @info "CUDA is on"
  CUDA.allowscalar(false)
end

function kernel!(u, ::Val{n}) where {n}
  
  return nothing

end # kernel!

function dkernel!(du, ::Val{n}) where {n}

  Enzyme.autodiff_deferred(kernel!, Const, du, Val(n))
  return nothing
  
end # dkernel!

function call_dkernel()

  n    = 10
  u    = rand(n) |> cu
  dzdu = rand(n) |> cu
  du   = Duplicated(u, dzdu)
  @cuda threads=4 dkernel!(du, Val(n))
  
end # call_dkernel

call_dkernel()

The output :

[ Info: CUDA is on
ERROR: LoadError: InvalidIRError: compiling kernel #dkernel!(Duplicated{CuDeviceVector{Float32, 1}}, Val{10}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f_getfield)
Stacktrace:
 [1] getindex
   @ ./tuple.jl:29
 [2] iterate
   @ ./tuple.jl:69
 [3] same_or_one
   @ /scratch/drozda/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:203
 [4] autodiff_deferred
   @ /scratch/drozda/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:429
 [5] dkernel!
   @ /scratch/drozda/test.jl:16
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(dkernel!), Tuple{Duplicated{CuDeviceVector{Float32, 1}}, Val{10}}}}, args::LLVM.Module)
    @ GPUCompiler /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/validation.jl:139
  [2] macro expansion
    @ /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:409 [inlined]
  [3] macro expansion
    @ /scratch/drozda/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [4] macro expansion
    @ /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:407 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
  [6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
  [7] #224
    @ /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
  [8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(dkernel!), Tuple{Duplicated{CuDeviceVector{Float32, 1}}, Val{10}}}}})
    @ GPUCompiler /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
  [9] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
 [11] cufunction(f::typeof(dkernel!), tt::Type{Tuple{Duplicated{CuDeviceVector{Float32, 1}}, Val{10}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
 [12] cufunction(f::typeof(dkernel!), tt::Type{Tuple{Duplicated{CuDeviceVector{Float32, 1}}, Val{10}}})
    @ CUDA /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:293
 [13] macro expansion
    @ /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
 [14] call_dkernel()
    @ Main /scratch/drozda/test.jl:27
 [15] top-level scope
    @ /scratch/drozda/test.jl:31
 [16] include(fname::String)
    @ Base.MainInclude ./client.jl:451
 [17] top-level scope
    @ REPL[2]:1
 [18] top-level scope
    @ /scratch/drozda/.julia/packages/CUDA/GGwVa/src/initialization.jl:52
in expression starting at /scratch/drozda/test.jl:31

wsmoses · 2022-06-06T16:20:30Z

@vchuravy it looks like the same_or_one function for deducing batchduplicated isn't actually getting inlined / is typeinstable. Can you look into?

vchuravy · 2022-06-16T15:08:19Z

In JuliaGPU/KernelAbstractions.jl#307 (comment) @pxl-th shared the output of @device_code

CodeInfo(
     @ /home/pxl-th/.julia/dev/KernelAbstractions/lib/KernelGradients/src/KernelGradients.jl:9 within `df`
1 ── %1  = Core.getfield(args, 1)::Duplicated{CuDeviceMatrix{Float32, 1}}
│    %2  = Core.getfield(args, 2)::Duplicated{CuDeviceMatrix{Float32, 1}}
│   ┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:429 within `autodiff_deferred`
│   │┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:190 within `same_or_one`
│   ││ %3  = Core.tuple(ctx, %1, %2)::Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(128,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, Duplicated{CuDeviceMatrix{Float32, 1}}, Duplicated{CuDeviceMatrix{Float32, 1}}}
│   ││ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:192 within `same_or_one`
└───││       goto #10 if not true
2 ┄─││ %5  = φ (#1 => 2, #9 => %18)::Int64
│   ││ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:206 within `same_or_one`
│   ││┌ @ tuple.jl:68 within `iterate`
│   │││┌ @ int.jl:481 within `<=`
│   ││││ %6  = Base.sle_int(1, %5)::Bool
│   │││└
└───│││       goto #4 if not %6
    │││┌ @ int.jl:481 within `<=`
3 ──││││ %8  = Base.sle_int(%5, 3)::Bool
│   │││└
└───│││       goto #5
4 ──│││       nothing::Nothing
5 ┄─│││ %11 = φ (#3 => %8, #4 => false)::Bool
└───│││       goto #7 if not %11
    │││┌ @ tuple.jl:29 within `getindex`
6 ──││││       Base.getfield(%3, %5, false)::Union{Duplicated{CuDeviceMatrix{Float32, 1}}, KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(128,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}}
│   │││└
│   │││┌ @ int.jl:87 within `+`
│   ││││ %14 = Base.add_int(%5, 1)::Int64
│   │││└
└───│││       goto #8
7 ──│││       Base.nothing::Nothing
└───│││       goto #8
    ││└
8 ┄─││ %18 = φ (#6 => %14)::Int64
│   ││ %19 = φ (#6 => false, #7 => true)::Bool
│   ││ %20 = Base.not_int(%19)::Bool
└───││       goto #10 if not %20
9 ──││       goto #2
    ││ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:208 within `same_or_one`
10 ┄││       goto #12 if not true
11 ─││       nothing::Nothing
    ││ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:212 within `same_or_one`
12 ┄││       goto #13
    │└
    │ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:444 within `autodiff_deferred`
    │┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4712 within `deferred_codegen`
    ││┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4693 within `gendeferred_codegen`
    │││┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4706 within `macro expansion`
13 ─││││ %26 = $(Expr(:foreigncall, "extern deferred_codegen", Ptr{Nothing}, svec(Ptr{Nothing}), 0, :(:llvmcall), :($(QuoteNode(Ptr{Nothing} @0x00007f4221e39ff8))), :($(QuoteNode(Ptr{Nothing} @0x00007f4221e39ff8)))))::Ptr{Nothing}
│   │└└└
│   │ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:451 within `autodiff_deferred`
│   │┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4260 within `CombinedAdjointThunk`
│   ││┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4278 within `enzyme_call`
│   │││┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4512 within `macro expansion`
│   ││││ %27 = Base.llvmcall::Core.IntrinsicFunction
│   ││││ %28 = Core.tuple("; ModuleID = 'llvmcall'\nsource_filename = \"llvmcall\"\n\n; Function Attrs: alwaysinline\ndefine void @entry(i64 %0, { [1 x [1 x [1 x i64]]], { [1 x [1 x [1 x i64]]] } } %1, { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, { i8 addrspace(1)*, i64, [2 x i64], i64 } %4, { i8 addrspace(1)*, i64, [2 x i64], i64 } %5) #0 {\nentry:\n  %6 = inttoptr i64 %0 to void ({ [1 x [1 x [1 x i64]]], { [1 x [1 x [1 x i64]]] } }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 })*\n  call void %6({ [1 x [1 x [1 x i64]]], { [1 x [1 x [1 x i64]]] } } %1, { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, { i8 addrspace(1)*, i64, [2 x i64], i64 } %4, { i8 addrspace(1)*, i64, [2 x i64], i64 } %5)\n  ret void\n}\n\nattributes #0 = { alwaysinline }\n", "entry")::Tuple{String, String}
│   ││││┌ @ Base.jl:38 within `getproperty`
│   │││││ %29 = Base.getfield(%1, :val)::CuDeviceMatrix{Float32, 1}
│   │││││ %30 = Base.getfield(%1, :dval)::CuDeviceMatrix{Float32, 1}
│   │││││ %31 = Base.getfield(%2, :val)::CuDeviceMatrix{Float32, 1}
│   │││││ %32 = Base.getfield(%2, :dval)::CuDeviceMatrix{Float32, 1}
│   ││││└
│   ││││       (%27)(%28, Enzyme.Compiler.Cvoid, Tuple{Ptr{Nothing}, KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(128,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}}, %26, ctx, %29, %30, %31, %32)::Nothing
│   │└└└
└───│       goto #14
    └
     @ /home/pxl-th/.julia/dev/KernelAbstractions/lib/KernelGradients/src/KernelGradients.jl:10 within `df`
14 ─       return KernelGradients.nothing
) => Nothing

@wsmoses you are using a for-loop to iterate over a heterogenous tuple. So the getindex operation is type-instable. You either have to write the function recursively or use ntuple while inlining the closure.

vchuravy · 2022-06-17T16:39:41Z

Should now be fixed across the stack.

luciano-drozda mentioned this issue Jun 5, 2022

CUDA support broken since Enzyme v0.8 and GPUCompiler v13 #230

Closed

wsmoses assigned vchuravy Jun 6, 2022

luciano-drozda changed the title ~~Unsupported Val in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v.0.15~~ Unsupported Val in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v0.15 Jun 7, 2022

pxl-th mentioned this issue Jun 16, 2022

Enzyme fails on GPU kernel JuliaGPU/KernelAbstractions.jl#307

Closed

pxl-th mentioned this issue Jun 16, 2022

Make 'same_or_one' type-stable #363

Closed

vchuravy closed this as completed Jun 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsupported `Val` in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v0.15 #358

Unsupported `Val` in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v0.15 #358

luciano-drozda commented Jun 5, 2022

wsmoses commented Jun 6, 2022

vchuravy commented Jun 16, 2022

vchuravy commented Jun 17, 2022

Unsupported Val in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v0.15 #358

Unsupported Val in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v0.15 #358

Comments

luciano-drozda commented Jun 5, 2022

wsmoses commented Jun 6, 2022

vchuravy commented Jun 16, 2022

vchuravy commented Jun 17, 2022

Unsupported `Val` in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v0.15 #358

Unsupported `Val` in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v0.15 #358