Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupported Val in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v0.15 #358

Closed
luciano-drozda opened this issue Jun 5, 2022 · 3 comments
Closed
Assignees

Comments

@luciano-drozda
Copy link

MWE :

using CUDA
using Enzyme
if has_cuda()
  @info "CUDA is on"
  CUDA.allowscalar(false)
end

function kernel!(u, ::Val{n}) where {n}
  
  return nothing

end # kernel!

function dkernel!(du, ::Val{n}) where {n}

  Enzyme.autodiff_deferred(kernel!, Const, du, Val(n))
  return nothing
  
end # dkernel!

function call_dkernel()

  n    = 10
  u    = rand(n) |> cu
  dzdu = rand(n) |> cu
  du   = Duplicated(u, dzdu)
  @cuda threads=4 dkernel!(du, Val(n))
  
end # call_dkernel

call_dkernel()

The output :

[ Info: CUDA is on
ERROR: LoadError: InvalidIRError: compiling kernel #dkernel!(Duplicated{CuDeviceVector{Float32, 1}}, Val{10}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f_getfield)
Stacktrace:
 [1] getindex
   @ ./tuple.jl:29
 [2] iterate
   @ ./tuple.jl:69
 [3] same_or_one
   @ /scratch/drozda/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:203
 [4] autodiff_deferred
   @ /scratch/drozda/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:429
 [5] dkernel!
   @ /scratch/drozda/test.jl:16
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(dkernel!), Tuple{Duplicated{CuDeviceVector{Float32, 1}}, Val{10}}}}, args::LLVM.Module)
    @ GPUCompiler /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/validation.jl:139
  [2] macro expansion
    @ /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:409 [inlined]
  [3] macro expansion
    @ /scratch/drozda/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [4] macro expansion
    @ /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:407 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
  [6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
  [7] #224
    @ /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
  [8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(dkernel!), Tuple{Duplicated{CuDeviceVector{Float32, 1}}, Val{10}}}}})
    @ GPUCompiler /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
  [9] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler /scratch/drozda/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
 [11] cufunction(f::typeof(dkernel!), tt::Type{Tuple{Duplicated{CuDeviceVector{Float32, 1}}, Val{10}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
 [12] cufunction(f::typeof(dkernel!), tt::Type{Tuple{Duplicated{CuDeviceVector{Float32, 1}}, Val{10}}})
    @ CUDA /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:293
 [13] macro expansion
    @ /scratch/drozda/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
 [14] call_dkernel()
    @ Main /scratch/drozda/test.jl:27
 [15] top-level scope
    @ /scratch/drozda/test.jl:31
 [16] include(fname::String)
    @ Base.MainInclude ./client.jl:451
 [17] top-level scope
    @ REPL[2]:1
 [18] top-level scope
    @ /scratch/drozda/.julia/packages/CUDA/GGwVa/src/initialization.jl:52
in expression starting at /scratch/drozda/test.jl:31
@wsmoses
Copy link
Member

wsmoses commented Jun 6, 2022

@vchuravy it looks like the same_or_one function for deducing batchduplicated isn't actually getting inlined / is typeinstable. Can you look into?

@luciano-drozda luciano-drozda changed the title Unsupported Val in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v.0.15 Unsupported Val in CUDA kernel - Enzyme v0.10.0 - GPUCompiler v0.15 Jun 7, 2022
@vchuravy
Copy link
Member

In JuliaGPU/KernelAbstractions.jl#307 (comment) @pxl-th shared the output of @device_code

CodeInfo(
     @ /home/pxl-th/.julia/dev/KernelAbstractions/lib/KernelGradients/src/KernelGradients.jl:9 within `df`
1 ── %1  = Core.getfield(args, 1)::Duplicated{CuDeviceMatrix{Float32, 1}}
│    %2  = Core.getfield(args, 2)::Duplicated{CuDeviceMatrix{Float32, 1}}
│   ┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:429 within `autodiff_deferred`
│   │┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:190 within `same_or_one`
│   ││ %3  = Core.tuple(ctx, %1, %2)::Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(128,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, Duplicated{CuDeviceMatrix{Float32, 1}}, Duplicated{CuDeviceMatrix{Float32, 1}}}
│   ││ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:192 within `same_or_one`
└───││       goto #10 if not true
2 ┄─││ %5  = φ (#1 => 2, #9 => %18)::Int64
│   ││ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:206 within `same_or_one`
│   ││┌ @ tuple.jl:68 within `iterate`
│   │││┌ @ int.jl:481 within `<=`
│   ││││ %6  = Base.sle_int(1, %5)::Bool
│   │││└
└───│││       goto #4 if not %6
    │││┌ @ int.jl:481 within `<=`
3 ──││││ %8  = Base.sle_int(%5, 3)::Bool
│   │││└
└───│││       goto #5
4 ──│││       nothing::Nothing
5 ┄─│││ %11 = φ (#3 => %8, #4 => false)::Bool
└───│││       goto #7 if not %11
    │││┌ @ tuple.jl:29 within `getindex`
6 ──││││       Base.getfield(%3, %5, false)::Union{Duplicated{CuDeviceMatrix{Float32, 1}}, KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(128,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}}
│   │││└
│   │││┌ @ int.jl:87 within `+`
│   ││││ %14 = Base.add_int(%5, 1)::Int64
│   │││└
└───│││       goto #8
7 ──│││       Base.nothing::Nothing
└───│││       goto #8
    ││└
8 ┄─││ %18 = φ (#6 => %14)::Int64
│   ││ %19 = φ (#6 => false, #7 => true)::Bool
│   ││ %20 = Base.not_int(%19)::Bool
└───││       goto #10 if not %20
9 ──││       goto #2
    ││ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:208 within `same_or_one`
10 ┄││       goto #12 if not true
11 ─││       nothing::Nothing
    ││ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:212 within `same_or_one`
12 ┄││       goto #13
    │└
    │ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:444 within `autodiff_deferred`
    │┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4712 within `deferred_codegen`
    ││┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4693 within `gendeferred_codegen`
    │││┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4706 within `macro expansion`
13 ─││││ %26 = $(Expr(:foreigncall, "extern deferred_codegen", Ptr{Nothing}, svec(Ptr{Nothing}), 0, :(:llvmcall), :($(QuoteNode(Ptr{Nothing} @0x00007f4221e39ff8))), :($(QuoteNode(Ptr{Nothing} @0x00007f4221e39ff8)))))::Ptr{Nothing}
│   │└└└
│   │ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/Enzyme.jl:451 within `autodiff_deferred`
│   │┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4260 within `CombinedAdjointThunk`
│   ││┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4278 within `enzyme_call`
│   │││┌ @ /home/pxl-th/.julia/packages/Enzyme/7MHm8/src/compiler.jl:4512 within `macro expansion`
│   ││││ %27 = Base.llvmcall::Core.IntrinsicFunction
│   ││││ %28 = Core.tuple("; ModuleID = 'llvmcall'\nsource_filename = \"llvmcall\"\n\n; Function Attrs: alwaysinline\ndefine void @entry(i64 %0, { [1 x [1 x [1 x i64]]], { [1 x [1 x [1 x i64]]] } } %1, { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, { i8 addrspace(1)*, i64, [2 x i64], i64 } %4, { i8 addrspace(1)*, i64, [2 x i64], i64 } %5) #0 {\nentry:\n  %6 = inttoptr i64 %0 to void ({ [1 x [1 x [1 x i64]]], { [1 x [1 x [1 x i64]]] } }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 }, { i8 addrspace(1)*, i64, [2 x i64], i64 })*\n  call void %6({ [1 x [1 x [1 x i64]]], { [1 x [1 x [1 x i64]]] } } %1, { i8 addrspace(1)*, i64, [2 x i64], i64 } %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, { i8 addrspace(1)*, i64, [2 x i64], i64 } %4, { i8 addrspace(1)*, i64, [2 x i64], i64 } %5)\n  ret void\n}\n\nattributes #0 = { alwaysinline }\n", "entry")::Tuple{String, String}
│   ││││┌ @ Base.jl:38 within `getproperty`
│   │││││ %29 = Base.getfield(%1, :val)::CuDeviceMatrix{Float32, 1}
│   │││││ %30 = Base.getfield(%1, :dval)::CuDeviceMatrix{Float32, 1}
│   │││││ %31 = Base.getfield(%2, :val)::CuDeviceMatrix{Float32, 1}
│   │││││ %32 = Base.getfield(%2, :dval)::CuDeviceMatrix{Float32, 1}
│   ││││└
│   ││││       (%27)(%28, Enzyme.Compiler.Cvoid, Tuple{Ptr{Nothing}, KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(128,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}, CuDeviceMatrix{Float32, 1}}, %26, ctx, %29, %30, %31, %32)::Nothing
│   │└└└
└───│       goto #14
    └
     @ /home/pxl-th/.julia/dev/KernelAbstractions/lib/KernelGradients/src/KernelGradients.jl:10 within `df`
14 ─       return KernelGradients.nothing
) => Nothing

@wsmoses you are using a for-loop to iterate over a heterogenous tuple. So the getindex operation is type-instable. You either have to write the function recursively or use ntuple while inlining the closure.

@vchuravy
Copy link
Member

Should now be fixed across the stack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants