Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SROA changes cause memory corruption in HDF5.jl Blosc decompression #43402

Closed
mkitti opened this issue Dec 12, 2021 · 12 comments · Fixed by #43408
Closed

SROA changes cause memory corruption in HDF5.jl Blosc decompression #43402

mkitti opened this issue Dec 12, 2021 · 12 comments · Fixed by #43408

Comments

@mkitti
Copy link
Contributor

mkitti commented Dec 12, 2021

SROA changes in #43239 causes memory corruption in HDF5.jl Blosc decompression. The issue was git bisected to 1a1f3d7 .

Background

HDF5.jl version 0.15.7 implements a Blosc compression plugin. The plugin is a Julia function blosc_filter which is passed to the HDF5 C library as a C function pointer via @cfunction. This filter implements both compression and decompression. The issue occurs during decompression.

https://github.com/JuliaIO/HDF5.jl/blob/af25a33934623e0e18ec25c2711e42be725ab5fe/src/blosc_filter.jl#L81-L85

function blosc_filter(flags::Cuint, cd_nelmts::Csize_t,
                      cd_values::Ptr{Cuint}, nbytes::Csize_t,
                      buf_size::Ptr{Csize_t}, buf::Ptr{Ptr{Cvoid}})
...
        outbuf_size, cbytes, blocksize = Blosc.cbuffer_sizes(unsafe_load(buf))
        outbuf = Libc.malloc(outbuf_size)
        outbuf == C_NULL && return Csize_t(0)
        status = Blosc.blosc_decompress(unsafe_load(buf), outbuf, outbuf_size)
        status <= 0 && (Libc.free(outbuf); return Csize_t(0))
...
end

During decompression, blosc_filter calls Blosc.cbuffer_sizes which is reproduced below.

function cbuffer_sizes(buf)
    s1 = Ref{Csize_t}()
    s2 = Ref{Csize_t}()
    s3 = Ref{Csize_t}()
    ccall((:blosc_cbuffer_sizes,libblosc), Cvoid,
          (Ptr{UInt8}, Ref{Csize_t}, Ref{Csize_t}, Ref{Csize_t}),
          buf, s1, s2, s3)
    return (s1[], s2[], s3[])
end

The source code for blosc_cbuffer_sizes is located here:
https://github.com/Blosc/c-blosc/blob/master/blosc/blosc.c#L2101-L2117

What appears to be happening after 1a1f3d7 is that the memory for outbuf_size is being corrupted. Blosc.blosc_decompress detects the corruption and returns an error code. The issue can be mitigated by modifying blosc_filter in HDF5.jl to preserve the outputs cbytes and blocksize until after the execution of Blosc.blosc_decompress. Perhaps this preserves the Refs long enough so that the underlying memory they are associated with is not modified. This doesn't make any sense to me given Julia's semantics.

function blosc_filter(flags::Cuint, cd_nelmts::Csize_t,
                      cd_values::Ptr{Cuint}, nbytes::Csize_t,
                      buf_size::Ptr{Csize_t}, buf::Ptr{Ptr{Cvoid}})
...
        outbuf_size, cbytes, blocksize = Blosc.cbuffer_sizes(unsafe_load(buf))
        GC.@preserve outbuf_size cbytes blocksize
            outbuf = Libc.malloc(outbuf_size)
            outbuf == C_NULL && return Csize_t(0)
            status = Blosc.blosc_decompress(unsafe_load(buf), outbuf, outbuf_size)
            status <= 0 && (Libc.free(outbuf); return Csize_t(0))
        end
...
end

This mitigation is deployed in pull request 880 to HDF5.jl which is expected to be merged as HDF5.jl 0.16.0.
JuliaIO/HDF5.jl@740983e#diff-8ec4f00da08b40e9b455e957caf58d485f8389d51504c1f4f9fcf1685c270b90R99

Minimum working example

HDF5 is at 0.15.7. Derived from https://github.com/JuliaIO/HDF5.jl/blob/v0.15.7/test/plain.jl

using HDF5, Test

function mwe()
    exit_code = 0

    # Create a new file
    fn = tempname()
    f = h5open(fn, "w")
    # Group
    g = create_group(f, "mygroup")
    # Test dataset with compression
    R = rand(1:20, 20, 40);
    # HDF5.jl v0.16
    #g["BloscA", chunk=(5, 6), shuffle=true, blosc=9] = R
    g["BloscA", chunk=(5, 6), shuffle=(), blosc=9] = R
    close(g)
    close(f)

    # Read the file back in
    fr = h5open(fn)
    try
        Rr4 = read(fr, "mygroup/BloscA")
        @test Rr4 == R
    catch err
        @info "Error" err
        exit_code = 5
    finally
        close(fr)
        HDF5.h5_close()
    end
    exit(exit_code)

end

mwe()

Execution under 1a1f3d7 (#43239)

$ usr/bin/julia mwe.jl
HDF5-DIAG: Error detected in HDF5 (1.12.0) thread 0:
  #000: D:/mingwbuild/mingw-w64-hdf5/src/hdf5-1.12.0/src/H5Dio.c line 192 in H5Dread(): can't read data
    major: Dataset
    minor: Read failed
  #001: D:/mingwbuild/mingw-w64-hdf5/src/hdf5-1.12.0/src/H5VLcallback.c line 2080 in H5VL_dataset_read(): dataset read failed
    major: Virtual Object Layer
    minor: Read failed
  #002: D:/mingwbuild/mingw-w64-hdf5/src/hdf5-1.12.0/src/H5VLcallback.c line 2046 in H5VL__dataset_read(): dataset read failed
    major: Virtual Object Layer
    minor: Read failed
  #003: D:/mingwbuild/mingw-w64-hdf5/src/hdf5-1.12.0/src/H5VLnative_dataset.c line 167 in H5VL__native_dataset_read(): can't read data
    major: Dataset
    minor: Read failed
  #004: D:/mingwbuild/mingw-w64-hdf5/src/hdf5-1.12.0/src/H5Dio.c line 567 in H5D__read(): can't read data
    major: Dataset
    minor: Read failed
  #005: D:/mingwbuild/mingw-w64-hdf5/src/hdf5-1.12.0/src/H5Dchunk.c line 2594 in H5D__chunk_read(): unable to read raw data chunk
    major: Low-level I/O
    minor: Read failed
  #006: D:/mingwbuild/mingw-w64-hdf5/src/hdf5-1.12.0/src/H5Dchunk.c line 3957 in H5D__chunk_lock(): data pipeline read failed
    major: Dataset
    minor: Filter operation failed
  #007: D:/mingwbuild/mingw-w64-hdf5/src/hdf5-1.12.0/src/H5Z.c line 1336 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed
┌ Info: Error
└   err = Error reading dataset /mygroup/BloscA

julia> versioninfo()
Julia Version 1.8.0-DEV.1080
Commit 1a1f3d7073* (2021-11-29 02:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

Execution under c624d4f (previous commit, #43208)

$ usr/bin/julia mwe.jl

No error. No output.

Environment

(@v1.8) pkg> st
Status `C:\Users\kittisopikulm\.julia\environments\v1.8\Project.toml`
  [f67ccb44] HDF5 v0.15.7

(@v1.8) pkg> st -m
Status `C:\Users\kittisopikulm\.julia\environments\v1.8\Manifest.toml`
  [a74b3585] Blosc v0.7.1
  [34da2185] Compat v3.40.0
  [f67ccb44] HDF5 v0.15.7
  [692b3bcd] JLLWrappers v1.3.0
  [21216c6a] Preferences v1.2.2
  [ae029012] Requires v1.1.3
  [0b7ba130] Blosc_jll v1.21.1+0
  [0234f1f7] HDF5_jll v1.12.0+1
  [5ced341a] Lz4_jll v1.9.3+0
  [458c3c95] OpenSSL_jll v1.1.10+0
  [3161d3a3] Zstd_jll v1.5.0+0
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8bb1440f] DelimitedFiles
  [8ba89e20] Distributed
  [f43a241f] Downloads v1.5.1
  [7b1f6079] FileWatching
  [b77e0a4c] InteractiveUtils
  [b27032c2] LibCURL v0.6.3
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.8.0
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [fa267f1f] TOML v1.0.0
  [a4e569a6] Tar v1.10.0
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v0.5.0+0
  [deac9b47] LibCURL_jll v7.73.0+4
  [29816b5a] LibSSH2_jll v1.9.1+2
  [c8ffd9c3] MbedTLS_jll v2.24.0+2
  [14a3606d] MozillaCACerts_jll v2020.7.22
  [4536629a] OpenBLAS_jll v0.3.17+2
  [83775a58] Zlib_jll v1.2.12+1
  [8e850b90] libblastrampoline_jll v3.1.0+0
  [8e850ede] nghttp2_jll v1.41.0+1
  [3f19e933] p7zip_jll v16.2.1+1
@mkitti
Copy link
Contributor Author

mkitti commented Dec 12, 2021

Ping @aviatesk

@mkitti
Copy link
Contributor Author

mkitti commented Dec 12, 2021

Doing GC.@preserve outbuf_size cbytes is sufficient to avoid the error. Reducing the line to GC.@preserve outbuf_size is not sufficient.

@mkitti
Copy link
Contributor Author

mkitti commented Dec 12, 2021

The value of outbuf_size should be 0x00000000000000f0 (240). On my machine it gets turned into 0x0000000000000058 (88) with the 5 being consistent and the last hexadecimal digit, 8 varying.

@mkitti mkitti changed the title SROA changes causes memory corruption in HDF5.jl Blosc decompression SROA changes cause memory corruption in HDF5.jl Blosc decompression Dec 12, 2021
@aviatesk
Copy link
Member

This doesn't make any sense to me given Julia's semantics.

It seems like cbytes, blocksize can be accessed implicitly in that block? Then the usage of GC.@preserve is required (we need to give the compiler additional information that can't be derived from Julia semantics)

@mkitti
Copy link
Contributor Author

mkitti commented Dec 12, 2021

Why is the value of outbuf_size dependent on preserving cbytes, blocksize though? In this case, that's the only variable we need.

@aviatesk
Copy link
Member

aviatesk commented Dec 12, 2021

If these memories can be accessed in C lib, we need to protect them from GC. cbuffer_sizes dereferences the pointers. So something seems to happen there.

@vchuravy
Copy link
Member

So what I find suspicous is:

julia> function cbuffer_sizes(buf)
           s1 = Ref{Csize_t}()
           s2 = Ref{Csize_t}()
           s3 = Ref{Csize_t}()
           ccall(:blosc_cbuffer_sizes, Cvoid,
                 (Ptr{UInt8}, Ref{Csize_t}, Ref{Csize_t}, Ref{Csize_t}),
                 buf, s1, s2, s3)
           return (s1[], s2[], s3[])
       end
cbuffer_sizes (generic function with 1 method)

julia> function f(buf)
         outbuf_size, cbytes, blocksize = cbuffer_sizes(buf)
         ptr = Libc.malloc(outbuf_size)
         Libc.free(ptr)
       end
f (generic function with 1 method)

julia> @code_llvm optimize=false f(C_NULL)
;  @ REPL[15]:1 within `f`
define void @julia_f_394(i64 zeroext %0) #0 {
top:
  %1 = call {}*** @julia.get_pgcstack()
  %2 = bitcast {}*** %1 to {}**
  %current_task = getelementptr inbounds {}*, {}** %2, i64 -12
  %3 = bitcast {}** %current_task to i64*
  %world_age = getelementptr inbounds i64, i64* %3, i64 13
;  @ REPL[15]:2 within `f`
; ┌ @ REPL[14]:2 within `cbuffer_sizes`
; │┌ @ refpointer.jl:135 within `Ref`
; ││┌ @ refvalue.jl:7 within `RefValue`
     %4 = bitcast {}*** %1 to {}**
     %current_task1 = getelementptr inbounds {}*, {}** %4, i64 -12
     %5 = call noalias nonnull {}* @julia.gc_alloc_obj({}** %current_task1, i64 8, {}* inttoptr (i64 139845884220736 to {}*)) #2
; │└└
; │ @ REPL[14]:3 within `cbuffer_sizes`
; │┌ @ refpointer.jl:135 within `Ref`
; ││┌ @ refvalue.jl:7 within `RefValue`
     %6 = bitcast {}*** %1 to {}**
     %current_task2 = getelementptr inbounds {}*, {}** %6, i64 -12
     %7 = call noalias nonnull {}* @julia.gc_alloc_obj({}** %current_task2, i64 8, {}* inttoptr (i64 139845884220736 to {}*)) #2
; │└└
; │ @ REPL[14]:4 within `cbuffer_sizes`
; │┌ @ refpointer.jl:135 within `Ref`
; ││┌ @ refvalue.jl:7 within `RefValue`
     %8 = bitcast {}*** %1 to {}**
     %current_task3 = getelementptr inbounds {}*, {}** %8, i64 -12
     %9 = call noalias nonnull {}* @julia.gc_alloc_obj({}** %current_task3, i64 8, {}* inttoptr (i64 139845884220736 to {}*)) #2
; │└└
; │ @ REPL[14]:5 within `cbuffer_sizes`
; │┌ @ refpointer.jl:101 within `unsafe_convert` @ refvalue.jl:40
; ││┌ @ pointer.jl:147 within `pointer_from_objref`
     %10 = call nonnull {}* @julia.pointer_from_objref({}* %5) #3
     %11 = ptrtoint {}* %10 to i64
     %12 = call nonnull {}* @julia.pointer_from_objref({}* %7) #3
     %13 = ptrtoint {}* %12 to i64
     %14 = call nonnull {}* @julia.pointer_from_objref({}* %9) #3
     %15 = ptrtoint {}* %14 to i64
; │└└
   %16 = bitcast {}* %10 to i8*
   %17 = bitcast {}* %12 to i8*
   %18 = bitcast {}* %14 to i8*
   %19 = load atomic void ()*, void ()** @ccall_blosc_cbuffer_sizes_396 unordered, align 8
   %20 = icmp ne void ()* %19, null
   br i1 %20, label %ccall, label %dlsym

dlsym:                                            ; preds = %top
   %21 = call void ()* @ijl_load_and_lookup(i8* null, i8* getelementptr inbounds ([20 x i8], [20 x i8]* @_j_str1, i32 0, i32 0), i8** @jl_RTLD_DEFAULT_handle)
   store atomic void ()* %21, void ()** @ccall_blosc_cbuffer_sizes_396 release, align 8
   br label %ccall

ccall:                                            ; preds = %dlsym, %top
   %22 = phi void ()* [ %19, %top ], [ %21, %dlsym ]
   %23 = bitcast void ()* %22 to void (i64, i8*, i8*, i8*)*
   call void %23(i64 %0, i8* %16, i8* %17, i8* %18) [ "jl_roots"({}* %5) ]
; │ @ REPL[14]:8 within `cbuffer_sizes`
; │┌ @ refvalue.jl:56 within `getindex`
; ││┌ @ Base.jl:38 within `getproperty`
     %24 = bitcast {}* %5 to i64*
     %25 = load i64, i64* %24, align 8
; └└└
;  @ REPL[15]:3 within `f`
; ┌ @ libc.jl:355 within `malloc`
   %26 = call i64 inttoptr (i64 139846208852768 to i64 (i64)*)(i64 %25)
; └
;  @ REPL[15]:4 within `f`
; ┌ @ libc.jl:348 within `free`
   call void inttoptr (i64 139846208854400 to void (i64)*)(i64 %26)
; └
  ret void
}

The ccall is supposed to root it's arguments, but it is not. It is only rooting %5 which is the first RefValue, but not Refvalue %7 and %8

@vchuravy
Copy link
Member

On 1.7 we indeed root all the relevant values. So I don't think it's the SROA change, it's probably just surfaced it.

@vchuravy
Copy link
Member

Okay probably due to inlining cbuffer_sizes on it's own codegens correctly. Inlined into f, things go awry.

@vchuravy
Copy link
Member

After inlining:

...
│         $(Expr(:foreigncall, :(:blosc_cbuffer_sizes), Nothing, svec(Ptr{UInt8}, Ptr{UInt64}, Ptr{UInt64}, Ptr{UInt64}), 0, :(:ccall), :(%4), :(%6), :(%8), :(%10), :(%1), Core.Argument(2)))::Nothing
...

What it should have been:

│         $(Expr(:foreigncall, :(:blosc_cbuffer_sizes), Nothing, svec(Ptr{UInt8}, Ptr{UInt64}, Ptr{UInt64}, Ptr{UInt64}), 0, :(:ccall), :(%4), :(%6), :(%8), :(%10), :(%3), :(%2), :(%1), Core.Argument(2)))::Nothing

Note the missing: :(%3), :(%2) in the after inlining.

@vchuravy
Copy link
Member

@mkitti did you check if this fixes the problem in HDF5 you observed?

@mkitti
Copy link
Contributor Author

mkitti commented Dec 15, 2021

@mkitti did you check if this fixes the problem in HDF5 you observed?

#43408 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants