bug on aarch64-apple-ios: Buffer Validation Illegal MTLStorageMode 0x10 #2322

evilsocket · 2024-07-08T11:06:21Z

I'm running candle with metal acceleration on iOS via uniffi, specifically:

candle-core = { version = "0.6.0", features = ["metal"] }
candle-nn = { version = "0.6.0", features = ["metal"] }
candle-transformers = { version = "0.6.0", features = ["metal"] }

But the call to Device::new_metal panics with:

-[MTLDebugDevice newBufferWithBytes:length:options:]:670: failed assertion `Buffer Validation
Illegal MTLStorageMode 0x10'

After some debugging, I figured this happens on this line due to iOS rejecting the metal::MTLResourceOptions::StorageModeManaged option.

I could replicate with this simple test:

#[uniffi::export]
pub fn test_metal() {
    let device = metal::Device::all().swap_remove(0);

    println!("device: {:?}", &device);

    println!(
        "MTLResourceOptions::StorageModeManaged = 0x{:x}",
        metal::MTLResourceOptions::StorageModeManaged
    );

    // it panics here
    let seed = device.new_buffer_with_data(
        [299792458].as_ptr() as *const std::ffi::c_void,
        4,
        metal::MTLResourceOptions::StorageModeManaged, // <-- 0x10
    );

    println!("seed: {:?}", &seed);
}

Which prints and then panics:

device: <CaptureMTLDevice: 0x105207820> -> <MTLDebugDevice: 0x105207580> -> <AGXG16Device: 0x10800f400>
    name = Apple A17 Pro GPU
MTLResourceOptions::StorageModeManaged = 0x10
-[MTLDebugDevice newBufferWithBytes:length:options:]:670: failed assertion `Buffer Validation
Illegal MTLStorageMode 0x10
'

Other values such as StorageModeShared do not panic, but i'm not sure of the implications also because StorageModeManaged is used in several other places in the metal implementation.

The text was updated successfully, but these errors were encountered:

evilsocket · 2024-07-08T11:07:42Z

also mentioned in #1841

evilsocket · 2024-07-10T12:50:38Z

After a bit of digging into Apple MLX and especially how they handle buffer allocation on both macOS and iOS, I found this https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/allocator.cpp#L207

You will see that all allocations are centralized there and they always and only use ResourceStorageModeShared (you won't find references to other storage modes in their metal backend). So it seems like on iOS at least, managed buffers are not needed. It makes sense if we think about it as Metal on macOS must support both Intel (where GPU and RAM are not unified) and Apple Silicon, while Metal on iOS only cares about unified memory, hence no syncing needed / supported.

Unless I've got it all wrong, which is possible.

evilsocket · 2024-07-10T13:53:00Z

Also related, only the precompiled macOS library is present in this repo #1759

working on it on my fork

evilsocket · 2024-07-10T14:09:42Z

My branch compiles and runs on iOS:

I'm still getting weird errors when I try to have my macOS M1 doing computations with an iPhone (Error: A weight is invalid in distribution), but it works.

evilsocket · 2024-07-10T14:10:28Z

Fixes happening here: https://github.com/evilsocket/candle
Integration happening here: https://github.com/evilsocket/llama3-cake

LaurentMazare · 2024-07-10T14:12:34Z

"weight is invalid in distribution" usually means that the model generated a nan, so somehow one of the metal kernel probably did not work in the same way on iphone as it would have on macos. It might be good to print the tensors after each op to get a sense of where the nan is coming from.

evilsocket · 2024-07-10T14:14:08Z

I have a feeling it comes from this https://codebrowser.dev/tokio/crates/rand-0.8.5/src/distributions/weighted_index.rs.html#454 .. i'm trying to narrow it down but to be honest I have very little experience with Candle so it's taking a while ...

LaurentMazare · 2024-07-10T14:23:56Z

Yes it's coming from the sampling, and the input of the sampling is based on the logits generated on the model, I would think that it's almost always caused by one of these values being a nan (or an inf).

evilsocket · 2024-07-10T14:26:43Z

tensor printf debugging ftw

evilsocket · 2024-07-10T16:19:55Z

@LaurentMazare you were 100% right, the error is in the logits output vector being full of NaN, the sampling doesn't like that ... disabling the kv_cache on both workers and master seem to be improve things, the point being, this is a problem of my tool, the porting of candle by itself seems to work

evilsocket · 2024-07-11T02:19:17Z

@LaurentMazare it seems to depend on some (version?) discrepancy between the libMetalFlashAttention.metallib bundled in this repo vs the iOS and macOS libMetalFlashAttention.metallib bundled here https://github.com/philipturner/metal-flash-attention/releases/tag/v1.0.1

In this issue #1759 @ivarflakstad suggests to use the iOS version from that release - however I found out that if I replace the the macOS metallib in this repo with that release, these tests fail https://github.com/evilsocket/candle-metal-tester/blob/main/core/src/lib.rs (even on macOS), suggesting that candle-metal-kernels makes some assumptions on the implementation of that library, that clearly changed over time.

Bottom line, it would be great if the last person to update the lib on this repo ( @FL33TW00D via b6afb46 ) could document which specific version/commit that library was compiled from and if possible also compile the same version for iOS (or point to the right precompiled release). I suspect that aligning both macOS and iOS libMetalFlashAttention versions would solve the issue.

ivarflakstad · 2024-07-11T04:58:18Z

We're currently using @FL33TW00D 's fork of mfa which iirc has a tiny change in how buffers are accessed.

ivarflakstad · 2024-07-11T05:03:18Z

My plan was actually to fix all these issues as soon as I'm back from vacation.
No need to use xcode 14. No longer be fully dependant on mfa. Introduce candle.metallib.

evilsocket · 2024-07-11T09:23:12Z

@ivarflakstad understood, i'll try to compile this for iOS https://github.com/FL33TW00D/metal-flash-attention and wait for proper fix, thanks

evilsocket · 2024-07-11T09:31:41Z

@ivarflakstad just tried to compile @FL33TW00D fork (last commit), that is not the one:

found <AGXG13XDevice: 0x13300ca00>
    name = Apple M1 Max

using <AGXG13XDevice: 0x13300ca00>
    name = Apple M1 Max
thread 'main' panicked at core/src/lib.rs:80:5:
assertion `left == right` failed
  left: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 right: [20.0, 23.0, 26.0, 29.0, 56.0, 68.0, 80.0, 92.0]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

evilsocket · 2024-07-11T09:39:05Z

I've literally tried to compile all branches of that repo and all of them fail the tests

evilsocket · 2024-07-13T10:32:00Z

@ivarflakstad i became a little bit obsessed with this so I tried to compile and test every single commit of every branch of both forks of libMetalFlashAttention.metallib, both for macOS and iOS. My results are a bit discouraging:

macOS:

0.1.0 (precompiled from main repo): fails tests and hangs bad the gpu state
0.2.0 (precompiled from main repo): fails tests and hangs bad the gpu state
1.0.0 (precompiled from main repo): fails tests and hangs bad the gpu state
1.0.1 (precompiled from main repo): fails tests
batches-2 (branch from FL33TW00D's fork): fails tests
larger-batches: (branch from FL33TW00D's fork): fails tests
original bundled binary: passes all tests, indicating this version comes from somewhere else.

iOS:

0.1.0 (precompiled from main repo): fails and errors called Result::unwrap() on an Err value: LoadFunctionError("Constant K_splits (212) value is required by function sgemm")
0.2.0 (precompiled from main repo): fails and errors called Result::unwrap() on an Err value: LoadFunctionError("Constant K_splits (212) value is required by function sgemm")
1.0.0 (precompiled from main repo): fails and errors called Result::unwrap() on an Err value: LoadFunctionError("Constant K_splits (212) value is required by function sgemm")
1.0.1 (precompiled from main repo): fails tests
batches-2 (branch from FL33TW00D's fork): fails tests
larger-batches: (branch from FL33TW00D's fork): fails tests
original bundled binary: does not exist :/

It's so frustrating because this single library is the only obstacle to iOS compatibility.

ivarflakstad · 2024-07-13T13:01:34Z

Right. I tried to nudge you away from the rabbit hole, but I get the curiosity hehe.
We won’t be dependent on the precompiled MFA library for much longer.

evilsocket · 2024-07-13T13:11:13Z

@ivarflakstad i am so looking forward to it - i don't think i have to tell you guys that there's a lot of potential in unlocking iOS. To my knowledge, Candle is currently the closest thing to a single framework supporting pretty much anything in a single codebase.

sinkingsugar · 2024-12-09T08:36:58Z

So this is still broken?

-[MTLDebugDevice newBufferWithBytes:length:options:]:723: failed assertion `Buffer Validation
Illegal MTLStorageMode 0x10

on 0.8.1

evilsocket mentioned this issue Jul 8, 2024

Metal iOS #1841

Open

evilsocket added a commit to evilsocket/cake that referenced this issue Jul 8, 2024

fix: using BLAS on iOS because of huggingface/candle#2322

9dc320f

evilsocket added a commit to evilsocket/candle that referenced this issue Jul 10, 2024

new: first attempt to handle huggingface#2322

73feed4

evilsocket added a commit to evilsocket/candle that referenced this issue Jul 10, 2024

fix: added libMetalFlashAttention metalib for iOS (see huggingface#2322)

2c0978c

zackangelo mentioned this issue Jul 15, 2024

WIP: Precompile metal kernels into .metallib files #2335

Closed

OptimusLime mentioned this issue Jan 9, 2025

Adding iOS Support to Mistral.rs EricLBuehler/candle#63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug on aarch64-apple-ios: Buffer Validation Illegal MTLStorageMode 0x10 #2322

bug on aarch64-apple-ios: Buffer Validation Illegal MTLStorageMode 0x10 #2322

evilsocket commented Jul 8, 2024

evilsocket commented Jul 8, 2024

evilsocket commented Jul 10, 2024

evilsocket commented Jul 10, 2024

evilsocket commented Jul 10, 2024

evilsocket commented Jul 10, 2024

LaurentMazare commented Jul 10, 2024

evilsocket commented Jul 10, 2024

LaurentMazare commented Jul 10, 2024

evilsocket commented Jul 10, 2024

evilsocket commented Jul 10, 2024

evilsocket commented Jul 11, 2024

ivarflakstad commented Jul 11, 2024

ivarflakstad commented Jul 11, 2024

evilsocket commented Jul 11, 2024

evilsocket commented Jul 11, 2024

evilsocket commented Jul 11, 2024

evilsocket commented Jul 13, 2024

ivarflakstad commented Jul 13, 2024

evilsocket commented Jul 13, 2024

sinkingsugar commented Dec 9, 2024

bug on aarch64-apple-ios: Buffer Validation Illegal MTLStorageMode 0x10 #2322

bug on aarch64-apple-ios: Buffer Validation Illegal MTLStorageMode 0x10 #2322

Comments

evilsocket commented Jul 8, 2024

evilsocket commented Jul 8, 2024

evilsocket commented Jul 10, 2024

evilsocket commented Jul 10, 2024

evilsocket commented Jul 10, 2024

evilsocket commented Jul 10, 2024

LaurentMazare commented Jul 10, 2024

evilsocket commented Jul 10, 2024

LaurentMazare commented Jul 10, 2024

evilsocket commented Jul 10, 2024

evilsocket commented Jul 10, 2024

evilsocket commented Jul 11, 2024

ivarflakstad commented Jul 11, 2024

ivarflakstad commented Jul 11, 2024

evilsocket commented Jul 11, 2024

evilsocket commented Jul 11, 2024

evilsocket commented Jul 11, 2024

evilsocket commented Jul 13, 2024

ivarflakstad commented Jul 13, 2024

evilsocket commented Jul 13, 2024

sinkingsugar commented Dec 9, 2024