Enable the use of PyTorch 'mps' device for inference on Apple Silicon #5

chris-hatton · 2023-03-19T11:38:40Z

The LLaMA.cpp project enables LLaMA inference on Apple Silicon devices by using CPU, but faster inference should be possible by supporting the M1/Pro/Max GPU onvanilla-llama, given that PyTorch is now M1 compatible using the 'mps' device.

I'm new to Python but my observations:

In both generation.py and model.py there are uses of the function.cuda() which can be replaced with

device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')
...
.to(device)

When attempting to run example.py after this; it's the Accelerate framework which throws an error with:
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' - something is trying to use cpu instead of mps.

I wonder if this is because the call into accelerate is load_checkpoint_and_dispatch with auto provided as the device map - is PyTorch preferring cpu over mps here for some reason. Edit: This

The text was updated successfully, but these errors were encountered:

anentropic · 2023-04-06T15:54:58Z

  File "vanilla-llama/llama/model.py", line 99, in __init__
    ).cuda()
  File "vanilla-llama/lib/python3.10/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

anentropic · 2023-04-06T18:19:48Z

I tried getting past the OP issue (RuntimeError: "addmm_impl_cpu_" not implemented for 'Half') by passing device_map=None

found and fixed this issue huggingface/accelerate#1297 along the way

accelerate docs say MPS backend will be used by default when available (apart from the caveat that their device map code doesn't support it yet) so I was hoping that would happen when device_map=None

But then I get this puzzling error:

vanilla-llama/llama/generation.py", line 121, in sample_top_p
next_token = torch.multinomial(probs_sort, num_samples=1)
NotImplementedError: Could not run 'aten::multinomial' with arguments from the 'Meta' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::multinomial' is only available for these backends: [CPU, MPS, BackendSelect, Python, FuncTorchDynamicLayerBackMode, Functionalize, Named, Conjugate, Negative, ZeroTensor, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradHIP, AutogradXLA, AutogradMPS, AutogradIPU, AutogradXPU, AutogradHPU, AutogradVE, AutogradLazy, AutogradMeta, AutogradMTIA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, AutogradNestedTensor, Tracer, AutocastCPU, AutocastCUDA, FuncTorchBatched, FuncTorchVmapMode, Batched, VmapMode, FuncTorchGradWrapper, PythonTLSSnapshot, FuncTorchDynamicLayerFrontMode, PythonDispatcher].

I couldn't find much by googling for this, not helped by the fact that PyTorch itself seems to be part of Meta (Facebook) now

There is this https://pytorch.org/docs/stable/generated/torch.Tensor.is_meta.html and https://pytorch.org/torchdistx/latest/fake_tensor.html

Fake tensors, similar to meta tensors, carry no data; however, unlike meta tensors which report meta as their device, fake tensors act as if they were allocated on a real device.

Not sure why we are on the meta device in this line of code?

It sounds like maybe device_map=None has left us without a device

Adding an explicit device map in LLaMAInference like:

        device = torch.device("cpu")
        if torch.has_cuda:
            device = torch.device("cuda")
        elif torch.has_mps:
            device = torch.device('mps')

        if device_map is None:
            modules = (
                "transformer",
                "tok_embeddings",
                "layers",
                "norm",
                "output",
            )
            device_map = {module: device for module in modules}

...this gets further!

Now I get this error:

vanilla-llama/llama/model.py", line 62, in apply_rotary_emb
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
NotImplementedError: The operator 'aten::view_as_complex' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

Finally I tried with the PYTORCH_ENABLE_MPS_FALLBACK=1 flag suggested and got:

vanilla-llama/llama/model.py:63: UserWarning: 0The operator aten::view_as_complex appears to be a view operator, but it has no implementation for the backend "mps:0". View operators don't support falling back to run on the CPU, since the tensor's storage cannot be shared across devices. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/CPUFallback.cpp:181.)
xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
libc++abi: terminating due to uncaught exception of type c10::TypeError: Trying to convert ComplexFloat to the MPS backend but it does not have support for that dtype.
Exception raised from getMPSScalarType at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:91 (most recent call first):
frame #0: at::native::mps::getMPSScalarType(c10::ScalarType) + 180 (0x134721278 in libtorch_cpu.dylib)

which starts to look like running on MPS device is a dead-end

chris-hatton · 2023-04-06T23:22:09Z

Some great sleuthing there @anentropic 👌 that last error is mentioned in this thread

anentropic · 2023-04-12T12:33:16Z

Others have got this working it seems:

https://github.com/jankais3r/LLaMA_MPS

chris-hatton changed the title ~~Enable the use of 'mps' device for inference on Apple Silicon~~ Enable the use of PyTorch 'mps' device to enable inference on Apple Silicon Mar 19, 2023

chris-hatton changed the title ~~Enable the use of PyTorch 'mps' device to enable inference on Apple Silicon~~ Enable the use of PyTorch 'mps' device for inference on Apple Silicon Mar 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the use of PyTorch 'mps' device for inference on Apple Silicon #5

Enable the use of PyTorch 'mps' device for inference on Apple Silicon #5

chris-hatton commented Mar 19, 2023 •

edited

Loading

anentropic commented Apr 6, 2023

anentropic commented Apr 6, 2023

chris-hatton commented Apr 6, 2023 •

edited

Loading

anentropic commented Apr 12, 2023

Enable the use of PyTorch 'mps' device for inference on Apple Silicon #5

Enable the use of PyTorch 'mps' device for inference on Apple Silicon #5

Comments

chris-hatton commented Mar 19, 2023 • edited Loading

anentropic commented Apr 6, 2023

anentropic commented Apr 6, 2023

chris-hatton commented Apr 6, 2023 • edited Loading

anentropic commented Apr 12, 2023

chris-hatton commented Mar 19, 2023 •

edited

Loading

chris-hatton commented Apr 6, 2023 •

edited

Loading