[BOUNTY - $500] Llama.cpp inference engine #167

AlexCheema · 2024-08-22T14:58:15Z

it should automatically detect the best device to run on
We should require 0 manual configuration from the user, by default llama.cpp for example requires specifying the device

danny-avila · 2024-08-25T23:27:19Z

Myself and many others likely only have windows systems, and llama.cpp is practically the only option.

MLX is macOS and tinygrad:

Windows support has been dropped to focus on Linux and Mac OS.
Some functionality may work on Windows but no support will be provided, use WSL instead.
source: https://github.com/tinygrad/tinygrad/releases/tag/v0.7.0

For this opening statement to be true, it would need to include windows-based systems, especially old gaming rigs.

Forget expensive NVIDIA GPUs, unify your existing devices into one powerful GPU: iPhone, iPad, Android, Mac, Linux, pretty much any device!

At the very least, a thorough guide on setting up tinygrad via WSL/WSL2 would be appreciated, because this is your only documentation:

Example Usage on Multiple MacOS Devices

bayedieng · 2024-08-27T11:07:59Z

I'd like to look into this. Adjacently, llamafiles might be worth looking into as they are binaries able to run on multiple desktop OSes without any configuration. Though I'm not sure about Android or IOS support.

AlexCheema · 2024-08-27T11:10:36Z

I'd like to look into this. Adjacently, llamafiles might be worth looking into as they are binaries able to run on multiple desktop OSes without any configuration. Though I'm not sure about Android or IOS support.

Go for it!

AlexCheema · 2024-08-27T11:55:06Z

@bayedieng I'd recommend looking at https://github.com/abetlen/llama-cpp-python -- it should hopefully be low level enough to do what we need to do. Also, I'd recommend looking at #139 for a minimal implementation of an inference engine that doesn't require explicitly defining every model -- it's a general solution.

bayedieng · 2024-08-27T12:37:36Z

Thanks for the suggestion. Yeah I had already seen the python bindings and went ahead and began a draft PR.

thegodone · 2024-09-14T05:18:00Z

I wonder if WebGPU can be plugged on top of Llama.cpp via this https://github.com/AnswerDotAI/gpu.cpp wrapper ?

bayedieng · 2024-09-21T13:35:57Z

It would seem that the LLAMA CPP API is too high level to perform sharded inference as it doesn't provide access to individual layers. In order to do so, one would have to use the GGML bindings directly to create a suitable inference engine compatible with Exo.

It was stated that it would be nice to have a similar API as the Pytorch engine where a base model and reused everywhere as provided with AutoModelForCausalLM however, GGML does not provide a simmilar API for inference, so each model would have to likely be implmented the same way that tinygrad and mlx does it.

@AlexCheema please let me know if you'd like to go forward and build the inference engine with ggml bindings.

danny-avila · 2024-09-21T13:44:11Z

Heads up, LocalAI now has distributed inference:

https://localai.io/features/distribute/

bayedieng · 2024-09-21T13:55:00Z

Heads up, LocalAI now has distributed inference:

https://localai.io/features/distribute/

Not familiar with them but they also likely just use the underlying GGML API as well, considering LLama CPP does inference end-to-end. LLama CPP also has a distributed inference example using GGML:

https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc

bayedieng · 2024-10-25T19:24:15Z

I've been caught up recently in a whole bunch of work recently and have had little time to work on the LLama support. I wouldn't would to hold the PR hostage if someone's capable of completing it go ahead!

AlexCheema assigned bayedieng Aug 27, 2024

bayedieng mentioned this issue Aug 27, 2024

Add Llama.cpp Support #183

Closed

AlexCheema mentioned this issue Aug 27, 2024

[BOUNTY - $200] Windows native support #186

Open

bayedieng mentioned this issue Oct 12, 2024

Add LLama CPP Support #335

Closed

3 tasks

bayedieng removed their assignment Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BOUNTY - $500] Llama.cpp inference engine #167

[BOUNTY - $500] Llama.cpp inference engine #167

AlexCheema commented Aug 22, 2024

danny-avila commented Aug 25, 2024

bayedieng commented Aug 27, 2024

AlexCheema commented Aug 27, 2024

AlexCheema commented Aug 27, 2024

bayedieng commented Aug 27, 2024

thegodone commented Sep 14, 2024

bayedieng commented Sep 21, 2024

danny-avila commented Sep 21, 2024

bayedieng commented Sep 21, 2024

bayedieng commented Oct 25, 2024

[BOUNTY - $500] Llama.cpp inference engine #167

[BOUNTY - $500] Llama.cpp inference engine #167

Comments

AlexCheema commented Aug 22, 2024

danny-avila commented Aug 25, 2024

bayedieng commented Aug 27, 2024

AlexCheema commented Aug 27, 2024

AlexCheema commented Aug 27, 2024

bayedieng commented Aug 27, 2024

thegodone commented Sep 14, 2024

bayedieng commented Sep 21, 2024

danny-avila commented Sep 21, 2024

bayedieng commented Sep 21, 2024

bayedieng commented Oct 25, 2024