[Bounty] PyTorch & HuggingFace Interface #139

risingsunomi · 2024-08-11T02:24:36Z

Hello all,

I’ve made some updates to the exo library based on the bounty mentioned in this tweet/X post. These changes aim to integrate PyTorch and expand access to various language models through Hugging Face’s AutoModelForCausalLM.

What's New?

ShardedHuggingFaceModel: Adds sharding support for Hugging Face models.
PyTorchDynamicShardInferenceEngine: A new inference engine that uses PyTorch tensors for dynamic sharding.

These updates enable the exo library to use PyTorch, allowing access to a broader range of language models.

Limitations and Bugs

Right now the ShardedHuggingFaceModel is focused on using LlamaForCausalLM from the huggingface transformers library. From that model we break it up using LLamaModel and the layers it contains. We can then select the layers and run the pytorch tensors over them as need. I focused on using llama3.1 8B as I could only slightly run that.

Due to my current hardware limitations (specifically GPU and VRAM), I wasn’t able to fully test this across multiple nodes. The model currently takes about 30 seconds per token to generate for me (I have slow GPUs), which might be related to the absence of caching (not implemented due to VRAM constraints). It’s running without reaching an EOT and the outputs seem random.

Request for Feedback

I’m sharing this in the hope that others can test it on more capable setups and provide feedback on how to enhance performance and stability.

Important Note on Meta LLaMA 3.1 Model

If you plan to test with the official Meta LLaMA 3.1 model, please note:

Access: You’ll need to request access and authenticate using huggingface-cli to download it.
Command: Run the following command before using the model:
```
huggingface-cli login
```
I’m exploring ways to simplify this process, but for now, it’s necessary.

Chat API Update

Added an option to select the LLaMA 3.1 model in the chat API.

Looking forward to any feedback or suggestions you might have.

Thank you

AlexCheema · 2024-10-17T19:59:05Z

It generates now but I got some errors when running an inference on llama-3.2-1b:

error loading and splitting model: 'ShardedHuggingFaceModel' object has no attribute 'llm_model_config'
Traceback (most recent call last):
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 729, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi", line 185, in _send_error_status_from_server
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi", line 99, in execute_batch
grpc._cython.cygrpc.ExecuteBatchError: Failed "execute_batch": (<grpc._cython.cygrpc.SendInitialMetadataOperation object at 0x127e02660>, 
<grpc._cython.cygrpc.SendStatusFromServerOperation object at 0x125c0f850>)

risingsunomi · 2024-10-17T20:05:43Z

It generates now but I got some errors when running an inference on llama-3.2-1b:

error loading and splitting model: 'ShardedHuggingFaceModel' object has no attribute 'llm_model_config'
Traceback (most recent call last):
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 729, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi", line 185, in _send_error_status_from_server
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi", line 99, in execute_batch
grpc._cython.cygrpc.ExecuteBatchError: Failed "execute_batch": (<grpc._cython.cygrpc.SendInitialMetadataOperation object at 0x127e02660>, 
<grpc._cython.cygrpc.SendStatusFromServerOperation object at 0x125c0f850>)

you might have to do

pip install -e .

as I am not finding a reference of llm_model_config in my newest code. Sometimes I have to do that when I do some updates.

…dev-oct24

… from single safetensor, starting safetensor sharding test

thangpnb · 2024-10-19T04:16:14Z

@risingsunomi Have you load and run this branch success with model llama-3.1-8b:
"TorchDynamicShardInferenceEngine": Shard(model_id="unsloth/Meta-Llama-3.1-8B-Instruct", start_layer=0, end_layer=0, n_layers=32),

risingsunomi · 2024-10-19T04:34:37Z

@risingsunomi Have you load and run this branch success with model llama-3.1-8b:
"TorchDynamicShardInferenceEngine": Shard(model_id="unsloth/Meta-Llama-3.1-8B-Instruct", start_layer=0, end_layer=0, n_layers=32),

Working on that tonight more. I need to shard the tensors better just weird how to do it with transformers and pytorch. Have a solution in place just going to hit the gym and then get back at it tonight. Will keep you all updated.

…e process for sharding HF safetensors

Pr139 dev oct24

risingsunomi · 2024-10-20T13:09:44Z

Added in sharding the safetensor files to only load weights of needed layers by a node. Unfortunately, the way transformers work, it will still ask for double or more VRAM/RAM to load the model hf doc.

AlexCheema · 2024-10-20T17:35:22Z

Not possible to use this option?

risingsunomi · 2024-10-20T20:05:15Z

Not possible to use this option?

Even using that doesn't help. In the discord I explained why in more detail than here but it doesn't seem to have an effect on its own. When I pair it with setting the pytorch device to auto, it seems to work but again might cause issues on some systems and still takes a bit of VRAM.

AlexCheema · 2024-10-21T17:06:33Z

Not possible to use this option?

Even using that doesn't help. In the discord I explained why in more detail than here but it doesn't seem to have an effect on its own. When I pair it with setting the pytorch device to auto, it seems to work but again might cause issues on some systems and still takes a bit of VRAM.

That's frustrating - this is not useful if it uses double the VRAM.
Would this require going a bit deeper into PyTorch?

risingsunomi · 2024-10-21T19:09:06Z

Not possible to use this option?

Even using that doesn't help. In the discord I explained why in more detail than here but it doesn't seem to have an effect on its own. When I pair it with setting the pytorch device to auto, it seems to work but again might cause issues on some systems and still takes a bit of VRAM.

That's frustrating - this is not useful if it uses double the VRAM. Would this require going a bit deeper into PyTorch?

Yes, we will need to write models in pure pytorch to full utilize it for exo. Going to work on one for llama3 and possibly qwen2 to remove this transformers limitation.

risingsunomi · 2024-10-23T01:33:08Z

Still working on this and testing a pytorch, non-transformers implementation of the llama model. Working through some bugs to get it loaded but by tomorrow or this weekend will have a version up for testing with sharding and not double VRAM loading.

Current WIP: https://github.com/risingsunomi/exo-nvidia/blob/pr139-dev-oct24/exo/inference/torch/models/llama3.py

risingsunomi · 2024-11-03T11:23:29Z

Finding building a pure pytorch implementation isn't working even from all the examples. Going to try to use the official meta code and hack it for the sharding we need. Will keep trying my method but not making much progress as able to shard the safetensors and everything but inference is not working at all. Still hitting at it.

Any other eyes to look at this would be appreciated. Right now, its in shambles but I am using the torchtune method as opposed to using fairscale. Think I might switch to fairscale though as the official meta llama model is looking better

My WIP code

Sorry again on the delay for this as regular job has me swamped but going to try to hit this faster before month is out.

Thank you again

risingsunomi · 2024-11-03T12:51:46Z

Finding building a pure pytorch implementation isn't working even from all the examples. Going to try to use the official meta code and hack it for the sharding we need. Will keep trying my method but not making much progress as able to shard the safetensors and everything but inference is not working at all. Still hitting at it.

Any other eyes to look at this would be appreciated. Right now, its in shambles but I am using the torchtune method as opposed to using fairscale. Think I might switch to fairscale though as the official meta llama model is looking better

My WIP code

Sorry again on the delay for this as regular job has me swamped but going to try to hit this faster before month is out.

Thank you again

spoke too soon, think I can get this working

risingsunomi added 30 commits August 10, 2024 16:00

layer test

1b27532

layer test

0876c79

layer test

e5c56c4

layer test

696c3bb

layer test

e807a63

layer test

0b589ee

layer test

ced3879

layer test

c6693a8

layer test

30e971d

fixing layer issue

8b4e624

temp and layer test

bcb499e

temp and layer test

724c6c4

temp and layer test

e23f3f7

temp and layer test

7f13a6d

temp and layer test

ec92328

temp and layer test

fc3d224

temp and layer test

f14a339

temp and layer test

4f4a9e1

change temp

3da44b3

change temp and alpha

0a4a003

change temp and alpha

e351501

change temp

3251567

change temp

16e4f7e

change temp

6083927

change sampling

5b02fd1

change sampling

9805ac2

change sampling

8da3114

change sampling

c62dd2d

change sampling

0f7f96d

change sampling

fc36619

risingsunomi added 5 commits October 18, 2024 12:06

adding num hidden layers manipulation for all models

d07b825

updating to use shard_num_hidden_layers

a840e7f

Merge branch 'main' of github.com:risingsunomi/exo-nvidia into pr139-…

bf5f22d

…dev-oct24

adding in better layer manipulation

52fa3f8

adding in safe tensor sharding, generate model.safetensors.index.json…

ec49e31

… from single safetensor, starting safetensor sharding test

risingsunomi added 8 commits October 19, 2024 04:51

implementing sharding tests, fixing bugs with safetensor recompile

f45b514

adding safetensor sharding, implementing it into model inference engine

f90c24a

updating backup and backup restore

696c264

added removing backup when restoring

9514e92

added generating weight map if none, did updates to backup and restor…

d65505e

…e process for sharding HF safetensors

cleaning up logging

d5b6113

updating docstring in newest class file

d2302cc

Merge pull request #29 from risingsunomi/pr139-dev-oct24

35c32eb

Pr139 dev oct24

risingsunomi and others added 4 commits October 22, 2024 17:33

Merge branch 'exo-explore:main' into main

291aa10

Merge branch 'main' into main

fcb298b

merge fixing

6e32be6

Merge branch 'main' into pr/risingsunomi/30

df13fbc

dtnewman mentioned this pull request Oct 30, 2024

Package exo as installable #354

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bounty] PyTorch & HuggingFace Interface #139

[Bounty] PyTorch & HuggingFace Interface #139

risingsunomi commented Aug 11, 2024

AlexCheema commented Oct 17, 2024

risingsunomi commented Oct 17, 2024

thangpnb commented Oct 19, 2024

risingsunomi commented Oct 19, 2024

risingsunomi commented Oct 20, 2024 •

edited

Loading

AlexCheema commented Oct 20, 2024

risingsunomi commented Oct 20, 2024

AlexCheema commented Oct 21, 2024

risingsunomi commented Oct 21, 2024

risingsunomi commented Oct 23, 2024 •

edited

Loading

risingsunomi commented Nov 3, 2024

risingsunomi commented Nov 3, 2024

[Bounty] PyTorch & HuggingFace Interface #139

Are you sure you want to change the base?

[Bounty] PyTorch & HuggingFace Interface #139

Conversation

risingsunomi commented Aug 11, 2024

What's New?

Limitations and Bugs

Request for Feedback

Important Note on Meta LLaMA 3.1 Model

Chat API Update

AlexCheema commented Oct 17, 2024

risingsunomi commented Oct 17, 2024

thangpnb commented Oct 19, 2024

risingsunomi commented Oct 19, 2024

risingsunomi commented Oct 20, 2024 • edited Loading

AlexCheema commented Oct 20, 2024

risingsunomi commented Oct 20, 2024

AlexCheema commented Oct 21, 2024

risingsunomi commented Oct 21, 2024

risingsunomi commented Oct 23, 2024 • edited Loading

risingsunomi commented Nov 3, 2024

risingsunomi commented Nov 3, 2024

risingsunomi commented Oct 20, 2024 •

edited

Loading

risingsunomi commented Oct 23, 2024 •

edited

Loading