Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bounty] PyTorch & HuggingFace Interface #139

Open
wants to merge 510 commits into
base: main
Choose a base branch
from

Conversation

risingsunomi
Copy link

Hello all,

I’ve made some updates to the exo library based on the bounty mentioned in this tweet/X post. These changes aim to integrate PyTorch and expand access to various language models through Hugging Face’s AutoModelForCausalLM.

What's New?

  • ShardedHuggingFaceModel: Adds sharding support for Hugging Face models.
  • PyTorchDynamicShardInferenceEngine: A new inference engine that uses PyTorch tensors for dynamic sharding.

These updates enable the exo library to use PyTorch, allowing access to a broader range of language models.

Limitations and Bugs

Right now the ShardedHuggingFaceModel is focused on using LlamaForCausalLM from the huggingface transformers library. From that model we break it up using LLamaModel and the layers it contains. We can then select the layers and run the pytorch tensors over them as need. I focused on using llama3.1 8B as I could only slightly run that.

Due to my current hardware limitations (specifically GPU and VRAM), I wasn’t able to fully test this across multiple nodes. The model currently takes about 30 seconds per token to generate for me (I have slow GPUs), which might be related to the absence of caching (not implemented due to VRAM constraints). It’s running without reaching an EOT and the outputs seem random.

Request for Feedback

I’m sharing this in the hope that others can test it on more capable setups and provide feedback on how to enhance performance and stability.

Important Note on Meta LLaMA 3.1 Model

If you plan to test with the official Meta LLaMA 3.1 model, please note:

  • Access: You’ll need to request access and authenticate using huggingface-cli to download it.
  • Command: Run the following command before using the model:
    huggingface-cli login
    
    I’m exploring ways to simplify this process, but for now, it’s necessary.

Chat API Update

  • Added an option to select the LLaMA 3.1 model in the chat API.

Looking forward to any feedback or suggestions you might have.

Thank you

@AlexCheema
Copy link
Contributor

It generates now but I got some errors when running an inference on llama-3.2-1b:

error loading and splitting model: 'ShardedHuggingFaceModel' object has no attribute 'llm_model_config'
Traceback (most recent call last):
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 729, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi", line 185, in _send_error_status_from_server
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi", line 99, in execute_batch
grpc._cython.cygrpc.ExecuteBatchError: Failed "execute_batch": (<grpc._cython.cygrpc.SendInitialMetadataOperation object at 0x127e02660>, 
<grpc._cython.cygrpc.SendStatusFromServerOperation object at 0x125c0f850>)

@risingsunomi
Copy link
Author

It generates now but I got some errors when running an inference on llama-3.2-1b:

error loading and splitting model: 'ShardedHuggingFaceModel' object has no attribute 'llm_model_config'
Traceback (most recent call last):
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 729, in grpc._cython.cygrpc._handle_exceptions
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi", line 185, in _send_error_status_from_server
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/callback_common.pyx.pxi", line 99, in execute_batch
grpc._cython.cygrpc.ExecuteBatchError: Failed "execute_batch": (<grpc._cython.cygrpc.SendInitialMetadataOperation object at 0x127e02660>, 
<grpc._cython.cygrpc.SendStatusFromServerOperation object at 0x125c0f850>)

you might have to do

pip install -e . 

as I am not finding a reference of llm_model_config in my newest code. Sometimes I have to do that when I do some updates.

@thangpnb
Copy link

@risingsunomi Have you load and run this branch success with model llama-3.1-8b:
"TorchDynamicShardInferenceEngine": Shard(model_id="unsloth/Meta-Llama-3.1-8B-Instruct", start_layer=0, end_layer=0, n_layers=32),

@risingsunomi
Copy link
Author

@risingsunomi Have you load and run this branch success with model llama-3.1-8b:
"TorchDynamicShardInferenceEngine": Shard(model_id="unsloth/Meta-Llama-3.1-8B-Instruct", start_layer=0, end_layer=0, n_layers=32),

Working on that tonight more. I need to shard the tensors better just weird how to do it with transformers and pytorch. Have a solution in place just going to hit the gym and then get back at it tonight. Will keep you all updated.

@risingsunomi
Copy link
Author

risingsunomi commented Oct 20, 2024

Added in sharding the safetensor files to only load weights of needed layers by a node. Unfortunately, the way transformers work, it will still ask for double or more VRAM/RAM to load the model hf doc.

@AlexCheema
Copy link
Contributor

IMG_0209

Not possible to use this option?

@risingsunomi
Copy link
Author

IMG_0209

Not possible to use this option?

Even using that doesn't help. In the discord I explained why in more detail than here but it doesn't seem to have an effect on its own. When I pair it with setting the pytorch device to auto, it seems to work but again might cause issues on some systems and still takes a bit of VRAM.

@AlexCheema
Copy link
Contributor

IMG_0209
Not possible to use this option?

Even using that doesn't help. In the discord I explained why in more detail than here but it doesn't seem to have an effect on its own. When I pair it with setting the pytorch device to auto, it seems to work but again might cause issues on some systems and still takes a bit of VRAM.

That's frustrating - this is not useful if it uses double the VRAM.
Would this require going a bit deeper into PyTorch?

@risingsunomi
Copy link
Author

IMG_0209
Not possible to use this option?

Even using that doesn't help. In the discord I explained why in more detail than here but it doesn't seem to have an effect on its own. When I pair it with setting the pytorch device to auto, it seems to work but again might cause issues on some systems and still takes a bit of VRAM.

That's frustrating - this is not useful if it uses double the VRAM. Would this require going a bit deeper into PyTorch?

Yes, we will need to write models in pure pytorch to full utilize it for exo. Going to work on one for llama3 and possibly qwen2 to remove this transformers limitation.

@risingsunomi
Copy link
Author

risingsunomi commented Oct 23, 2024

Still working on this and testing a pytorch, non-transformers implementation of the llama model. Working through some bugs to get it loaded but by tomorrow or this weekend will have a version up for testing with sharding and not double VRAM loading.

Current WIP: https://github.com/risingsunomi/exo-nvidia/blob/pr139-dev-oct24/exo/inference/torch/models/llama3.py

@risingsunomi
Copy link
Author

Finding building a pure pytorch implementation isn't working even from all the examples. Going to try to use the official meta code and hack it for the sharding we need. Will keep trying my method but not making much progress as able to shard the safetensors and everything but inference is not working at all. Still hitting at it.

Any other eyes to look at this would be appreciated. Right now, its in shambles but I am using the torchtune method as opposed to using fairscale. Think I might switch to fairscale though as the official meta llama model is looking better

My WIP code

Sorry again on the delay for this as regular job has me swamped but going to try to hit this faster before month is out.

Thank you again

@risingsunomi
Copy link
Author

Finding building a pure pytorch implementation isn't working even from all the examples. Going to try to use the official meta code and hack it for the sharding we need. Will keep trying my method but not making much progress as able to shard the safetensors and everything but inference is not working at all. Still hitting at it.

Any other eyes to look at this would be appreciated. Right now, its in shambles but I am using the torchtune method as opposed to using fairscale. Think I might switch to fairscale though as the official meta llama model is looking better

My WIP code

Sorry again on the delay for this as regular job has me swamped but going to try to hit this faster before month is out.

Thank you again

spoke too soon, think I can get this working
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants