-
Notifications
You must be signed in to change notification settings - Fork 774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bounty] PyTorch & HuggingFace Interface #139
base: main
Are you sure you want to change the base?
Conversation
It generates now but I got some errors when running an inference on
|
you might have to do
as I am not finding a reference of llm_model_config in my newest code. Sometimes I have to do that when I do some updates. |
… from single safetensor, starting safetensor sharding test
@risingsunomi Have you load and run this branch success with model llama-3.1-8b: |
Working on that tonight more. I need to shard the tensors better just weird how to do it with transformers and pytorch. Have a solution in place just going to hit the gym and then get back at it tonight. Will keep you all updated. |
…e process for sharding HF safetensors
Pr139 dev oct24
Added in sharding the safetensor files to only load weights of needed layers by a node. Unfortunately, the way transformers work, it will still ask for double or more VRAM/RAM to load the model hf doc. |
Even using that doesn't help. In the discord I explained why in more detail than here but it doesn't seem to have an effect on its own. When I pair it with setting the pytorch device to auto, it seems to work but again might cause issues on some systems and still takes a bit of VRAM. |
That's frustrating - this is not useful if it uses double the VRAM. |
Yes, we will need to write models in pure pytorch to full utilize it for exo. Going to work on one for llama3 and possibly qwen2 to remove this transformers limitation. |
Still working on this and testing a pytorch, non-transformers implementation of the llama model. Working through some bugs to get it loaded but by tomorrow or this weekend will have a version up for testing with sharding and not double VRAM loading. Current WIP: https://github.com/risingsunomi/exo-nvidia/blob/pr139-dev-oct24/exo/inference/torch/models/llama3.py |
Finding building a pure pytorch implementation isn't working even from all the examples. Going to try to use the official meta code and hack it for the sharding we need. Will keep trying my method but not making much progress as able to shard the safetensors and everything but inference is not working at all. Still hitting at it. Any other eyes to look at this would be appreciated. Right now, its in shambles but I am using the torchtune method as opposed to using fairscale. Think I might switch to fairscale though as the official meta llama model is looking better Sorry again on the delay for this as regular job has me swamped but going to try to hit this faster before month is out. Thank you again |
|
Hello all,
I’ve made some updates to the exo library based on the bounty mentioned in this tweet/X post. These changes aim to integrate PyTorch and expand access to various language models through Hugging Face’s
AutoModelForCausalLM
.What's New?
These updates enable the exo library to use PyTorch, allowing access to a broader range of language models.
Limitations and Bugs
Right now the ShardedHuggingFaceModel is focused on using LlamaForCausalLM from the huggingface transformers library. From that model we break it up using LLamaModel and the layers it contains. We can then select the layers and run the pytorch tensors over them as need. I focused on using llama3.1 8B as I could only slightly run that.
Due to my current hardware limitations (specifically GPU and VRAM), I wasn’t able to fully test this across multiple nodes. The model currently takes about 30 seconds per token to generate for me (I have slow GPUs), which might be related to the absence of caching (not implemented due to VRAM constraints). It’s running without reaching an EOT and the outputs seem random.
Request for Feedback
I’m sharing this in the hope that others can test it on more capable setups and provide feedback on how to enhance performance and stability.
Important Note on Meta LLaMA 3.1 Model
If you plan to test with the official Meta LLaMA 3.1 model, please note:
huggingface-cli
to download it.Chat API Update
Looking forward to any feedback or suggestions you might have.
Thank you