-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is split offloading supported for 8bit mode? #193
Comments
It is possible. I get LLaMA 13B to work on my 3080Ti with both RAM offload and 8bit, and it's 3 times faster than staying in fp16, as more modules can be loaded in VRAM. But this requires some hack:
|
That's new to me. @sgsdxzy, can you copy and paste the exact modifications that you had to make to get this working for clarity? |
Related to #190 |
I did a lot of hacky work on the modeling_utils.py I initially tried inserting --So what worked for me--
|
If this solution can be integrated here... the possibilities! |
@oobabooga I was originally preparing a pr for you... now that the transformers part is rejected thing get a bit difficult.
(the part that convert device map from 'auto' to a dict) and paste before " # Extend the modules to not convert to keys that are supposed to be offloaded to And pass --load-in-8bit --auto-devices to server.py. |
Oh my frick... thank you so much kind internet stranger. I was able to use your code and can bifurcate 8-bit between cpu and gpu oh my god, wow like absolutely incredible thank you so much! |
Thanks a lot for this! |
Fixed in #358. Just use |
It would be really great to run the LLaMA 30B model in 8bit mode, but right now I can't get the memory to split between gpu and CPU using 8bit mode.
I feel like if this were possible it would be the revolutionary!
The text was updated successfully, but these errors were encountered: