-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --int8-threshold argument #198
Conversation
Has anyone done any testing if this negatively impacts model "accuracy"? According to the documentation, this may degrade performance and as far as I've heard the balance of int8 is not trivial. Maybe we could do some fixed seed testing and run a couple of benchmarks? |
No benchmarks, but I can confirm it's working well |
I find that this reduces generation time by about 25-30%, but If I compare the results for the same prompt using the Here is a definition for this parameter: https://huggingface.co/docs/transformers/main/main_classes/quantization#play-with-llmint8threshold
I am not sure how to interpret |
I want to try it and see what it does on pascal. I have to set mine to .8 or I get errors. Maybe 0 works and doesn't balloon up the memory while generating. Ok.. I found out. .8 is slow and OOM.. probably 0 will be terrible too, it in essence turns off 8bit while generating. 0 gives me the NaN error. Hence it's "faster" for you. BUT using a threshold of 1 allows me to generate on older cards just fine and keep the low memory. |
I think I should turn |
That would be the perfect solution. |
I've been using this PR for a while :) Any reasons not to merge? |
Here are the results of a more careful test, using the
The conclusion is that this parameter can improve generation performance somewhat, but it degrades accuracy very noticeably and is hard to interpret. I feel like it would be a confusing addition to the web UI. |
That's because this parameter is only useful for pre-ampere cards. The default value is 6. A value of 0 in essence uses FP16 for inference. Look at nvtop while generating and you should see memory balloon up. No way should regular users have to set this every time they use 8bit, but for someone like me I get NaN errors and can't use 8-bit at all without it. What was the effect on accuracy? I don't notice anything when using 1 or 1.5 on llama 13. |
@Ph0rk0z what is the benefit of this on pre-Ampere cards? I have tried using it on my GTX-1650 and it didn't solve the I haven't measured the performance degradation objectively, but the outputs when I use this parameter are significantly different in the examples that I have tried. |
What values did you use? You tried between 0.5 and 1.5? People using kobold picked 0.8. There is no benefit besides being able to use int8 at all. I can run llama13b but not pythia, goes OOM. Performance was better than int4 but I did not use deterministic prompts like you to check if the output is affected. Value too low = OOM during generation. Value too high = NaN I do not have a smaller card to test on unfortunately so don't know what the sweet spot for a 1650 would be. |
I have just made tests with 0.5, 0.8, and 1.5 and in all cases I got
For clarity: GTX 1650 GPU. |
Your card should not be using cublaslt at all.. it doesn't support it. Perhaps bits and bytes fixed it all fscky. That function he changed should return false for compute capability < 7.5. This is what he did: bitsandbytes-foundation/bitsandbytes@ec5fbf4 I basically did this:
I never got your error. Installed bitsandbytes from the github repo and edited it. https://github.com/TimDettmers/bitsandbytes/releases/download/0.37.0/bitsandbytes-0.37.0-py3-none-any.whl |
~40% inference speed gain in 8 bit quantized models (7.6 tokens/s vs 5.2 tokens/s in LLaMA 13B) on RTX4090 with
--int8-threshold 0
startup argument. May increase VRAM usage during inference. Related to: #190You may get different performance boost on your configuration, try to experiment with different threshold values like 0, 1, 5, 6, 60, 1000 etc...