Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --int8-threshold argument #198

Closed
wants to merge 3 commits into from

Conversation

pamparamm
Copy link

@pamparamm pamparamm commented Mar 8, 2023

~40% inference speed gain in 8 bit quantized models (7.6 tokens/s vs 5.2 tokens/s in LLaMA 13B) on RTX4090 with --int8-threshold 0 startup argument. May increase VRAM usage during inference. Related to: #190
You may get different performance boost on your configuration, try to experiment with different threshold values like 0, 1, 5, 6, 60, 1000 etc...

@CypherNaught-0x
Copy link
Contributor

Has anyone done any testing if this negatively impacts model "accuracy"? According to the documentation, this may degrade performance and as far as I've heard the balance of int8 is not trivial. Maybe we could do some fixed seed testing and run a couple of benchmarks?

@lxe
Copy link
Contributor

lxe commented Mar 9, 2023

No benchmarks, but I can confirm it's working well

@oobabooga
Copy link
Owner

I find that this reduces generation time by about 25-30%, but If I compare the results for the same prompt using the Debug (deterministic) preset, they seem to be richer without llm_int8_threshold=0 and also closer to the regular fp16 results.

Here is a definition for this parameter:

https://huggingface.co/docs/transformers/main/main_classes/quantization#play-with-llmint8threshold

Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16.

I am not sure how to interpret llm_int8_threshold=0.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 9, 2023

I want to try it and see what it does on pascal. I have to set mine to .8 or I get errors. Maybe 0 works and doesn't balloon up the memory while generating.

Ok.. I found out.

.8 is slow and OOM.. probably 0 will be terrible too, it in essence turns off 8bit while generating. 0 gives me the NaN error. Hence it's "faster" for you. BUT using a threshold of 1 allows me to generate on older cards just fine and keep the low memory.

@pamparamm
Copy link
Author

I think I should turn llm_int8_threshold into startup argument, so everyone could experiment with the threshold since it has various impact on performance/memory on different configurations.

@oobabooga
Copy link
Owner

That would be the perfect solution.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 9, 2023

Proof:

8bitPascal

@pamparamm pamparamm changed the title Int 8 performance boost Add --int8-threshold argument Mar 9, 2023
@lxe
Copy link
Contributor

lxe commented Mar 13, 2023

I've been using this PR for a while :) Any reasons not to merge?

@oobabooga
Copy link
Owner

Here are the results of a more careful test, using the Debug-deterministic preset to generate 1000 tokens:

  • --int8-threshold 0:

Output generated in 75.20 seconds (13.28 tokens/s, 999 tokens)

  • --int8-threshold 2:

Output generated in 88.33 seconds (11.31 tokens/s, 999 tokens)

  • --int8-threshold 6:

Output generated in 75.30 seconds (13.27 tokens/s, 999 tokens)

  • --int8-threshold 1000:

Output generated in 58.35 seconds (17.12 tokens/s, 999 tokens)

The conclusion is that this parameter can improve generation performance somewhat, but it degrades accuracy very noticeably and is hard to interpret. I feel like it would be a confusing addition to the web UI.

@oobabooga oobabooga closed this Mar 14, 2023
@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 14, 2023

That's because this parameter is only useful for pre-ampere cards. The default value is 6. A value of 0 in essence uses FP16 for inference. Look at nvtop while generating and you should see memory balloon up.

No way should regular users have to set this every time they use 8bit, but for someone like me I get NaN errors and can't use 8-bit at all without it.

What was the effect on accuracy? I don't notice anything when using 1 or 1.5 on llama 13.

@oobabooga oobabooga reopened this Mar 14, 2023
@oobabooga
Copy link
Owner

@Ph0rk0z what is the benefit of this on pre-Ampere cards? I have tried using it on my GTX-1650 and it didn't solve the generated inf or nan error that it gets for being an older GPU not supported by bitsandbytes.

I haven't measured the performance degradation objectively, but the outputs when I use this parameter are significantly different in the examples that I have tried.

@oobabooga oobabooga closed this Mar 14, 2023
@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 14, 2023

What values did you use? You tried between 0.5 and 1.5? People using kobold picked 0.8.

There is no benefit besides being able to use int8 at all. I can run llama13b but not pythia, goes OOM. Performance was better than int4 but I did not use deterministic prompts like you to check if the output is affected.

Value too low = OOM during generation. Value too high = NaN

I do not have a smaller card to test on unfortunately so don't know what the sweet spot for a 1650 would be.

@oobabooga
Copy link
Owner

I have just made tests with 0.5, 0.8, and 1.5 and in all cases I got

  File "/home/user/.miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

For clarity: GTX 1650 GPU.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Mar 14, 2023

Your card should not be using cublaslt at all.. it doesn't support it. Perhaps bits and bytes fixed it all fscky. That function he changed should return false for compute capability < 7.5.

This is what he did: bitsandbytes-foundation/bitsandbytes@ec5fbf4

I basically did this:

def is_cublasLt_compatible(cc):
    has_cublaslt = False

    return has_cublaslt

I never got your error. Installed bitsandbytes from the github repo and edited it. https://github.com/TimDettmers/bitsandbytes/releases/download/0.37.0/bitsandbytes-0.37.0-py3-none-any.whl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants