Add --int8-threshold argument #198

pamparamm · 2023-03-08T20:21:41Z

~40% inference speed gain in 8 bit quantized models (7.6 tokens/s vs 5.2 tokens/s in LLaMA 13B) on RTX4090 with --int8-threshold 0 startup argument. May increase VRAM usage during inference. Related to: #190
You may get different performance boost on your configuration, try to experiment with different threshold values like 0, 1, 5, 6, 60, 1000 etc...

CypherNaught-0x · 2023-03-08T22:26:48Z

Has anyone done any testing if this negatively impacts model "accuracy"? According to the documentation, this may degrade performance and as far as I've heard the balance of int8 is not trivial. Maybe we could do some fixed seed testing and run a couple of benchmarks?

lxe · 2023-03-09T05:58:13Z

No benchmarks, but I can confirm it's working well

oobabooga · 2023-03-09T13:10:56Z

I find that this reduces generation time by about 25-30%, but If I compare the results for the same prompt using the Debug (deterministic) preset, they seem to be richer without llm_int8_threshold=0 and also closer to the regular fp16 results.

Here is a definition for this parameter:

https://huggingface.co/docs/transformers/main/main_classes/quantization#play-with-llmint8threshold

Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16.

I am not sure how to interpret llm_int8_threshold=0.

Ph0rk0z · 2023-03-09T14:49:57Z

I want to try it and see what it does on pascal. I have to set mine to .8 or I get errors. Maybe 0 works and doesn't balloon up the memory while generating.

Ok.. I found out.

.8 is slow and OOM.. probably 0 will be terrible too, it in essence turns off 8bit while generating. 0 gives me the NaN error. Hence it's "faster" for you. BUT using a threshold of 1 allows me to generate on older cards just fine and keep the low memory.

pamparamm · 2023-03-09T16:24:21Z

I think I should turn llm_int8_threshold into startup argument, so everyone could experiment with the threshold since it has various impact on performance/memory on different configurations.

oobabooga · 2023-03-09T16:26:26Z

That would be the perfect solution.

Ph0rk0z · 2023-03-09T16:36:35Z

Proof:

lxe · 2023-03-13T16:58:47Z

I've been using this PR for a while :) Any reasons not to merge?

oobabooga · 2023-03-14T11:29:03Z

Here are the results of a more careful test, using the Debug-deterministic preset to generate 1000 tokens:

--int8-threshold 0:

Output generated in 75.20 seconds (13.28 tokens/s, 999 tokens)

--int8-threshold 2:

Output generated in 88.33 seconds (11.31 tokens/s, 999 tokens)

--int8-threshold 6:

Output generated in 75.30 seconds (13.27 tokens/s, 999 tokens)

--int8-threshold 1000:

Output generated in 58.35 seconds (17.12 tokens/s, 999 tokens)

The conclusion is that this parameter can improve generation performance somewhat, but it degrades accuracy very noticeably and is hard to interpret. I feel like it would be a confusing addition to the web UI.

Ph0rk0z · 2023-03-14T13:19:48Z

That's because this parameter is only useful for pre-ampere cards. The default value is 6. A value of 0 in essence uses FP16 for inference. Look at nvtop while generating and you should see memory balloon up.

No way should regular users have to set this every time they use 8bit, but for someone like me I get NaN errors and can't use 8-bit at all without it.

What was the effect on accuracy? I don't notice anything when using 1 or 1.5 on llama 13.

oobabooga · 2023-03-14T19:25:50Z

@Ph0rk0z what is the benefit of this on pre-Ampere cards? I have tried using it on my GTX-1650 and it didn't solve the generated inf or nan error that it gets for being an older GPU not supported by bitsandbytes.

I haven't measured the performance degradation objectively, but the outputs when I use this parameter are significantly different in the examples that I have tried.

Ph0rk0z · 2023-03-14T22:39:31Z

What values did you use? You tried between 0.5 and 1.5? People using kobold picked 0.8.

There is no benefit besides being able to use int8 at all. I can run llama13b but not pythia, goes OOM. Performance was better than int4 but I did not use deterministic prompts like you to check if the output is affected.

Value too low = OOM during generation. Value too high = NaN

I do not have a smaller card to test on unfortunately so don't know what the sweet spot for a 1650 would be.

oobabooga · 2023-03-14T22:51:15Z

I have just made tests with 0.5, 0.8, and 1.5 and in all cases I got

  File "/home/user/.miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

For clarity: GTX 1650 GPU.

Ph0rk0z · 2023-03-14T23:57:30Z

Your card should not be using cublaslt at all.. it doesn't support it. Perhaps bits and bytes fixed it all fscky. That function he changed should return false for compute capability < 7.5.

This is what he did: bitsandbytes-foundation/bitsandbytes@ec5fbf4

I basically did this:

def is_cublasLt_compatible(cc):
    has_cublaslt = False

    return has_cublaslt

I never got your error. Installed bitsandbytes from the github repo and edited it. https://github.com/TimDettmers/bitsandbytes/releases/download/0.37.0/bitsandbytes-0.37.0-py3-none-any.whl

int 8 performance boost

9a89ba8

new argument for int8 threshold

6be523b

pamparamm changed the title ~~Int 8 performance boost~~ Add --int8-threshold argument Mar 9, 2023

Merge branch 'main' into int8-perf-boost

8de4f49

Ph0rk0z mentioned this pull request Mar 13, 2023

[run llama-13b] NameError: name 'cuda_setup' is not defined. Did you mean: 'CUDASetup'? #291

Closed

oobabooga closed this Mar 14, 2023

oobabooga reopened this Mar 14, 2023

oobabooga closed this Mar 14, 2023

oobabooga mentioned this pull request Mar 16, 2023

Add support for memory maps with --load-in-8bit #358

Merged

Ph0rk0z mentioned this pull request Mar 23, 2023

"RuntimeError: probability tensor contains either inf, nan or element < 0" 8bit mode not working 1070 Ti. #510

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --int8-threshold argument #198

Add --int8-threshold argument #198

pamparamm commented Mar 8, 2023 •

edited

Loading

CypherNaught-0x commented Mar 8, 2023

lxe commented Mar 9, 2023

oobabooga commented Mar 9, 2023

Ph0rk0z commented Mar 9, 2023 •

edited

Loading

pamparamm commented Mar 9, 2023

oobabooga commented Mar 9, 2023

Ph0rk0z commented Mar 9, 2023

lxe commented Mar 13, 2023

oobabooga commented Mar 14, 2023

Ph0rk0z commented Mar 14, 2023 •

edited

Loading

oobabooga commented Mar 14, 2023

Ph0rk0z commented Mar 14, 2023

oobabooga commented Mar 14, 2023

Ph0rk0z commented Mar 14, 2023

Add --int8-threshold argument #198

Add --int8-threshold argument #198

Conversation

pamparamm commented Mar 8, 2023 • edited Loading

CypherNaught-0x commented Mar 8, 2023

lxe commented Mar 9, 2023

oobabooga commented Mar 9, 2023

Ph0rk0z commented Mar 9, 2023 • edited Loading

pamparamm commented Mar 9, 2023

oobabooga commented Mar 9, 2023

Ph0rk0z commented Mar 9, 2023

lxe commented Mar 13, 2023

oobabooga commented Mar 14, 2023

Ph0rk0z commented Mar 14, 2023 • edited Loading

oobabooga commented Mar 14, 2023

Ph0rk0z commented Mar 14, 2023

oobabooga commented Mar 14, 2023

Ph0rk0z commented Mar 14, 2023

pamparamm commented Mar 8, 2023 •

edited

Loading

Ph0rk0z commented Mar 9, 2023 •

edited

Loading

Ph0rk0z commented Mar 14, 2023 •

edited

Loading