Replies: 2 comments
-
I too, would love to know this. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Are there any solutions? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've looked through a number of GPTQ forks and so far found nothing on this, so thought to ask here: The quantization (compression) examples all show e.g. CUDA_VISIBLE_DEVICES=0 for that step, then multiple devices for benchmark and inference. E.g. here under the language generation section: https://github.com/qwopqwop200/GPTQ-for-LLaMa
I have plenty of CPU RAM, yet seemingly can't quantize llama-30b on just one of my 3060 (12 GB) GPUs. If that process could be split among GPUs, I think it'd fit into two of them though, much like using --auto-devices can for inference with almost all models.
Is there a fundamental limitation with quantizing having to run on just one GPU?
Beta Was this translation helpful? Give feedback.
All reactions