-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codellama 7B OOM on A30 #541
Comments
Hi, could you share the command starting tabby? |
To install and run Codellama directly using Docker, add the --model TabbyML/CodeLlama-7B flag. docker run -it |
On a GPU with a capability of >= 8.0 (which A30 has), the inference engine will attempt to load the model in int8 mode. For CodeLlama-7B, it roughly requires around 8GB of VRAM. Could you please confirm if you have sufficient VRAM to run such a model? Additionally, could you share the output of |
I experience the same problem on a L40 GPU. It happens with Llama but also with Mistral. |
@JohanVer, have you tried version v0.2.2? Does it also exhibit the issue you described? There are a few significant changes between v0.2.0 and v0.2.2 that might affect VRAM usage, so I’d like to confirm |
When deploying the Codellama 7B model on a Windows system, there seems to be a continuous increase in VRAM usage during inference. However, when deploying on a Linux system, the VRAM usage remains stable. On the other hand, When calling the Codellama 7B model on Linux, there is a significant increase in CPU usage. However, when using the Starcoder model, the CPU usage remains stable. |
Hello @mydipan396, given @JohanVer 's comments, could you also please test the CodeLlama 7B model on Linux with versions 0.2.0 and 0.2.2 to determine if the out-of-memory (OOM) issue still persists? |
Re-testing, it is indeed a version issue, independent of the operating system. There are problems when installing version 0.2.2, but updating to the latest version, 0.3.0, resolves the problem. |
To confirm, does the "OOM" problem on CodeLLama-7B disappear after upgrading to v0.3.0? Is that correct? |
Hi @wsxiaoys , Btw: I tested it with Mistral7B and StarCoder1B |
Hello, @JohanVer. Thank you for conducting these experiments. I do have a potential culprint commit between versions 0.2.0 and 0.2.1, which may potentially increase GPU RAM usage when setting a higher parallelism for GPU replicas (b3b4986). As a temporary solution, please continue using v0.2.0 for your setup. I will investigate the issue further and provide an update once it's resolved. |
Thanks @wsxiaoys for the investigating the problem and for the awesome project :) |
Here is the Docker image 0.3.1-rc.0. Please give it a try and see if the issue is resolved! |
@wsxiaoys |
Hi @mydipan396 @ClarkWain could you also test with 0.3.1-rc.0 to see if it fixed the issue for you? |
I have tried it, the memory usage of CodaLlama7B is within the range of 7GB~8GB, and there is no OOM anymore. Thanks |
Released in v0.3.1 https://github.com/TabbyML/tabby/releases/tag/v0.3.1 |
When installing Codellama 7b on a device with 24GB of VRAM like the A30, it may exceed the available VRAM capacity and result in an error. In such cases, you can try using CTranslate2 without utilizing the ggml model.
Error Info:
2023-10-12 20:37:08 2023-10-12T12:37:08.293510Z INFO tabby::serve: crates/tabby/src/serve/mod.rs:183: Listening at 0.0.0.0:8080
2023-10-12 20:39:00 terminate called after throwing an instance of 'std::runtime_error'
2023-10-12 20:39:00 what(): CUDA failed with error out of memory
The text was updated successfully, but these errors were encountered: