Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codellama 7B OOM on A30 #541

Closed
mydipan396 opened this issue Oct 12, 2023 · 17 comments
Closed

Codellama 7B OOM on A30 #541

mydipan396 opened this issue Oct 12, 2023 · 17 comments
Labels
enhancement New feature or request

Comments

@mydipan396
Copy link

When installing Codellama 7b on a device with 24GB of VRAM like the A30, it may exceed the available VRAM capacity and result in an error. In such cases, you can try using CTranslate2 without utilizing the ggml model.

Error Info:
2023-10-12 20:37:08 2023-10-12T12:37:08.293510Z INFO tabby::serve: crates/tabby/src/serve/mod.rs:183: Listening at 0.0.0.0:8080
2023-10-12 20:39:00 terminate called after throwing an instance of 'std::runtime_error'
2023-10-12 20:39:00 what(): CUDA failed with error out of memory

@mydipan396 mydipan396 added the enhancement New feature or request label Oct 12, 2023
@wsxiaoys
Copy link
Member

wsxiaoys commented Oct 12, 2023

Hi, could you share the command starting tabby?

@mydipan396
Copy link
Author

Hi, could you share the command starting tabby?

To install and run Codellama directly using Docker, add the --model TabbyML/CodeLlama-7B flag.

docker run -it
--gpus all -p 8080:8080 -v $HOME/.tabby:/data
tabbyml/tabby
serve --model TabbyML/CodeLlama-7B --device cuda

@wsxiaoys
Copy link
Member

On a GPU with a capability of >= 8.0 (which A30 has), the inference engine will attempt to load the model in int8 mode. For CodeLlama-7B, it roughly requires around 8GB of VRAM.

Could you please confirm if you have sufficient VRAM to run such a model? Additionally, could you share the output of nvidia-smi before and after running the above Docker command?

@wsxiaoys wsxiaoys changed the title How to use ggml models? Codellama 7B oom on A30 Oct 13, 2023
@wsxiaoys wsxiaoys changed the title Codellama 7B oom on A30 Codellama 7B OOM on A30 Oct 13, 2023
@JohanVer
Copy link

I experience the same problem on a L40 GPU. It happens with Llama but also with Mistral.
I noticed that after startup the GPU Mem is at the expected <8GB, however with every tabby completion request the memory grows. It feels that there is a memory leak somewhere.
When I downgrade to V0.2.0 I do not notice the problem.

@wsxiaoys
Copy link
Member

@JohanVer, have you tried version v0.2.2? Does it also exhibit the issue you described?

There are a few significant changes between v0.2.0 and v0.2.2 that might affect VRAM usage, so I’d like to confirm

@mydipan396
Copy link
Author

mydipan396 commented Oct 18, 2023

When deploying the Codellama 7B model on a Windows system, there seems to be a continuous increase in VRAM usage during inference. However, when deploying on a Linux system, the VRAM usage remains stable. On the other hand, When calling the Codellama 7B model on Linux, there is a significant increase in CPU usage. However, when using the Starcoder model, the CPU usage remains stable.

@wsxiaoys
Copy link
Member

wsxiaoys commented Oct 18, 2023

Hello @mydipan396, given @JohanVer 's comments, could you also please test the CodeLlama 7B model on Linux with versions 0.2.0 and 0.2.2 to determine if the out-of-memory (OOM) issue still persists?

@mydipan396
Copy link
Author

Re-testing, it is indeed a version issue, independent of the operating system. There are problems when installing version 0.2.2, but updating to the latest version, 0.3.0, resolves the problem.

@wsxiaoys
Copy link
Member

To confirm, does the "OOM" problem on CodeLLama-7B disappear after upgrading to v0.3.0? Is that correct?

@JohanVer
Copy link

JohanVer commented Oct 19, 2023

Hi @wsxiaoys ,
I gave all versions a try now using a L40 GPU.
Every version higher than 0.2.0 (0.2.1, 0.2.2 and also 0.3.0) leads to growing memory consumption (higher with every completion prompt).
It seems that the version 0.3.0 lets the memory grow slower in the beginning but eventually the problem remains.

Btw: I tested it with Mistral7B and StarCoder1B

@wsxiaoys
Copy link
Member

Hello, @JohanVer. Thank you for conducting these experiments. I do have a potential culprint commit between versions 0.2.0 and 0.2.1, which may potentially increase GPU RAM usage when setting a higher parallelism for GPU replicas (b3b4986).

As a temporary solution, please continue using v0.2.0 for your setup. I will investigate the issue further and provide an update once it's resolved.

@JohanVer
Copy link

Thanks @wsxiaoys for the investigating the problem and for the awesome project :)
Let me know if I can help with testing.

@wsxiaoys
Copy link
Member

wsxiaoys commented Oct 19, 2023

Here is the Docker image 0.3.1-rc.0. Please give it a try and see if the issue is resolved!

@JohanVer
Copy link

JohanVer commented Oct 19, 2023

@wsxiaoys
That seemed to have worked! 🎉
Memory with Mistral7B is now staying between 7832MiB and 8120MiB :)
Thanks a lot!

@wsxiaoys
Copy link
Member

wsxiaoys commented Oct 20, 2023

Hi @mydipan396 @ClarkWain could you also test with 0.3.1-rc.0 to see if it fixed the issue for you?

@ClarkWain
Copy link

Hi @mydipan396 @ClarkWain could you also test with 0.3.1-rc.0 to see if it fixed the issue for you?

I have tried it, the memory usage of CodaLlama7B is within the range of 7GB~8GB, and there is no OOM anymore. Thanks

@wsxiaoys
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants