Codellama 7B OOM on A30 #541

mydipan396 · 2023-10-12T12:52:39Z

When installing Codellama 7b on a device with 24GB of VRAM like the A30, it may exceed the available VRAM capacity and result in an error. In such cases, you can try using CTranslate2 without utilizing the ggml model.

Error Info:
2023-10-12 20:37:08 2023-10-12T12:37:08.293510Z INFO tabby::serve: crates/tabby/src/serve/mod.rs:183: Listening at 0.0.0.0:8080
2023-10-12 20:39:00 terminate called after throwing an instance of 'std::runtime_error'
2023-10-12 20:39:00 what(): CUDA failed with error out of memory

wsxiaoys · 2023-10-12T15:56:33Z

Hi, could you share the command starting tabby?

mydipan396 · 2023-10-13T01:52:15Z

Hi, could you share the command starting tabby?

To install and run Codellama directly using Docker, add the --model TabbyML/CodeLlama-7B flag.

docker run -it
--gpus all -p 8080:8080 -v $HOME/.tabby:/data
tabbyml/tabby
serve --model TabbyML/CodeLlama-7B --device cuda

wsxiaoys · 2023-10-13T06:17:31Z

On a GPU with a capability of >= 8.0 (which A30 has), the inference engine will attempt to load the model in int8 mode. For CodeLlama-7B, it roughly requires around 8GB of VRAM.

Could you please confirm if you have sufficient VRAM to run such a model? Additionally, could you share the output of nvidia-smi before and after running the above Docker command?

JohanVer · 2023-10-18T07:43:09Z

I experience the same problem on a L40 GPU. It happens with Llama but also with Mistral.
I noticed that after startup the GPU Mem is at the expected <8GB, however with every tabby completion request the memory grows. It feels that there is a memory leak somewhere.
When I downgrade to V0.2.0 I do not notice the problem.

wsxiaoys · 2023-10-18T09:30:06Z

@JohanVer, have you tried version v0.2.2? Does it also exhibit the issue you described?

There are a few significant changes between v0.2.0 and v0.2.2 that might affect VRAM usage, so I’d like to confirm

mydipan396 · 2023-10-18T11:50:56Z

When deploying the Codellama 7B model on a Windows system, there seems to be a continuous increase in VRAM usage during inference. However, when deploying on a Linux system, the VRAM usage remains stable. On the other hand, When calling the Codellama 7B model on Linux, there is a significant increase in CPU usage. However, when using the Starcoder model, the CPU usage remains stable.

wsxiaoys · 2023-10-18T16:18:14Z

Hello @mydipan396, given @JohanVer 's comments, could you also please test the CodeLlama 7B model on Linux with versions 0.2.0 and 0.2.2 to determine if the out-of-memory (OOM) issue still persists?

mydipan396 · 2023-10-19T02:58:30Z

Re-testing, it is indeed a version issue, independent of the operating system. There are problems when installing version 0.2.2, but updating to the latest version, 0.3.0, resolves the problem.

wsxiaoys · 2023-10-19T04:51:13Z

To confirm, does the "OOM" problem on CodeLLama-7B disappear after upgrading to v0.3.0? Is that correct?

JohanVer · 2023-10-19T06:53:22Z

Hi @wsxiaoys ,
I gave all versions a try now using a L40 GPU.
Every version higher than 0.2.0 (0.2.1, 0.2.2 and also 0.3.0) leads to growing memory consumption (higher with every completion prompt).
It seems that the version 0.3.0 lets the memory grow slower in the beginning but eventually the problem remains.

Btw: I tested it with Mistral7B and StarCoder1B

wsxiaoys · 2023-10-19T07:06:15Z

Hello, @JohanVer. Thank you for conducting these experiments. I do have a potential culprint commit between versions 0.2.0 and 0.2.1, which may potentially increase GPU RAM usage when setting a higher parallelism for GPU replicas (b3b4986).

As a temporary solution, please continue using v0.2.0 for your setup. I will investigate the issue further and provide an update once it's resolved.

JohanVer · 2023-10-19T07:15:47Z

Thanks @wsxiaoys for the investigating the problem and for the awesome project :)
Let me know if I can help with testing.

wsxiaoys · 2023-10-19T07:30:23Z

Here is the Docker image 0.3.1-rc.0. Please give it a try and see if the issue is resolved!

JohanVer · 2023-10-19T08:26:04Z

@wsxiaoys
That seemed to have worked! 🎉
Memory with Mistral7B is now staying between 7832MiB and 8120MiB :)
Thanks a lot!

wsxiaoys · 2023-10-20T03:22:37Z

Hi @mydipan396 @ClarkWain could you also test with 0.3.1-rc.0 to see if it fixed the issue for you?

ClarkWain · 2023-10-21T08:58:44Z

Hi @mydipan396 @ClarkWain could you also test with 0.3.1-rc.0 to see if it fixed the issue for you?

I have tried it, the memory usage of CodaLlama7B is within the range of 7GB~8GB, and there is no OOM anymore. Thanks

wsxiaoys · 2023-10-21T19:34:01Z

Released in v0.3.1 https://github.com/TabbyML/tabby/releases/tag/v0.3.1

mydipan396 added the enhancement New feature or request label Oct 12, 2023

mydipan396 closed this as completed Oct 13, 2023

mydipan396 reopened this Oct 13, 2023

wsxiaoys changed the title ~~How to use ggml models?~~ Codellama 7B oom on A30 Oct 13, 2023

wsxiaoys changed the title ~~Codellama 7B oom on A30~~ Codellama 7B OOM on A30 Oct 13, 2023

wsxiaoys mentioned this issue Oct 18, 2023

Memory Overflows when Using CodeLlama7B Model on Titan XP #587

Closed

wsxiaoys mentioned this issue Oct 20, 2023

fix: cap parallelism to 4 for cuda to avoid oom #601

Merged

wsxiaoys closed this as completed Oct 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codellama 7B OOM on A30 #541

Codellama 7B OOM on A30 #541

mydipan396 commented Oct 12, 2023

wsxiaoys commented Oct 12, 2023 •

edited

Loading

mydipan396 commented Oct 13, 2023

wsxiaoys commented Oct 13, 2023

JohanVer commented Oct 18, 2023

wsxiaoys commented Oct 18, 2023

mydipan396 commented Oct 18, 2023 •

edited

Loading

wsxiaoys commented Oct 18, 2023 •

edited

Loading

mydipan396 commented Oct 19, 2023

wsxiaoys commented Oct 19, 2023

JohanVer commented Oct 19, 2023 •

edited

Loading

wsxiaoys commented Oct 19, 2023

JohanVer commented Oct 19, 2023

wsxiaoys commented Oct 19, 2023 •

edited

Loading

JohanVer commented Oct 19, 2023 •

edited

Loading

wsxiaoys commented Oct 20, 2023 •

edited

Loading

ClarkWain commented Oct 21, 2023

wsxiaoys commented Oct 21, 2023

Codellama 7B OOM on A30 #541

Codellama 7B OOM on A30 #541

Comments

mydipan396 commented Oct 12, 2023

wsxiaoys commented Oct 12, 2023 • edited Loading

mydipan396 commented Oct 13, 2023

wsxiaoys commented Oct 13, 2023

JohanVer commented Oct 18, 2023

wsxiaoys commented Oct 18, 2023

mydipan396 commented Oct 18, 2023 • edited Loading

wsxiaoys commented Oct 18, 2023 • edited Loading

mydipan396 commented Oct 19, 2023

wsxiaoys commented Oct 19, 2023

JohanVer commented Oct 19, 2023 • edited Loading

wsxiaoys commented Oct 19, 2023

JohanVer commented Oct 19, 2023

wsxiaoys commented Oct 19, 2023 • edited Loading

JohanVer commented Oct 19, 2023 • edited Loading

wsxiaoys commented Oct 20, 2023 • edited Loading

ClarkWain commented Oct 21, 2023

wsxiaoys commented Oct 21, 2023

wsxiaoys commented Oct 12, 2023 •

edited

Loading

mydipan396 commented Oct 18, 2023 •

edited

Loading

wsxiaoys commented Oct 18, 2023 •

edited

Loading

JohanVer commented Oct 19, 2023 •

edited

Loading

wsxiaoys commented Oct 19, 2023 •

edited

Loading

JohanVer commented Oct 19, 2023 •

edited

Loading

wsxiaoys commented Oct 20, 2023 •

edited

Loading