You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i can not found gpu_memory_utilization and enforce_eager , i don't know how to setting these two params.
and also i pass NIM_MAX_MODEL_LEN is not work .
root@bcm10-headnode:~/nim-operator-workspace# kubectl create -f nv-llama3-8b-instruct-nim-service-nimcache.yaml
nimservice.apps.nvidia.com/nv-llama3-8b-instruct created
root@bcm10-headnode:~/nim-operator-workspace# kubectl get nimservices.apps.nvidia.com -n nim-service
NAME STATUS AGE
nv-llama3-8b-instruct NotReady 2s
root@bcm10-headnode:~/nim-operator-workspace# kubectl get nimservices.apps.nvidia.com -n nim-service
NAME STATUS AGE
nv-llama3-8b-instruct NotReady 12s
root@bcm10-headnode:~/nim-operator-workspace# kubectl get pod -n nim-service
NAME READY STATUS RESTARTS AGE
nv-llama3-8b-instruct-65bcb494c5-rhfll 0/1 Running 0 19s
root@bcm10-headnode:~/nim-operator-workspace# kubectl logs -f -n nim-service nv-llama3-8b-instruct-65bcb494c5-rhfll
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.0.3
Model: meta/llama3-8b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.
ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.
2025-01-20 06:55:58,149 [INFO] PyTorch version 2.2.2 available.
2025-01-20 06:55:59,070 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2025-01-20 06:55:59,070 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
[TensorRT-LLM][INFO] Set logger level by INFO
2025-01-20 06:55:59,300 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 01-20 06:55:59.944 api_server.py:489] NIM LLM API version 1.0.0
INFO 01-20 06:55:59.945 ngc_profile.py:218] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-20 06:55:59.945 ngc_profile.py:220] Detected 1 compatible profile(s).
INFO 01-20 06:55:59.945 ngc_injector.py:107] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]
INFO 01-20 06:55:59.946 ngc_injector.py:142] Selected profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
INFO 01-20 06:55:59.946 ngc_injector.py:147] Profile metadata: feat_lora: false
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: llm_engine: vllm
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: precision: fp16
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: tp: 1
INFO 01-20 06:55:59.947 ngc_injector.py:167] Preparing model workspace. This step might download additional files to run the model.
INFO 01-20 06:55:59.949 ngc_injector.py:173] Model workspace is now ready. It took 0.002 seconds
INFO 01-20 06:55:59.951 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-1qw4xz20', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-1qw4xz20', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 01-20 06:56:00.215 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-20 06:56:00.229 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
INFO 01-20 06:56:01 selector.py:28] Using FlashAttention backend.
INFO 01-20 06:56:04 model_runner.py:173] Loading model weights took 14.9595 GB
INFO 01-20 06:56:06.85 gpu_executor.py:119] # GPU blocks: 35035, # CPU blocks: 2048
INFO 01-20 06:56:07 model_runner.py:973] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-20 06:56:07 model_runner.py:977] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
[nv-llama3-8b-instruct-65bcb494c5-rhfll:00031] *** Process received signal ***
3. some else
if i use vllm image , i can pass params by args ..., and it's work ..
i updated my NIM image (llama-3.1-8b-instruct:1.3.3) ,and now it's works.
BUT i still want to know ,during the NIM deployment , for vLLM backend , HOW to set vLLM params to container,
like gpu_memory_utilization and enforce_eager , because the default value 0.9 for gpu_memory_utilization is too high..
FOR MY TEST CASE, 8B LLM ,0.9 gpu_memory_utilization cost too much GPU resources.
1. Quick Debug Information
2. Issue or feature description
TRY TO SET NIM_MAX_MODEL_LEN , to decrease gpu memory , but looks not work.
and according to this page https://docs.nvidia.com/nim/large-language-models/latest/configuration.html ,
i can not found
gpu_memory_utilization
andenforce_eager
, i don't know how to setting these two params.and also i pass NIM_MAX_MODEL_LEN is not work .
3. some else
if i use vllm image , i can pass params by args ..., and it's work ..
The text was updated successfully, but these errors were encountered: