Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass env params into nim looks not work (how to set params?) #297

Open
RandyChen1985 opened this issue Jan 20, 2025 · 2 comments
Open

pass env params into nim looks not work (how to set params?) #297

RandyChen1985 opened this issue Jan 20, 2025 · 2 comments
Assignees

Comments

@RandyChen1985
Copy link

RandyChen1985 commented Jan 20, 2025

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu22.04
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8S Server Version: v1.28.15
  • NIM Operator Version: 1.0.1
root@bcm10-headnode:~/nim-operator-workspace# helm list -A
NAME                   	NAMESPACE       	REVISION	UPDATED                                	STATUS  	CHART                        	APP VERSION
gpu-operator-1734685148	gpu-operator    	1       	2024-12-20 16:59:09.420816993 +0800 CST	deployed	gpu-operator-v24.9.1         	v24.9.1
k8s-nim-operator       	nim-operator    	1       	2025-01-10 10:01:29.551913422 +0800 CST	deployed	k8s-nim-operator-1.0.1       	1.0.1
local-path-provisioner 	cm              	1       	2024-12-20 16:02:55.986116685 +0800 CST	deployed	local-path-provisioner-0.0.30	v0.0.30
network-operator       	network-operator	1       	2024-12-20 16:05:57.780143109 +0800 CST	deployed	network-operator-24.7.0      	v24.7.0

2. Issue or feature description

TRY TO SET NIM_MAX_MODEL_LEN , to decrease gpu memory , but looks not work.

and according to this page https://docs.nvidia.com/nim/large-language-models/latest/configuration.html ,

i can not found gpu_memory_utilization and enforce_eager , i don't know how to setting these two params.

and also i pass NIM_MAX_MODEL_LEN is not work .

root@bcm10-headnode:~/nim-operator-workspace# kubectl create -f nv-llama3-8b-instruct-nim-service-nimcache.yaml
nimservice.apps.nvidia.com/nv-llama3-8b-instruct created
root@bcm10-headnode:~/nim-operator-workspace# kubectl get nimservices.apps.nvidia.com -n nim-service
NAME                    STATUS     AGE
nv-llama3-8b-instruct   NotReady   2s
root@bcm10-headnode:~/nim-operator-workspace# kubectl get nimservices.apps.nvidia.com -n nim-service
NAME                    STATUS     AGE
nv-llama3-8b-instruct   NotReady   12s
root@bcm10-headnode:~/nim-operator-workspace# kubectl get pod -n nim-service
NAME                                     READY   STATUS    RESTARTS   AGE
nv-llama3-8b-instruct-65bcb494c5-rhfll   0/1     Running   0          19s
root@bcm10-headnode:~/nim-operator-workspace# kubectl logs -f -n nim-service nv-llama3-8b-instruct-65bcb494c5-rhfll

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.3
Model: meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2025-01-20 06:55:58,149 [INFO] PyTorch version 2.2.2 available.
2025-01-20 06:55:59,070 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2025-01-20 06:55:59,070 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
[TensorRT-LLM][INFO] Set logger level by INFO
2025-01-20 06:55:59,300 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
INFO 01-20 06:55:59.944 api_server.py:489] NIM LLM API version 1.0.0
INFO 01-20 06:55:59.945 ngc_profile.py:218] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 01-20 06:55:59.945 ngc_profile.py:220] Detected 1 compatible profile(s).
INFO 01-20 06:55:59.945 ngc_injector.py:107] Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]
INFO 01-20 06:55:59.946 ngc_injector.py:142] Selected profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1)
INFO 01-20 06:55:59.946 ngc_injector.py:147] Profile metadata: feat_lora: false
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: llm_engine: vllm
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: precision: fp16
INFO 01-20 06:55:59.947 ngc_injector.py:147] Profile metadata: tp: 1
INFO 01-20 06:55:59.947 ngc_injector.py:167] Preparing model workspace. This step might download additional files to run the model.
INFO 01-20 06:55:59.949 ngc_injector.py:173] Model workspace is now ready. It took 0.002 seconds
INFO 01-20 06:55:59.951 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/tmp/meta--llama3-8b-instruct-1qw4xz20', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-1qw4xz20', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
WARNING 01-20 06:56:00.215 logging.py:314] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 01-20 06:56:00.229 utils.py:609] Found nccl from library /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
INFO 01-20 06:56:01 selector.py:28] Using FlashAttention backend.
INFO 01-20 06:56:04 model_runner.py:173] Loading model weights took 14.9595 GB
INFO 01-20 06:56:06.85 gpu_executor.py:119] # GPU blocks: 35035, # CPU blocks: 2048
INFO 01-20 06:56:07 model_runner.py:973] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-20 06:56:07 model_runner.py:977] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
[nv-llama3-8b-instruct-65bcb494c5-rhfll:00031] *** Process received signal ***

3. some else

if i use vllm image , i can pass params by args ..., and it's work ..

containers:
      - name: qwen-72b
        image: vllm/vllm-openai:latest
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh", "-c"]
        args: [
          "vllm serve /model-cache/modelscope/hub/Qwen/Qwen2___5-72B-Instruct --trust-remote-code --enable-chunked-prefill --max_num_batc
hed_tokens 1024 --served-model-name qwen-72b  --gpu_memory_utilization 0.7 --tensor_parallel_size 4 --enforce-eager"
        ]
        ports:
        - containerPort: 8000
        env:
        - name: PYTORCH_CUDA_ALLOC_CONF
          value: "expandable_segments:True"
        - name: VLLM_USE_MODELSCOPE
          value: "True"
        resources:
          limits:
            nvidia.com/gpu: "8"
Image Image Image Image
@RandyChen1985
Copy link
Author

i updated my NIM image (llama-3.1-8b-instruct:1.3.3) ,and now it's works.

BUT i still want to know ,during the NIM deployment , for vLLM backend , HOW to set vLLM params to container,
like gpu_memory_utilization and enforce_eager , because the default value 0.9 for gpu_memory_utilization is too high..

Image

FOR MY TEST CASE, 8B LLM ,0.9 gpu_memory_utilization cost too much GPU resources.
Image

@RandyChen1985
Copy link
Author

and for this page https://docs.nvidia.com/nim/vision-language-models/latest/configuration.html ,
WHY VLMs NIM has NIM_KVCACHE_PERCENT param, and this is my need~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants