This repository contains the logical code to run NoxtuaCompliance with vllm. A Gradio application is used for quick testing with a chat.
-
Install Docker and Python (tested with version 3.11.2)
-
Run vllm
docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host vllm/vllm-openai:v0.6.6.post1 --model xaynetwork/NoxtuaCompliance --tensor-parallel-size=8 --disable-log-requests --max-model-len 120000 --gpu-memory-utilization 0.95
Adjust
tensor-parallel-size
to be the amount of available GPUs and to be the same number as specified for the docker command. -
Validate hosted model
curl http://0.0.0.0:8000/v1/models
pip install -r requirements.txt
python app.py
This command starts the Gradio application with a chat in the localhost under the specified port 8020
. Open the displayed link in the browser, e.g. "http://0.0.0.0:8020" or "http://localhost:8020".