Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 25 01 13 #358

Merged
merged 116 commits into from
Jan 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
32c9eff
[Bugfix][V1] Fix molmo text-only inputs (#11676)
jeejeelee Jan 6, 2025
e20c92b
[Kernel] Move attn_type to Attention.__init__() (#11690)
heheda12345 Jan 6, 2025
91b361a
[V1] Extend beyond image modality and support mixed-modality inferenc…
ywang96 Jan 6, 2025
08fb75c
[Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (#11772)
DarkLight1337 Jan 7, 2025
d0169e1
[Model] Future-proof Qwen2-Audio multi-modal processor (#11776)
DarkLight1337 Jan 7, 2025
d93d2d7
[XPU] Make pp group initilized for pipeline-parallelism (#11648)
ys950902 Jan 7, 2025
8ceffbf
[Doc][3/N] Reorganize Serving section (#11766)
DarkLight1337 Jan 7, 2025
b278557
[Kernel][LoRA]Punica prefill kernels fusion (#11234)
jeejeelee Jan 7, 2025
0f3f3c8
[Bugfix] Update attention interface in `Whisper` (#11784)
ywang96 Jan 7, 2025
898cdf0
[CI] Fix neuron CI and run offline tests (#11779)
liangfu Jan 7, 2025
e512f76
fix init error for MessageQueue when n_local_reader is zero (#11768)
XiaobingSuper Jan 7, 2025
ce1917f
[Doc] Create a vulnerability management team (#9925)
russellb Jan 7, 2025
1e4ce29
[CI][CPU] adding build number to docker image name (#11788)
zhouyuan Jan 7, 2025
8082ad7
[V1][Doc] Update V1 support for `LLaVa-NeXT-Video` (#11798)
ywang96 Jan 7, 2025
8f37be3
[Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calcula…
DarkLight1337 Jan 7, 2025
869e829
[doc] add doc to explain how to use uv (#11773)
youkaichao Jan 7, 2025
2de197b
[V1] Support audio language models on V1 (#11733)
ywang96 Jan 7, 2025
d9fa1c0
[doc] update how pip can install nightly wheels (#11806)
youkaichao Jan 7, 2025
c0efe92
[Doc] Add note to `gte-Qwen2` models (#11808)
DarkLight1337 Jan 7, 2025
869579a
[optimization] remove python function call for custom op (#11750)
youkaichao Jan 7, 2025
c994223
[Bugfix] update the prefix for qwen2 (#11795)
jiangjiadi Jan 7, 2025
973f5dc
[Doc]Add documentation for using EAGLE in vLLM (#11417)
sroy745 Jan 7, 2025
a4e2b26
[Bugfix] Significant performance drop on CPUs with --num-scheduler-st…
DamonFool Jan 8, 2025
5950f55
[Doc] Group examples into categories (#11782)
hmellor Jan 8, 2025
91445c7
[Bugfix] Fix image input for Pixtral-HF (#11741)
DarkLight1337 Jan 8, 2025
4d29e91
[Misc] sort torch profiler table by kernel timing (#11813)
divakar-amd Jan 8, 2025
dc71af0
Remove the duplicate imports of MultiModalKwargs and PlaceholderRange…
WangErXiao Jan 8, 2025
b640b19
Fixed docker build for ppc64le (#11518)
npanpaliya Jan 8, 2025
f4923cb
[OpenVINO] Fixed Docker.openvino build (#11732)
ilya-lavrenov Jan 8, 2025
f645eb6
[Bugfix] Add checks for LoRA and CPU offload (#11810)
jeejeelee Jan 8, 2025
259abd8
[Docs] reorganize sponsorship page (#11639)
simon-mo Jan 8, 2025
ef68eb2
[Bug] Fix pickling of `ModelConfig` when RunAI Model Streamer is used…
DarkLight1337 Jan 8, 2025
889e662
[misc] improve memory profiling (#11809)
youkaichao Jan 8, 2025
ad9f1aa
[doc] update wheels url (#11830)
youkaichao Jan 8, 2025
a1b2b86
[Docs] Update sponsor name: 'Novita' to 'Novita AI' (#11833)
simon-mo Jan 8, 2025
cfd3219
[Hardware][Apple] Native support for macOS Apple Silicon (#11696)
wallashss Jan 8, 2025
f121411
[torch.compile] consider relevant code in compilation cache (#11614)
youkaichao Jan 8, 2025
2a0596b
[VLM] Reorganize profiling/processing-related code (#11812)
DarkLight1337 Jan 8, 2025
aba8d6e
[Doc] Move examples into categories (#11840)
hmellor Jan 8, 2025
6cd40a5
[Doc][4/N] Reorganize API Reference (#11843)
DarkLight1337 Jan 8, 2025
2f70249
[CI/Build][Bugfix] Fix CPU CI image clean up (#11836)
bigPYJ1151 Jan 8, 2025
78f4590
[Bugfix][XPU] fix silu_and_mul (#11823)
yma11 Jan 8, 2025
ca47e17
[Misc] Move some model utils into vision file (#11848)
DarkLight1337 Jan 8, 2025
5984499
[Doc] Expand Multimodal API Reference (#11852)
DarkLight1337 Jan 8, 2025
47de882
[Misc]add some explanations for BlockHashType (#11847)
WangErXiao Jan 8, 2025
56fe4c2
[TPU][Quantization] TPU `W8A8` (#11785)
robertgshaw2-redhat Jan 8, 2025
526de82
[Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup f…
rasmith Jan 8, 2025
3db0caf
[Docs] Add Google Cloud Meetup (#11864)
simon-mo Jan 8, 2025
615e4a5
[CI] Turn on basic correctness tests for V1 (#10864)
tlrmchlsmth Jan 9, 2025
1fe554b
treat do_lower_case in the same way as the sentence-transformers libr…
maxdebayser Jan 9, 2025
730e959
[Doc] Recommend uv and python 3.12 for quickstart guide (#11849)
mgoin Jan 9, 2025
d848800
[Misc] Move `print_*_once` from utils to logger (#11298)
DarkLight1337 Jan 9, 2025
a732900
[Doc] Intended links Python multiprocessing library (#11878)
guspan-tanadi Jan 9, 2025
310aca8
[perf]fix current stream (#11870)
youkaichao Jan 9, 2025
0bd1ff4
[Bugfix] Override dunder methods of placeholder modules (#11882)
DarkLight1337 Jan 9, 2025
1d967ac
[Bugfix] fix beam search input errors and latency benchmark script (#…
yeqcharlotte Jan 9, 2025
65097ca
[Doc] Add model development API Reference (#11884)
DarkLight1337 Jan 9, 2025
405eb8e
[platform] Allow platform specify attention backend (#11609)
wangxiyuan Jan 9, 2025
bd82872
[ci]try to fix flaky multi-step tests (#11894)
youkaichao Jan 9, 2025
9a22834
[Misc] Provide correct Pixtral-HF chat template (#11891)
DarkLight1337 Jan 9, 2025
36f5303
[Docs] Add Modal to deployment frameworks (#11907)
charlesfrye Jan 9, 2025
c3cf54d
[Doc][5/N] Move Community and API Reference to the bottom (#11896)
DarkLight1337 Jan 10, 2025
b844b99
[VLM] Enable tokenized inputs for merged multi-modal processor (#11900)
DarkLight1337 Jan 10, 2025
3de2b1e
[Doc] Show default pooling method in a table (#11904)
DarkLight1337 Jan 10, 2025
cf5f000
[torch.compile] Hide KV cache behind torch.compile boundary (#11677)
heheda12345 Jan 10, 2025
ac2f3f7
[Bugfix] Validate lora adapters to avoid crashing server (#11727)
joerunde Jan 10, 2025
61af633
[BUGFIX] Fix `UnspecifiedPlatform` package name (#11916)
jikunshang Jan 10, 2025
d53575a
[ci] fix gh200 tests (#11919)
youkaichao Jan 10, 2025
d907be7
[misc] remove python function call for custom activation op (#11885)
cennn Jan 10, 2025
ef725fe
[platform] support pytorch custom op pluggable (#11328)
wangxiyuan Jan 10, 2025
d85c47d
Replace "online inference" with "online serving" (#11923)
hmellor Jan 10, 2025
241ad7b
[ci] Fix sampler tests (#11922)
youkaichao Jan 10, 2025
12664dd
[Doc] [1/N] Initial guide for merged multi-modal processor (#11925)
DarkLight1337 Jan 10, 2025
20410b2
[platform] support custom torch.compile backend key (#11318)
wangxiyuan Jan 10, 2025
482cdc4
[Doc] Rename offline inference examples (#11927)
hmellor Jan 10, 2025
f33e033
[Docs] Fix docstring in `get_ip` function (#11932)
KuntaiDu Jan 10, 2025
5959564
Doc fix in `benchmark_long_document_qa_throughput.py` (#11933)
KuntaiDu Jan 10, 2025
aa1e77a
[Hardware][CPU] Support MOE models on x86 CPU (#11831)
bigPYJ1151 Jan 10, 2025
46fa98c
[Misc] Clean up debug code in Deepseek-V3 (#11930)
Isotr0py Jan 10, 2025
8a57940
[Misc] Update benchmark_prefix_caching.py fixed example usage (#11920)
remimin Jan 10, 2025
d45cbe7
[Bugfix] Check that number of images matches number of <|image|> toke…
tjohnson31415 Jan 10, 2025
c9f09a4
[mypy] Fix mypy warnings in api_server.py (#11941)
frreiss Jan 11, 2025
899136b
[ci] fix broken distributed-tests-4-gpus (#11937)
youkaichao Jan 11, 2025
2118d05
[Bugfix][SpecDecode] Adjust Eagle model architecture to align with in…
llsj14 Jan 11, 2025
c32a7c7
[Bugfix] fused_experts_impl wrong compute type for float32 (#11921)
shaochangxu Jan 11, 2025
7a3a83e
[CI/Build] Move model-specific multi-modal processing tests (#11934)
DarkLight1337 Jan 11, 2025
a991f7d
[Doc] Basic guide for writing unit tests for new models (#11951)
DarkLight1337 Jan 11, 2025
d697dc0
[Bugfix] Fix RobertaModel loading (#11940)
NickLucche Jan 11, 2025
4b657d3
[Model] Add cogagent model support vLLM (#11742)
sixsixcoder Jan 11, 2025
b25cfab
[V1] Avoid sending text prompt to core engine (#11963)
ywang96 Jan 12, 2025
43f3d9e
[CI/Build] Add markdown linter (#11857)
rafvasq Jan 12, 2025
f967e51
[Model] Initialize support for Deepseek-VL2 models (#11578)
Isotr0py Jan 12, 2025
8bddb73
[Hardware][CPU] Multi-LoRA implementation for the CPU backend (#11100)
Akshat-Tripathi Jan 12, 2025
263a870
[Hardware][TPU] workaround fix for MoE on TPU (#11764)
avshalomman Jan 12, 2025
9597a09
[V1][Core][1/n] Logging and Metrics (#11962)
robertgshaw2-redhat Jan 12, 2025
d14e98d
[Model] Support GGUF models newly added in `transformers` 4.46.0 (#9685)
Isotr0py Jan 13, 2025
619ae26
[V1] [2/n] Logging and Metrics - `OutputProcessor` Abstraction (#11973)
robertgshaw2-redhat Jan 13, 2025
f7b3ba8
[MISC] fix typo in kv transfer send recv test (#11983)
yyccli Jan 13, 2025
9dd02d8
[Bug] Fix usage of `.transpose()` and `.view()` consecutively. (#11979)
liaoyanqing666 Jan 13, 2025
80ea3af
[CI][Spec Decode] fix: broken test for EAGLE model (#11972)
llsj14 Jan 13, 2025
cf6bbcb
[Misc] Fix Deepseek V2 fp8 kv-scale remapping (#11947)
Concurrensee Jan 13, 2025
c3f05b0
[Misc]Minor Changes about Worker (#11555)
noemotiovon Jan 13, 2025
89ce62a
[platform] add ray_device_key (#11948)
youkaichao Jan 13, 2025
5340a30
Fix Max Token ID for Qwen-VL-Chat (#11980)
alex-jw-brooks Jan 13, 2025
0f8cafe
[Kernel] unified_attention for Attention.forward (#11967)
heheda12345 Jan 13, 2025
cd82499
[Doc][V1] Update model implementation guide for V1 support (#11998)
ywang96 Jan 13, 2025
e8c23ff
[Doc] Organise installation documentation into categories and tabs (#…
hmellor Jan 13, 2025
458e63a
[platform] add device_control env var (#12009)
youkaichao Jan 13, 2025
a7d5968
[Platform] Move get_punica_wrapper() function to Platform (#11516)
shen-shanshan Jan 13, 2025
c6db213
bugfix: Fix signature mismatch in benchmark's `get_tokenizer` functio…
e1ijah1 Jan 13, 2025
ce53f46
Merge remote-tracking branch 'upstream/main'
gshtras Jan 13, 2025
5a51290
Using list
gshtras Jan 13, 2025
079750e
Revert "[misc] improve memory profiling (#11809)"
gshtras Jan 13, 2025
043c93d
Trying to make scales work with compileable attention
gshtras Jan 13, 2025
16f8680
Docs lint
gshtras Jan 14, 2025
eb4abfd
Merge remote-tracking branch 'origin/main' into upstream_merge_25_01_13
gshtras Jan 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
37 changes: 20 additions & 17 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,63 +9,60 @@ CORE_RANGE=${CORE_RANGE:-48-95}
NUMA_NODE=${NUMA_NODE:-1}

# Try building the docker image
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test-"$BUILDKITE_BUILD_NUMBER" -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 -f Dockerfile.cpu .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test-"$NUMA_NODE" cpu-test-avx2-"$NUMA_NODE" || true; }
remove_docker_container() { set -e; docker rm -f cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2-"$NUMA_NODE" cpu-test-avx2
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2

function cpu_tests() {
set -e
export NUMA_NODE=$2

# offline inference
docker exec cpu-test-avx2-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" bash -c "
set -e
python3 examples/offline_inference.py"
python3 examples/offline_inference/basic.py"

# Run basic model test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pip install pytest pytest-asyncio \
decord einops librosa peft Pillow sentence-transformers soundfile \
transformers_stream_generator matplotlib datamodel_code_generator
pip install torchvision --index-url https://download.pytorch.org/whl/cpu
pip install -r vllm/requirements-test.txt
pytest -v -s tests/models/decoder_only/language -m cpu_model
pytest -v -s tests/models/embedding/language -m cpu_model
pytest -v -s tests/models/encoder_decoder/language -m cpu_model
pytest -v -s tests/models/decoder_only/audio_language -m cpu_model
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"

# Run compressed-tensor test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"

# Run AWQ test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_ipex_quant.py"

# Run chunked-prefill and prefix-cache test
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v -k cpu_model \
tests/basic_correctness/test_chunked_prefill.py"

# online inference
docker exec cpu-test-"$NUMA_NODE" bash -c "
# online serving
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
export VLLM_CPU_KVCACHE_SPACE=10
export VLLM_CPU_OMP_THREADS_BIND=$1
Expand All @@ -78,6 +75,12 @@ function cpu_tests() {
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"

# Run multi-lora tests
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/lora/test_qwen2vl.py"
}

# All of CPU tests are expected to be finished less than 25 mins.
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/run-gh200-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,5 @@ remove_docker_container

# Run the image and test offline inference
docker run --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
python3 examples/offline_inference.py
python3 examples/offline_inference/basic.py
'
2 changes: 1 addition & 1 deletion .buildkite/run-hpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference.py
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
53 changes: 27 additions & 26 deletions .buildkite/run-neuron-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,18 @@
# This script build the Neuron docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage.
set -e
set -v

image_name="neuron/vllm-ci"
container_name="neuron_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)"

HF_CACHE="$(realpath ~)/huggingface"
mkdir -p "${HF_CACHE}"
HF_MOUNT="/root/.cache/huggingface"

NEURON_COMPILE_CACHE_URL="$(realpath ~)/neuron_compile_cache"
mkdir -p "${NEURON_COMPILE_CACHE_URL}"
NEURON_COMPILE_CACHE_MOUNT="/root/.cache/neuron_compile_cache"

# Try building the docker image
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
Expand All @@ -13,41 +25,30 @@ if [ -f /tmp/neuron-docker-build-timestamp ]; then
last_build=$(cat /tmp/neuron-docker-build-timestamp)
current_time=$(date +%s)
if [ $((current_time - last_build)) -gt 86400 ]; then
docker image prune -f
docker system prune -f
rm -rf "${HF_MOUNT:?}/*"
rm -rf "${NEURON_COMPILE_CACHE_MOUNT:?}/*"
echo "$current_time" > /tmp/neuron-docker-build-timestamp
fi
else
date "+%s" > /tmp/neuron-docker-build-timestamp
fi

docker build -t neuron -f Dockerfile.neuron .
docker build -t "${image_name}" -f Dockerfile.neuron .

# Setup cleanup
remove_docker_container() { docker rm -f neuron || true; }
remove_docker_container() {
docker image rm -f "${image_name}" || true;
}
trap remove_docker_container EXIT
remove_docker_container

# Run the image
docker run --device=/dev/neuron0 --device=/dev/neuron1 --network host --name neuron neuron python3 -m vllm.entrypoints.api_server \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --max-num-seqs 8 --max-model-len 128 --block-size 128 --device neuron --tensor-parallel-size 2 &

# Wait for the server to start
wait_for_server_to_start() {
timeout=300
counter=0

while [ "$(curl -s -o /dev/null -w '%{http_code}' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
fi
done
}
wait_for_server_to_start

# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'
docker run --rm -it --device=/dev/neuron0 --device=/dev/neuron1 --network host \
-v "${HF_CACHE}:${HF_MOUNT}" \
-e "HF_HOME=${HF_MOUNT}" \
-v "${NEURON_COMPILE_CACHE_URL}:${NEURON_COMPILE_CACHE_MOUNT}" \
-e "NEURON_COMPILE_CACHE_URL=${NEURON_COMPILE_CACHE_MOUNT}" \
--name "${container_name}" \
${image_name} \
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/neuron.py"
2 changes: 1 addition & 1 deletion .buildkite/run-openvino-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference.py
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference/basic.py
11 changes: 10 additions & 1 deletion .buildkite/run-tpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,13 @@ remove_docker_container
# For HF_TOKEN.
source /etc/environment
# Run a simple end-to-end example.
docker run --privileged --net host --shm-size=16G -it -e "HF_TOKEN=$HF_TOKEN" --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && python3 -m pip install lm_eval[api]==0.4.4 && pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py"
docker run --privileged --net host --shm-size=16G -it \
-e "HF_TOKEN=$HF_TOKEN" --name tpu-test \
vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git \
&& python3 -m pip install pytest \
&& python3 -m pip install lm_eval[api]==0.4.4 \
&& pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py \
&& pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \
&& python3 /workspace/vllm/tests/tpu/test_compilation.py \
&& python3 /workspace/vllm/tests/tpu/test_quantization_accuracy.py \
&& python3 /workspace/vllm/examples/offline_inference/tpu.py"
4 changes: 2 additions & 2 deletions .buildkite/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,6 @@ remove_docker_container

# Run the image and test offline inference/tensor parallel
docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c '
python3 examples/offline_inference.py
python3 examples/offline_inference_cli.py -tp 2
python3 examples/offline_inference/basic.py
python3 examples/offline_inference/cli.py -tp 2
'
38 changes: 22 additions & 16 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ steps:
- pip install -r requirements-docs.txt
- SPHINXOPTS=\"-W\" make html
# Check API reference (if it fails, you may have missing mock imports)
- grep \"sig sig-object py\" build/html/dev/sampling_params.html
- grep \"sig sig-object py\" build/html/api/inference_params.html

- label: Async Engine, Inputs, Utils, Worker Test # 24min
fast_check: true
Expand All @@ -52,6 +52,7 @@ steps:
- tests/worker
- tests/standalone_tests/lazy_torch_compile.py
commands:
- pip install git+https://github.com/Isotr0py/DeepSeek-VL2.git # Used by multimoda processing test
- python3 standalone_tests/lazy_torch_compile.py
- pytest -v -s mq_llm_engine # MQLLMEngine
- pytest -v -s async_engine # AsyncLLMEngine
Expand Down Expand Up @@ -187,19 +188,19 @@ steps:
- examples/
commands:
- pip install tensorizer # for tensorizer test
- python3 offline_inference.py
- python3 cpu_offload.py
- python3 offline_inference_chat.py
- python3 offline_inference_with_prefix.py
- python3 llm_engine_example.py
- python3 offline_inference_vision_language.py
- python3 offline_inference_vision_language_multi_image.py
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference_encoder_decoder.py
- python3 offline_inference_classification.py
- python3 offline_inference_embedding.py
- python3 offline_inference_scoring.py
- python3 offline_profile.py --model facebook/opt-125m run_num_steps --num-steps 2
- python3 offline_inference/basic.py
- python3 offline_inference/cpu_offload.py
- python3 offline_inference/chat.py
- python3 offline_inference/prefix_caching.py
- python3 offline_inference/llm_engine_example.py
- python3 offline_inference/vision_language.py
- python3 offline_inference/vision_language_multi_image.py
- python3 other/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 other/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference/encoder_decoder.py
- python3 offline_inference/classification.py
- python3 offline_inference/embedding.py
- python3 offline_inference/scoring.py
- python3 offline_inference/profiling.py --model facebook/opt-125m run_num_steps --num-steps 2

- label: Prefix Caching Test # 9min
mirror_hardwares: [amd]
Expand All @@ -214,6 +215,7 @@ steps:
- vllm/model_executor/layers
- vllm/sampling_metadata.py
- tests/samplers
- tests/conftest.py
commands:
- pytest -v -s samplers
- VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers
Expand All @@ -229,20 +231,22 @@ steps:
- pytest -v -s test_logits_processor.py
- pytest -v -s model_executor/test_guided_processors.py

- label: Speculative decoding tests # 30min
- label: Speculative decoding tests # 40min
source_file_dependencies:
- vllm/spec_decode
- tests/spec_decode
- vllm/model_executor/models/eagle.py
commands:
- pytest -v -s spec_decode/e2e/test_multistep_correctness.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py
- pytest -v -s spec_decode/e2e/test_eagle_correctness.py

- label: LoRA Test %N # 15min each
mirror_hardwares: [amd]
source_file_dependencies:
- vllm/lora
- tests/lora
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_minicpmv_tp.py
parallelism: 4

- label: "PyTorch Fullgraph Smoke Test" # 9min
Expand Down Expand Up @@ -367,6 +371,7 @@ steps:
- tests/models/encoder_decoder/vision_language
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/multimodal
- pytest -v -s models/decoder_only/audio_language -m 'core_model or quant_model'
- pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'core_model or quant_model'
- pytest -v -s models/embedding/vision_language -m core_model
Expand Down Expand Up @@ -535,6 +540,7 @@ steps:
# requires multi-GPU testing for validation.
- pytest -v -s -x lora/test_chatglm3_tp.py
- pytest -v -s -x lora/test_llama_tp.py
- pytest -v -s -x lora/test_minicpmv_tp.py


- label: Weight Loading Multiple GPU Test # 33min
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ on:
- "docs/**"

jobs:
sphinx-lint:
doc-lint:
runs-on: ubuntu-latest
strategy:
matrix:
Expand All @@ -29,4 +29,4 @@ jobs:
python -m pip install --upgrade pip
pip install -r requirements-lint.txt
- name: Linting docs
run: tools/sphinx-lint.sh
run: tools/doc-lint.sh
5 changes: 1 addition & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -79,10 +79,7 @@ instance/

# Sphinx documentation
docs/_build/
docs/source/getting_started/examples/*.rst
!**/*.template.rst
docs/source/getting_started/examples/*.md
!**/*.template.md
docs/source/getting_started/examples/

# PyBuilder
.pybuilder/
Expand Down
6 changes: 3 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
# to run the OpenAI compatible server.

# Please update any changes made here to
# docs/source/dev/dockerfile/dockerfile.md and
# docs/source/assets/dev/dockerfile-stages-dependency.png
# docs/source/contributing/dockerfile/dockerfile.md and
# docs/source/assets/contributing/dockerfile-stages-dependency.png

ARG CUDA_VERSION=12.4.1
#################### BASE BUILD IMAGE ####################
Expand Down Expand Up @@ -250,7 +250,7 @@ ENV VLLM_USAGE_SOURCE production-docker-image
# define sagemaker first, so it is not default from `docker build`
FROM vllm-openai-base AS vllm-sagemaker

COPY examples/sagemaker-entrypoint.sh .
COPY examples/online_serving/sagemaker-entrypoint.sh .
RUN chmod +x sagemaker-entrypoint.sh
ENTRYPOINT ["./sagemaker-entrypoint.sh"]

Expand Down
8 changes: 6 additions & 2 deletions Dockerfile.neuron
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ RUN apt-get update && \
ffmpeg libsm6 libxext6 libgl1

### Mount Point ###
# When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app
# When launching the container, mount the code directory to /workspace
ARG APP_MOUNT=/workspace
VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT}/vllm

Expand All @@ -25,6 +25,7 @@ RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.45.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install neuronx-cc==2.16.345.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install pytest

COPY . .
ARG GIT_REPO_CHECK=0
Expand All @@ -42,4 +43,7 @@ RUN --mount=type=bind,source=.git,target=.git \
# install development dependencies (for testing)
RUN python3 -m pip install -e tests/vllm_test_utils

# overwrite entrypoint to run bash script
RUN echo "import subprocess; import sys; subprocess.check_call(sys.argv[1:])" > /usr/local/bin/dockerd-entrypoint.py

CMD ["/bin/bash"]
1 change: 1 addition & 0 deletions Dockerfile.openvino
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi

RUN python3 -m pip install -U pip
# install build requirements
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/requirements-build.txt
# build vLLM with OpenVINO backend
Expand Down
Loading
Loading