Release v0.4.3 · sgl-project/sglang

Highlights

The SGLang team is excited to announce the release of v0.4.3. We will keep improving DeepSeek V3/R1 performance. In the last six weeks, SGLang has been the fastest engine running DeepSeek V3/R1 among all open-source LLM inference engines. We stay ahead by integrating FlashInfer MLA and optimizing further. Look out for new optimizations coming soon! Please feel free to join our Slack channel https://slack.sglang.ai Cheers!

Performance Improvements

DeepSeek V3/R1 Optimizations

Pioneering integration of FlashInfer MLA Attention delivers 4x performance improvement for long-context scenarios (Special thanks to the FlashInfer team @yzh119 ) #3550
Added torch.compile support for FP8, achieving 50 tokens/s for online inference #3232
Implemented CUTLASS block-wise FP8 for enhanced efficiency

Architecture Enhancements

Upgraded to FlashInfer v0.2
Enabled Flash Attention 3 by default for prefill
Extended EAGLE 2 support:
- Enhanced integration with FlashInfer backend
- Added support in Triton backend

New Features

Introduced Function Calling capabilities
Added regex pattern support in XGrammar backend
Implemented custom sampling processor for flexible inference control
Integrated LoRA support in Triton backend

What's Changed

docs: add deepseek v3 launch instructions by @zhyncs in #2589
fix: only enable moe_align_block_size for now by @zhyncs in #2590
docs: update deepseek v3 example by @zhyncs in #2592
h100 tuning fused_moe_triton for qwen2 moe by @BBuf in #2560
Fix cache hit rate when chunked prefill by @hnyls2002 in #2555
Update README.md by @merrymercy in #2594
Error occurs when loading the gemma model in bitsandbytes format. by @upskyy in #2557
[Feature] Support new parameter - EBNF in xgrammar by @adarshxs in #2526
update readme of DeepSeek V3 by @fsygd in #2596
Fix logprob_start_len for multi modal models by @merrymercy in #2597
Fix duplicated handling of GetWeightsByNameReqInput by @fzyzcjy in #2565
[unittest] add unit test to test quant args of srt engine by @JamesSand in #2574
Fix test and benchmark scripts by @merrymercy in #2598
fix: package data missing by @yudian0504 in #2521
[UTILS] improve makefile a bit by adding help info by @kzhou003 in #2570
Super tiny typo fix by @fzyzcjy in #2564
Update contributor_guide.md by @merrymercy in #2603
Update README.md by @merrymercy in #2605
Tiny code cleanup in tokenizer_manager.py by @fzyzcjy in #2586
Regression fix to AMD/ROCm from recent change by @HaiShaw in #2606
Update CODEOWNERS by @merrymercy in #2608
Fused moe triton cfg opt for rocm by @kkHuang-amd in #2612
Fix triton kernel performance regression by @kkHuang-amd in #2611
Change extend attention kernel launch parameter for ROCm platform to … by @kkHuang-amd in #2610
fix moe_align_block_size by @HandH1998 in #2615
update sgl_moe_align_block_size usage by @HandH1998 in #2617
chore: bump v0.4.1.post1 by @zhyncs in #2616
docs: update README by @zhyncs in #2618
[FIX] Update EOS from config by @zhengy001 in #2475
[minor] clean up docs and eos id by @merrymercy in #2622
Add more supporting organizations by @merrymercy in #2623
Update readme by @ispobock in #2625
avoid fused_moe_triton padding circular import by @BBuf in #2624
[CI] Fix nightly test and raise better error message by @merrymercy in #2626
Docs: Add constrained decoding tutorial by @shuaills in #2614
[docs]Refactor constrained decoding tutorial by @shuaills in #2633
add configs for block fp8 related kernels by @zhyncs in #2628
Add update_weights_from_tensor by @fzyzcjy in #2631
[Feature] Function Calling by @Tushar-ml in #2544
[Docs] Add EBNF to sampling params docs by @adarshxs in #2609
Clean up wrapper in flashinfer backend by @merrymercy in #2638
minor: add nsys cli for docker dev by @zhyncs in #2639
Add llama_eagle.py by @merrymercy in #2640
[Session] Update session control interface by @Ying1123 in #2635
AMD: set weights and scaling numbers properly for block FP8 by @HaiShaw in #2637
Update Triton configs for block fp8 kernels by @HandH1998 in #2641
chore: bump v0.4.1.post2 by @zhyncs in #2643
docs: update README by @zhyncs in #2644
docs: add development guide using docker by @zhyncs in #2645
[Feature] Get Token IDs with Engine.generate() by @shuaills in #2636
Fix unittest for input tokens by @shuaills in #2646
skip special token for unit test by @zhaochenyang20 in #2648
Release 0.4.1.post3 - upload the config.json to PyPI by @merrymercy in #2647
Update the timeout in nightly-test.yml by @merrymercy in #2649
add 2*h20 node serving example for deepseek v3 by @Lzhang-hub in #2650
docs: update README by @zhyncs in #2651
[feat] Add math eval to CI by @XiaotongJiang in #2652
Revert "[feat] Add math eval to CI" by @merrymercy in #2656
fix typo by @HaiShaw in #2655
[Docs] clean up structured outputs docs by @merrymercy in #2654
Update structured_outputs.ipynb by @merrymercy in #2666
Refactor sgl-kernel build by @ispobock in #2642
Refactor logprob computation to return the real logprob used in sampling by @merrymercy in #2664
Add GemLite caching after each capture by @mobicham in #2669
AMD DeepSeek_V3 FP8 Numerical fix by @HaiShaw in #2667
Minor follow-up fixes for the logprob refactor by @merrymercy in #2670
Tiny update scripts to fail fast by @fzyzcjy in #2672
Improve the computation for time_per_output_token Prometheus metrics by @merrymercy in #2674
Add cutlass submodule for sgl-kernel by @ispobock in #2676
minor: cleanup sgl-kernel by @zhyncs in #2679
Eagle speculative decoding part 1: Support target model verification in the attention backend by @merrymercy in #2678
misc: update CODEOWNERS by @zhyncs in #2680
feat: use CUDA 12.4 by default (for FA3) by @zhyncs in #2682
Update README.md by @merrymercy in #2683
Eagle speculative decoding part 2: Fix cuda graph + DP attention hanging by @merrymercy in #2684
[Fix] fix openai adapter by @Ying1123 in #2685
h200 tuning fused_moe_triton config for Mixtral 8x7B/8x22B and Qwen2 57BA14B by @BBuf in #2689
[Docs] refactor Contribution Guide by @shuaills in #2690
Doc: Rename contribution_guide.md by @zhaochenyang20 in #2691
ROCm base image update by @kkHuang-amd in #2692
[Docs] Add Support for Structured Output Format by @shuaills in #2697
[feat] Add math eval to CI nightly run by @XiaotongJiang in #2663
Improve moe reduce sum kernel performance by @kkHuang-amd in #2705
Speed up update_weights_from_tensor by @fzyzcjy in #2695
Eagle speculative decoding part 3: small modifications to the general scheduler by @merrymercy in #2709
Eagle speculative decoding part 4: Add EAGLE2 worker by @yukavio in #2150
feat: support moe_align_block_size_triton by @zhyncs in #2712
Included multi-node DeepSeekv3 example by @roG0d in #2707
Update documentation workflow and contribution guide by @shuaills in #2704
[Fix] fix incorrectly overwriting the port specified in ServerArgs by @mickqian in #2714
[Fix] fix retract error in eagle speculative decoding by @yukavio in #2711
Support loading pre-sharded moe weights by @merrymercy in #2716
[Feature, Hardware] Enable DeepseekV3 on AMD GPUs by @BruceXcluding in #2601
Update README.md by @merrymercy in #2722
[Docs] fix 404 - Contributor Guide, again by @gaocegege in #2727
feat: Support VLM in reference_hf by @gaocegege in #2726
Refactor SchedulePolicy to improve code organization by @libratiger in #2571
Revert the GLOO_SOCKET_IFNAME change by @merrymercy in #2731
fix lint by @zhyncs in #2733
improve moe_align_kernel for deepseek v3 by @BBuf in #2735
Support twoshot kernel by @yizhang2077 in #2688
chore: bump v0.4.1.post4 by @zhyncs in #2713
Fix sgl-kernel cu118 compile issue by @ispobock in #2750
Remove unused var in moe_align_kernel by @ispobock in #2751
Support cutlass Int8 gemm by @ispobock in #2752
Support llamafy/Qwen-Qwen2.5-7B-Instruct-llamafied by @Xu-Chen in #2748
feat: add devcontainer.json for VSCode development by @observerw in #2745
Clean up eagle code by @merrymercy in #2756
Enable Nvidia's ModelOpt fp8 quantized models by @Edwardf0t1 in #2535
Add generator-style run_batch function by @xingyaoww in #2513
Update README.md by @merrymercy in #2757
Remove --modelopt-config in server_args by @merrymercy in #2758
add benchmark_moe_align_blocks by @BBuf in #2767
Use Optional with None default by @HaiShaw in #2770
Misc fix for min_p_sampling, --cuda-graph-bs by @merrymercy in #2761
Update int8 gemm config by @ispobock in #2774
Host memory pool for hierarchical caching by @xiezhq-hermann in #2771
Disable math eval on nightly CI temporarily by @merrymercy in #2779
Fix nightly accuracy tests by @merrymercy in #2780
[eagle2] fix end check when target model verify by @jjjjohnson in #2723
Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm by @merrymercy in #2784
Docs: Rewrite docs for LLama 405B and ModelSpace by @minleminzui in #2773
Update the style of llma 3.1 405B docs by @zhaochenyang20 in #2789
Update modelopt config and fix running issue by @ispobock in #2792
Remove vllm dependency in model config by @cermeng in #2809
Fix typo in cuda_graph_bs by @merrymercy in #2813
minor: support specifying local dataset path for gsm8k and hellaswag by @sleepcoo in #2816
[Doc] Deepseek reference docs by @XiaotongJiang in #2787
Doc: add block-wise FP8 in dpsk model reference by @zhaochenyang20 in #2830
Update README.md by @merrymercy in #2833
Add more metrics to serving benchmark. by @Mutinifni in #2819
[Bugfix] Fix embedding model hangs with --enable-metrics by @CatherineSue in #2822
[Bugfix] Fix bug in fork logic caused by null text_ by @Muqi1029 in #2835
Fix port number overflow by @gty111 in #2826
[Eagle2]Fix multiple concurrent request crashes by @coolhok in #2730
Cache controller for hierarchical caching by @xiezhq-hermann in #2804
Update threshold in test_nightly_gsm8k_eval.py by @merrymercy in #2836
[HotFix] fix fp8 scale load failed in tp>1 by @BBuf in #2837
chore: bump v0.4.1.post5 by @zhyncs in #2840
docs: update README by @zhyncs in #2841
Improve: Token-In Token-Out Usage for RLHF by @shuaills in #2843
add sampling_scaling_penalties kernel by @BBuf in #2846
fix sgl-kernel build by @zhyncs in #2850
Add int8 quant kernel by @ispobock in #2848
Support FP8 E4M3 KV Cache by @bjmsong in #2786
Update base image for ROCm by @sogalin in #2852
Integrate ROCm ater package for ck moe function feasibility by @kkHuang-amd in #2854
[Fix]eagle2 health_generate is first request,apiserver will core by @coolhok in #2853
Fix linear.py and improve weight loading by @merrymercy in #2851
Unify sglang coding style by @kkHuang-amd in #2856
fix: not delete CNAME by @zhyncs in #2860
docs: update link by @zhyncs in #2857
minor: use ubuntu-latest instead of self-hosted runner for amd build by @zhyncs in #2861
Use only one GPU for MLA CI tests by @merrymercy in #2858
Collect more metrics: num_requests_total by @merrymercy in #2859
Integration of TurboMind AWQ by @bjmsong in #2828
Fix quant kernel accuracy issue by @ispobock in #2865
Revert "Integration of TurboMind AWQ" by @merrymercy in #2866
Dump requests to a folder by @merrymercy in #2862
Fix typos in io_struct.py by @merrymercy in #2867
minor: fix release docs by @zhyncs in #2868
add qwen2 eagle model by @Lzhang-hub in #2863
Revert "Dump requests to a folder" by @merrymercy in #2869
Sampling penalties memory interface by @BBuf in #2870
CUDA-graph-compatible releasing and resuming KV cache and model weight memory by @fzyzcjy in #2630
Add a new api configure_logging to allow dumping the requests by @merrymercy in #2875
docs: update README by @zhyncs in #2878
Adjust flashinfer workspace size for Qwen2 models by @ispobock in #2879
update ROCm docker for layernorm kernel optimization by @kkHuang-amd in #2885
Support w8a8 int8 quantization config by @ispobock in #2881
feat: support internlm 3 dense by @zhyncs in #2888
introduce CUB in sgl-kernel by @BBuf in #2887
chore: bump v0.4.1.post6 by @zhyncs in #2899
Add ut for w8a8 int8 quantization by @ispobock in #2897
Disable graceful shutdown of tokenizer manager when not in the main thread by @comaniac in #2872
optimize custom allreduce kernel by @yizhang2077 in #2904
fix: sgl-kernel link cuda by @zhyncs in #2906
adapt custom allreduce for tensorrt llm by @yizhang2077 in #2511
minor: update pr test by @zhyncs in #2908
minor: rename bench for sgl kernel by @zhyncs in #2909
[kernel] MiniMax-Text-01 prefill lightning_attn with triton by @BBuf in #2911
feat: patch linear base by @zhyncs in #2915
fix setup for sgl kernel by @zhyncs in #2917
minor: use bear for compilation database by @zhyncs in #2919
Improve benchmark scripts and error message printing by @merrymercy in #2922
fixed lm_head.weight error for quantized qwen by @RinRin-32 in #2910
add profiling to bench_one_batch script by @yundai424 in #2821
Simplify the process launch code in server.py by @merrymercy in #2923
Add CI for sgl-kernel by @ispobock in #2924
Support multi-node DP attention by @merrymercy in #2925
Update release-docker-amd.yml to run on amd docker runner. by @saienduri in #2927
Improve type annotation and styles by @merrymercy in #2926
[kernel] MiniMax-Text-01 decode lightning_attn with triton by @BBuf in #2920
Update pull_request_template.md by @zhaochenyang20 in #2928
Fix zmq binding by @merrymercy in #2930
[Frontend] Fix request length check and add option to disallow auto truncation in scheduler by @CatherineSue in #2876
Enable CPU device on SGLang by @chunyuan-w in #2806
Update release-docs.yml by @merrymercy in #2937
Fix sgl-kernel ci by @ispobock in #2938
feat: remove vllm distributed by @zhyncs in #2907
Fix qwen accuracy issue by @ispobock in #2945
docs: add Cursor for adoption and sponsorship by @zhyncs in #2950
update ci install dependency by @zhyncs in #2949
cleanup models dependencies 1/n by @zhyncs in #2948
Add ut for qwen model by @ispobock in #2947
Update pr template by @ispobock in #2951
cleanup models unused import 2/n by @zhyncs in #2952
feat: use get_rope for gemma2 by @zhyncs in #2954
Fix Llama-3.1-405B References Docs by @HermitSun in #2944
Multi-turn benchmark for hierarchical caching by @xiezhq-hermann in #2942
support e4m3 kvcache in qwen2 & add kv scaling facotr json by @bjmsong in #2894
Query remaining memory dynamically for PrefillAdder by @xiezhq-hermann in #2941
Remove fp8 monkey patch by @ispobock in #2960
fix sgl-kernel setup.py by @sleepcoo in #2963
feat: remove vllm get_rope by @zhyncs in #2964
upgrade cutlass v3.7.0 by @zhyncs in #2967
optimize MiniMax-Text-01 lightning_attn_decode triton by @BBuf in #2966
[Feature] Support minicpmv v2.6 by @mickqian in #2785
fix file name spelling mistake and useless variable in minmax-text-01-lightning_attention by @BBuf in #2971
Memory pool: Minor optimize to avoid to by @zhengy001 in #2901
Frontend: better error message handling for FINISH_ABORT in scheduler.py by @CatherineSue in #2956
Refactor to add TypeBasedDispatcher to simplify dispatching by @fzyzcjy in #2958
Remove the unused write_with_records by @merrymercy in #2972
Fix the request loggings to make it fully able to be easily replayed by @merrymercy in #2973
Simplify logits processor by @merrymercy in #2974
remove cub and add cccl by @zhyncs in #2976
[devcontainer] Fix mount and GPU & Support rust dev by @ByronHsu in #2978
[router] Allow empty worker list for sglang.launch_router by @ByronHsu in #2979
[router] Fix sgl router path for release by @ByronHsu in #2980
fix deepseek v2 with cpu device by @zhyncs in #2975
add config to swtich from vllm custom allreduce to sgl_kernel custom allreduce by @yizhang2077 in #2981
feat: check for is_cuda for sgl_kernel import by @zhyncs in #2984
update docker dev image by @zhyncs in #2985
docs: update supported_models by @zhyncs in #2987
cleanup unused header in sgl_kernel by @zhyncs in #2986
fix missing revision arg when loading tokenizer by @giorgiopiatti-dfinity in #2982
[#2812] Make the decode status dict capcity adjustable by a CLI param by @seungduk-yanolja in #2839
fix custom op version compatibility by @zhyncs in #2988
support regex in xgrammar backend by @qeternity in #2983
[Feature] Add sampler custom logits processor by @hongpeng-guo in #2396
Move sgl.Runtime under sglang/lang by @merrymercy in #2990
Improve metrics, logging, and importing orders by @merrymercy in #2992
Docs: Only use X-Grammar in structed output by @zhaochenyang20 in #2991
Remove dependency of pynvml on ROCm by @lcskrishna in #2995
keep rotary_embedding only by @zhyncs in #2997
Separate two entry points: Engine and HTTP server by @merrymercy in #2996
Update TypeBasedDispatcher and balance CI tests by @merrymercy in #3001
Skip flaky custom_logit_processor tests by @merrymercy in #3004
add performance pic for dpa by @zhaochenyang20 in #3005
[Enhancement] Custom Logit Processor Improvement by @hongpeng-guo in #2998
fix deepseekv3 moe align blocks benchmark by @yiakwy-xpu-ml-framework-team in #3003
Fix perf regression on small batch sizes due to kv cache scale by @merrymercy in #3008
Roll back to use vllm custom allreduce by @merrymercy in #3006
Sync distributed package from vllm 0.6.4.post1 by @merrymercy in #3010
[kernel] port rope cuda kernel to sgl-kernel by @ByronHsu in #2993
chore: bump v0.4.1.post7 by @zhyncs in #3009
Add clang-format check to sgl-kernel ci by @ispobock in #3012
Add compile flags for cutlass 3.x by @ispobock in #3013
[router] Expose worker startup secs & Return error instead of panic for router init by @ByronHsu in #3016
[router] Expose worker startup interval by @ByronHsu in #3019
bump router to 0.1.3 by @ByronHsu in #3020
deepseek v3 and r1 chat template by @qeternity in #3015
enable kv_scale remap by @hliuca in #3017
[Doc] Update doc of custom logit processor by @hongpeng-guo in #3021
Fix flaky tests in test_programs.py by @merrymercy in #3022
[EAGLE] Fix some boundary situation when retract reqs and req's max token = 1 by @josephydu in #2939
Enable Cohere2 Models by @hliuca in #3018
minor: update Makefile for sgl-kernel by @zhyncs in #3025
upgrade torch version for sgl-kernel by @zhyncs in #3026
Add accuracy and latency tests of eagle into CI by @merrymercy in #3027
feat: add flashinfer as 3rdparty and use rmsnorm as example by @zhyncs in #3033
Support sm90 Int8 gemm by @ispobock in #3035
fix pr-test-sgl-kernel by @zhyncs in #3036
Use int64 as indices for set_kv_buffer by @merrymercy in #3039
Fix sgl-kernel compile for sm80 by @ispobock in #3046
update norm cu by @zhyncs in #3048
sync the upstream updates of flashinfer by @zhyncs in #3051
feat: integrate norm kernels into sgl-kernel by @zhyncs in #3052
feat: integrate activation kernels into sgl-kernel by @zhyncs in #3053
minor: update header and use pytest by @zhyncs in #3054
feat: integrate bmm_fp8 kernel into sgl-kernel by @zhyncs in #3056
fix rotary_embedding rope_scaling for phi by @sudo-root-ns in #3055
add notice about flashinfer in sgl-kernel by @zhyncs in #3057
disable custom allreduce on HIP by @hliuca in #3058
[Doc]Update doc of profiling with PyTorch Profiler by @Fridge003 in #3038
Fix the FP8 E4M3 parsing offline scales failure bug by @sleepcoo in #3045
Add some flags to allow sync token ids across TP ranks by @merrymercy in #3060
[devcontainer] add non-root user by @ByronHsu in #2989
[router] make error actionable by @ByronHsu in #3063
Fix tp token sync for dp attention by @merrymercy in #3062
Support loading of larger models with on-the-fly quantization by @kwen2501 in #3061
Revert "disable custom allreduce on HIP" by @merrymercy in #3067
docs: add developer guide for sgl-kernel by @zhyncs in #3068
docs: update developer guide for sgl-kernel by @zhyncs in #3069
use v0.6.4.post1 for sgl-kernel ci by @zhyncs in #3071
support lightning_attention_decode in sgl-kernel for MiniMax-Text-01 by @BBuf in #3030
Remove torch dependency in sgl-kernel by @merrymercy in #3074
fix build error for sgl-kernel by @zhyncs in #3078
update version setup for sgl-kernel by @zhyncs in #3079
use env variable to control the build conf on the CPU build node by @zhyncs in #3080
sync flashinfer and update sgl-kernel tests by @zhyncs in #3081
Use flashinfer vec_dtypes in sgl_kernel by @BBuf in #3083
[hotfix] fix test_sampling_scaling_penalties.py ci test by @BBuf in #3084
feat: integrate sampling kernels into sgl-kernel by @zhyncs in #3086
chore: bump sgl-kernel 0.0.2.post16 by @zhyncs in #3087
Update doc for server arguments by @simveit in #2742
Add shapes for int8 gemm benchmark by @ispobock in #3093
[router] Forward all request headers from router to workers by @ByronHsu in #3070
bump router to 0.1.4 by @ByronHsu in #3094
[router] Fix twine uploading by @ByronHsu in #3095
Fix cu118 group gemm compile issue by @ispobock in #3097
minor: sync flashinfer and add turbomind as 3rdparty by @zhyncs in #3105
Allow local cutlass directory to be used in sgl-kernel build by @trevor-m in #3037
[Docs] minor update for phi-3 and phi-4 by @adarshxs in #3096
minor: update sgl-kernel setup by @zhyncs in #3107
Add workflow for sgl-kernel cu118 release by @ispobock in #3109
Add step to update sgl-kernel whl index by @ispobock in #3110
support fp32 in sampling_scaling_penalties kernel by @BBuf in #3121
mirror fix for custom allreduce by @yizhang2077 in #3124
chore: bump v0.0.2.post17 for sgl-kernel by @zhyncs in #3125
speedup pr test for sgl-kernel by @zhyncs in #3126
Update tag name for whl release by @ispobock in #3127
Update whl index path by @ispobock in #3128
update installation doc for sgl-kernel by @zhyncs in #3129
feat: refactor sgl-kernel and use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops by @yinfan98 in #3130
Fix CI tests by @merrymercy in #3132
Use torch.compile for scaling penalty by @merrymercy in #3133
enable kv_scale for Gemma2 by @hliuca in #3113
feat: cross python wheel for sgl-kernel by @zhyncs in #3138
[Fix] Not skip NVML Check on AMD Platform by @BruceXcluding in #3135
Fix repetition penalty by @merrymercy in #3139
minor: cleanup sgl-kernel by @zhyncs in #3143
support w8a8 fp8 kernel with CUTLASS by @HandH1998 in #3047
Add CPU affinity setting to latency benchmark by @hubertlu-tw in #3085
Simplify the computation of cached_tokens by @merrymercy in #3145
Do not load OPENAI_KEY from secrets by @merrymercy in #3147
chore: bump 0.0.2.post18 for sgl-kernel by @zhyncs in #3149
Temporarily skip the openai frontend tests by @merrymercy in #3151
udpate sgl-kernel version for srt by @zhyncs in #3150
Return more infos for computing average acceptance length by @merrymercy in #3152
fix link in README by @zhyncs in #3153
use self-hosted to build sgl-kernel by @zhyncs in #3154
Feature/function calling update by @YAMY1234 in #2700
Add function calling in index.rst by @zhaochenyang20 in #3155
Doc: Add Docs about EAGLE speculative decoding by @jhinpan in #3144
Add more logprob tests by @merrymercy in #3162
[kernel] Integrate flashinfer's rope with higher precision and better perf by @ByronHsu in #3134
add unit test for block wise fp8 by @yizhang2077 in #3156
Bump sgl kernel to 0.0.2.post19 by @ByronHsu in #3167
Add activation parameters to fused_moe by @merrymercy in #3170
[kernel] Fix position ids in rope by @ByronHsu in #3173
add dsv3 mi300 triton config for block scale by @BruceXcluding in #3146
Improve weight loading and code style by @merrymercy in #3174
Update thresholds in test_nightly_gsm8k_eval.py by @merrymercy in #3176
cleanup sgl-kernel kernels by @zhyncs in #3175
chore: bump 0.0.3 for sgl-kernel by @zhyncs in #3178
feat: use sgl-kernel 0.0.3 in sglang by @zhyncs in #3179
chore: bump v0.4.2 by @zhyncs in #3180
fix: update Dockerfile for cu118 by @zhyncs in #3181
Sanity check to prevent performance regression by @xiezhq-hermann in #3171
Docs fix about EAGLE and streaming output by @jhinpan in #3166
[test] deduplicate test_session_control by @ByronHsu in #3183
clean up useless file by @BBuf in #3192
[kernel] Use sgl_kernel rope by @ByronHsu in #3169
Fix typo in README by @falegh in #3190
[Fix] Address remaining issues of supporting MiniCPMV by @mickqian in #2977
[test] Lower number of top logprobs to get rid of -inf by @ByronHsu in #3212
update 3rdparty and rms norm for sgl-kernel by @zhyncs in #3213
update setup for sgl-kernel by @zhyncs in #3214
add tensorrt_llm common and cutlass_extensions as 3rdparty by @zhyncs in #3216
add tensorrt_llm moe_gemm as 3rdparty by @zhyncs in #3217
keep the parts needed for moe_kernels by @zhyncs in #3218
docs: add Novita for adoption and sponsorship by @Ying1123 in #3227
Update supported models with Mistral 3 by @ravi03071991 in #3229
revert the MoE dependence by @zhyncs in #3230
[fix] Clamp logprob with dtype min to prevent -inf by @ByronHsu in #3224
Fix block wise fp8 torch compile by @ispobock in #3232
support 12.5 CUDA runtime by @zhyncs in #3231
chore: bump v0.4.2.post1 by @zhyncs in #3233
Quick fix for Speculative_decoding doc by @jhinpan in #3228
compatible with flashinfer v0.2 by @zhyncs in #3235
Optimize MoE topk with torch compile by @ispobock in #3236
update sgl-kernel version for sglang by @zhyncs in #3238
update cutlass dependency by @zhyncs in #3240
add tuning block wise fp8 by @zhyncs in #3242
[Docs] Add more details to profiling docs by @Edenzzzz in #3221
Add test for fp8 torch compile by @ispobock in #3246
update ENV to ROCm dockers by @HaiShaw in #3248
update and simplify CustomOp by @zhyncs in #3249
support QuickGELU by @zhyncs in #3250
add contact us in README by @zhyncs in #3251
use srt VocabParallelEmbedding by @zhyncs in #3252
Tune paged attention parameters for AMD GPU. by @whchung in #3255
docs/accuracy evaluation by @simveit in #3114
Docs: Update accuracy evaluation by @zhaochenyang20 in #3261
ROCm: bump 6.3.0 by @HaiShaw in #3259
Fix min_p sampling crash when using flashinfer backend by @zifeitong in #3207
Add a Doc about guide on nvidia jetson #3182 by @lycanlancelot in #3205
optimize test_fused_moe style by @BBuf in #3268
refactor EAGLE 2 by @zhyncs in #3269
add copyright for sgl-kernel by @zhyncs in #3270
adding Triton configs for DeepSeekV3 on Blackwell by @kushanam in #3272
add Nebius for Adoption and Sponsorship by @zhyncs in #3274
add Atlas Cloud for Adoption and Sponsorship by @zhyncs in #3276
Update server args doc by @simveit in #3273
[Feature] Define backends and add Triton backend for Lora by @Fridge003 in #3161
upgrade flashinfer v0.2.0.post2 by @zhyncs in #3288
ROCm: sgl-kernel enablement starting with sgl_moe_align_block by @HaiShaw in #3287
Update Triton decode backend interface by @ispobock in #3292
update flashinfer install index url by @zhyncs in #3293
[ROCm] Add tuning configs for AMD Radeon Graphics. by @whchung in #3294
[ROCm] Manually unroll _w8a8_block_fp8_matmul kernel on AMD GPU. by @whchung in #3299
Use forward_cuda to execute custom op for hip platform by @kkHuang-amd in #3305
[ROCm] Logic to decide whether to used manually unrolled kernel. by @whchung in #3306
Fix lora flashinfer import bug on ROCM by @Fridge003 in #3312
chore: bump v0.4.2.post2 by @zhyncs in #3313
Update Triton extend backend interface by @ispobock in #3309
Support custom mask for Triton attention by @ispobock in #3317
Initial Enablement of CI on MI300 by @saienduri in #3168
update README by @zhyncs in #3324
Docker switch on mi300 CI. by @saienduri in #3327
[ROCm] Fix fp8 unrolledx4 matmul kernel. by @whchung in #3325
clean moe align block kernel code and add acc test by @BBuf in #3332
Add sgl-kernel to MI300 CI paths tested. by @saienduri in #3335
update pull request template by @zhyncs in #3337
add AMD guide for DeepSeek-R1 by @zhyncs in #3338
[Doc] Add optimization option guide for deepseek v3 by @ispobock in #3349
fix sgl-kernel build failure on AMD by @zhyncs in #3352
optimize moe_align_kernel cuda by @BBuf in #3347
enable fake finish for docs PR by @zhaochenyang20 in #3350
Feature/docs deepseek usage and add multi-node by @lycanlancelot in #3314
Feature: Fix the binding error in Llama by @zhaochenyang20 in #3355
Fix: Runtime error for function calling by @shuaills in #3300
update waves_per_eu to 1 by @lizamd in #3356
update unit test in AMD CI by @zhyncs in #3366
fix undefined symbol cudaGetDriverEntryPointByVersion by @zhyncs in #3372
support speculative decoding kernel in sgl-kernel by @zhyncs in #3373
update sgl-kernel version by @zhyncs in #3374
update pr-test ci by @zhyncs in #3376
fix EagleVerifyInput by @zhyncs in #3378
chore: bump v0.4.2.post3 by @zhyncs in #3369
added amd_configure.md to references by @zstreet87 in #3275
Add H20 fp8 w8a8 gemm config by @sleepcoo in #3386
[BUG] fix moe benchmark when bs*seq is small by @yiakwy-xpu-ml-framework-team in #3382
Update fused_moe's benchmark by @WhatGhost in #3346
Add deepseek-v3 a100 serving example by @ispobock in #3404
fix EAGLE 2 non greedy case by @zhyncs in #3407
add disable cuda graph unit test for eagle 2 by @zhyncs in #3412
[Fix] Fix eagle with disable cuda graph by @Ying1123 in #3411
minor: cleanup test_eagle_infer by @zhyncs in #3415
[docs] Add multi-node inference example for SLURM in documentation by @shuaills in #3408
fix cu118 link issue by @zhyncs in #3421
remove cutex dependency by @zhyncs in #3422
update forward_return_lse by @zhyncs in #3425
add cuda graph capture failure possible solution by @zhyncs in #3430
fix draft cuda graph capture failure by @zhyncs in #3431
remove activation dependency in fused_moe by @zhyncs in #3433
compatible with new outlines by @zhyncs in #3435
[Docs] Add quantization docs by @Edenzzzz in #3410
[docs] Update quantization documentation by @shuaills in #3437
support version in sgl-kernel by @zhyncs in #3439
chore: bump sgl-kernel v0.0.3.post3 by @zhyncs in #3440
fix ci by @zhyncs in #3441
feat: enable ragged fa3 by default on hopper 12.4+ by @zhyncs in #3442
Update contribution_guide.md by @Ying1123 in #3452
remove _grouped_size_compiled_for_decode_kernels by @zhyncs in #3453
[Fix] Fix accuracy bug and refactor codes for lora by @Fridge003 in #3413
use nvcr.io/nvidia/tritonserver:24.04-py3-min as base image by @zhyncs in #3457
chore: bump v0.4.2.post4 by @zhyncs in #3459
Support Eagle2 for Triton backend by @ispobock in #3466
[Eagle] reduce one draft forward by @Ying1123 in #3468
fix mla test by @zhyncs in #3469
refine some typo by @BBuf in #3473
[Feat] return hidden states by @Jackmin801 in #3364
[ROCm] Add ROCm tuning config to block gemm and Re-tune for AMD Radeon Graphics by @BruceXcluding in #3418
optimize per token group quant fp8 by @BBuf in #3490
Tune MI300X fused MoE Triton kernel JSON config. by @whchung in #3492
Support Eagle cuda graph for Triton backend by @ispobock in #3500
fix deepseek_v3 typo by @didier-durand in #3497
fix supported_models Qwen typo by @didier-durand in #3498
fix server_arguments typo by @didier-durand in #3499
fix router typo by @didier-durand in #3496
add deepseek-v3 amd docker command by @zstreet87 in #3495
MI30x: More graph captures for larger batch sizes and concurrencies by @HaiShaw in #3420
Make NCCL NVLS configurable by @MrAta in #3502
doc: Support a new vLM by @mickqian in #3405
refine deepseek_v3 launch server doc by @BBuf in #3522
chore: bump 0.0.3.post4 sgl-kernel by @zhyncs in #3523
use sgl_per_token_group_quant_fp8 kernel by @BBuf in #3493
added llama and cleaned up by @zstreet87 in #3503
Fix deepseek awq v3 by @hnyls2002 in #3450
support blockwise fp8 matmul kernel by @yizhang2077 in #3267
chore: bump 0.0.3.post5 sgl-kernel by @zhyncs in #3530
integrate blockwise fp8 kernel by @yizhang2077 in #3529
[ROCm] Add ROCm tuning configs for AMD Instinct MI325X. by @whchung in #3536
Update DeepSeek V3 Doc by @jhinpan in #3541
fix moe_align_kernel shm init not sync bug by @BBuf in #3534
update README by @zhyncs in #3543
Update install docs by @simveit in #3553
feat: support flashinfer mla attention for deepseek v3 by @zhyncs in #3550
chore: bump 0.0.3.post6 sgl-kernel by @zhyncs in #3555
chore: bump v0.4.3 by @zhyncs in #3556

New Contributors

@fsygd made their first contribution in #2596
@fzyzcjy made their first contribution in #2565
@JamesSand made their first contribution in #2574
@yudian0504 made their first contribution in #2521
@kzhou003 made their first contribution in #2570
@XiaotongJiang made their first contribution in #2652
@mobicham made their first contribution in #2669
@roG0d made their first contribution in #2707
@mickqian made their first contribution in #2714
@BruceXcluding made their first contribution in #2601
@gaocegege made their first contribution in #2727
@libratiger made their first contribution in #2571
@observerw made their first contribution in #2745
@Edwardf0t1 made their first contribution in #2535
@xingyaoww made their first contribution in #2513
@jjjjohnson made their first contribution in #2723
@minleminzui made their first contribution in #2773
@sleepcoo made their first contribution in #2816
@Mutinifni made their first contribution in #2819
@CatherineSue made their first contribution in #2822
@Muqi1029 made their first contribution in #2835
@gty111 made their first contribution in #2826
@coolhok made their first contribution in #2730
@sogalin made their first contribution in #2852
@yundai424 made their first contribution in #2821
@saienduri made their first contribution in #2927
@chunyuan-w made their first contribution in #2806
@HermitSun made their first contribution in #2944
@giorgiopiatti-dfinity made their first contribution in #2982
@seungduk-yanolja made their first contribution in #2839
@hongpeng-guo made their first contribution in #2396
@lcskrishna made their first contribution in #2995
@yiakwy-xpu-ml-framework-team made their first contribution in #3003
@josephydu made their first contribution in #2939
@sudo-root-ns made their first contribution in #3055
@Fridge003 made their first contribution in #3038
@simveit made their first contribution in #2742
@trevor-m made their first contribution in #3037
@yinfan98 made their first contribution in #3130
@hubertlu-tw made their first contribution in #3085
@YAMY1234 made their first contribution in #2700
@jhinpan made their first contribution in #3144
@falegh made their first contribution in #3190
@ravi03071991 made their first contribution in #3229
@whchung made their first contribution in #3255
@lycanlancelot made their first contribution in #3205
@kushanam made their first contribution in #3272
@lizamd made their first contribution in #3356
@zstreet87 made their first contribution in #3275
@WhatGhost made their first contribution in #3346
@Jackmin801 made their first contribution in #3364
@didier-durand made their first contribution in #3497

Full Changelog: v0.4.1...v0.4.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.3