Highlights
The SGLang team is excited to announce the release of v0.4.3. We will keep improving DeepSeek V3/R1 performance. In the last six weeks, SGLang has been the fastest engine running DeepSeek V3/R1 among all open-source LLM inference engines. We stay ahead by integrating FlashInfer MLA and optimizing further. Look out for new optimizations coming soon! Please feel free to join our Slack channel https://slack.sglang.ai Cheers!
Performance Improvements
DeepSeek V3/R1 Optimizations
- Pioneering integration of FlashInfer MLA Attention delivers 4x performance improvement for long-context scenarios (Special thanks to the FlashInfer team @yzh119 ) #3550
- Added torch.compile support for FP8, achieving 50 tokens/s for online inference #3232
- Implemented CUTLASS block-wise FP8 for enhanced efficiency
Architecture Enhancements
- Upgraded to FlashInfer v0.2
- Enabled Flash Attention 3 by default for prefill
- Extended EAGLE 2 support:
- Enhanced integration with FlashInfer backend
- Added support in Triton backend
New Features
- Introduced Function Calling capabilities
- Added regex pattern support in XGrammar backend
- Implemented custom sampling processor for flexible inference control
- Integrated LoRA support in Triton backend
What's Changed
- docs: add deepseek v3 launch instructions by @zhyncs in #2589
- fix: only enable moe_align_block_size for now by @zhyncs in #2590
- docs: update deepseek v3 example by @zhyncs in #2592
- h100 tuning fused_moe_triton for qwen2 moe by @BBuf in #2560
- Fix cache hit rate when chunked prefill by @hnyls2002 in #2555
- Update README.md by @merrymercy in #2594
- Error occurs when loading the gemma model in bitsandbytes format. by @upskyy in #2557
- [Feature] Support new parameter - EBNF in xgrammar by @adarshxs in #2526
- update readme of DeepSeek V3 by @fsygd in #2596
- Fix logprob_start_len for multi modal models by @merrymercy in #2597
- Fix duplicated handling of GetWeightsByNameReqInput by @fzyzcjy in #2565
- [unittest] add unit test to test quant args of srt engine by @JamesSand in #2574
- Fix test and benchmark scripts by @merrymercy in #2598
- fix: package data missing by @yudian0504 in #2521
- [UTILS] improve makefile a bit by adding help info by @kzhou003 in #2570
- Super tiny typo fix by @fzyzcjy in #2564
- Update contributor_guide.md by @merrymercy in #2603
- Update README.md by @merrymercy in #2605
- Tiny code cleanup in tokenizer_manager.py by @fzyzcjy in #2586
- Regression fix to AMD/ROCm from recent change by @HaiShaw in #2606
- Update CODEOWNERS by @merrymercy in #2608
- Fused moe triton cfg opt for rocm by @kkHuang-amd in #2612
- Fix triton kernel performance regression by @kkHuang-amd in #2611
- Change extend attention kernel launch parameter for ROCm platform to … by @kkHuang-amd in #2610
- fix moe_align_block_size by @HandH1998 in #2615
- update sgl_moe_align_block_size usage by @HandH1998 in #2617
- chore: bump v0.4.1.post1 by @zhyncs in #2616
- docs: update README by @zhyncs in #2618
- [FIX] Update EOS from config by @zhengy001 in #2475
- [minor] clean up docs and eos id by @merrymercy in #2622
- Add more supporting organizations by @merrymercy in #2623
- Update readme by @ispobock in #2625
- avoid fused_moe_triton
padding
circular import by @BBuf in #2624 - [CI] Fix nightly test and raise better error message by @merrymercy in #2626
- Docs: Add constrained decoding tutorial by @shuaills in #2614
- [docs]Refactor constrained decoding tutorial by @shuaills in #2633
- add configs for block fp8 related kernels by @zhyncs in #2628
- Add
update_weights_from_tensor
by @fzyzcjy in #2631 - [Feature] Function Calling by @Tushar-ml in #2544
- [Docs] Add EBNF to sampling params docs by @adarshxs in #2609
- Clean up wrapper in flashinfer backend by @merrymercy in #2638
- minor: add nsys cli for docker dev by @zhyncs in #2639
- Add llama_eagle.py by @merrymercy in #2640
- [Session] Update session control interface by @Ying1123 in #2635
- AMD: set weights and scaling numbers properly for block FP8 by @HaiShaw in #2637
- Update Triton configs for block fp8 kernels by @HandH1998 in #2641
- chore: bump v0.4.1.post2 by @zhyncs in #2643
- docs: update README by @zhyncs in #2644
- docs: add development guide using docker by @zhyncs in #2645
- [Feature] Get Token IDs with Engine.generate() by @shuaills in #2636
- Fix unittest for input tokens by @shuaills in #2646
- skip special token for unit test by @zhaochenyang20 in #2648
- Release 0.4.1.post3 - upload the config.json to PyPI by @merrymercy in #2647
- Update the timeout in nightly-test.yml by @merrymercy in #2649
- add 2*h20 node serving example for deepseek v3 by @Lzhang-hub in #2650
- docs: update README by @zhyncs in #2651
- [feat] Add math eval to CI by @XiaotongJiang in #2652
- Revert "[feat] Add math eval to CI" by @merrymercy in #2656
- fix typo by @HaiShaw in #2655
- [Docs] clean up structured outputs docs by @merrymercy in #2654
- Update structured_outputs.ipynb by @merrymercy in #2666
- Refactor sgl-kernel build by @ispobock in #2642
- Refactor logprob computation to return the real logprob used in sampling by @merrymercy in #2664
- Add GemLite caching after each capture by @mobicham in #2669
- AMD DeepSeek_V3 FP8 Numerical fix by @HaiShaw in #2667
- Minor follow-up fixes for the logprob refactor by @merrymercy in #2670
- Tiny update scripts to fail fast by @fzyzcjy in #2672
- Improve the computation for time_per_output_token Prometheus metrics by @merrymercy in #2674
- Add cutlass submodule for sgl-kernel by @ispobock in #2676
- minor: cleanup sgl-kernel by @zhyncs in #2679
- Eagle speculative decoding part 1: Support target model verification in the attention backend by @merrymercy in #2678
- misc: update CODEOWNERS by @zhyncs in #2680
- feat: use CUDA 12.4 by default (for FA3) by @zhyncs in #2682
- Update README.md by @merrymercy in #2683
- Eagle speculative decoding part 2: Fix cuda graph + DP attention hanging by @merrymercy in #2684
- [Fix] fix openai adapter by @Ying1123 in #2685
- h200 tuning fused_moe_triton config for Mixtral 8x7B/8x22B and Qwen2 57BA14B by @BBuf in #2689
- [Docs] refactor Contribution Guide by @shuaills in #2690
- Doc: Rename contribution_guide.md by @zhaochenyang20 in #2691
- ROCm base image update by @kkHuang-amd in #2692
- [Docs] Add Support for Structured Output Format by @shuaills in #2697
- [feat] Add math eval to CI nightly run by @XiaotongJiang in #2663
- Improve moe reduce sum kernel performance by @kkHuang-amd in #2705
- Speed up
update_weights_from_tensor
by @fzyzcjy in #2695 - Eagle speculative decoding part 3: small modifications to the general scheduler by @merrymercy in #2709
- Eagle speculative decoding part 4: Add EAGLE2 worker by @yukavio in #2150
- feat: support moe_align_block_size_triton by @zhyncs in #2712
- Included multi-node DeepSeekv3 example by @roG0d in #2707
- Update documentation workflow and contribution guide by @shuaills in #2704
- [Fix] fix incorrectly overwriting the port specified in ServerArgs by @mickqian in #2714
- [Fix] fix retract error in eagle speculative decoding by @yukavio in #2711
- Support loading pre-sharded moe weights by @merrymercy in #2716
- [Feature, Hardware] Enable DeepseekV3 on AMD GPUs by @BruceXcluding in #2601
- Update README.md by @merrymercy in #2722
- [Docs] fix 404 - Contributor Guide, again by @gaocegege in #2727
- feat: Support VLM in reference_hf by @gaocegege in #2726
- Refactor SchedulePolicy to improve code organization by @libratiger in #2571
- Revert the GLOO_SOCKET_IFNAME change by @merrymercy in #2731
- fix lint by @zhyncs in #2733
- improve moe_align_kernel for deepseek v3 by @BBuf in #2735
- Support twoshot kernel by @yizhang2077 in #2688
- chore: bump v0.4.1.post4 by @zhyncs in #2713
- Fix sgl-kernel cu118 compile issue by @ispobock in #2750
- Remove unused var in moe_align_kernel by @ispobock in #2751
- Support cutlass Int8 gemm by @ispobock in #2752
- Support llamafy/Qwen-Qwen2.5-7B-Instruct-llamafied by @Xu-Chen in #2748
- feat: add devcontainer.json for VSCode development by @observerw in #2745
- Clean up eagle code by @merrymercy in #2756
- Enable Nvidia's ModelOpt fp8 quantized models by @Edwardf0t1 in #2535
- Add generator-style run_batch function by @xingyaoww in #2513
- Update README.md by @merrymercy in #2757
- Remove --modelopt-config in server_args by @merrymercy in #2758
- add benchmark_moe_align_blocks by @BBuf in #2767
- Use Optional with None default by @HaiShaw in #2770
- Misc fix for min_p_sampling, --cuda-graph-bs by @merrymercy in #2761
- Update int8 gemm config by @ispobock in #2774
- Host memory pool for hierarchical caching by @xiezhq-hermann in #2771
- Disable math eval on nightly CI temporarily by @merrymercy in #2779
- Fix nightly accuracy tests by @merrymercy in #2780
- [eagle2] fix end check when target model verify by @jjjjohnson in #2723
- Improve linear.py to load sharded weights & remove the dependency of Parameters from vllm by @merrymercy in #2784
- Docs: Rewrite docs for LLama 405B and ModelSpace by @minleminzui in #2773
- Update the style of llma 3.1 405B docs by @zhaochenyang20 in #2789
- Update modelopt config and fix running issue by @ispobock in #2792
- Remove vllm dependency in model config by @cermeng in #2809
- Fix typo in cuda_graph_bs by @merrymercy in #2813
- minor: support specifying local dataset path for gsm8k and hellaswag by @sleepcoo in #2816
- [Doc] Deepseek reference docs by @XiaotongJiang in #2787
- Doc: add block-wise FP8 in dpsk model reference by @zhaochenyang20 in #2830
- Update README.md by @merrymercy in #2833
- Add more metrics to serving benchmark. by @Mutinifni in #2819
- [Bugfix] Fix embedding model hangs with
--enable-metrics
by @CatherineSue in #2822 - [Bugfix] Fix bug in fork logic caused by null text_ by @Muqi1029 in #2835
- Fix port number overflow by @gty111 in #2826
- [Eagle2]Fix multiple concurrent request crashes by @coolhok in #2730
- Cache controller for hierarchical caching by @xiezhq-hermann in #2804
- Update threshold in test_nightly_gsm8k_eval.py by @merrymercy in #2836
- [HotFix] fix fp8 scale load failed in tp>1 by @BBuf in #2837
- chore: bump v0.4.1.post5 by @zhyncs in #2840
- docs: update README by @zhyncs in #2841
- Improve: Token-In Token-Out Usage for RLHF by @shuaills in #2843
- add sampling_scaling_penalties kernel by @BBuf in #2846
- fix sgl-kernel build by @zhyncs in #2850
- Add int8 quant kernel by @ispobock in #2848
- Support FP8 E4M3 KV Cache by @bjmsong in #2786
- Update base image for ROCm by @sogalin in #2852
- Integrate ROCm ater package for ck moe function feasibility by @kkHuang-amd in #2854
- [Fix]eagle2 health_generate is first request,apiserver will core by @coolhok in #2853
- Fix linear.py and improve weight loading by @merrymercy in #2851
- Unify sglang coding style by @kkHuang-amd in #2856
- fix: not delete CNAME by @zhyncs in #2860
- docs: update link by @zhyncs in #2857
- minor: use ubuntu-latest instead of self-hosted runner for amd build by @zhyncs in #2861
- Use only one GPU for MLA CI tests by @merrymercy in #2858
- Collect more metrics: num_requests_total by @merrymercy in #2859
- Integration of TurboMind AWQ by @bjmsong in #2828
- Fix quant kernel accuracy issue by @ispobock in #2865
- Revert "Integration of TurboMind AWQ" by @merrymercy in #2866
- Dump requests to a folder by @merrymercy in #2862
- Fix typos in io_struct.py by @merrymercy in #2867
- minor: fix release docs by @zhyncs in #2868
- add qwen2 eagle model by @Lzhang-hub in #2863
- Revert "Dump requests to a folder" by @merrymercy in #2869
- Sampling penalties memory interface by @BBuf in #2870
- CUDA-graph-compatible releasing and resuming KV cache and model weight memory by @fzyzcjy in #2630
- Add a new api configure_logging to allow dumping the requests by @merrymercy in #2875
- docs: update README by @zhyncs in #2878
- Adjust flashinfer workspace size for Qwen2 models by @ispobock in #2879
- update ROCm docker for layernorm kernel optimization by @kkHuang-amd in #2885
- Support w8a8 int8 quantization config by @ispobock in #2881
- feat: support internlm 3 dense by @zhyncs in #2888
- introduce CUB in sgl-kernel by @BBuf in #2887
- chore: bump v0.4.1.post6 by @zhyncs in #2899
- Add ut for w8a8 int8 quantization by @ispobock in #2897
- Disable graceful shutdown of tokenizer manager when not in the main thread by @comaniac in #2872
- optimize custom allreduce kernel by @yizhang2077 in #2904
- fix: sgl-kernel link cuda by @zhyncs in #2906
- adapt custom allreduce for tensorrt llm by @yizhang2077 in #2511
- minor: update pr test by @zhyncs in #2908
- minor: rename bench for sgl kernel by @zhyncs in #2909
- [kernel] MiniMax-Text-01 prefill lightning_attn with triton by @BBuf in #2911
- feat: patch linear base by @zhyncs in #2915
- fix setup for sgl kernel by @zhyncs in #2917
- minor: use bear for compilation database by @zhyncs in #2919
- Improve benchmark scripts and error message printing by @merrymercy in #2922
- fixed lm_head.weight error for quantized qwen by @RinRin-32 in #2910
- add profiling to bench_one_batch script by @yundai424 in #2821
- Simplify the process launch code in server.py by @merrymercy in #2923
- Add CI for sgl-kernel by @ispobock in #2924
- Support multi-node DP attention by @merrymercy in #2925
- Update release-docker-amd.yml to run on amd docker runner. by @saienduri in #2927
- Improve type annotation and styles by @merrymercy in #2926
- [kernel] MiniMax-Text-01 decode lightning_attn with triton by @BBuf in #2920
- Update pull_request_template.md by @zhaochenyang20 in #2928
- Fix zmq binding by @merrymercy in #2930
- [Frontend] Fix request length check and add option to disallow auto truncation in scheduler by @CatherineSue in #2876
- Enable CPU device on SGLang by @chunyuan-w in #2806
- Update release-docs.yml by @merrymercy in #2937
- Fix sgl-kernel ci by @ispobock in #2938
- feat: remove vllm distributed by @zhyncs in #2907
- Fix qwen accuracy issue by @ispobock in #2945
- docs: add Cursor for adoption and sponsorship by @zhyncs in #2950
- update ci install dependency by @zhyncs in #2949
- cleanup models dependencies 1/n by @zhyncs in #2948
- Add ut for qwen model by @ispobock in #2947
- Update pr template by @ispobock in #2951
- cleanup models unused import 2/n by @zhyncs in #2952
- feat: use get_rope for gemma2 by @zhyncs in #2954
- Fix Llama-3.1-405B References Docs by @HermitSun in #2944
- Multi-turn benchmark for hierarchical caching by @xiezhq-hermann in #2942
- support e4m3 kvcache in qwen2 & add kv scaling facotr json by @bjmsong in #2894
- Query remaining memory dynamically for PrefillAdder by @xiezhq-hermann in #2941
- Remove fp8 monkey patch by @ispobock in #2960
- fix sgl-kernel setup.py by @sleepcoo in #2963
- feat: remove vllm get_rope by @zhyncs in #2964
- upgrade cutlass v3.7.0 by @zhyncs in #2967
- optimize MiniMax-Text-01 lightning_attn_decode triton by @BBuf in #2966
- [Feature] Support minicpmv v2.6 by @mickqian in #2785
- fix file name spelling mistake and useless variable in minmax-text-01-lightning_attention by @BBuf in #2971
- Memory pool: Minor optimize to avoid to by @zhengy001 in #2901
- Frontend: better error message handling for FINISH_ABORT in scheduler.py by @CatherineSue in #2956
- Refactor to add TypeBasedDispatcher to simplify dispatching by @fzyzcjy in #2958
- Remove the unused write_with_records by @merrymercy in #2972
- Fix the request loggings to make it fully able to be easily replayed by @merrymercy in #2973
- Simplify logits processor by @merrymercy in #2974
- remove cub and add cccl by @zhyncs in #2976
- [devcontainer] Fix mount and GPU & Support rust dev by @ByronHsu in #2978
- [router] Allow empty worker list for sglang.launch_router by @ByronHsu in #2979
- [router] Fix sgl router path for release by @ByronHsu in #2980
- fix deepseek v2 with cpu device by @zhyncs in #2975
- add config to swtich from vllm custom allreduce to sgl_kernel custom allreduce by @yizhang2077 in #2981
- feat: check for is_cuda for sgl_kernel import by @zhyncs in #2984
- update docker dev image by @zhyncs in #2985
- docs: update supported_models by @zhyncs in #2987
- cleanup unused header in sgl_kernel by @zhyncs in #2986
- fix missing revision arg when loading tokenizer by @giorgiopiatti-dfinity in #2982
- [#2812] Make the decode status dict capcity adjustable by a CLI param by @seungduk-yanolja in #2839
- fix custom op version compatibility by @zhyncs in #2988
- support regex in xgrammar backend by @qeternity in #2983
- [Feature] Add sampler custom logits processor by @hongpeng-guo in #2396
- Move sgl.Runtime under sglang/lang by @merrymercy in #2990
- Improve metrics, logging, and importing orders by @merrymercy in #2992
- Docs: Only use X-Grammar in structed output by @zhaochenyang20 in #2991
- Remove dependency of pynvml on ROCm by @lcskrishna in #2995
- keep rotary_embedding only by @zhyncs in #2997
- Separate two entry points: Engine and HTTP server by @merrymercy in #2996
- Update TypeBasedDispatcher and balance CI tests by @merrymercy in #3001
- Skip flaky custom_logit_processor tests by @merrymercy in #3004
- add performance pic for dpa by @zhaochenyang20 in #3005
- [Enhancement] Custom Logit Processor Improvement by @hongpeng-guo in #2998
- fix deepseekv3 moe align blocks benchmark by @yiakwy-xpu-ml-framework-team in #3003
- Fix perf regression on small batch sizes due to kv cache scale by @merrymercy in #3008
- Roll back to use vllm custom allreduce by @merrymercy in #3006
- Sync distributed package from vllm 0.6.4.post1 by @merrymercy in #3010
- [kernel] port rope cuda kernel to sgl-kernel by @ByronHsu in #2993
- chore: bump v0.4.1.post7 by @zhyncs in #3009
- Add clang-format check to sgl-kernel ci by @ispobock in #3012
- Add compile flags for cutlass 3.x by @ispobock in #3013
- [router] Expose worker startup secs & Return error instead of panic for router init by @ByronHsu in #3016
- [router] Expose worker startup interval by @ByronHsu in #3019
- bump router to 0.1.3 by @ByronHsu in #3020
- deepseek v3 and r1 chat template by @qeternity in #3015
- enable kv_scale remap by @hliuca in #3017
- [Doc] Update doc of custom logit processor by @hongpeng-guo in #3021
- Fix flaky tests in test_programs.py by @merrymercy in #3022
- [EAGLE] Fix some boundary situation when retract reqs and req's max token = 1 by @josephydu in #2939
- Enable Cohere2 Models by @hliuca in #3018
- minor: update Makefile for sgl-kernel by @zhyncs in #3025
- upgrade torch version for sgl-kernel by @zhyncs in #3026
- Add accuracy and latency tests of eagle into CI by @merrymercy in #3027
- feat: add flashinfer as 3rdparty and use rmsnorm as example by @zhyncs in #3033
- Support sm90 Int8 gemm by @ispobock in #3035
- fix pr-test-sgl-kernel by @zhyncs in #3036
- Use int64 as indices for set_kv_buffer by @merrymercy in #3039
- Fix sgl-kernel compile for sm80 by @ispobock in #3046
- update norm cu by @zhyncs in #3048
- sync the upstream updates of flashinfer by @zhyncs in #3051
- feat: integrate norm kernels into sgl-kernel by @zhyncs in #3052
- feat: integrate activation kernels into sgl-kernel by @zhyncs in #3053
- minor: update header and use pytest by @zhyncs in #3054
- feat: integrate bmm_fp8 kernel into sgl-kernel by @zhyncs in #3056
- fix rotary_embedding rope_scaling for phi by @sudo-root-ns in #3055
- add notice about flashinfer in sgl-kernel by @zhyncs in #3057
- disable custom allreduce on HIP by @hliuca in #3058
- [Doc]Update doc of profiling with PyTorch Profiler by @Fridge003 in #3038
- Fix the FP8 E4M3 parsing offline scales failure bug by @sleepcoo in #3045
- Add some flags to allow sync token ids across TP ranks by @merrymercy in #3060
- [devcontainer] add non-root user by @ByronHsu in #2989
- [router] make error actionable by @ByronHsu in #3063
- Fix tp token sync for dp attention by @merrymercy in #3062
- Support loading of larger models with on-the-fly quantization by @kwen2501 in #3061
- Revert "disable custom allreduce on HIP" by @merrymercy in #3067
- docs: add developer guide for sgl-kernel by @zhyncs in #3068
- docs: update developer guide for sgl-kernel by @zhyncs in #3069
- use v0.6.4.post1 for sgl-kernel ci by @zhyncs in #3071
- support lightning_attention_decode in sgl-kernel for MiniMax-Text-01 by @BBuf in #3030
- Remove torch dependency in sgl-kernel by @merrymercy in #3074
- fix build error for sgl-kernel by @zhyncs in #3078
- update version setup for sgl-kernel by @zhyncs in #3079
- use env variable to control the build conf on the CPU build node by @zhyncs in #3080
- sync flashinfer and update sgl-kernel tests by @zhyncs in #3081
- Use flashinfer vec_dtypes in sgl_kernel by @BBuf in #3083
- [hotfix] fix test_sampling_scaling_penalties.py ci test by @BBuf in #3084
- feat: integrate sampling kernels into sgl-kernel by @zhyncs in #3086
- chore: bump sgl-kernel 0.0.2.post16 by @zhyncs in #3087
- Update doc for server arguments by @simveit in #2742
- Add shapes for int8 gemm benchmark by @ispobock in #3093
- [router] Forward all request headers from router to workers by @ByronHsu in #3070
- bump router to 0.1.4 by @ByronHsu in #3094
- [router] Fix twine uploading by @ByronHsu in #3095
- Fix cu118 group gemm compile issue by @ispobock in #3097
- minor: sync flashinfer and add turbomind as 3rdparty by @zhyncs in #3105
- Allow local cutlass directory to be used in sgl-kernel build by @trevor-m in #3037
- [Docs] minor update for phi-3 and phi-4 by @adarshxs in #3096
- minor: update sgl-kernel setup by @zhyncs in #3107
- Add workflow for sgl-kernel cu118 release by @ispobock in #3109
- Add step to update sgl-kernel whl index by @ispobock in #3110
- support fp32 in sampling_scaling_penalties kernel by @BBuf in #3121
- mirror fix for custom allreduce by @yizhang2077 in #3124
- chore: bump v0.0.2.post17 for sgl-kernel by @zhyncs in #3125
- speedup pr test for sgl-kernel by @zhyncs in #3126
- Update tag name for whl release by @ispobock in #3127
- Update whl index path by @ispobock in #3128
- update installation doc for sgl-kernel by @zhyncs in #3129
- feat: refactor sgl-kernel and use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops by @yinfan98 in #3130
- Fix CI tests by @merrymercy in #3132
- Use torch.compile for scaling penalty by @merrymercy in #3133
- enable kv_scale for Gemma2 by @hliuca in #3113
- feat: cross python wheel for sgl-kernel by @zhyncs in #3138
- [Fix] Not skip NVML Check on AMD Platform by @BruceXcluding in #3135
- Fix repetition penalty by @merrymercy in #3139
- minor: cleanup sgl-kernel by @zhyncs in #3143
- support w8a8 fp8 kernel with CUTLASS by @HandH1998 in #3047
- Add CPU affinity setting to latency benchmark by @hubertlu-tw in #3085
- Simplify the computation of cached_tokens by @merrymercy in #3145
- Do not load OPENAI_KEY from secrets by @merrymercy in #3147
- chore: bump 0.0.2.post18 for sgl-kernel by @zhyncs in #3149
- Temporarily skip the openai frontend tests by @merrymercy in #3151
- udpate sgl-kernel version for srt by @zhyncs in #3150
- Return more infos for computing average acceptance length by @merrymercy in #3152
- fix link in README by @zhyncs in #3153
- use self-hosted to build sgl-kernel by @zhyncs in #3154
- Feature/function calling update by @YAMY1234 in #2700
- Add function calling in index.rst by @zhaochenyang20 in #3155
- Doc: Add Docs about EAGLE speculative decoding by @jhinpan in #3144
- Add more logprob tests by @merrymercy in #3162
- [kernel] Integrate flashinfer's rope with higher precision and better perf by @ByronHsu in #3134
- add unit test for block wise fp8 by @yizhang2077 in #3156
- Bump sgl kernel to 0.0.2.post19 by @ByronHsu in #3167
- Add activation parameters to fused_moe by @merrymercy in #3170
- [kernel] Fix position ids in rope by @ByronHsu in #3173
- add dsv3 mi300 triton config for block scale by @BruceXcluding in #3146
- Improve weight loading and code style by @merrymercy in #3174
- Update thresholds in test_nightly_gsm8k_eval.py by @merrymercy in #3176
- cleanup sgl-kernel kernels by @zhyncs in #3175
- chore: bump 0.0.3 for sgl-kernel by @zhyncs in #3178
- feat: use sgl-kernel 0.0.3 in sglang by @zhyncs in #3179
- chore: bump v0.4.2 by @zhyncs in #3180
- fix: update Dockerfile for cu118 by @zhyncs in #3181
- Sanity check to prevent performance regression by @xiezhq-hermann in #3171
- Docs fix about EAGLE and streaming output by @jhinpan in #3166
- [test] deduplicate test_session_control by @ByronHsu in #3183
- clean up useless file by @BBuf in #3192
- [kernel] Use sgl_kernel rope by @ByronHsu in #3169
- Fix typo in README by @falegh in #3190
- [Fix] Address remaining issues of supporting MiniCPMV by @mickqian in #2977
- [test] Lower number of top logprobs to get rid of
-inf
by @ByronHsu in #3212 - update 3rdparty and rms norm for sgl-kernel by @zhyncs in #3213
- update setup for sgl-kernel by @zhyncs in #3214
- add tensorrt_llm common and cutlass_extensions as 3rdparty by @zhyncs in #3216
- add tensorrt_llm moe_gemm as 3rdparty by @zhyncs in #3217
- keep the parts needed for moe_kernels by @zhyncs in #3218
- docs: add Novita for adoption and sponsorship by @Ying1123 in #3227
- Update supported models with Mistral 3 by @ravi03071991 in #3229
- revert the MoE dependence by @zhyncs in #3230
- [fix] Clamp logprob with dtype min to prevent
-inf
by @ByronHsu in #3224 - Fix block wise fp8 torch compile by @ispobock in #3232
- support 12.5 CUDA runtime by @zhyncs in #3231
- chore: bump v0.4.2.post1 by @zhyncs in #3233
- Quick fix for Speculative_decoding doc by @jhinpan in #3228
- compatible with flashinfer v0.2 by @zhyncs in #3235
- Optimize MoE topk with torch compile by @ispobock in #3236
- update sgl-kernel version for sglang by @zhyncs in #3238
- update cutlass dependency by @zhyncs in #3240
- add tuning block wise fp8 by @zhyncs in #3242
- [Docs] Add more details to profiling docs by @Edenzzzz in #3221
- Add test for fp8 torch compile by @ispobock in #3246
- update ENV to ROCm dockers by @HaiShaw in #3248
- update and simplify CustomOp by @zhyncs in #3249
- support QuickGELU by @zhyncs in #3250
- add contact us in README by @zhyncs in #3251
- use srt VocabParallelEmbedding by @zhyncs in #3252
- Tune paged attention parameters for AMD GPU. by @whchung in #3255
- docs/accuracy evaluation by @simveit in #3114
- Docs: Update accuracy evaluation by @zhaochenyang20 in #3261
- ROCm: bump 6.3.0 by @HaiShaw in #3259
- Fix min_p sampling crash when using flashinfer backend by @zifeitong in #3207
- Add a Doc about guide on nvidia jetson #3182 by @lycanlancelot in #3205
- optimize test_fused_moe style by @BBuf in #3268
- refactor EAGLE 2 by @zhyncs in #3269
- add copyright for sgl-kernel by @zhyncs in #3270
- adding Triton configs for DeepSeekV3 on Blackwell by @kushanam in #3272
- add Nebius for Adoption and Sponsorship by @zhyncs in #3274
- add Atlas Cloud for Adoption and Sponsorship by @zhyncs in #3276
- Update server args doc by @simveit in #3273
- [Feature] Define backends and add Triton backend for Lora by @Fridge003 in #3161
- upgrade flashinfer v0.2.0.post2 by @zhyncs in #3288
- ROCm: sgl-kernel enablement starting with sgl_moe_align_block by @HaiShaw in #3287
- Update Triton decode backend interface by @ispobock in #3292
- update flashinfer install index url by @zhyncs in #3293
- [ROCm] Add tuning configs for AMD Radeon Graphics. by @whchung in #3294
- [ROCm] Manually unroll _w8a8_block_fp8_matmul kernel on AMD GPU. by @whchung in #3299
- Use forward_cuda to execute custom op for hip platform by @kkHuang-amd in #3305
- [ROCm] Logic to decide whether to used manually unrolled kernel. by @whchung in #3306
- Fix lora flashinfer import bug on ROCM by @Fridge003 in #3312
- chore: bump v0.4.2.post2 by @zhyncs in #3313
- Update Triton extend backend interface by @ispobock in #3309
- Support custom mask for Triton attention by @ispobock in #3317
- Initial Enablement of CI on MI300 by @saienduri in #3168
- update README by @zhyncs in #3324
- Docker switch on mi300 CI. by @saienduri in #3327
- [ROCm] Fix fp8 unrolledx4 matmul kernel. by @whchung in #3325
- clean moe align block kernel code and add acc test by @BBuf in #3332
- Add sgl-kernel to MI300 CI paths tested. by @saienduri in #3335
- update pull request template by @zhyncs in #3337
- add AMD guide for DeepSeek-R1 by @zhyncs in #3338
- [Doc] Add optimization option guide for deepseek v3 by @ispobock in #3349
- fix sgl-kernel build failure on AMD by @zhyncs in #3352
- optimize moe_align_kernel cuda by @BBuf in #3347
- enable fake finish for docs PR by @zhaochenyang20 in #3350
- Feature/docs deepseek usage and add multi-node by @lycanlancelot in #3314
- Feature: Fix the binding error in Llama by @zhaochenyang20 in #3355
- Fix: Runtime error for function calling by @shuaills in #3300
- update waves_per_eu to 1 by @lizamd in #3356
- update unit test in AMD CI by @zhyncs in #3366
- fix undefined symbol cudaGetDriverEntryPointByVersion by @zhyncs in #3372
- support speculative decoding kernel in sgl-kernel by @zhyncs in #3373
- update sgl-kernel version by @zhyncs in #3374
- update pr-test ci by @zhyncs in #3376
- fix EagleVerifyInput by @zhyncs in #3378
- chore: bump v0.4.2.post3 by @zhyncs in #3369
- added amd_configure.md to references by @zstreet87 in #3275
- Add H20 fp8 w8a8 gemm config by @sleepcoo in #3386
- [BUG] fix moe benchmark when bs*seq is small by @yiakwy-xpu-ml-framework-team in #3382
- Update fused_moe's benchmark by @WhatGhost in #3346
- Add deepseek-v3 a100 serving example by @ispobock in #3404
- fix EAGLE 2 non greedy case by @zhyncs in #3407
- add disable cuda graph unit test for eagle 2 by @zhyncs in #3412
- [Fix] Fix eagle with disable cuda graph by @Ying1123 in #3411
- minor: cleanup test_eagle_infer by @zhyncs in #3415
- [docs] Add multi-node inference example for SLURM in documentation by @shuaills in #3408
- fix cu118 link issue by @zhyncs in #3421
- remove cutex dependency by @zhyncs in #3422
- update forward_return_lse by @zhyncs in #3425
- add cuda graph capture failure possible solution by @zhyncs in #3430
- fix draft cuda graph capture failure by @zhyncs in #3431
- remove activation dependency in fused_moe by @zhyncs in #3433
- compatible with new outlines by @zhyncs in #3435
- [Docs] Add quantization docs by @Edenzzzz in #3410
- [docs] Update quantization documentation by @shuaills in #3437
- support version in sgl-kernel by @zhyncs in #3439
- chore: bump sgl-kernel v0.0.3.post3 by @zhyncs in #3440
- fix ci by @zhyncs in #3441
- feat: enable ragged fa3 by default on hopper 12.4+ by @zhyncs in #3442
- Update contribution_guide.md by @Ying1123 in #3452
- remove _grouped_size_compiled_for_decode_kernels by @zhyncs in #3453
- [Fix] Fix accuracy bug and refactor codes for lora by @Fridge003 in #3413
- use nvcr.io/nvidia/tritonserver:24.04-py3-min as base image by @zhyncs in #3457
- chore: bump v0.4.2.post4 by @zhyncs in #3459
- Support Eagle2 for Triton backend by @ispobock in #3466
- [Eagle] reduce one draft forward by @Ying1123 in #3468
- fix mla test by @zhyncs in #3469
- refine some typo by @BBuf in #3473
- [Feat] return hidden states by @Jackmin801 in #3364
- [ROCm] Add ROCm tuning config to block gemm and Re-tune for AMD Radeon Graphics by @BruceXcluding in #3418
- optimize per token group quant fp8 by @BBuf in #3490
- Tune MI300X fused MoE Triton kernel JSON config. by @whchung in #3492
- Support Eagle cuda graph for Triton backend by @ispobock in #3500
- fix deepseek_v3 typo by @didier-durand in #3497
- fix supported_models Qwen typo by @didier-durand in #3498
- fix server_arguments typo by @didier-durand in #3499
- fix router typo by @didier-durand in #3496
- add deepseek-v3 amd docker command by @zstreet87 in #3495
- MI30x: More graph captures for larger batch sizes and concurrencies by @HaiShaw in #3420
- Make NCCL NVLS configurable by @MrAta in #3502
- doc: Support a new vLM by @mickqian in #3405
- refine deepseek_v3 launch server doc by @BBuf in #3522
- chore: bump 0.0.3.post4 sgl-kernel by @zhyncs in #3523
- use sgl_per_token_group_quant_fp8 kernel by @BBuf in #3493
- added llama and cleaned up by @zstreet87 in #3503
- Fix deepseek awq v3 by @hnyls2002 in #3450
- support blockwise fp8 matmul kernel by @yizhang2077 in #3267
- chore: bump 0.0.3.post5 sgl-kernel by @zhyncs in #3530
- integrate blockwise fp8 kernel by @yizhang2077 in #3529
- [ROCm] Add ROCm tuning configs for AMD Instinct MI325X. by @whchung in #3536
- Update DeepSeek V3 Doc by @jhinpan in #3541
- fix moe_align_kernel shm init not sync bug by @BBuf in #3534
- update README by @zhyncs in #3543
- Update install docs by @simveit in #3553
- feat: support flashinfer mla attention for deepseek v3 by @zhyncs in #3550
- chore: bump 0.0.3.post6 sgl-kernel by @zhyncs in #3555
- chore: bump v0.4.3 by @zhyncs in #3556
New Contributors
- @fsygd made their first contribution in #2596
- @fzyzcjy made their first contribution in #2565
- @JamesSand made their first contribution in #2574
- @yudian0504 made their first contribution in #2521
- @kzhou003 made their first contribution in #2570
- @XiaotongJiang made their first contribution in #2652
- @mobicham made their first contribution in #2669
- @roG0d made their first contribution in #2707
- @mickqian made their first contribution in #2714
- @BruceXcluding made their first contribution in #2601
- @gaocegege made their first contribution in #2727
- @libratiger made their first contribution in #2571
- @observerw made their first contribution in #2745
- @Edwardf0t1 made their first contribution in #2535
- @xingyaoww made their first contribution in #2513
- @jjjjohnson made their first contribution in #2723
- @minleminzui made their first contribution in #2773
- @sleepcoo made their first contribution in #2816
- @Mutinifni made their first contribution in #2819
- @CatherineSue made their first contribution in #2822
- @Muqi1029 made their first contribution in #2835
- @gty111 made their first contribution in #2826
- @coolhok made their first contribution in #2730
- @sogalin made their first contribution in #2852
- @yundai424 made their first contribution in #2821
- @saienduri made their first contribution in #2927
- @chunyuan-w made their first contribution in #2806
- @HermitSun made their first contribution in #2944
- @giorgiopiatti-dfinity made their first contribution in #2982
- @seungduk-yanolja made their first contribution in #2839
- @hongpeng-guo made their first contribution in #2396
- @lcskrishna made their first contribution in #2995
- @yiakwy-xpu-ml-framework-team made their first contribution in #3003
- @josephydu made their first contribution in #2939
- @sudo-root-ns made their first contribution in #3055
- @Fridge003 made their first contribution in #3038
- @simveit made their first contribution in #2742
- @trevor-m made their first contribution in #3037
- @yinfan98 made their first contribution in #3130
- @hubertlu-tw made their first contribution in #3085
- @YAMY1234 made their first contribution in #2700
- @jhinpan made their first contribution in #3144
- @falegh made their first contribution in #3190
- @ravi03071991 made their first contribution in #3229
- @whchung made their first contribution in #3255
- @lycanlancelot made their first contribution in #3205
- @kushanam made their first contribution in #3272
- @lizamd made their first contribution in #3356
- @zstreet87 made their first contribution in #3275
- @WhatGhost made their first contribution in #3346
- @Jackmin801 made their first contribution in #3364
- @didier-durand made their first contribution in #3497
Full Changelog: v0.4.1...v0.4.3