Skip to content

v0.4.3

Latest
Compare
Choose a tag to compare
@zhyncs zhyncs released this 14 Feb 02:50
· 52 commits to main since this release
e0b9a42

Highlights

The SGLang team is excited to announce the release of v0.4.3. We will keep improving DeepSeek V3/R1 performance. In the last six weeks, SGLang has been the fastest engine running DeepSeek V3/R1 among all open-source LLM inference engines. We stay ahead by integrating FlashInfer MLA and optimizing further. Look out for new optimizations coming soon! Please feel free to join our Slack channel https://slack.sglang.ai Cheers!

Performance Improvements

DeepSeek V3/R1 Optimizations

  • Pioneering integration of FlashInfer MLA Attention delivers 4x performance improvement for long-context scenarios (Special thanks to the FlashInfer team @yzh119 ) #3550
  • Added torch.compile support for FP8, achieving 50 tokens/s for online inference #3232
  • Implemented CUTLASS block-wise FP8 for enhanced efficiency

Architecture Enhancements

  • Upgraded to FlashInfer v0.2
  • Enabled Flash Attention 3 by default for prefill
  • Extended EAGLE 2 support:
    • Enhanced integration with FlashInfer backend
    • Added support in Triton backend

New Features

  • Introduced Function Calling capabilities
  • Added regex pattern support in XGrammar backend
  • Implemented custom sampling processor for flexible inference control
  • Integrated LoRA support in Triton backend

What's Changed

New Contributors

Full Changelog: v0.4.1...v0.4.3