-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support SGLang as Potential Backend for Evaluation #2703
base: main
Are you sure you want to change the base?
Conversation
* . * loglikelihood_rolling * /
[Speed compared with vLLM] A6000-48GB as our test bed. [vLLM+disable cuda graph] speed=20.90it/s .
![]() [vLLM+default setting]speed=22.28it/s @mgoin Thanks for you comment:) I reran with your command in A6000 again.
![]() [SGlang+default setting] speed=62.92it/s
|
@Monstertail SGLang is a beast! One quick note: right now the test uses |
@Qubitium Thanks! We have the world's most talented and diligent team. Enjoy the collaboration! |
Thanks for your comment! I will take a look when I am more available:) |
Hey @Monstertail congrats on the PR. If you want to compare the speed of vLLM vs SGLang, please use the default arguments in both cases equally and do not disable CUDA Graphs + lower the batch size for vLLM to artificially reduce the performance. These parameters set in only in your vLLM command are not the default and lower performance:
For your SGLang command you stick to the defaults, so I think it is only fair to compare the defaults to avoid misleading comparison. Here are my results on an H100 with commands that are normalized to just use the default in both cases, as in vLLM speed=74.87it/s
SGLang speed=69.01it/s
|
Hey @mgoin . Thanks for pointing this out. We use your command on our H100, and get these results:
CUDA_VISIBLE_DEVICES=0 lm_eval --model vllm --model_args pretrained=Qwen/Qwen2-1.5B-Instruct,dtype=auto --tasks gsm8k_cot --device cuda:0 --apply_chat_template --fewshot_as_multiturn --num_fewshot 8 --gen_kwargs temperature=0 --batch_size auto --seed 123
Running generate_until requests: 100%|███████████████████| 1319/1319 [00:25<00:00, 51.17it/s]
2025-02-16:18:48:53,326 INFO [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
vllm (pretrained=Qwen/Qwen2-1.5B-Instruct,dtype=auto), gen_kwargs: (temperature=0), limit: None, num_fewshot: 8, batch_size: auto
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot| 3|flexible-extract| 8|exact_match|↑ |0.5580|± |0.0137|
| | |strict-match | 8|exact_match|↑ |0.5064|± |0.0138| CUDA_VISIBLE_DEVICES=0 lm_eval --model sglang --model_args pretrained=Qwen/Qwen2-1.5B-Instruct,dtype=auto --tasks gsm8k_cot --device "cuda" --apply_chat_template --fewshot_as_multiturn --num_fewshot 8 --gen_kwargs temperature=0 --batch_size auto --seed 123
Running generate_until requests: 100%|██████████████████| 1319/1319 [00:12<00:00, 108.58it/s]
[2025-02-16 18:51:55] Output path not provided, skipping saving results aggregated
sglang (pretrained=Qwen/Qwen2-1.5B-Instruct,dtype=auto), gen_kwargs: (temperature=0), limit: None, num_fewshot: 8, batch_size: auto
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot| 3|flexible-extract| 8|exact_match|↑ |0.5527|± |0.0137|
| | |strict-match | 8|exact_match|↑ |0.5011|± |0.0138| We can also let community users to evaluate this and we will test on our A100 today. |
Hi @mgoin, thanks for your comment. We also use your command on our A100, and get these results: We ran 5 times and compare the performance of these two backends here(all use your comment): |
I updated to the latest vLLM 0.7.2 (it/s): [71.87, 73.51, 71.03, 71.77] avg=72.045 it/s Note: If you really want to go for performance here, I would recommend enabling vLLM V1 on the same release - it can get over 200 iterations per second on this workload! vLLM 0.7.2 with
Here is my full system configuration for additional clarity: The output of `python collect_env.py`
|
Hi, I confirmed that what I tested before and now was based on Second, I tested vllm_v1 on A100 locally, and it's much faster than without v1 : (with v1) vllm speedup=130 it/s. Great! I bet users will like these improvements!
Third, I noticed the sglang speed has still a gap in your side. It may be caused by Flashinfer is not correctly installed or other issues. Anyway, I do think, providing a new option is always not a bad thing. We can make the community better together:) Last , I tested with a larger model (Qwen2-7B-Instruct) and noticed that SGlang is faster than vllm(with Testbed: a single A100 with Qwen2-7B-Instruct.
[vLLM with V1] speed =45.93 it/s(awesome 4x improvement with v1)
[SGlang] speed=55.04 it/s
|
We also noticed that there is some accuracy gap between HF model card and lm-eval-harness due to parser. See here for details, and we provide a simple solution to bridge the gap. |
@baberabb @lintangsutawika Our PR is near ready. Thanks so much for help |
Todos to make the PR better(I will do the 1,2,4 late today, but 3 needs discussion):
|
) | ||
if self.data_parallel_size <= 1: | ||
self.model = sgl.Engine(**self.model_args) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we skip data parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I think this can be simplified by merging two self.model=self.model = sgl.Engine(**self.model_args) to only one, then throw a warning if self.data_parallel_size > 1
because dp might be replaced by sglang_router in the future. See here. Shall we keep the warning?
"Either context_length or max_model_len may be provided, but not both" | ||
) | ||
# Initialize your sglang model here | ||
self._max_length = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we could also pass max_length
kwarg as that's used thorough the library.
Hi! Thanks v. much for the PR. This is excellent! just a couple of nits in the comments. Could you also add a section in the main readme, like all the other backends, mainly providing the command and identifying any footguns. :) With respect to |
Hi @baberabb Thanks for your quick reply! Appreciate your comments:)
What do you think? I will list how do I plan to modify the dependencies below. |
@@ -78,6 +78,7 @@ zeno = ["pandas", "zeno-client"] | |||
wandb = ["wandb>=0.16.3", "pandas", "numpy"] | |||
gptqmodel = ["gptqmodel>=1.0.9"] | |||
japanese_leaderboard = ["emoji==2.14.0", "neologdn==0.5.3", "fugashi[unidic-lite]", "rouge_score>=0.1.2"] | |||
sglang =["sglang>=0.4.2.post2"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about deleting these partial dependencies. Install, I will remind the users in the readme to install sglang in advance.
): | ||
super().__init__() | ||
|
||
if not find_spec("sglang"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, modify the error message here to remind the user for second time(apart from the readme):
raise ModuleNotFoundError(
"attempted to use 'sglang' LM type, but package `sglang` is not installed. "
"Please install sglang via official document here: https://docs.sglang.ai/start/install.html#method-1-with-pip"
)
Yeah. SGLang should be installed separately, but it's easy. |
Hi! These seem reasonable. And yeah I agree will be best to add the installation instructions in init, in case not installed |
Slightly off-topic - what is the benefit of Are there certain evals that can not be done with |
@fxmarty-amd Hi, it's true both vLLM and SGLang support OpenAI-like APIs, but it would be slower than offline batch inference(based on my previous tests). As in vLLM/SGLang docs, different server implementations would be provided to satisfy different requirements. We believe offline batch inference is needed as an option here as well. As for the completeness of the functions, I think the API Server is basically complete for vLLM and SGLang. |
As some arguments of SGlang engine are different from other backends, we are thinking about to provide a simple doc to tell the users how to run in the future. Here is a simple example:
Please tell us what we need to provide or modify! Thanks to our SGlang team manager Chenyang @zhaochenyang20 , and co-contributor Xiaotong @XiaotongJiang