-
Notifications
You must be signed in to change notification settings - Fork 711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an offline engine API #1567
Provide an offline engine API #1567
Conversation
57f9cd8
to
94346ea
Compare
I think |
@zhyncs oh i mean |
@zhyncs i can do the async gen in the next PR |
1bfd171
to
c080de1
Compare
3d92605
to
34a6c2e
Compare
005e5e0
to
b18b447
Compare
66d5402
to
22c7e3e
Compare
Please don't merge now. Consistency test is failing on H100 in CI, but passing on my A100 |
1a49352
to
61f17a2
Compare
hey guys i'm getting the AttributeError: module 'sglang' has no attribute 'Engine' on sgl '0.3.2' still not in prod ? |
@imadoualid the changes should be in the main HEAD if you can build from source. |
Motivation
This PR is to support "Add APIs for using the inference engine in a single script without launching a separate server" in #1487. This is a simplified version of #1127 @JianyuZhan where I reuse most of the existing code.
Modifications
Context
The current SRT server consists of an HTTP server and the SRT engine.
HTTP server and Tokenizer Manager are both running in the main process, but there is no way to decouple them and only instantiate Tokenizer Manager.
Decouple SRT engine and HTTP server
This PR introduces SRT engine by decoupling
launch_server
tolaunch_server
andlaunch_engine
.launch_server
:launch_engine
+ HTTP server creation, used by SRT Runtime and standalone server.launch_engine
: SRT Engine creation, used by SRT engine.New public API:
Engine
Uplift Engine to the top level, so users can easily call with
sgl.Engine
Engine Usage Example
Same settings as vllm but use SRT Engine.
Discussion
One caveat is that we construct
ServerArgs
, but the HTTP server related args will not be used. I think this is ok because ServerArgs is the superset of Engine Args, so it can cover everything.Testing
Add
test_srt_engine.py
, which runs batch inference and assert the answer.TODO
Checklist