-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: LoRa adapter responses not matching peft/transformers response #10798
Comments
I've only taken a preliminary look at your issue, I suggest first upgrading Triton to around version 3.1.0 |
Thanks @jeejeelee , I just tried with triton 3.0.0 and got the same thing. Can try 3.1.0 if you think worth it.
|
No need, I'll spend some time looking into this issue later |
Any other ways I can dig in to help here?
…On Sun, Dec 1, 2024 at 6:50 PM Jee Jee Li ***@***.***> wrote:
No need, I'll spend some time looking into this issue later
—
Reply to this email directly, view it on GitHub
<#10798 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASVG6CSNJPJ2WX55SK6Q4N32DPDHXAVCNFSM6AAAAABSYZW6EOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJQGQ2TINZYGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@RonanKMcGovern I think the cause is due to rslora. Currently, we do not have support for this feature. |
Thanks @jeejeelee , do you recommend just setting use_rslora to false then in the adapter_config.json ? Anything else you recommend changing? Having used rslora just means that a scaling factor was applied to the learning rate during training. I don't believe it should affect how weights are loaded at inference. |
No, if you set
See: #6909 (comment) |
Confirming this issue is caused by rs_lora being used. The hacky fix is to multiply alpha in the lora config by the rank. See #6909 |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
Issue: LoRa adapter responses with vLLM do not match peft/transformer responses.
The reproduction involves running inference using a) vllm, compared to b) running inference with transformers and peft. I have run both on A40 machines on runpod.
vLLM approach
vLLM server startup with statically loaded LoRa
vLLM Script to call the server endpoint, aka
vllm_replication.py
Transformers / PEFT script
Results / Output
vLLM
Transformers / peft Response
Additional Notes
Questions
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: