-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Phi3V #2383
base: main
Are you sure you want to change the base?
Add support for Phi3V #2383
Conversation
NOTE: Testing is still pending. |
I added you as a co-author in #2500, so your future commits will trigger CI automatically. |
How did you test the correctness of this model locally? Did you compare the logits against HF implements, similar to this one? #2365 (comment) |
Thanks for the pointer @merrymercy. I'll test it out. Currently, it seems that Phi3VConfig and Phi3VModel are not part of the transformers library. Instead, they reside in the model files. As a result, I need to add them to the transformers library first to make this PR functional. In other words, the following imports currently throw an error: from transformers import Phi3VConfig, Phi3Vmodel |
@ravi03071991 Can you fix the error in CI? https://github.com/sgl-project/sglang/actions/runs/12452217969/job/34760948018?pr=2383#step:4:984 Please make sure you can run it locally. |
@merrymercy yeah sure. I am still working on it. |
Updates:
TODO: The logic for combining text and image embeddings remains unclear. I have imported the logic from Qwen2 VL, but it differs from the approach suggested in the Hugging Face Phi3V code base. Specifically, I am struggling to understand the rationale behind combining the embeddings using image offset and padding. can someone help me here? |
@yizhang2077 may help take a look. Thanks. |
In function pad_input_ids, it do padding for original input_ids with image tokens and add record image offsets here (where a image embedding start from). I think you can do some modification here, and in forward you can replace embedding by using image embedding and image offsets |
Thanks @yizhang2077. The |
It is used in |
Oh, I see. I missed that. The padding function seems different for qwen2_vl and llava. Is it specific to the model, and should it be checked on HF? |
Looks like the model provider has some padding logic here. |
I think it is specific to the model since how to do padding depends on model implementation |
@yizhang2077 / @zhyncs I think I kinda stuck here:
I’m a bit confused about how to proceed from here. Could you help me here? |
Okay. I figured out the best way to solve this is by moving image_processing step into |
PR to add support for Phi3V.
Fixes #1108