Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Phi3V #2383

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

ravi03071991
Copy link
Contributor

PR to add support for Phi3V.

Fixes #1108

@ravi03071991 ravi03071991 marked this pull request as ready for review December 7, 2024 09:26
@ravi03071991
Copy link
Contributor Author

NOTE: Testing is still pending.

@ravi03071991 ravi03071991 marked this pull request as draft December 9, 2024 18:58
@merrymercy
Copy link
Contributor

merrymercy commented Dec 17, 2024

I added you as a co-author in #2500, so your future commits will trigger CI automatically.

@merrymercy
Copy link
Contributor

How did you test the correctness of this model locally? Did you compare the logits against HF implements, similar to this one? #2365 (comment)

@ravi03071991
Copy link
Contributor Author

How did you test the correctness of this model locally? Did you compare the logits against HF implements, similar to this one? #2365 (comment)

Thanks for the pointer @merrymercy. I'll test it out.

Currently, it seems that Phi3VConfig and Phi3VModel are not part of the transformers library. Instead, they reside in the model files. As a result, I need to add them to the transformers library first to make this PR functional.

In other words, the following imports currently throw an error:

from transformers import Phi3VConfig, Phi3Vmodel

@merrymercy
Copy link
Contributor

@ravi03071991 Can you fix the error in CI? https://github.com/sgl-project/sglang/actions/runs/12452217969/job/34760948018?pr=2383#step:4:984 Please make sure you can run it locally.

@merrymercy merrymercy marked this pull request as ready for review December 26, 2024 16:02
@merrymercy merrymercy self-assigned this Dec 26, 2024
@ravi03071991
Copy link
Contributor Author

@merrymercy yeah sure. I am still working on it.

@ravi03071991
Copy link
Contributor Author

Updates:

  1. Added an image processing step prior to image embedding.
  2. Implemented image embedding.
  3. Integrated Phi3VForCausalLM functionalities:
    - Loading weights
    - Embedding text and images
    - Combining text and image embeddings and passing them to the LLM.

TODO:

The logic for combining text and image embeddings remains unclear. I have imported the logic from Qwen2 VL, but it differs from the approach suggested in the Hugging Face Phi3V code base. Specifically, I am struggling to understand the rationale behind combining the embeddings using image offset and padding.

can someone help me here?

@zhyncs
Copy link
Member

zhyncs commented Jan 8, 2025

can someone help me here?

@yizhang2077 may help take a look. Thanks.

@yizhang2077
Copy link
Collaborator

can someone help me here?

In function pad_input_ids, it do padding for original input_ids with image tokens and add record image offsets here (where a image embedding start from). I think you can do some modification here, and in forward you can replace embedding by using image embedding and image offsets

@ravi03071991
Copy link
Contributor Author

ravi03071991 commented Jan 8, 2025

can someone help me here?

In function pad_input_ids, it do padding for original input_ids with image tokens and add record image offsets here (where a image embedding start from). I think you can do some modification here, and in forward you can replace embedding by using image embedding and image offsets

Thanks @yizhang2077. The pad_input_ids seems to be defined in qwen2_vl but it does not seem to be used?

@yizhang2077
Copy link
Collaborator

yizhang2077 commented Jan 8, 2025

Thanks @yizhang2077. The pad_input_ids seems to be defined in qwen2_vl but it does not seem to be used?

It is used in pad_input_ids_func here

@ravi03071991
Copy link
Contributor Author

It is used in pad_input_ids_func here

Oh, I see. I missed that. The padding function seems different for qwen2_vl and llava. Is it specific to the model, and should it be checked on HF?

@ravi03071991
Copy link
Contributor Author

Looks like the model provider has some padding logic here.

@yizhang2077
Copy link
Collaborator

Oh, I see. I missed that. The padding function seems different for qwen2_vl and llava. Is it specific to the model, and should it be checked on HF?

I think it is specific to the model since how to do padding depends on model implementation

@ravi03071991
Copy link
Contributor Author

@yizhang2077 / @zhyncs I think I kinda stuck here:

  1. pad_input_ids requires num_img_tokens.
  2. The current implementation computes num_img_tokens based on the image processor, which is called inside the forward method.

I’m a bit confused about how to proceed from here. Could you help me here?

@ravi03071991
Copy link
Contributor Author

@yizhang2077 / @zhyncs I think I kinda stuck here:

  1. pad_input_ids requires num_img_tokens.
  2. The current implementation computes num_img_tokens based on the image processor, which is called inside the forward method.

I’m a bit confused about how to proceed from here. Could you help me here?

Okay. I figured out the best way to solve this is by moving image_processing step into pad_input_ids and store the necessary information in image_inputs so that they can be used later in forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Do we have any plan for supporting Phi3V?
4 participants