Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_transform_ndarray vs _transform_blob #904

Closed
mrjackbo opened this issue Apr 5, 2023 · 2 comments · Fixed by #910
Closed

_transform_ndarray vs _transform_blob #904

mrjackbo opened this issue Apr 5, 2023 · 2 comments · Fixed by #910
Assignees

Comments

@mrjackbo
Copy link

mrjackbo commented Apr 5, 2023

Hey, thanks for the great project!

I noticed that you always use _transform_ndarray when encoding an image, while _transform_blob seems to be more in line with the original code (e.g. here and here).

Unfortunately they give quite different results as explained in the warning here: _transform_blob applies Resize to a PIL Image, hence uses anti-aliasing, while _transform_ndarray applies Resize to an ndarray, and does not use anti-aliasing. If you plot the results, they look quite different. In terms of CLIP embeddings, in my example images I get cosine similarities of around 0.94 (ViT-H14::laion2b-s32b-b79k), which is less than I would have expected.

Am I doing something wrong? Are the models you provide with clip-as-a-service trained with a different preprocessing function than the ones I located in the original repos? Are the text embeddings now slightly misaligned to the image embeddings?

@numb3r3
Copy link
Member

numb3r3 commented Apr 7, 2023

@mrjackbo Thank you for pointing that out. Previously, we conducted a simple experiment to show that the _transform_ndarray wouldn't harm the downstream task (including retrieval, and zero-shot classification tasks). Thus, we made the conclusion that the embeddings from the same transform operation would be acceptable.

However, based on your question:

Are the text embeddings now slightly misaligned to the image embeddings?

I think you are right, we did not consider this use case. We should use the _transform_blob that potentially improve the text-image retrieval quality.

@saahil
Copy link

saahil commented Apr 12, 2023

TODO:

  • Change the pre-processing implementation in CLIP Executor to address the above bug

@ZiniuYu ZiniuYu linked a pull request Apr 12, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants