You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that you always use _transform_ndarray when encoding an image, while _transform_blob seems to be more in line with the original code (e.g. here and here).
Unfortunately they give quite different results as explained in the warning here: _transform_blob applies Resize to a PIL Image, hence uses anti-aliasing, while _transform_ndarray applies Resize to an ndarray, and does not use anti-aliasing. If you plot the results, they look quite different. In terms of CLIP embeddings, in my example images I get cosine similarities of around 0.94 (ViT-H14::laion2b-s32b-b79k), which is less than I would have expected.
Am I doing something wrong? Are the models you provide with clip-as-a-service trained with a different preprocessing function than the ones I located in the original repos? Are the text embeddings now slightly misaligned to the image embeddings?
The text was updated successfully, but these errors were encountered:
@mrjackbo Thank you for pointing that out. Previously, we conducted a simple experiment to show that the _transform_ndarray wouldn't harm the downstream task (including retrieval, and zero-shot classification tasks). Thus, we made the conclusion that the embeddings from the same transform operation would be acceptable.
However, based on your question:
Are the text embeddings now slightly misaligned to the image embeddings?
I think you are right, we did not consider this use case. We should use the _transform_blob that potentially improve the text-image retrieval quality.
Hey, thanks for the great project!
I noticed that you always use
_transform_ndarray
when encoding an image, while_transform_blob
seems to be more in line with the original code (e.g. here and here).Unfortunately they give quite different results as explained in the warning here:
_transform_blob
appliesResize
to a PIL Image, hence uses anti-aliasing, while_transform_ndarray
appliesResize
to an ndarray, and does not use anti-aliasing. If you plot the results, they look quite different. In terms of CLIP embeddings, in my example images I get cosine similarities of around0.94
(ViT-H14::laion2b-s32b-b79k), which is less than I would have expected.Am I doing something wrong? Are the models you provide with clip-as-a-service trained with a different preprocessing function than the ones I located in the original repos? Are the text embeddings now slightly misaligned to the image embeddings?
The text was updated successfully, but these errors were encountered: