-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add three new open clip roberta base models #860
Conversation
* feat: bump openclip to v2.5.0 * fix: conflicts * fix: default fp32 on cpu and fp16 on gpu * feat: add two new models * fix: remove debug * fix: add roberta models (test) * fix: model name xlm * fix: (wip)
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
Codecov Report
@@ Coverage Diff @@
## main #860 +/- ##
==========================================
+ Coverage 80.28% 80.38% +0.10%
==========================================
Files 22 22
Lines 1633 1448 -185
==========================================
- Hits 1311 1164 -147
+ Misses 322 284 -38
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
1ebe483
to
c4beeca
Compare
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size. |
assert not need_weights, "not allowed to return weights." | ||
assert q.dtype in [ | ||
torch.float16, | ||
torch.bfloat16, | ||
], f"flash attention only support torch.float16 or torch.bfloat16 but got {q.dtype}." | ||
assert q.is_cuda, "flash attention only support cuda." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest removing these asserts. It's much safe, but degrading the performance a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what's more, from the function's parameter, seq_len
. It seems that the flash-attention implementation can only be used for the text encoder. Is it can be applied to a vision transformer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it could. Every image tensor first convert to sentence-like tensor before fed into model.
460622f
to
cf9595d
Compare
88fd2cf
to
1bf83ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Goals : align with openclip v2.7.0
Changes :
roberta-ViT-B-32::laion2b-s12b-b32k
xlm-roberta-base-ViT-B-32::laion5b-s13b-b90k
, andxlm-roberta-large-ViT-H-14::frozen_laion5b_s13b_b90k
;LayernormFp32
(originalLayernorm
handles fp16) (Default precision: fp32 on cpu and fp16 on gpu);CLIP
toTextTransformer
,VisionTransformer
and add_build_text_tower
_build_vision_tower
for seperately building;