You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This feature allows clip-client users to send a request to a server and then process the response with a custom callback function. There are three callbacks that users can process with custom functions: on_done, on_error and on_always.
The following code snippet shows how to send a request to a server and save the response to a database.
We have integrated the flash attention module as a faster replacement for nn.MultiHeadAttention. To take advantage of this feature, you will need to install the flash attention module manually:
If flash attention is present, clip_server will automatically try to use it.
The table below compares CLIP performance with and without the flash attention module. We conducted all tests on a Tesla T4 GPU, and times how long it took to encode a batch of documents 100 times.
Model
Input data
Input shape
w/o flash attention
flash attention
Speedup
ViT-B-32
text
(1, 77)
0.42692
0.37867
1.1274
ViT-B-32
text
(8, 77)
0.48738
0.45324
1.0753
ViT-B-32
text
(16, 77)
0.4764
0.44315
1.07502
ViT-B-32
image
(1, 3, 224, 224)
0.4349
0.40392
1.0767
ViT-B-32
image
(8, 3, 224, 224)
0.47367
0.45316
1.04527
ViT-B-32
image
(16, 3, 224, 224)
0.51586
0.50555
1.0204
Based on our experiments, performance improvements vary depending on the model and GPU, but in general, the flash attention module improves performance.
🐞 Bug Fixes
Increase timeout at startup for Executor docker images (#854)
During Executor initialization, it can take quite a lot of time to download model parameters. If a model is very large and downloading slowly, the Executor may fail due to time-out before even starting. We have increased the timeout to 3000000ms.
Install transformers for Executor docker images (#851)
We have added the transformers package to Executor docker images, in order to support the multilingual CLIP model.
Release Note
This release contains 1 new feature, 1 performance improvement, 2 bug fixes and 4 documentation improvements.
🆕 Features
Allow custom callback in
clip_client
(#849)This feature allows
clip-client
users to send a request to a server and then process the response with a custom callback function. There are three callbacks that users can process with custom functions:on_done
,on_error
andon_always
.The following code snippet shows how to send a request to a server and save the response to a database.
For more details, please refer to the CLIP client documentation.
🚀 Performance
Integrate flash attention (#853)
We have integrated the flash attention module as a faster replacement for
nn.MultiHeadAttention
. To take advantage of this feature, you will need to install the flash attention module manually:If flash attention is present,
clip_server
will automatically try to use it.The table below compares CLIP performance with and without the flash attention module. We conducted all tests on a
Tesla T4
GPU, and times how long it took to encode a batch of documents 100 times.ViT-B-32
ViT-B-32
ViT-B-32
ViT-B-32
ViT-B-32
ViT-B-32
Based on our experiments, performance improvements vary depending on the model and GPU, but in general, the flash attention module improves performance.
🐞 Bug Fixes
Increase timeout at startup for Executor docker images (#854)
During
Executor
initialization, it can take quite a lot of time to download model parameters. If a model is very large and downloading slowly, theExecutor
may fail due to time-out before even starting. We have increased the timeout to 3000000ms.Install transformers for Executor docker images (#851)
We have added the
transformers
package toExecutor
docker images, in order to support the multilingual CLIP model.📗 Documentation Improvements
🤟 Contributors
We would like to thank all contributors to this release:
The text was updated successfully, but these errors were encountered: