Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update finetuner docs #843

Merged
merged 9 commits into from
Oct 21, 2022
55 changes: 50 additions & 5 deletions docs/user-guides/finetuner.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ This guide will show you how to use [Finetuner](https://finetuner.jina.ai) to fi
For installation and basic usage of Finetuner, please refer to [Finetuner documentation](https://finetuner.jina.ai).
You can also [learn more details about fine-tuning CLIP](https://finetuner.jina.ai/tasks/text-to-image/).

This tutorial requires `finetuner >=v0.6.4`, `clip_server >=v0.6.0`.

## Prepare Training Data

Finetuner accepts training data and evaluation data in the form of {class}`~docarray.array.document.DocumentArray`.
Expand Down Expand Up @@ -84,14 +86,14 @@ import finetuner

finetuner.login()
run = finetuner.fit(
model='openai/clip-vit-base-patch32',
model='ViT-B-32::openai',
run_name='clip-fashion',
train_data='clip-fashion-train-data',
eval_data='clip-fashion-eval-data', # optional
epochs=5,
learning_rate=1e-5,
loss='CLIPLoss',
cpu=False,
to_onnx=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As finetuner supports open_clip, can we finetune model='ViT-B-32::openai' in this tutorial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this model name does not match that in finetuner

jemmyshin marked this conversation as resolved.
Show resolved Hide resolved
)
```

Expand Down Expand Up @@ -169,15 +171,58 @@ executors:
py_modules:
- clip_server.executors.clip_onnx
with:
name: ViT-B/32
name: ViT-B-32::openai
model_path: 'clip-fashion-cas' # path to clip-fashion-cas
replicas: 1
```
jemmyshin marked this conversation as resolved.
Show resolved Hide resolved

```{warning}
jemmyshin marked this conversation as resolved.
Show resolved Hide resolved
Note that Finetuner only support ViT-B/32 CLIP model currently. The model name should match the fine-tuned model, or you will get incorrect output.
You can use `finetuner.describe_models()` to check the supported models in `finetuner`, you should see:
```bash
Finetuner backbones
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ name ┃ task ┃ output_dim ┃ architecture ┃ description ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ bert-base-cased │ text-to-text │ 768 │ transformer │ BERT model pre-trained on BookCorpus and English Wikipedia │
│ openai/clip-vit-base-patch16 │ text-to-image │ 512 │ transformer │ CLIP base model with patch size 16 │
│ openai/clip-vit-base-patch32 │ text-to-image │ 512 │ transformer │ CLIP base model │
│ openai/clip-vit-large-patch14-336 │ text-to-image │ 768 │ transformer │ CLIP large model for 336x336 images │
│ openai/clip-vit-large-patch14 │ text-to-image │ 1024 │ transformer │ CLIP large model with patch size 14 │
│ efficientnet_b0 │ image-to-image │ 1280 │ cnn │ EfficientNet B0 pre-trained on ImageNet │
│ efficientnet_b4 │ image-to-image │ 1792 │ cnn │ EfficientNet B4 pre-trained on ImageNet │
│ RN101::openai │ text-to-image │ 512 │ transformer │ Open CLIP "RN101::openai" model │
│ RN101-quickgelu::openai │ text-to-image │ 512 │ transformer │ Open CLIP "RN101-quickgelu::openai" model │
│ RN101-quickgelu::yfcc15m │ text-to-image │ 512 │ transformer │ Open CLIP "RN101-quickgelu::yfcc15m" model │
│ RN101::yfcc15m │ text-to-image │ 512 │ transformer │ Open CLIP "RN101::yfcc15m" model │
│ RN50::cc12m │ text-to-image │ 1024 │ transformer │ Open CLIP "RN50::cc12m" model │
│ RN50::openai │ text-to-image │ 1024 │ transformer │ Open CLIP "RN50::openai" model │
│ RN50-quickgelu::cc12m │ text-to-image │ 1024 │ transformer │ Open CLIP "RN50-quickgelu::cc12m" model │
│ RN50-quickgelu::openai │ text-to-image │ 1024 │ transformer │ Open CLIP "RN50-quickgelu::openai" model │
│ RN50-quickgelu::yfcc15m │ text-to-image │ 1024 │ transformer │ Open CLIP "RN50-quickgelu::yfcc15m" model │
│ RN50x16::openai │ text-to-image │ 768 │ transformer │ Open CLIP "RN50x16::openai" model │
│ RN50x4::openai │ text-to-image │ 640 │ transformer │ Open CLIP "RN50x4::openai" model │
│ RN50x64::openai │ text-to-image │ 1024 │ transformer │ Open CLIP "RN50x64::openai" model │
│ RN50::yfcc15m │ text-to-image │ 1024 │ transformer │ Open CLIP "RN50::yfcc15m" model │
│ ViT-B-16::laion400m_e31 │ text-to-image │ 512 │ transformer │ Open CLIP "ViT-B-16::laion400m_e31" model │
│ ViT-B-16::laion400m_e32 │ text-to-image │ 512 │ transformer │ Open CLIP "ViT-B-16::laion400m_e32" model │
│ ViT-B-16::openai │ text-to-image │ 512 │ transformer │ Open CLIP "ViT-B-16::openai" model │
│ ViT-B-16-plus-240::laion400m_e31 │ text-to-image │ 640 │ transformer │ Open CLIP "ViT-B-16-plus-240::laion400m_e31" model │
│ ViT-B-16-plus-240::laion400m_e32 │ text-to-image │ 640 │ transformer │ Open CLIP "ViT-B-16-plus-240::laion400m_e32" model │
│ ViT-B-32::laion2b_e16 │ text-to-image │ 512 │ transformer │ Open CLIP "ViT-B-32::laion2b_e16" model │
│ ViT-B-32::laion400m_e31 │ text-to-image │ 512 │ transformer │ Open CLIP "ViT-B-32::laion400m_e31" model │
│ ViT-B-32::laion400m_e32 │ text-to-image │ 512 │ transformer │ Open CLIP "ViT-B-32::laion400m_e32" model │
│ ViT-B-32::openai │ text-to-image │ 512 │ transformer │ Open CLIP "ViT-B-32::openai" model │
│ ViT-B-32-quickgelu::laion400m_e31 │ text-to-image │ 512 │ transformer │ Open CLIP "ViT-B-32-quickgelu::laion400m_e31" model │
│ ViT-B-32-quickgelu::laion400m_e32 │ text-to-image │ 512 │ transformer │ Open CLIP "ViT-B-32-quickgelu::laion400m_e32" model │
│ ViT-B-32-quickgelu::openai │ text-to-image │ 512 │ transformer │ Open CLIP "ViT-B-32-quickgelu::openai" model │
│ ViT-L-14-336::openai │ text-to-image │ 768 │ transformer │ Open CLIP "ViT-L-14-336::openai" model │
│ ViT-L-14::openai │ text-to-image │ 768 │ transformer │ Open CLIP "ViT-L-14::openai" model │
│ resnet152 │ image-to-image │ 2048 │ cnn │ ResNet152 pre-trained on ImageNet │
│ resnet50 │ image-to-image │ 2048 │ cnn │ ResNet50 pre-trained on ImageNet │
│ sentence-transformers/msmarco-distilbert-base-v3 │ text-to-text │ 768 │ transformer │ Pretrained BERT, fine-tuned on MS Marco │
└──────────────────────────────────────────────────┴────────────────┴────────────┴──────────────┴───────────────────────────────────────────────────────────
```


You can now start the `clip_server` using fine-tuned model to get a performance boost:

```bash
Expand Down