Text-to-Image Search via CLIP#
This guide will showcase fine-tuning a CLIP model for text to image retrieval.
Task#
We’ll be fine-tuning CLIP on the fashion captioning dataset which contains information about fashion products.
For each product the dataset contains a title and images of multiple variants of the product. We constructed a parent Document for each picture, which contains two chunks: an image document and a text document holding the description of the product.
Data#
Our journey starts locally. We have to prepare the data and push it to the cloud and Finetuner will be able to get the dataset by its name. For this example,
we already prepared the data, and we’ll provide the names of training and evaluation data (clip-fashion-train-data and clip-fashion-eval-data) directly to Finetuner.
Backbone model#
Currently, we only support openai/clip-vit-base-patch32 for text to image retrieval tasks. However, you can see all available models either in choose backbone section or by calling describe_models().
Fine-tuning#
From now on, all the action happens in the cloud!
First you need to login to Jina ecosystem:
import finetuner
finetuner.login()
Now that everything’s ready, let’s create a fine-tuning run!
import finetuner
run = finetuner.fit(
model='openai/clip-vit-base-patch32',
run_name='clip-fashion',
train_data='clip-fashion-train-data',
eval_data='clip-fashion-eval-data',
epochs=5,
learning_rate= 1e-5,
loss='CLIPLoss',
cpu=False,
)
Let’s understand what this piece of code does:
finetuner.fit parameters
The only required arguments are model and train_data. We provide default values for others. Here is the full list of the parameters.
We start with providing
model,run_name, names of training and evaluation data.We also provide some hyper-parameters such as number of
epochsand alearning_rate.Additionally, we use
BestModelCheckpointto save the best model after each epoch andEvaluationCallbackfor evaluation.
Monitoring#
We created a run! Now let’s see its status.
print(run.status())
{'status': 'CREATED', 'details': 'Run submitted and awaits execution'}
Since some runs might take up to several hours/days, it’s important to know how to reconnect to Finetuner and retrieve your run.
import finetuner
finetuner.login()
run = finetuner.get_run('clip-fashion')
You can continue monitoring the run by checking the status - status() or the logs - logs().
Evaluating#
Currently, we don’t have a user-friendly way to get evaluation metrics from the EvaluationCallback we initialized previously.
What you can do for now is to call logs() in the end of the run and see evaluation results:
INFO Done ✨ __main__.py:219
INFO Saving fine-tuned models ... __main__.py:222
INFO Saving model 'model' in /usr/src/app/tuned-models/model ... __main__.py:233
INFO Pushing saved model to Hubble ... __main__.py:240
[10:38:14] INFO Pushed model artifact ID: '62a1af491597c219f6a330fe' __main__.py:246
INFO Finished 🚀 __main__.py:248
Evaluation of CLIP
In this example, we did not plug-in an EvaluationCallback since the callback can evaluate one model at one time.
In most cases, we want to evaluate two models: i.e. use CLIPTextEncoder to encode textual Documents as query_data while use CLIPImageEncoder to encode image Documents as index_data.
Then use the textual Documents to search image Documents.
We have done the evaulation for you in the table below.
| Before Finetuning | After Finetuning | |
|---|---|---|
| average_precision | 0.253423 | 0.415924 |
| dcg_at_k | 0.902417 | 2.14489 |
| f1_score_at_k | 0.0831918 | 0.241773 |
| hit_at_k | 0.611976 | 0.856287 |
| ndcg_at_k | 0.350172 | 0.539948 |
| precision_at_k | 0.0994012 | 0.256587 |
| r_precision | 0.231756 | 0.35847 |
| recall_at_k | 0.108982 | 0.346108 |
| reciprocal_rank | 0.288791 | 0.487505 |
Saving#
After the run has finished successfully, you can download the tuned model on your local machine:
run.save_artifact('clip-model')