Text-to-Image Search via CLIP#
This guide will showcase fine-tuning a CLIP
model for text to image retrieval.
Task#
We’ll be fine-tuning CLIP on the fashion captioning dataset which contains information about fashion products.
For each product the dataset contains a title and images of multiple variants of the product. We constructed a parent Document
for each picture, which contains two chunks: an image document and a text document holding the description of the product.
Data#
Our journey starts locally. We have to prepare the data and push it to the cloud and Finetuner will be able to get the dataset by its name. For this example,
we already prepared the data, and we’ll provide the names of training and evaluation data (clip-fashion-train-data
and clip-fashion-eval-data
) directly to Finetuner.
Backbone model#
Currently, we only support openai/clip-vit-base-patch32
for text to image retrieval tasks. However, you can see all available models either in choose backbone section or by calling describe_models()
.
Fine-tuning#
From now on, all the action happens in the cloud!
First you need to login to Jina ecosystem:
import finetuner
finetuner.login()
Now that everything’s ready, let’s create a fine-tuning run!
import finetuner
run = finetuner.fit(
model='openai/clip-vit-base-patch32',
run_name='clip-fashion',
train_data='clip-fashion-train-data',
eval_data='clip-fashion-eval-data',
epochs=5,
learning_rate= 1e-5,
loss='CLIPLoss',
cpu=False,
)
Let’s understand what this piece of code does:
finetuner.fit parameters
The only required arguments are model
and train_data
. We provide default values for others. Here is the full list of the parameters.
We start with providing
model
,run_name
, names of training and evaluation data.We also provide some hyper-parameters such as number of
epochs
and alearning_rate
.Additionally, we use
BestModelCheckpoint
to save the best model after each epoch andEvaluationCallback
for evaluation.
Monitoring#
We created a run! Now let’s see its status.
print(run.status())
{'status': 'CREATED', 'details': 'Run submitted and awaits execution'}
Since some runs might take up to several hours/days, it’s important to know how to reconnect to Finetuner and retrieve your run.
import finetuner
finetuner.login()
run = finetuner.get_run('clip-fashion')
You can continue monitoring the run by checking the status - status()
or the logs - logs()
.
Evaluating#
Currently, we don’t have a user-friendly way to get evaluation metrics from the EvaluationCallback
we initialized previously.
What you can do for now is to call logs()
in the end of the run and see evaluation results:
INFO Done ✨ __main__.py:219
INFO Saving fine-tuned models ... __main__.py:222
INFO Saving model 'model' in /usr/src/app/tuned-models/model ... __main__.py:233
INFO Pushing saved model to Hubble ... __main__.py:240
[10:38:14] INFO Pushed model artifact ID: '62a1af491597c219f6a330fe' __main__.py:246
INFO Finished 🚀 __main__.py:248
Evaluation of CLIP
In this example, we did not plug-in an EvaluationCallback
since the callback can evaluate one model at one time.
In most cases, we want to evaluate two models: i.e. use CLIPTextEncoder
to encode textual Documents as query_data
while use CLIPImageEncoder
to encode image Documents as index_data
.
Then use the textual Documents to search image Documents.
We have done the evaulation for you in the table below.
Before Finetuning | After Finetuning | |
---|---|---|
average_precision | 0.253423 | 0.415924 |
dcg_at_k | 0.902417 | 2.14489 |
f1_score_at_k | 0.0831918 | 0.241773 |
hit_at_k | 0.611976 | 0.856287 |
ndcg_at_k | 0.350172 | 0.539948 |
precision_at_k | 0.0994012 | 0.256587 |
r_precision | 0.231756 | 0.35847 |
recall_at_k | 0.108982 | 0.346108 |
reciprocal_rank | 0.288791 | 0.487505 |
Saving#
After the run has finished successfully, you can download the tuned model on your local machine:
run.save_artifact('clip-model')