Backbone Model#
Finetuner provides several widely used backbone models,
including resnet
, efficientnet
, clip
and bert
.
Finetuner will convert these backbone models to embedding models by removing the head or applying pooling, fine-tuning and producing the final embedding model. The embedding model can be fine-tuned for text-to-text, image-to-image or text-to-image search tasks.
You can call:
import finetuner
finetuner.describe_models()
To get a list of supported models:
Finetuner backbones
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ model ┃ task ┃ output_dim ┃ architecture ┃ description ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ resnet50 │ image-to-image │ 2048 │ CNN │ Pretrained on ImageNet │
│ resnet152 │ image-to-image │ 2048 │ CNN │ Pretrained on ImageNet │
│ efficientnet_b0 │ image-to-image │ 1280 │ CNN │ Pretrained on ImageNet │
│ efficientnet_b4 │ image-to-image │ 1280 │ CNN │ Pretrained on ImageNet │
│ openai/clip-vit-base-patch32 │ text-to-image │ 768 │ transformer │ Pretrained on text image pairs by OpenAI │
│ bert-base-cased │ text-to-text │ 768 │ transformer │ Pretrained on BookCorpus and English Wikipedia │
│ sentence-transformers/msmarco-distilbert-base-v3 │ text-to-text │ 768 │ transformer │ Pretrained on BERT, fine-tuned on MS Marco │
└──────────────────────────────────────────────────┴────────────────┴────────────┴──────────────┴────────────────────────────────────────────────┘
ResNets are suitable for image-to-image search tasks with high performance requirement.
EfficientNets are suitable for image-to-image search tasks with fast training and inference. The model is more light-weighted than ResNet.
CLIP is the one for text-to-image search, where the images do not need to have any text descriptors.
BERT is generally suitable for text-to-text search tasks.
Msmarco-distilbert-base-v3 is suitable for short text-to-text search.
It should be noted that:
resnet/efficientnet models are loaded from the torchvision library.
transformer based models are loaded from the huggingface transformers library.
msmarco-distilbert-base-v3
has been fine-tuned once by sentence-transformers on the MS MARCO dataset on top of BERT.