CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

Aman Shrivastava; Ramprasaath R. Selvaraju; Nikhil Naik; Vicente Ordonez

← retour aux publications

publication

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations

Aman Shrivastava, Ramprasaath R. Selvaraju, Nikhil Naik, Vicente Ordonez.

Int. Conf. on Artificial Intelligence and Statistics AISTATS 2023. Valencia, Spain / Hybrid.

article pdf bibtex brut

résumé

Nous proposons CLIP-Lite, une méthode efficace en information pour l'apprentissage de représentations visuelles par alignement de caractéristiques avec des annotations textuelles. Par rapport au modèle CLIP proposé précédemment, CLIP-Lite ne nécessite qu'une seule paire d'échantillons image-texte négative pour chaque échantillon image-texte positif lors de l'optimisation de son objectif d'apprentissage contrastif. Nous y parvenons en tirant parti d'une borne inférieure efficace en information pour maximiser l'information mutuelle entre les deux modalités d'entrée. Cela permet à CLIP-Lite d'être entraîné avec des quantités de données et des tailles de lots considérablement réduites tout en obtenant de meilleures performances que CLIP à la même échelle. Nous évaluons CLIP-Lite en le pré-entraînant sur le jeu de données COCO-Captions et en testant l'apprentissage par transfert vers d'autres jeux de données. CLIP-Lite obtient un gain absolu de performance de +14,0 % en mAP sur la classification Pascal VOC, et un gain de +22,1 % en précision top-1 sur ImageNet, tout en étant comparable ou supérieur à d'autres modèles supervisés par texte plus complexes. CLIP-Lite est également supérieur à CLIP en recherche d'images et de textes, en classification zero-shot et en ancrage visuel. Enfin, nous montrons que CLIP-Lite peut exploiter la sémantique du langage pour favoriser des représentations visuelles sans biais utilisables dans des tâches en aval. Implémentation : https://github.com/4m4n5/CLIP-Lite

citation

@inproceedings{shrivastava2023clip,
  title = {CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations},
  author = {Shrivastava, Aman and Selvaraju, Ramprasaath R. and Naik, Nikhil and Ordonez, Vicente},
  year = {2023},
  booktitle = {Int. Conf. on Artificial Intelligence and Statistics AISTATS 2023},
  url = {https://arxiv.org/abs/2112.07133},
}