On the Transferability of Visual Features in Generalized Zero-Shot Learning
publication

On the Transferability of Visual Features in Generalized Zero-Shot Learning

Paola Cascante-Bonilla, Leonid Karlinsky, James Seale Smith, Yanjun Qi, Vicente Ordonez.
arXiv:2211.12494 November 2022.
Lab News Desk

News Release Summary

This section is intentionally written in a reporter-style news release voice for general readers.

Researchers at Rice University, MIT-IBM Watson AI Lab, Georgia Tech, and the University of Virginia have taken a systematic look at a question that the Generalized Zero-Shot Learning (GZSL) field had largely ignored: does it matter which visual feature extractor you use? GZSL is the problem of training an image classifier that can recognize both familiar categories and entirely new ones it has never seen, relying on attribute descriptions as a bridge. Most prior work in the field had simply plugged in features from a ResNet101 network trained on ImageNet and moved on. The team instead ran a large-scale experiment swapping in a wide range of modern feature extractors — including convolutional networks, Vision Transformers, and MLP-Mixers trained with supervised, self-supervised, and contrastive objectives — across three standard benchmark datasets. They found that the choice of feature extractor matters quite a lot. Models trained using DINO, a self-supervised method combining contrastive learning with self-distillation, produced feature representations that boosted performance by as much as 15 percentage points compared to standard supervised models on fine-grained datasets. Counterintuitively, training on larger datasets like ImageNet-21K did not reliably improve results. They also tested CLIP, the large multimodal model trained on 400 million image-text pairs, and found that while CLIP is strong out of the box, pairing its visual features with generative-based GZSL methods still pushes performance higher on fine-grained tasks, suggesting that the architectural advances of GZSL are not yet obsolete. The work is relevant because it gives practitioners concrete guidance on feature selection and challenges the field's long-standing reliance on a single backbone.

abstract

Generalized Zero-Shot Learning (GZSL) aims to train a classifier that can generalize to unseen classes, using a set of attributes as auxiliary information, and the visual features extracted from a pre-trained convolutional neural network. While recent GZSL methods have explored various techniques to leverage the capacity of these features, there has been an extensive growth of representation learning techniques that remain under-explored. In this work, we investigate the utility of different GZSL methods when using different feature extractors, and examine how these models' pre-training objectives, datasets, and architecture design affect their feature representation ability. Our results indicate that 1) methods using generative components for GZSL provide more advantages when using recent feature extractors; 2) feature extractors pre-trained using self-supervised learning objectives and knowledge distillation provide better feature representations, increasing up to 15% performance when used with recent GZSL techniques; 3) specific feature extractors pre-trained with larger datasets do not necessarily boost the performance of GZSL methods. In addition, we investigate how GZSL methods fare against CLIP, a more recent multi-modal pre-trained model with strong zero-shot performance. We found that GZSL tasks still benefit from generative-based GZSL methods along with CLIP's internet-scale pre-training to achieve state-of-the-art performance in fine-grained datasets. We release a modular framework for analyzing representation learning issues in GZSL here: https://github.com/uvavision/TV-GZSL

citation

@article{cascantebonilla2022transferability,
  title = {On the Transferability of Visual Features in Generalized Zero-Shot Learning},
  author = {Cascante-Bonilla, Paola and Karlinsky, Leonid and Smith, James Seale and Qi, Yanjun and Ordonez, Vicente},
  year = {2022},
  journal = {arXiv preprint arXiv:2211.12494},
  url = {https://arxiv.org/abs/2211.12494},
}