Generative Visual Instruction Tuning
preprint

Generative Visual Instruction Tuning

Jefferson Hernandez, Ruben Villegas, Vicente Ordonez.
arXiv:2406.11262 June 2024.

abstract

We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.

details

comment
Add more results using task tokens, expand the introduction and related work FIX: error in LLM-as-judge evaluation that was over-inflating the results

citation

@article{hernandez2024generative,
  title = {Generative Visual Instruction Tuning},
  author = {Hernandez, Jefferson and Villegas, Ruben and Ordonez, Vicente},
  year = {2024},
  journal = {arXiv preprint arXiv:2406.11262},
  url = {https://arxiv.org/abs/2406.11262},
}