Improving Large Vision and Language Models by Learning from a Panel of Peers
News Release Summary
Researchers from Rice University and Adobe Research have developed a new training technique for AI vision-language models that sidesteps the expensive and time-consuming process of collecting human-labeled feedback data. The system, called Panel-of-Peers (PoP), works by assembling a small group of similar AI models — in this case, three variants of the LLaVA model built on different underlying language models — and having them grade each other's answers to visual questions rather than relying on human annotators or a single more powerful "teacher" model. Each model in the group generates candidate responses to image-question pairs drawn from an unlabeled dataset, the other models score those responses across dimensions like helpfulness and correctness, and the resulting ranked pairs are used to fine-tune all the models in the group through an iterative loop repeated three times. Testing across 15 standard vision-language benchmarks covering tasks from chart reading and OCR to math reasoning and hallucination detection, the approach lifted the average score of the model panel from 48% to 57% — a gain the researchers note exceeds what comparable methods using human-curated or machine-generated preference data have achieved at similar data scales. The team also showed that a model deliberately crippled by removing most of its OCR training data could recover that capability by learning from peers that retained it, suggesting the framework can transfer specific skills between models, not just improve general performance. The work matters because producing human preference data for multimodal AI remains costly and difficult to scale, and self-improvement methods that rely on a single model evaluating its own outputs tend to reinforce existing errors; using a diverse group of roughly equal-strength models to cross-evaluate each other appears to reduce that problem without requiring access to a much larger, more expensive frontier model as a supervisor.
abstract
Traditional alignment methods for Large Vision and Language Models (LVLMs) primarily rely on human-curated preference data. Human-generated preference data is costly; machine-generated preference data is limited in quality; and self-supervised preference data often introduces hallucinations. To overcome these limitations, we propose a novel Panel-of-Peers learning framework inspired by collaborative learning among humans. This approach leverages a panel of LVLMs, each evaluating and learning from their collective outputs through an iterative self-improvement process. By simulating a peer review system, our models generate, assess, and refine outputs in response to a curated set of prompts, mimicking a classroom learning environment. We demonstrate that this methodology enhances model performance without requiring extensive human-labeled datasets. Our experiments show significant improvement across multiple benchmarks, demonstrating the potential of peer evaluations as a scalable alternative to self-supervised alignment. Notably, we show that Panel-of-Peers increases the average score on fifteen benchmarks from 48% to 57%
details
citation
@inproceedings{hernandez2025improving,
title = {Improving Large Vision and Language Models by Learning from a Panel of Peers},
author = {Hernandez, Jefferson and Shi, Jing and Jenni, Simon and Ordonez, Vicente and Kafle, Kushal},
year = {2025},
booktitle = {International Conference on Computer Vision. ICCV 2025},
url = {https://arxiv.org/abs/2509.01610},
}