Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance

Jaywon Koo; Jefferson Hernandez; Moayed Haji-Ali; Ziyan Yang; Vicente Ordonez

← back to publications

publication

Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance

Jaywon Koo, Jefferson Hernandez, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez.

IEEE Winter Conference on Applications of Computer Vision. WACV 2026. Tucson, AZ.

paper pdf raw bibtex

Lab News Desk

News Release Summary

This section is intentionally written in a reporter-style news release voice for general readers.

Researchers at Rice University have developed a new metric called cFreD (conditional Fréchet Distance) to better evaluate AI systems that generate images from text descriptions. Current evaluation methods struggle because they either measure image quality while ignoring how well the image matches the text prompt, or vice versa. The team's approach combines both assessments into a single score by incorporating the text prompt directly into the distance calculation. Testing across multiple datasets showed cFreD correlates much more strongly with human judgments than existing metrics like FID and CLIPScore, achieving up to 97% correlation in some cases. The researchers released their evaluation toolkit as open-source software, potentially providing the AI community with a more reliable way to benchmark text-to-image generation models without requiring expensive human evaluations

abstract

Evaluating text-to-image and text-to-video models is challenging due to a fundamental disconnect: established metrics fail to jointly measure visual quality and semantic alignment with text, leading to a poor correlation with human judgments. To address this critical issue, we propose cFreD, a general metric based on a Conditional Fréchet Distance that unifies the assessment of visual fidelity and text-prompt consistency into a single score. Existing metrics such as Fréchet Inception Distance (FID) capture image quality but ignore text conditioning while alignment scores such as CLIPScore are insensitive to visual quality. Furthermore, learned preference models require constant retraining and are unlikely to generalize to novel architectures or out-of-distribution prompts. Through extensive experiments across multiple recently proposed text-to-image models and diverse prompt datasets, cFreD exhibits a higher correlation with human judgments compared to statistical metrics , including metrics trained with human preferences. Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text conditioned models, standardizing benchmarking in this rapidly evolving field. We release our evaluation toolkit and benchmark.

details

comment: Added new video experiments and more image experiments to validate the method

citation

@inproceedings{koo2026evaluating,
  title = {Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance},
  author = {Koo, Jaywon and Hernandez, Jefferson and Haji-Ali, Moayed and Yang, Ziyan and Ordonez, Vicente},
  year = {2026},
  booktitle = {IEEE Winter Conference on Applications of Computer Vision. WACV 2026},
  url = {https://arxiv.org/abs/2503.21721},
}

automatically generated questions, main contributions and limitations of this paper

Questions this paper helps answer

What is cFreD and what problem does it address? cFreD is a Conditional Fréchet Distance metric designed to evaluate text-conditioned generation by measuring both visual fidelity and alignment with the input prompt.
Why are FID and CLIPScore insufficient for text-to-image evaluation? FID can reward realistic image distributions even when images do not match their prompts, while CLIPScore focuses on image-text similarity without fully capturing visual quality.
How well does cFreD correlate with human preferences for text-to-image generation? Across HPDv2, Gen-AI Bench, PartiPrompts, and COCO evaluations, cFreD achieves the strongest average correlation and rank accuracy among the statistical metrics compared in the paper.
Does cFreD extend beyond text-to-image generation? Yes, the paper applies the same conditional formulation to text-to-video evaluation and reports the highest average rank accuracy across T2VQA-DB and EvalCrafter among the statistical metrics tested.
What makes cFreD practical for future benchmarks? It requires no human-preference training, can use modern vision and text encoders, and is released as an open-source toolkit, making it a plug-and-play evaluation option for new text-conditioned generative models.

Main contributions

The paper adapts Conditional Fréchet Distance to text-to-image and text-to-video synthesis, giving the community a unified statistical metric that accounts for conditioning information.
cFreD consistently outperforms FID, CLIPScore, CMMD, and FDDINOv2 in average human-preference correlation and rank accuracy across the paper's text-to-image benchmark suite.
The text-to-video results show that cFreD generalizes to temporal generation, matching or exceeding established video metrics in rank accuracy without requiring task-specific human-preference training.
Robustness experiments show that cFreD responds sensibly to image corruptions and text perturbations, while FID can miss prompt-image misalignment because it observes only image statistics.
The paper includes a broad backbone analysis showing that modern transformer-based encoders improve alignment with human judgments and that InceptionV3 is no longer the best default choice for this kind of evaluation.

Limitations and cautions

cFreD is still a statistical proxy for human judgment rather than a replacement for carefully designed human studies, but its strong rank accuracy makes it a valuable scalable screening tool when human evaluation is costly.
The metric depends on the choice of image and text encoders, so future work can continue improving cFreD as stronger multimodal backbones become available; the paper's ablations already provide useful guidance for selecting those encoders.
The reported evaluations focus on available image and video preference datasets, leaving specialized domains such as medical, satellite, and scientific imagery as promising next areas to validate the same conditional formulation.
cFreD summarizes distribution-level behavior rather than providing detailed per-sample explanations of every failure, which makes it best suited for benchmark-level comparison while complementary diagnostic tools can inspect individual examples.
The formulation assumes useful paired conditioning information, so extensions to multi-condition settings such as ControlNet or audio-to-video generation are natural follow-up directions; the paper explicitly points to this broader applicability.

How to read this result

This paper is best read as a strong practical contribution to generative-model evaluation: cFreD preserves the simplicity and scalability of statistical metrics while much better reflecting human judgments about whether generated images and videos are both high quality and faithful to their prompts.