SimVQA: Exploring Simulated Environments for Visual Question Answering.
publication

SimVQA: Exploring Simulated Environments for Visual Question Answering.

Paola Cascante-Bonilla, Hui Wu, Letao Wang, Rogerio Feris, Vicente Ordonez.
Conf. on Computer Vision and Pattern Recognition CVPR 2022. New Orleans, LA.
Lab News Desk

News Release Summary

This section is intentionally written in a reporter-style news release voice for general readers.

Researchers from Rice University, MIT-IBM Watson AI Lab, and the University of Virginia have found a way to use computer-generated synthetic imagery to teach visual question-answering (VQA) systems skills they struggle to learn from real-world photographs alone. The core problem the team tackled is that building large VQA datasets from real images is expensive, raises privacy concerns, and limits the variety of scenarios a model can learn from. To work around this, the researchers built two new synthetic datasets — Hypersim-VQA and ThreeDWorld-VQA — by extending an existing photorealistic 3D scene dataset and using a physics simulation platform to automatically generate images paired with question-and-answer sets covering counting, color, object existence, and spatial relationships. Their experiments showed that a VQA model trained entirely without counting questions from real data could still learn to count objects in real images when given only synthetic counting examples during training, demonstrating meaningful transfer across the significant visual gap between rendered and photographic images. The team also developed a technique called Feature Swapping (F-SWAP), which sidesteps traditional domain-adaptation approaches like adversarial training by simply swapping object-level feature representations between real and synthetic images during training. This method outperformed more complex alternatives, including adversarial domain adaptation and Maximum Mean Discrepancy alignment, while avoiding the instability associated with generative adversarial training. The work matters because it offers a relatively low-cost, privacy-safe path for expanding AI training data and suggests that synthetic environments could play a practical role in filling gaps in real-world datasets for multimodal AI systems.

abstract

Existing work on VQA explores data augmentation to achieve better generalization by perturbing the images in the dataset or modifying the existing questions and answers. While these methods exhibit good performance, the diversity of the questions and answers are constrained by the available image set. In this work we explore using synthetic computer-generated data to fully control the visual and language space, allowing us to provide more diverse scenarios. We quantify the effect of synthetic data in real-world VQA benchmarks and to which extent it produces results that generalize to real data. By exploiting 3D and physics simulation platforms, we provide a pipeline to generate synthetic data to expand and replace type-specific questions and answers without risking the exposure of sensitive or personal data that might be present in real images. We offer a comprehensive analysis while expanding existing hyper-realistic datasets to be used for VQA. We also propose Feature Swapping (F-SWAP) -- where we randomly switch object-level features during training to make a VQA model more domain invariant. We show that F-SWAP is effective for enhancing a currently existing VQA dataset of real images without compromising on the accuracy to answer existing questions in the dataset.

details

comment
Accepted to CVPR 2022. Camera-Ready version. Project page: https://simvqa.github.io/

citation

@inproceedings{cascantebonilla2022simvqa,
  title = {SimVQA: Exploring Simulated Environments for Visual Question Answering.},
  author = {Cascante-Bonilla, Paola and Wu, Hui and Wang, Letao and Feris, Rogerio and Ordonez, Vicente},
  year = {2022},
  booktitle = {Conf. on Computer Vision and Pattern Recognition CVPR 2022},
  url = {https://arxiv.org/abs/2203.17219},
}