Chat-crowd: A Dialog-based Platform for Visual Layout Composition

Paola Cascante-Bonilla; Xuwang Yin; Vicente Ordonez; Song Feng

publication

Chat-crowd: A Dialog-based Platform for Visual Layout Composition

Paola Cascante-Bonilla, Xuwang Yin, Vicente Ordonez, Song Feng.

North American Chapter of the Association for Computational Linguistics. NAACL 2019. System Demonstrations. Minneapolis, MN. June 2019.

paper arxiv code pdf raw bibtex

Lab News Desk

News Release Summary

This section is intentionally written in a reporter-style news release voice for general readers.

Researchers from the University of Virginia and IBM have built a data collection tool called Chat-crowd that lets pairs of human workers reconstruct visual layouts through back-and-forth conversation, with the goal of generating training data for AI systems that need to understand spatial language. The setup assigns one worker the role of "director," who can see a reference image containing shapes or real-world objects, and another as "designer," who manipulates an editable canvas based only on the director's text instructions. A notable engineering choice is that the two workers do not need to be online simultaneously — different people can pick up either role mid-conversation, which lowers the cost and complexity of crowdsourced data collection. The system also injects synthetic messages from a bot to provoke less common conversation moves, such as clarification questions, and uses those injections to quietly assess worker quality. Testing the platform on simple geometric shape layouts and object arrangements from the COCO image dataset, the researchers found that directors reliably described objects using location, color, and shape in over 90 percent of instructions, while designers asked clarifying questions only about 40 percent of the time and usually just modified the canvas directly. More complex scenes — those with six to eight objects — required more than twice as many conversational rounds to complete as simpler ones, underscoring how scene complexity drives language demand. The work matters because datasets pairing natural language with spatial reasoning remain scarce, and Chat-crowd offers a practical, scalable way to produce them for training future vision-and-language AI systems.

abstract

In this paper we introduce Chat-crowd, an interactive environment for visual layout composition via conversational interactions. Chat-crowd supports multiple agents with two conversational roles: agents who play the role of a designer are in charge of placing objects in an editable canvas according to instructions or commands issued by agents with a director role. The system can be integrated with crowdsourcing platforms for both synchronous and asynchronous data collection and is equipped with comprehensive quality controls on the performance of both types of agents. We expect that this system will be useful to build multimodal goal-oriented dialog tasks that require spatial and geometric reasoning.

citation

@inproceedings{cascantebonilla2019chat,
  title = {Chat-crowd: A Dialog-based Platform for Visual Layout Composition},
  author = {Cascante-Bonilla, Paola and Yin, Xuwang and Ordonez, Vicente and Feng, Song},
  year = {2019},
  booktitle = {North American Chapter of the Association for Computational Linguistics. NAACL 2019},
  url = {https://chatcrowd.github.io/},
}