The Vision, Language, and Learning Lab at the University of Virginia

Text2Scene

Text2Scene was proposed in a paper by our group at CVPR 2019 as Text2Scene: Generating Compositional Scenes from Textual Descriptions. This model takes as input textual descriptions of a scene and generates the scene graphically object by object using a Recurrent Neural Network, highlighting their ability to learn complex and seemingly non-sequential tasks. The more advanced version of our model requires more computing but can also produce real images by stitching segments from other images. Read more about Text2Scene in the in the research blogs of IBM and NVIDIA and download the full source code from https://github.com/uvavision/Text2Scene. This demo generates cartoon-like images using the vocabulary and graphics from the Abstract Scenes dataset proposed by Zitnick and Parikh in 2013.

The Text2Scene model was proposed by our group in CVPR 2019 paper titled Text2Scene: Generating Compositional Scenes from Textual Descriptions. This model combines pieces of images from the COCO Dataset and creates new images with them by stitching them into a new image. This demo generates cartoon-like images using the vocabulary and graphics from the Abstract Scenes dataset proposed by Zitnick and Parikh.

Besides Mike and Jenny feel free to reference any of these other objects: bear, cat, dog, duck, owl, snake, hat, crown, pirate hat, viking hat, witch hat, glasses, pie, pizza, hot dog, ketchup, mustard, drink, bee, slide, sandbox, swing, tree, pine tree, apple tree, helicopter, balloon, sun, cloud, rocket, airplane, ball, football, basketball, baseball bat, shovel, tennis racket, kite, fire. Also feel free to describe Mike and Jenny with other attributes or action words such as sitting, running, jumping, kicking, standing, afraid, happy, scared, angry, etc.

Demo by Leticia and Vicente

Gallery of examples

#1. Mike looks at Jenny

#2. Jenny looks at Mike

#3. A bear looks at Jenny

#1. Jenny is holding a shovel

#2. Mike is wearing sunglasses

#3. The day is rainy

#1. Mike and Jenny watch a rocket

#2. The day is sunny

#3. A duck is next to Jenny

#1. Mike runs from a bear

#2. A duck is on the top of the tree

#3. Jenny is sitting

#1. An airplane is on the sky

#2. Mike and Jenny are wearing pirate hats

#3. A table with a pizza on it

#1. Mike is playing tennis

#2. An apple tree is on the background

#3. Jenny is holding a cat

Technical Notes: This demo is running in a CPU-only cloud instance, the inference time it takes to run this demo is about 15 seconds ignoring download times. Running a full demo that uses our model to generate realistic images would take longer under these computing constraints but the code for all of the models described in our our CVPR 2019 paper: Text2Scene can be found here. A large part of the processing time is taken by the underlying Convolutional Recurrent Neural Network used in the object-attention and attribute-attention decoders in our model. This demo does not aim to collect user data and we only show examples created by our team.