Moviescope: Large-scale Analysis of Movies using Multiple Modalities
publication

Moviescope: Large-scale Analysis of Movies using Multiple Modalities

Paola Cascante-Bonilla, Kalpathy Sitaraman, Mengjia Luo, Vicente Ordonez.
arXiv:1908.03180. August 2019.
Lab News Desk

News Release Summary

This section is intentionally written in a reporter-style news release voice for general readers.

Researchers at the University of Virginia and Microsoft have released Moviescope, a dataset of 5,000 movies pairing video trailers, audio, movie posters, text plot summaries, and metadata pulled from sources including YouTube, Wikipedia, and IMDb, in order to systematically test how well different types of data can predict high-level movie attributes like genre and production budget. The team found that simple averaging operations over word or frame embeddings — methods they call fastText and fastVideo — consistently outperformed more computationally expensive approaches like LSTM recurrent networks and action-recognition models designed for short clips, suggesting that for holistic movie-level classification, preserving temporal order matters less than researchers might expect. Text-based plot summaries turned out to be the strongest single predictor of genre, edging out video and even structured metadata, while audio proved surprisingly useful for estimating budget — outperforming the video signal from the same trailer. A human study using Amazon Mechanical Turk showed that people performed only marginally better than the models, with humans doing best when reading plot text and struggling most with raw video frames. Combining all five modalities together yielded the best overall results, confirming that each data type captures something the others miss. The work matters because most existing video datasets focus on short, isolated action clips, whereas Moviescope is designed for the kind of long-range, narrative-level understanding that movies demand, and the authors are releasing their dataset, pretrained embeddings, and code to give other researchers a practical benchmark for multimodal video analysis.

abstract

Film media is a rich form of artistic expression. Unlike photography, and short videos, movies contain a storyline that is deliberately complex and intricate in order to engage its audience. In this paper we present a large scale study comparing the effectiveness of visual, audio, text, and metadata-based features for predicting high-level information about movies such as their genre or estimated budget. We demonstrate the usefulness of content-based methods in this domain in contrast to human-based and metadata-based predictions in the era of deep learning. Additionally, we provide a comprehensive study of temporal feature aggregation methods for representing video and text and find that simple pooling operations are effective in this domain. We also show to what extent different modalities are complementary to each other. To this end, we also introduce Moviescope, a new large-scale dataset of 5,000 movies with corresponding movie trailers (video + audio), movie posters (images), movie plots (text), and metadata.

citation

@article{cascantebonilla2019moviescope,
  title = {Moviescope: Large-scale Analysis of Movies using Multiple Modalities},
  author = {Cascante-Bonilla, Paola and Sitaraman, Kalpathy and Luo, Mengjia and Ordonez, Vicente},
  year = {2019},
  journal = {arXiv preprint arXiv:1908.03180},
  url = {https://arxiv.org/abs/1908.03180},
}