Unsupervised Semantic Parsing of Video Collections


User generated videos have an underlying structure as a starting point, ending, and certain objective steps between these two. In this project, we are aiming to discover this underlying structure with no labelling or supervision by only watching a large-set of video collections. We accomplish this using both visual (video) and language (speech) information in a joint unsupervised model.

In a nutshell, our algorithm starts with a single query over Youtube such as “How to make an omelette?”, and downloads a large number of videos. Then, it discovers activities and parses each video in terms of these activities. We call the resulting joint parse as a storyline. For 5 videos, it looks like the above figure. For another query of “How to make a milkshake?”, the resulting story line is below. We visualize the storylines as temporal segmentation of the videos and ground truth segmentation. We also color code the activity steps we discovered and visualize their key-frames and the automatically generated captions.

How Does It Work?

In this video, we show a collection of discovered activities and their descriptions. Our algorithm generates these activities and their descriptions in a fully unsupervised way from large collection of YouTube videos.

Paper, Code and Data

Unsupervised Semantic Parsing of Video Collections
Ozan Sener, Amir Zamir, Silvio Savarese, Ashutosh Saxena
In International Conference on Computer Vision (ICCV), 2015
[PDF] [Supp.PDF] [Dataset] [GitHub]

Related papers: RoboBrain@ISRR 2015 and rCRF@RSS 2015.


We also thank Bart Selman for useful discussions and Jay Hack for building the ModalDB.

Technical Queries: Ozan Sener and Ashutosh Saxena