For decades, multimedia researchers mainly evaluate visual systems according to a set of application-driven tasks, such as the cross-modal retrieval and concept annotation etc. Although the recent advance in computer vision has effectively boosted the performance of visual systems on these tasks, a core question still cannot be explicitly answered: Does the machine understand what is happening in a video, and can the results of the analysis be interpretable by human users? Another way to look at the limitation is to evaluate how many facts that the machine can recognize from a video.
This new grand challenge will encourage researchers to explore a key aspect of recognizing facts from a video, that is relation understanding. In many AI and knowledge-based systems, a fact is represented by a relation between a subject entity and an object entity (i.e. <subject,predicate,object>), which forms the fundamental building block for high-level inference and decision making tasks. The challenge is based on a large-scale user-generated video dataset with objects and relations densly annotated. We announce 3 pivotal tasks in video object detection, action detection and visual relation detection to push the limits of relation understanding.
As the first step in relation understanding, the task is to detect objects of certain categories and spatio-temporally localize each detected object using a bounding-box trajectory in videos. For each object category, we compute Average Precision (AP) to evaluate the detection performance and rank according to the mean AP over all categories.
Action is another important semantic in videos. This task is to detect actions of certain categories and spatio-temporally localize the subject of each detected action using a bounding-box trajectory. For each action category, we compute AP to evaluate the detection performance and rank according to the mean AP over all categories.
Beyond recognizing object and action individually, this task is to detect relation triplets (i.e. <subject,predicate,object>) of interest and spatio-temporally localize the subject and object of each detected relation triplet using bounding-box trajectories. The categories of predicate will also include spatial type in addition to the action type. For each testing video, we compute AP to evaluate the detection performance and rank according to the mean AP over all testing videos.
The challenge is a team-based contest. Each team can have one or more members, and an individual cannot be a member of multiple teams. At the end of the challenge, all teams will be ranked based on the objective evaluation above. The top three performing teams will receive award certificates. At the same time, all accepted submissions are eligible for the conference’s Grand Challenge Award competition.
Organization team: Xindi Shang, Donglin Di, Junbin Xiao and Tat-Seng Chua.
For general information or inquiry about the challenge, please contact: