Abstract

Perception Test is a novel multimodal benchmark that evaluates the perception and reasoning skills of pre-trained multimodal models (e.g. Gemini, GPT-4V), focusing on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities. The benchmark probes pre-trained models for their generalization capabilities, in a zero-shot, few-shot, or limited fine-tuning regime. Perception Test introduces 11.6k real-world videos, up to 35s long, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a substantial gap in performance (91.4% vs 55%), suggesting that there is significant room for improvement in multimodal video understanding.

Dataset, baseline code, and challenge server are available at https://github.com/google-deepmind/perception_test