AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning

Pre-Print



1University of Maryland, 2Adobe Research

TL;DR

We introduce AURA, a benchmark that forces models to reason using both audio and video, and AuraScore, a metric that evaluates if this reasoning is factually and logically sound. Our findings reveal a major "reasoning gap" in even the best models, which achieve high accuracy (~92%) but fail on our reasoning evaluation (~45%).

We present AURA, the first question-answering (QA) benchmark designed to evaluate state- of-the-art Audio-Visual Large Language Models (AV-LLMs) and Omni-Modal Language Models (OLMs) on fine-grained cognitive tasks, including Cross-Modal Causal Reasoning, Timbre/Pitch Reasoning, Tempo/AV Synchronization Analysis, Unanswerability, Implicit Distractions, and Performer Skill Profiling.

Abstract

Current audio-visual (AV) benchmarks focus on final answer accuracy, overlooking the underlying reasoning process. This makes it difficult to distinguish genuine comprehension from correct answers derived through fallacies or hallucinations. To address this, we introduce AURA (Audio-visual Understanding and Reasoning Assessment), a benchmark for evaluating the cross-modal reasoning capabilities of Audio-Visual Large Language Models (AV-LLMs) and Omni-modal Language Models (OLMs). AURA includes questions across six challenging cognitive domains, such as causality, timbre and pitch, tempo and AV synchronization, unanswerability, implicit distractions, and skill profiling, explicitly designed to be unanswerable from a single modality. This forces models to construct a valid logical path grounded in both audio and video, setting AURA apart from AV datasets that allow uni-modal shortcuts. To assess reasoning traces, we propose a novel metric, AuraScore, which addresses the lack of robust tools for evaluating reasoning fidelity. It decomposes reasoning into two aspects: (i) Factual Consistency - whether reasoning is grounded in perceptual evidence, and (ii) Core Inference - the logical validity of each reasoning step. Evaluations of SOTA models on AURA reveal a critical reasoning gap: although models achieve high accuracy (up to 92% on some tasks), their Factual Consistency and Core Inference scores fall below 45%. This discrepancy highlights that models often arrive at correct answers through flawed logic, underscoring the need for our benchmark and paving the way for more robust multimodal evaluation.

AURA Benchmark Statistics


Dataset statistics of AURA. The plots in the above figure showcase the details of the data distribution in AURA. (a) The number of MCQ-questions in each QA/task category. (b) Percentage distribution of video samples in the dataset based on their duration. (c) Contribution of video samples to the dataset based on their source of origin.

Automatic QA Generation Pipeline


The automated QA generation pipeline for the AURA benchmark. The process begins with (1) Captioning and Transcription, where an input video is decomposed into text descriptions using specialized models. These multi-modal annotations are then inserted into a (2) Prompt Template for GPT-4o. The complete prompt is then used to (3) Generate a structured JSON output containing the question, multiple-choice options, the correct answer, and a "gold standard" reasoning trace.

Quantitative Results


Evaluation Scores for Omnimodal LLMs. Our proposed AuraScore, consisting of ACC(Accuracy), FCS (Factual Consistency Score), CIS (Core Inference Score), provides insight into the performance of the SOTA Models across multiple tasks in AURA. Closed source models are at the bottom.

Qualitative Results


Qualitative results on tasks. We show several qualitative results on our six different tasks, CR, TPR, TSA, UANS, ID and PSP.

BibTeX