arxiv: 2604.08540 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Chong Luo, Lili Qiu, Qi Dai, Rui Wang, Yifan Yang, Yuqing Yang, Zeyuan Lai, Zhen Xing, Ziwei Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:10 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords text-to-audio-video generationbenchmark evaluationmultimodal generationsemantic controllabilityaudio-visual synthesisfine-grained assessmentmedia generation tools

0 comments

The pith

A new benchmark reveals that text-to-audio-video generators produce attractive clips but fail at consistent semantic accuracy in text, speech, and physical logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a task-driven benchmark with prompts from eleven real-world categories to evaluate text-to-audio-video generation at multiple levels of detail. It combines specialist models and multimodal language models to check both surface quality and fine control over content. The evaluation finds that models often succeed at aesthetics yet show repeated errors in rendering text, maintaining speech coherence, applying physical reasoning, and controlling musical pitch. A reader would care because these tools are positioned for media creation, and weak semantic performance limits their practical reliability. If the findings hold, future development must shift emphasis from visual appeal alone toward joint correctness across modalities.

Core claim

The central claim is that existing evaluation methods for text-to-audio-video generation are too coarse or isolated, and a multi-granular framework applied to a new set of realistic prompts exposes a clear separation between strong perceptual quality and weak semantic reliability, with specific breakdowns in text rendering, speech coherence, physical reasoning, and musical pitch control across current systems.

What carries the argument

The multi-granular evaluation framework that pairs lightweight specialist models with multimodal large language models to score outputs from basic perceptual quality up to fine-grained semantic controllability across joint audio-video content.

If this is right

Developers must prioritize reliable text rendering inside generated video frames as a core requirement.
Speech coherence across audio tracks needs targeted improvements to match visual content.
Models require better mechanisms for physical reasoning to avoid implausible object or action sequences.
Musical pitch control must be addressed as a distinct failure mode rather than a side effect of general audio quality.
Evaluation protocols should move beyond isolated audio or video metrics to joint semantic checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark's category-based prompt set could serve as a standard test suite for tracking progress in integrated audio-video models over time.
Persistent failures suggest that training objectives focused mainly on aesthetic metrics may need explicit semantic alignment terms added.
If the gap persists in newer models, it may indicate a need to revisit how audio and video streams are jointly conditioned on text prompts during generation.

Load-bearing premise

The multi-granular evaluation framework that combines specialist models with multimodal language models accurately measures fine-grained joint correctness and semantic controllability without introducing its own biases or blind spots.

What would settle it

A controlled study in which human raters score the same set of generated audio-video clips on the benchmark's semantic criteria and the scores diverge substantially from the automated framework's results.

Figures

Figures reproduced from arXiv: 2604.08540 by Chong Luo, Lili Qiu, Qi Dai, Rui Wang, Yifan Yang, Yuqing Yang, Zeyuan Lai, Zhen Xing, Ziwei Zhou.

**Figure 1.** Figure 1: Comparison between AVGen-Bench and existing benchmarks. Unlike prior works that rely on separate audio/visual evaluations and simple prompts, AVGen-Bench introduces (1) joint audio-visual evaluation, (2) fine-grained metrics across 10 dimensions, and (3) rich, complex prompts with high token counts to ensure a rigorous assessment. chronized and semantically correct audio can dramatically enhance immersion—… view at source ↗

**Figure 2.** Figure 2: Qualitative examples of failure modes across different fine-grained dimensions. (a1) Explicitly prompted text rendering. (a2) Incidental text rendering in background elements. (b) Fine-grained musical control (Pitch Accuracy). (c) Speech generation regarding incidental coherence and explicit instruction following. (d) Holistic semantic alignment in complex multi-shot narratives. (e) High-level physical pla… view at source ↗

**Figure 3.** Figure 3: Overview of the AVGen-Bench framework. The benchmark features a Task-Driven Prompt Set (left) categorized into three real-world application domains: Professional Media, Creator Economy, and World Simulation. The generated content is evaluated via our Multi-Granular Evaluation Suite (right), which employs a hybrid strategy combining lightweight specialist models (orange) for signal-level precision and MLLMs… view at source ↗

**Figure 4.** Figure 4: Dataset-level statistics of the prompts used in AVGen-Bench. et al., 2025) sets the standard for video quality but inherently neglects the acoustic dimension. Conversely, audio benchmarks like TTA-Bench (Wang et al., 2025b) focus on text-to-audio generation but often face scalability bottlenecks, relying heavily on subjective human evaluation to compensate for the poor perceptual correlation of traditio… view at source ↗

**Figure 5.** Figure 5: Detailed workflows of the six Fine-grained Evaluation Modules in AVGen-Bench. The suite employs hybrid strategies combining specialist models (blue nodes) and MLLMs (purple nodes) to evaluate: (a) Scene Text Rendering (OCR + Verification); (b) Facial Consistency (InsightFace + DBSCAN); (c) Pitch Accuracy (Audio-to-MIDI + Theory Check); (d) Speech Intelligibility (ASR + Contextual Logic); (e) Physical Plaus… view at source ↗

**Figure 6.** Figure 6: Benchmark-scale robustness under prompt subset resampling. We repeatedly sample prompt subsets at different ratios and recompute the overall normalized score 200 times. The solid lines denote the mean score over random subsets, the error bars indicate one standard deviation, and the dashed lines mark the corresponding full-benchmark score. Results for both Veo 3.1 Fast and LTX-2 remain close to the full-sc… view at source ↗

**Figure 7.** Figure 7: Extended Examples of Text Rendering Failures. Top (Prompted Text): Models struggle with ”glyph collapse” and layout errors when prompted with specific strings like ”Your customers are talking” or ”EIGHTY-SEVEN SECONDS”. Even high-performing models like Veo 3.1 and Wan 2.6 often fail to render the text perfectly legible or place it on the correct object. Bottom (Incidental Text): A pervasive failure mode wh… view at source ↗

**Figure 8.** Figure 8: Extended Examples of Face Inconsistency and Speech Generation Errors. Top (Face Inconsistency): We observe two distinct patterns of identity loss: (1) Identity Drift across shot transitions, where a character’s appearance changes significantly after a cut; and (2) Crowd Degradation, where faces in multi-person scenes (e.g., boxing audience) become distorted. Bottom (Speech Generation): Models frequently fa… view at source ↗

**Figure 9.** Figure 9: Extended Examples of Physical Violations and Semantic Misalignment. Top (Violation of Physical Laws): Models fail to simulate complex physical phenomena driven by sound. Left (Chladni Plate): Models fail to generate the correct geometric sand patterns corresponding to resonant frequencies. Right (Chemical Reaction): Models fail to depict the correct color oscillations or liquid dynamics in a Briggs-Rausche… view at source ↗

**Figure 10.** Figure 10: Deep Dive into Pitch Accuracy Failures via Symbolic-Neural Verification. We illustrate the disconnect between visual realism and acoustic logic in music generation. Top (Piano): The prompt strictly requests a ”C-G-Am-F” chord progression. While models generate convincing visuals of hands on keys, the extracted MIDI data reveals that the audio contains wrong chords, random melodic noise, or chaotic note cl… view at source ↗

**Figure 11.** Figure 11: Overview of the Custom Gradio Annotation Suite. We tailored the annotation interface to the specific nature of the task. (a) For subjective dimensions, we enforce strict side-by-side comparison to reduce inter-rater variance. (b) For objective dimensions like text, we use absolute scoring to capture specific failure modes. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AVGen-Bench gives the field a practical new benchmark with real-world prompts and a mixed eval framework, but the headline claims about semantic gaps rest on MLLM judgments that lack reported validation.

read the letter

The main thing to know is that this paper ships a task-driven benchmark for text-to-audio-video generation covering 11 categories, plus a multi-granular scoring setup that mixes lightweight specialist models for perceptual quality with MLLMs for semantic controllability. That directly tackles the fragmentation the abstract describes, where prior work mostly scored audio and video separately or used coarse embeddings. The prompt collection and public code release are concrete steps that could help standardize testing in this area. The reported pattern of strong aesthetics but weak semantics, especially on text rendering, speech, physical reasoning, and pitch control, is the kind of observation that could guide model work if it holds up. The framework itself is a reasonable attempt to get at joint correctness rather than isolated metrics. The soft spot is the dependence on MLLMs for the fine-grained semantic scores. Those models have known limits on precise temporal audio analysis and can produce inconsistent outputs on alignment tasks. The abstract gives no numbers on human calibration, inter-rater agreement, or ablations that isolate the MLLM component, so the size of the claimed gaps could partly reflect evaluator blind spots rather than generator performance alone. If the full paper includes those checks, the findings become more usable; without them, the benchmark is still worth having but the specific conclusions need caution. This is aimed at researchers building or evaluating generative media systems who want prompts and metrics that reflect real prompts. A reader working on T2AV or related multimodal generation would find the category design and framework idea useful even if they run their own human studies later. It deserves a serious referee because benchmarks shape what counts as progress, and this one targets a clear gap even if the current evidence for the gaps themselves is preliminary.

Referee Report

1 major / 2 minor

Summary. The paper introduces AVGen-Bench, a task-driven benchmark for text-to-audio-video (T2AV) generation consisting of high-quality prompts across 11 real-world categories. It proposes a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs) to assess perceptual quality through fine-grained semantic controllability. The evaluation of existing T2AV models reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, with specific persistent failures in text rendering, speech coherence, physical reasoning, and musical pitch control.

Significance. If the multi-granular framework can be validated as reliable, the benchmark would address a clear gap in T2AV evaluation by moving beyond isolated audio/video metrics or coarse embeddings toward joint semantic correctness. This could help the community track progress on controllability issues that current generators struggle with.

major comments (1)

[Evaluation Framework and Results] The central claim of a gap between aesthetics and semantic reliability (including failures in pitch control, speech coherence, and physical reasoning) rests on MLLM judgments for fine-grained semantic controllability. However, the manuscript provides no reported validation of these MLLM scores against human experts, objective ground truth, or inter-rater agreement metrics, nor any ablation isolating the MLLM component. This leaves open the possibility that observed failures reflect evaluator limitations rather than generator deficiencies.

minor comments (2)

[Benchmark Construction] The abstract and high-level description mention 11 categories and the availability of code/resources, but the manuscript should include a table or appendix explicitly listing the categories, prompt counts, and example prompts to allow reproducibility.
[Evaluation Framework] Clarify the exact division of labor between the lightweight specialist models and the MLLMs (e.g., which metrics each handles) to avoid ambiguity in the multi-granular framework description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on validation of the MLLM component below and will revise the manuscript accordingly to strengthen the evaluation framework.

read point-by-point responses

Referee: [Evaluation Framework and Results] The central claim of a gap between aesthetics and semantic reliability (including failures in pitch control, speech coherence, and physical reasoning) rests on MLLM judgments for fine-grained semantic controllability. However, the manuscript provides no reported validation of these MLLM scores against human experts, objective ground truth, or inter-rater agreement metrics, nor any ablation isolating the MLLM component. This leaves open the possibility that observed failures reflect evaluator limitations rather than generator deficiencies.

Authors: We agree that the absence of explicit validation for the MLLM judgments represents a limitation in the current manuscript. While the multi-granular framework combines specialist models for perceptual aspects with MLLMs for semantic controllability, and the MLLM prompts were designed to target specific failure modes identified in preliminary checks, we did not report human agreement or ablation studies. In the revised manuscript, we will add a human evaluation on a representative subset of prompts (e.g., 200 samples across categories), reporting inter-rater agreement metrics and correlation with MLLM scores. We will also include an ablation isolating the MLLM component to demonstrate its contribution to detecting the reported semantic gaps. This will directly address the concern that failures may reflect evaluator limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent evaluation framework

full rationale

The paper proposes AVGen-Bench as a new task-driven benchmark and multi-granular evaluation framework combining specialist models with MLLMs. No derivations, equations, fitted parameters, or predictions are present that reduce to prior quantities by construction. The central claim (gap between aesthetics and semantic reliability) is an empirical observation from applying the framework to existing generators, not a self-referential result. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The evaluation methodology is presented as a novel contribution without reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark introduction paper; no free parameters, axioms, or invented entities are introduced or required for the central contribution.

pith-pipeline@v0.9.0 · 5484 in / 1110 out tokens · 49099 ms · 2026-05-10T17:10:43.410975+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages

[1]

Accessed: 2024-02-15

URL https://openai.com/research/ video-generation-models-as-world-simulators . Accessed: 2024-02-15. OpenAI. Sora 2 System Card, 2025. URL https://cdn.openai.com/pdf/ 50d5973c-c4ff-4c2d-986f-c72b5d0ff069/ sora_2_system_card.pdf. Accessed: 2026-01- 22. OpenAI. Introducing gpt-5.2, 2026. URL https:// openai.com/index/introducing-gpt-5-2/ . Accessed: 2026-01...

work page arXiv 2024
[2]

Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

Google Blog. Accessed: 2026-03-16. Runway. Introducing gen-3 alpha, 2024. URL https://runwayml.com/research/ introducing-gen-3-alpha . Accessed: 2026-01- 23. Seedance, T., Chen, H., Chen, S., Chen, X., Chen, Y ., Chen, Y ., Chen, Z., Cheng, F., Cheng, T., Cheng, X., Chi, X., et al. Seedance 1.5 pro: A native audio- visual joint generation foundation model...

work page arXiv 2026
[3]

• Rationale:Determining ”which voice sounds more natural” is cognitively easier and more consistent via side-by-side comparison than absolute scoring

Pairwise Comparison for Subjective Quality (Speech & Semantic).For dimensions where quality is often relative or nuanced—such asSpeech QualityandHolistic Semantic Alignment—we utilized aBlind A/B Testingprotocol (Figure 11a). • Rationale:Determining ”which voice sounds more natural” is cognitively easier and more consistent via side-by-side comparison tha...
[4]

A pairwise comparison might result in a ”Tie” if both models produce gibberish, failing to capture the absolute failure

Pointwise Scoring for Objective Correctness (Text Rendering).Conversely, text rendering requires an absolute assessment of legibility and spelling correctness. A pairwise comparison might result in a ”Tie” if both models produce gibberish, failing to capture the absolute failure. Therefore, we adopted aPointwise Protocol(Figure 11b). • Rationale:Text qual...