citation dossier

Audio flamingo 3: Advancing audio intelligence with fully open large audio language models

A · 2025 · arXiv 2507.08128

19Pith papers citing it

19reference links

cs.SDtop field · 8 papers

UNVERDICTEDtop verdict bucket · 18 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 19 reviewed papers. Its strongest current cluster is cs.SD (8 papers). The largest review-status bucket among citing papers is UNVERDICTED (18 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

cs.CR · 2026-04-17 · conditional · novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

eess.AS · 2026-04-24 · unverdicted · novelty 7.0

LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

eess.AS · 2026-04-03 · unverdicted · novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.

Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

eess.AS · 2026-05-12 · unverdicted · novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

eess.AS · 2026-05-06 · unverdicted · novelty 6.0

JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.

Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology

cs.CV · 2026-05-05 · unverdicted · novelty 6.0

MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.

HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

cs.SD · 2026-04-26 · unverdicted · novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

cs.SD · 2026-04-20 · unverdicted · novelty 6.0

Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.

Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use

cs.SD · 2026-04-17 · unverdicted · novelty 6.0

Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.

Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

A generative reward model supplies separate semantic and turn-taking scores for spoken dialogues to enable more reliable reinforcement learning.

SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

cs.SD · 2026-04-14 · unverdicted · novelty 6.0

SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.

Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs

cs.SD · 2026-04-10 · unverdicted · novelty 6.0

NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.

Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

eess.AS · 2026-04-09 · unverdicted · novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.

Qwen3-Omni Technical Report

cs.CL · 2025-09-22 · unverdicted · novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

cs.SD · 2026-05-01 · unverdicted · novelty 5.0

GaMMA unifies global and temporal music understanding in a single LMM via MoE audio encoders and progressive training, achieving new state-of-the-art accuracies on music benchmarks including 79.1% on MuchoMusic.

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

cs.SD · 2026-04-20 · unverdicted · novelty 5.0

A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

eess.AS · 2026-04-14 · unverdicted · novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.

Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

cs.CL · 2026-04-25 · unverdicted · novelty 4.0

A cross-modal attention refinement module plus hybrid loss improves robustness of audio-text retrieval on noisy and long-form audio.

citing papers explorer

Showing 19 of 19 citing papers.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models cs.SD · 2026-04-21 · unverdicted · none · ref 11
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs cs.CR · 2026-04-17 · conditional · none · ref 9
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding eess.AS · 2026-04-24 · unverdicted · none · ref 8
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR eess.AS · 2026-04-03 · unverdicted · none · ref 17
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model eess.AS · 2026-05-12 · unverdicted · none · ref 8
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions eess.AS · 2026-05-06 · unverdicted · none · ref 53
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology cs.CV · 2026-05-05 · unverdicted · none · ref 32
MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models cs.SD · 2026-04-26 · unverdicted · none · ref 9
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval cs.SD · 2026-04-20 · unverdicted · none · ref 36
Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use cs.SD · 2026-04-17 · unverdicted · none · ref 32
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models cs.AI · 2026-04-16 · unverdicted · none · ref 1
A generative reward model supplies separate semantic and turn-taking scores for spoken dialogues to enable more reliable reinforcement learning.
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding cs.SD · 2026-04-14 · unverdicted · none · ref 13
SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs cs.SD · 2026-04-10 · unverdicted · none · ref 29
NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs eess.AS · 2026-04-09 · unverdicted · none · ref 11
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
Qwen3-Omni Technical Report cs.CL · 2025-09-22 · unverdicted · none · ref 12
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.
GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models cs.SD · 2026-05-01 · unverdicted · none · ref 18
GaMMA unifies global and temporal music understanding in a single LMM via MoE audio encoders and progressive training, achieving new state-of-the-art accuracies on music benchmarks including 79.1% on MuchoMusic.
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models cs.SD · 2026-04-20 · unverdicted · none · ref 16
A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models eess.AS · 2026-04-14 · unverdicted · none · ref 28
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss cs.CL · 2026-04-25 · unverdicted · none · ref 4
A cross-modal attention refinement module plus hybrid loss improves robustness of audio-text retrieval on noisy and long-form audio.

Audio flamingo 3: Advancing audio intelligence with fully open large audio language models

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer