hub Canonical reference

Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian · 2025 · cs.CL · arXiv 2507.16632

Canonical reference. 82% of citing Pith papers cite this work as background.

48 Pith papers citing it

Background 82% of classified citations

open full Pith review browse 48 citing papers arXiv PDF

abstract

This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 1 method 1

citation-polarity summary

background 9 baseline 1 use method 1

representative citing papers

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

cs.SD · 2026-04-21 · unverdicted · novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.

Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

RedVox benchmark shows speech model safety and fairness vulnerabilities persist under non-adversarial conditions, worsen in non-English languages, and increase with spoken inputs.

AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries?

cs.SD · 2026-06-19 · unverdicted · novelty 7.0

Introduces the first benchmark for over-refusal in large audio language models using 3,000 pseudo-harmful audio samples and evaluates 12 models across six families, finding widespread over-refusal.

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

cs.SD · 2026-06-09 · unverdicted · novelty 7.0

Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

eess.AS · 2026-06-01 · unverdicted · novelty 7.0

SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

eess.AS · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

cs.MM · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.

Liberating LLM Capabilities in Full-Duplex Speech Models

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

LWS is a text-first paradigm for full-duplex speech LLMs that treats visible writing as a primary output channel alongside audio input and spoken response, implemented via token schema and synthetic per-second annotations.

SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

eess.AS · 2026-04-13 · unverdicted · novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

cs.SD · 2026-04-09 · unverdicted · novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

eess.AS · 2026-04-03 · unverdicted · novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.

TiCo: Time-Controllable Spoken Dialogue Model

cs.CL · 2026-03-23 · unverdicted · novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

cs.CL · 2026-01-14 · unverdicted · novelty 7.0

MCGA is a new 119-hour multi-task audio corpus for classical Chinese literary genres that shows current MLLMs face substantial challenges on its test set.

Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

cs.CL · 2025-12-29 · accept · novelty 7.0

Spoken language models exhibit style amnesia and fail to maintain instructed paralinguistic styles across multi-turn conversations, with explicit recall offering partial mitigation.

Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation

eess.AS · 2026-06-29 · unverdicted · novelty 6.0

PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.

MSU-Bench: Towards Speaker-Centric Understanding in Conversational Multi-Speaker Scenarios

eess.AS · 2026-06-22 · unverdicted · novelty 6.0

MSU-Bench is a new two-tier benchmark covering speaker grounding to dialogue reasoning in multi-speaker conversations, with Gemini-assisted annotation and human verification.

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

cs.CL · 2026-06-09 · unverdicted · novelty 6.0

A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.

RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark

cs.SD · 2026-06-09 · unverdicted · novelty 6.0

Introduces RAIL, a CHC-grounded benchmark with five core auditory capabilities to assess LALMs beyond task-centric metrics, showing uneven model performance.

Audio Interaction Model

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.

citing papers explorer

Showing 46 of 46 citing papers after filters.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models cs.SD · 2026-04-21 · unverdicted · none · ref 19 · internal anchor
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-language models.
Unlocking Speech-Text Compositional Powers: Instruction-Following Speech Language Models without Instruction Tuning cs.CL · 2026-07-02 · unverdicted · none · ref 30 · internal anchor
SpeechCombine produces instruction-following SLMs via speech pre-training followed by direct weight combination with the text LLM instruction delta, without any speech instruction tuning.
RedVox: Safety and Fairness Gaps in Speech Models Across Languages cs.CL · 2026-06-25 · unverdicted · none · ref 164 · internal anchor
RedVox benchmark shows speech model safety and fairness vulnerabilities persist under non-adversarial conditions, worsen in non-English languages, and increase with spoken inputs.
AOR-Bench: Do Large Audio Language Models Over-Refuse Pseudo-Harmful Queries? cs.SD · 2026-06-19 · unverdicted · none · ref 48 · internal anchor
Introduces the first benchmark for over-refusal in large audio language models using 3,000 pseudo-harmful audio samples and evaluates 12 models across six families, finding widespread over-refusal.
Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models cs.SD · 2026-06-09 · unverdicted · none · ref 10 · internal anchor
Instruction-based vector steering redirects temporal attention in LALMs to acoustically relevant regions, recovering queried sound event locations with 60.87-68.72% overlap accuracy without training.
Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering cs.CL · 2026-06-09 · unverdicted · none · ref 42 · internal anchor
FD-SLMs exhibit state inertia during abrupt interruptions that a training-free perception-vector steering intervention mitigates, lifting correctness from 28% to 45% and IWOR from 40% to 72% on the Zero-Buffer Benchmark.
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing eess.AS · 2026-06-01 · unverdicted · none · ref 33 · internal anchor
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects cs.CL · 2026-05-31 · unverdicted · none · ref 87 · internal anchor
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action eess.AS · 2026-05-20 · unverdicted · none · ref 20 · 2 links · internal anchor
DuplexSLA introduces a three-channel full-duplex architecture that synchronizes continuous user audio, discrete assistant audio, and rate-limited textual actions inside a single backbone for native turn-taking and in-conversation tool use.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation cs.MM · 2026-05-12 · unverdicted · none · ref 24 · 2 links · internal anchor
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue cs.CL · 2026-05-11 · unverdicted · none · ref 43 · internal anchor
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of weaker integration.
Liberating LLM Capabilities in Full-Duplex Speech Models cs.CL · 2026-05-04 · unverdicted · none · ref 15 · internal anchor
LWS is a text-first paradigm for full-duplex speech LLMs that treats visible writing as a primary output channel alongside audio input and spoken response, implemented via token schema and synthetic per-second annotations.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation cs.CL · 2026-04-22 · unverdicted · none · ref 30 · internal anchor
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models eess.AS · 2026-04-13 · unverdicted · none · ref 36 · internal anchor
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation cs.SD · 2026-04-09 · unverdicted · none · ref 41 · internal anchor
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR eess.AS · 2026-04-03 · unverdicted · none · ref 18 · internal anchor
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
TiCo: Time-Controllable Spoken Dialogue Model cs.CL · 2026-03-23 · unverdicted · none · ref 69 · internal anchor
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus cs.CL · 2026-01-14 · unverdicted · none · ref 14 · internal anchor
MCGA is a new 119-hour multi-task audio corpus for classical Chinese literary genres that shows current MLLMs face substantial challenges on its test set.
Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation eess.AS · 2026-06-29 · unverdicted · none · ref 30 · internal anchor
PRIME-Speech adds low-latency speech output to frozen S2T LLMs by synchronizing a causal post-decoder with intermediate hidden states and using mixed conditioning plus turn-level KV-cache packing, preserving original S2T performance across translation, QA, and dialogue tasks.
MSU-Bench: Towards Speaker-Centric Understanding in Conversational Multi-Speaker Scenarios eess.AS · 2026-06-22 · unverdicted · none · ref 32 · internal anchor
MSU-Bench is a new two-tier benchmark covering speaker grounding to dialogue reasoning in multi-speaker conversations, with Gemini-assisted annotation and human verification.
Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models cs.CL · 2026-06-09 · unverdicted · none · ref 46 · internal anchor
A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.
RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark cs.SD · 2026-06-09 · unverdicted · none · ref 37 · internal anchor
Introduces RAIL, a CHC-grounded benchmark with five core auditory capabilities to assess LALMs beyond task-centric metrics, showing uneven model performance.
Audio Interaction Model cs.SD · 2026-06-03 · unverdicted · none · ref 35 · internal anchor
Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.
LaSR: Context-Aware Speech Recognition via Latent Reasoning cs.CL · 2026-05-30 · unverdicted · none · ref 6 · internal anchor
LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard fine-tuning without added latency.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model eess.AS · 2026-05-12 · unverdicted · none · ref 26 · internal anchor
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs cs.CL · 2026-05-11 · unverdicted · none · ref 50 · internal anchor
EchoDistill applies noisy-to-clean self-distillation with GRPO to boost Audio LLM robustness, reporting 4.18% average GSR gains under strong noise.
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models cs.SD · 2026-05-06 · unverdicted · none · ref 37 · internal anchor
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech eess.AS · 2026-04-19 · unverdicted · none · ref 47 · internal anchor
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use cs.SD · 2026-04-17 · unverdicted · none · ref 31 · internal anchor
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization eess.AS · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on downstream tasks.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs eess.AS · 2026-04-09 · unverdicted · none · ref 24 · internal anchor
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models cs.CL · 2025-10-10 · unverdicted · none · ref 2 · internal anchor
MPS proposes a dual-brain architecture separating formulation reasoning from articulation to achieve real-time CoT in SLMs with accuracy comparable to full pre-computation but much lower latency.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs cs.CL · 2025-09-26 · unverdicted · none · ref 76 · internal anchor
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models cs.SD · 2026-06-27 · unverdicted · none · ref 12 · internal anchor
ALM2Vec learns unified audio embeddings from large audio-language models for text-audio retrieval, instruction-aware retrieval, and other tasks across domains.
Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization cs.CL · 2026-05-26 · unverdicted · none · ref 24 · internal anchor
MAPO is a dual-branch RL framework using modality relevance masks from cross-modal differential entropy and auxiliary attention losses to reduce late-stage modality collapse in audio reasoning models and improve benchmark results.
StepAudio 2.5 Technical Report eess.AS · 2026-05-22 · unverdicted · none · ref 9 · internal anchor
StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook cs.SD · 2026-05-18 · unverdicted · none · ref 19 · internal anchor
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models cs.SD · 2026-04-20 · unverdicted · none · ref 14 · internal anchor
A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models eess.AS · 2026-04-14 · unverdicted · none · ref 16 · internal anchor
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving cs.CL · 2026-07-02 · unverdicted · none · ref 11 · internal anchor
JSTIP interleaves speech and text sequences during pretraining on 38k hours of ASR data to improve entity accuracy over ASR-only and simple joint-training baselines while matching performance from domain text.
Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models eess.AS · 2026-06-05 · unverdicted · none · ref 37 · internal anchor
CogAudio-LLM introduces LIME-440K dataset, EIPS chain-of-thought reasoning, and DR-SAPO optimization to address semantic dominance and improve affective responses in audio language models.
Audio-Mind: An Auditable Agentic Framework for Audio Understanding eess.AS · 2026-05-27 · unverdicted · none · ref 43 · internal anchor
Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on MSU-Bench.
Step-Audio-R1.5 Technical Report eess.AS · 2026-04-28 · unverdicted · none · ref 5 · internal anchor
Step-Audio-R1.5 applies RLHF to audio reasoning models to escape the verifiable reward trap of RLVR, preserving analytical ability while restoring prosodic naturalness and immersion in long dialogues.
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization cs.CV · 2026-02-05 · unverdicted · none · ref 11 · internal anchor
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
A Survey of Audio Reasoning in Multimodal Foundation Models eess.AS · 2026-05-20 · unverdicted · none · ref 61 · internal anchor
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models eess.AS · 2026-05-15 · unverdicted · none · ref 64 · internal anchor
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.

Step-Audio 2 Technical Report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer