LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.
Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
Omni-LLMs show systematic failures at cross-modal coreference; a new dataset and both training-free and training-based fixes produce substantial gains on 13 models.
XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.
AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
citing papers explorer
-
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.
-
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
-
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
-
Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs
Omni-LLMs show systematic failures at cross-modal coreference; a new dataset and both training-free and training-based fixes produce substantial gains on 13 models.
-
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.
-
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
-
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.