pith. machine review for the scientific record. sign in

arxiv: 2602.07064 · v2 · submitted 2026-02-05 · 💻 cs.CV

Recognition: no theorem link

OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords OmniFysicsomni-modal networkphysical intelligencesignal processingFysicsAnyFysicsOmniCapmultimodal benchmarksphysics data engine
0
0 comments X

The pith

OmniFysics unifies omni-modal signals with physics laws to evolve AI physical intelligence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current AI models struggle with physical understanding because key signals are ambiguous and sparse in web-scale data. The paper presents OmniFysics, a compact network that processes images, audio, video, and text in one system. It adds a dynamic physical data engine containing FysicsAny, which maps salient objects to verified physical attributes through hierarchical retrieval and physics-law constraints, plus FysicsOmniCap, which distills web videos into high-fidelity pairs focused on dynamic physical cues. The network trains via staged multimodal alignment and evolutive instruction tuning that includes latent-space flow matching. A sympathetic reader would care because the approach aims to move AI from brittle data-driven perception toward reliable physical reasoning that could matter for real-world deployment.

Core claim

OmniFysics is a compact omni-modal network that unifies signal processing across images, audio, video, and text, and uses a dynamic physical data engine with FysicsAny and FysicsOmniCap mechanisms to inject explicit physical knowledge, achieving competitive performance on standard multimodal benchmarks while advancing physics-oriented evaluations through staged optimization and evolutive tuning.

What carries the argument

FysicsAny, the adaptive mechanism that produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification.

If this is right

  • The system achieves competitive performance on standard multimodal benchmarks.
  • It significantly advances results on physics-oriented evaluations.
  • Latent-space flow matching integrates into the optimization for improved generation.
  • An adaptive intent router enables more efficient execution of the network.
  • The overall paradigm supports autonomous optimization of networked AI systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-plus-law-verification pattern could be adapted to other domains such as chemistry or biology where domain rules are well known.
  • Replacing the external retrieval step with an internal learned module might increase scalability while preserving the physics constraints.
  • Stronger physical intelligence could directly benefit downstream tasks like robotic planning or physics simulation where current models fail on basic dynamics.

Load-bearing premise

The FysicsAny mechanism can reliably map salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification without introducing systematic errors or biases from the retrieval process.

What would settle it

A controlled ablation that disables the physics-law-constrained verification step inside FysicsAny and then re-runs the physics-oriented evaluation benchmarks to check whether the reported performance gains disappear.

Figures

Figures reproduced from arXiv: 2602.07064 by Dingkang Yang, Lihua Zhang, Minghao Han, Yizhou Liu, Yue Jiang.

Figure 1
Figure 1. Figure 1: FysicsAny Pipeline. Overview of the pipeline for constructing physics-aware supervision from web-scale data. It integrates object centric perception, hierarchical knowledge retrieval, and physical law constrained verification to generate diverse instruction image pairs for physical attribute supervision. Algorithm 1 FysicsOmniCap Data Generation Pipeline 1: Input: Raw video dataset Draw 2: Output: Physics-… view at source ↗
Figure 2
Figure 2. Figure 2: Quantitative comparison of Mean Relative Accuracy (MRA) in real-world physical attribute estimation. The chart evaluates the predictive performance of GPT-5 and the proposed FysicsAny pipeline across 11 diverse physical properties against instrument-measured ground truths. Higher MRA percentages indicate closer alignment with actual real-world physical values. FysicsAny consistently and significantly outpe… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of OmniFysics and training data distribution. (a) Model architecture. The model employs Temporal Multimodal Rotary Position Embedding to process interleaved sequences of images, audio, and text. For understanding, the Vision and Audio Encoders extract features to feed the LLM backbone. For the generation task, the Codec and VAE Encoder are utilized to assist the SpokenVoxer and Flow Head in synthe… view at source ↗
Figure 4
Figure 4. Figure 4: Training pipeline of OmniFysics. We employ a four-stage training strategy for OmniFysics to progressively enhance its omni-modal perception and physical understanding, including speech and text-to-image generation. to-Speech (S2S) samples, synthesized with emotion tags via Gemini-2.5-flash [38]. To enable emotion-aware generation, we condition generation on the query speech encoding q (a) and target text y… view at source ↗
Figure 5
Figure 5. Figure 5: OmniFysics on Vision Understanding vs. Leading [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Physics-aware Generation. Mapping Physical Parameters to Faithful Materials and Simulating Accurate Scientific Phenomena. TABLE V PERFORMANCE OF OMNIFYSICS ON IMAGE GENERATION BENCHMARKS COMPARED TO LEADING EXPERT AND OMNI MODELS. Model Size GenEval DPG-Bench Science-T2I-S Expert Models SDv1.5 [75] - 0.43 63.18 5.81 Hunyuan-DiT [76] 2B 0.63 78.87 26.00 Janus [77] 2B 0.61 79.68 30.78 Ovis-U1 [78] 3B 0.89 83… view at source ↗
Figure 7
Figure 7. Figure 7: Routing accuracy of the intent-aware router (SA-IAR) and its ablated variants across four distinct interaction scenarios, demonstrating the specific [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

The autonomous evolution of networked AI systems relies heavily on robust environmental perception. However, physical understanding remains brittle in current models because key physical signals are visually ambiguous and sparsely represented in web-scale data. To bridge the gap between data-centric learning and knowledge-based physical rules, we present OmniFysics, a compact omni-modal network that unifies signal processing and understanding across images, audio, video, and text. To enable autonomous optimization and inject explicit physical knowledge, we construct a dynamic physical data engine. Within this engine, FysicsAny acts as an adaptive mechanism that produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification. Concurrently, FysicsOmniCap distills web videos utilizing advanced audio-visual cross-modal signal processing, generating high-fidelity data pairs that emphasize dynamic physical cues. We optimize the OmniFysics network through staged multimodal alignment and evolutive instruction tuning, integrating latent-space flow matching for generation and an adaptive intent router for efficient execution. Experiments demonstrate that this evolutive optimization paradigm not only achieves competitive performance on standard multimodal benchmarks but also significantly advances physics-oriented evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents OmniFysics, a compact omni-modal network unifying signal processing and understanding across images, audio, video, and text. It introduces a dynamic physical data engine containing FysicsAny, which produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification, and FysicsOmniCap, which distills web videos into high-fidelity data pairs emphasizing dynamic physical cues. The network is optimized through staged multimodal alignment, evolutive instruction tuning, latent-space flow matching for generation, and an adaptive intent router. The central claim is that this evolutive optimization paradigm achieves competitive performance on standard multimodal benchmarks while significantly advancing physics-oriented evaluations.

Significance. If the empirical claims are substantiated with quantitative evidence, the work could advance integration of explicit physical rules into multimodal AI systems, addressing brittleness in physical understanding from web-scale data. The combination of custom data engines for physics supervision and evolutive tuning offers a novel direction for autonomous optimization, though its broader impact hinges on demonstrating that the physics gains are not artifacts of the internal pipeline.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: The claim that experiments demonstrate 'competitive performance on standard multimodal benchmarks' and 'significantly advances physics-oriented evaluations' supplies no quantitative metrics, baselines, error bars, ablation studies, or specific benchmark scores. This absence leaves the central empirical claim without visible support and prevents assessment of the magnitude of any physics gains.
  2. [FysicsAny mechanism] FysicsAny mechanism (system description): The hierarchical retrieval plus physics-law-constrained verification is presented as producing reliable physics-grounded supervision, yet no error rates, bias measurements, retrieval accuracy ablations, or failure-case analysis are reported. Without these controls it is impossible to determine whether reported physics advances reflect genuine improvements or systematic artifacts from the retrieval process.
  3. [Evaluation] Evaluation setup: Performance on physics-oriented evaluations is measured using the custom FysicsAny and FysicsOmniCap engines that are defined, trained, and evaluated within the same manuscript. This creates a circularity burden; independent external physics benchmarks or cross-validation against established datasets are not shown, weakening the claim that the advances are generalizable.
minor comments (1)
  1. [Introduction / System Overview] The acronyms 'FysicsAny' and 'FysicsOmniCap' are introduced without explicit expansion or relation to prior terminology, which may hinder readability for readers outside the immediate subfield.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important gaps in the empirical support and evaluation rigor of our work. We address each major comment below and commit to revisions that will strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: The claim that experiments demonstrate 'competitive performance on standard multimodal benchmarks' and 'significantly advances physics-oriented evaluations' supplies no quantitative metrics, baselines, error bars, ablation studies, or specific benchmark scores. This absence leaves the central empirical claim without visible support and prevents assessment of the magnitude of any physics gains.

    Authors: We agree that the current manuscript version presents the experimental claims at a high level without the supporting numerical evidence. In the revised version we will expand the Experiments section to report concrete scores on standard multimodal benchmarks (including VQAv2, AudioCaps, and video understanding tasks), direct comparisons to relevant baselines, standard error bars computed over multiple random seeds, and full ablation tables isolating the dynamic physical data engine and evolutive tuning components. These additions will make the magnitude of both the competitive multimodal results and the physics-oriented gains quantitatively verifiable. revision: yes

  2. Referee: [FysicsAny mechanism] FysicsAny mechanism (system description): The hierarchical retrieval plus physics-law-constrained verification is presented as producing reliable physics-grounded supervision, yet no error rates, bias measurements, retrieval accuracy ablations, or failure-case analysis are reported. Without these controls it is impossible to determine whether reported physics advances reflect genuine improvements or systematic artifacts from the retrieval process.

    Authors: The referee is correct that validation metrics for FysicsAny are currently missing. The revised manuscript will include a dedicated analysis subsection reporting retrieval error rates, bias measurements stratified by object category and physical attribute, ablation results comparing accuracy with versus without the physics-law constraints, and a failure-case study with representative examples and quantitative breakdown of error types. These controls will allow readers to assess whether the physics supervision is reliable. revision: yes

  3. Referee: [Evaluation] Evaluation setup: Performance on physics-oriented evaluations is measured using the custom FysicsAny and FysicsOmniCap engines that are defined, trained, and evaluated within the same manuscript. This creates a circularity burden; independent external physics benchmarks or cross-validation against established datasets are not shown, weakening the claim that the advances are generalizable.

    Authors: We acknowledge the circularity concern. The revision will add evaluations on independent external physics benchmarks (e.g., Physion and established physical-reasoning datasets) that were not generated by our engines. We will also report results on held-out test splits and cross-validation protocols that separate data generation from final evaluation, thereby demonstrating that the observed physics gains generalize beyond the internal pipeline. revision: yes

Circularity Check

1 steps flagged

Physics-oriented evaluation gains reduce to internal FysicsAny data engine by construction

specific steps
  1. fitted input called prediction [Abstract]
    "FysicsAny acts as an adaptive mechanism that produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification. [...] Experiments demonstrate that this evolutive optimization paradigm not only achieves competitive performance on standard multimodal benchmarks but also significantly advances physics-oriented evaluations."

    The physics-oriented evaluation metric is advanced by the identical hierarchical-retrieval + law-constrained verification process that FysicsAny uses to create training supervision. No separate external physics benchmark or error-controlled ablation is cited; the 'advance' is therefore the output of the paper's own data engine, reducing the claimed result to a renaming of its input construction.

full rationale

The paper's headline result—that the OmniFysics paradigm 'significantly advances physics-oriented evaluations'—is produced by the same FysicsAny mechanism that generates the physics-grounded supervision used for training. Because the abstract presents no external benchmark, error bounds, or independent verification for the retrieval/verification pipeline, the reported physics gains are statistically forced by the paper's own data-construction step rather than emerging from an independent derivation or external test.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the effectiveness of two newly introduced data-generation modules whose internal parameters and verification rules are not detailed, plus standard assumptions from multimodal learning literature.

free parameters (1)
  • hyperparameters for staged multimodal alignment and evolutive instruction tuning
    Multiple training-stage parameters are implied but not enumerated or justified in the provided abstract.
axioms (1)
  • domain assumption Physical attributes of objects can be reliably retrieved and verified through hierarchical search plus physics-law constraints
    Invoked to justify the FysicsAny supervision mechanism.
invented entities (3)
  • FysicsAny no independent evidence
    purpose: Adaptive mechanism that maps objects to verified physical attributes
    New component introduced to generate physics-grounded supervision.
  • FysicsOmniCap no independent evidence
    purpose: Distills web videos into high-fidelity audio-visual pairs emphasizing dynamic physical cues
    New component introduced for data generation.
  • OmniFysics network no independent evidence
    purpose: Compact omni-modal backbone unifying image, audio, video, and text processing
    Core new model architecture proposed in the paper.

pith-pipeline@v0.9.0 · 5513 in / 1428 out tokens · 30332 ms · 2026-05-16T07:06:39.727821+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 26 internal anchors

  1. [1]

    Beyond language-specific neurons: The challenge of identifying speech-specific neurons in multimodal llms,

    N. Park, C. H. Lee, J. Yeomet al., “Beyond language-specific neurons: The challenge of identifying speech-specific neurons in multimodal llms,” IEEE Journal of Selected Topics in Signal Processing, 2026

  2. [2]

    Rehazing for dehazing: A physics-guided parametric augmentation net,

    C.-L. Chang, F.-J. Tsai, Z. Huanget al., “Rehazing for dehazing: A physics-guided parametric augmentation net,”IEEE Journal of Selected Topics in Signal Processing, 2025

  3. [3]

    Sgnet: Sequence grouping network via vision-language model for text-guided video summarization,

    J. Yao, J. Zhang, and L. Zhuo, “Sgnet: Sequence grouping network via vision-language model for text-guided video summarization,”IEEE Journal of Selected Topics in Signal Processing, 2025

  4. [4]

    Cross-model adjudication for bias mitigation in large language models,

    X. Li, C. Li, W. Liuet al., “Cross-model adjudication for bias mitigation in large language models,”IEEE Journal of Selected Topics in Signal Processing, 2026

  5. [5]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucheret al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  6. [6]

    Gemini: A Family of Highly Capable Multimodal Models

    R. Anil, S. Borgeaud, J.-B. Alayracet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  7. [7]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liuet al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  8. [8]

    Seed1.5-VL Technical Report

    D. Guo, F. Wu, F. Zhuet al., “Seed1. 5-vl technical report,”arXiv preprint arXiv:2505.07062, 2025

  9. [9]

    Gemma 3 Technical Report

    G. Team, A. Kamath, J. Ferretet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, 2025

  10. [10]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. Heet al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  11. [11]

    Step-Audio 2 Technical Report

    B. Wu, C. Yan, C. Huet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025

  12. [12]

    Kling-Omni Technical Report

    K. Team, J. Chen, Y . Ciet al., “Kling-omni technical report,”arXiv preprint arXiv:2512.16776, 2025

  13. [13]

    Longcat-flash-omni technical report,

    M. L. Team, B. Wang, B. Xiaoet al., “Longcat-flash-omni technical report,”arXiv preprint arXiv:2511.00279, 2025. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 12

  14. [14]

    Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation,

    B. Ma, C. Zou, C. Yanet al., “Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation,”arXiv preprint arXiv:2510.24821, 2025

  15. [15]

    Phybench: Holistic evaluation of physical perception and reasoning in large language models,

    S. Qiu, S. Guo, Z.-Y . Songet al., “Phybench: Holistic evaluation of physical perception and reasoning in large language models,”arXiv preprint arXiv:2504.16074, 2025

  16. [16]

    Is sora a world simulator? a comprehensive survey on general world models and beyond,

    Z. Zhu, X. Wang, W. Zhaoet al., “Is sora a world simulator? a comprehensive survey on general world models and beyond,”arXiv preprint arXiv:2405.03520, 2024

  17. [17]

    Towards a physics foundation model,

    F. Wiesner, M. Wessling, and S. Baek, “Towards a physics foundation model,”arXiv preprint arXiv:2509.13805, 2025

  18. [18]

    Do generative video models understand physical principles?

    S. Motamed, L. Culp, K. Swerskyet al., “Do generative video models understand physical principles?” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 948–958

  19. [19]

    Baichuan-omni-1.5 technical report,

    Y . Li, J. Liu, T. Zhanget al., “Baichuan-omni-1.5 technical report,”arXiv preprint arXiv:2501.15368, 2025

  20. [20]

    Physbench: Benchmarking and enhancing vision-language models for physical world understanding,

    W. Chow, J. Mao, B. Liet al., “Physbench: Benchmarking and enhancing vision-language models for physical world understanding,” inICLR, 2025

  21. [21]

    Video generation models as world simulators,

    OpenAI, “Video generation models as world simulators,” https://open ai.com/research/video-generation-models-as-world-simulators, 2024, accessed: 2024-02-15

  22. [22]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    M. Assran, A. Bardes, D. Fanet al., “V-jepa 2: Self-supervised video models enable understanding, prediction and planning,”arXiv preprint arXiv:2506.09985, 2025

  23. [23]

    Intphys: A framework and benchmark for visual intuitive physics reasoning,

    R. Riochet, M. Y . Castro, M. Bernardet al., “Intphys: A framework and benchmark for visual intuitive physics reasoning,”arXiv preprint arXiv:1803.07616, 2018

  24. [24]

    Clevrer: Collision events for video representation and reasoning,

    K. Yi, C. Gan, Y . Liet al., “Clevrer: Collision events for video representation and reasoning,” inICLR, 2020

  25. [25]

    Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models,

    L. Puyin, T. Xiang, E. Maoet al., “Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models,”arXiv preprint arXiv:2512.19526, 2025

  26. [26]

    Abench-physics: Benchmarking physical reasoning in llms via high-difficulty and dynamic physics problems,

    Y . Zhang, Y . Ma, Y . Guet al., “Abench-physics: Benchmarking physical reasoning in llms via high-difficulty and dynamic physics problems,” arXiv preprint arXiv:2507.04766, 2025

  27. [27]

    Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models,

    L. Wang, E. Su, J. Liuet al., “Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models,”arXiv preprint arXiv:2506.17667, 2025

  28. [28]

    Physreason: A comprehensive benchmark towards physics-based reasoning,

    X. Zhang, Y . Dong, Y . Wuet al., “Physreason: A comprehensive benchmark towards physics-based reasoning,” inACL, 2025, pp. 16 593– 16 615

  29. [29]

    Seephys: Does seeing help thinking?–benchmarking vision-based physics reasoning,

    K. Xiang, H. Li, T. J. Zhanget al., “Seephys: Does seeing help thinking?–benchmarking vision-based physics reasoning,”arXiv preprint arXiv:2505.19099, 2025

  30. [30]

    Phystoolbench: Benchmarking physical tool understanding for mllms,

    Z. Zhang, K. Chen, X. Linet al., “Phystoolbench: Benchmarking physical tool understanding for mllms,”arXiv preprint arXiv:2510.09507, 2025

  31. [31]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,

    A. Kuznetsova, H. Rom, N. Alldrinet al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,”IJCV, vol. 128, no. 7, pp. 1956–1981, 2020

  32. [32]

    Introducing GPT-5,

    OpenAI, “Introducing GPT-5,” Aug. 2025, accessed: 2025-11-03. [Online]. Available: https://openai.com/index/introducing-gpt-5/

  33. [33]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Y . Zhang, M. Li, D. Longet al., “Qwen3 embedding: Advancing text embedding and reranking through foundation models,”arXiv preprint arXiv:2506.05176, 2025

  34. [34]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chenet al., “Qwen3-vl technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2511.21631

  35. [35]

    Thinking in space: How multimodal large language models see, remember, and recall spaces,

    J. Yang, S. Yang, A. W. Guptaet al., “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inCVPR, 2025, pp. 10 632–10 643

  36. [36]

    Vggsound: A large-scale audio- visual dataset,

    H. Chen, W. Xie, A. Vedaldiet al., “Vggsound: A large-scale audio- visual dataset,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 721–725

  37. [37]

    Imagebind: One embedding space to bind them all,

    R. Girdhar, A. El-Nouby, Z. Liuet al., “Imagebind: One embedding space to bind them all,” inCVPR, 2023, pp. 15 180–15 190

  38. [38]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermannet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  39. [39]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yanget al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

  40. [40]

    Pearson correlation coefficient,

    J. Benesty, J. Chen, Y . Huanget al., “Pearson correlation coefficient,” in Noise reduction in speech processing. Springer, 2009

  41. [41]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    M. Tschannen, A. Gritsenko, X. Wanget al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, local- ization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

  42. [42]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xuet al., “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  43. [43]

    Qwen2.5 Technical Report

    Qwen, :, A. Yanget al., “Qwen2.5 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2412.15115

  44. [44]

    Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,

    S. Ji, Z. Jiang, W. Wanget al., “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,” inICLR, 2025

  45. [45]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inICCV, 2023, pp. 4195–4205

  46. [46]

    Wan: Open and Advanced Large-Scale Video Generative Models

    T. Wan, A. Wang, B. Aiet al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

  47. [47]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Z. Du, Y . Wang, Q. Chenet al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

  48. [48]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamuet al., “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

  49. [49]

    Laion-sg: An enhanced large-scale dataset for training complex image-text models with structural annotations,

    Z. Li, C. Meng, Y . Liet al., “Laion-sg: An enhanced large-scale dataset for training complex image-text models with structural annotations,”

  50. [50]

    Available: https://arxiv.org/abs/2412.08580

    [Online]. Available: https://arxiv.org/abs/2412.08580

  51. [51]

    Introducing claude haiku 4.5,

    Anthropic, “Introducing claude haiku 4.5,” https://www.anthropic.com/ news/claude-haiku-4-5, Oct. 2025, accessed: 2026-01-29

  52. [52]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    W. Wang, Z. Gao, L. Guet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

  53. [53]

    Ovis2.5 Technical Report

    S. Lu, Y . Li, Y . Xiaet al., “Ovis2. 5 technical report,”arXiv preprint arXiv:2508.11737, 2025

  54. [54]

    Sail-vl2 technical report,

    W. Yin, Y . Ye, F. Shuet al., “Sail-vl2 technical report,”arXiv preprint arXiv:2509.14033, 2025

  55. [55]

    Qwen3-Omni Technical Report

    J. Xu, Z. Guo, H. Huet al., “Qwen3-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2509.17765

  56. [56]

    Omnivinci: Enhancing archi- tecture and data for omni-modal understanding llm,

    H. Ye, C.-H. H. Yang, A. Goelet al., “Omnivinci: Enhancing archi- tecture and data for omni-modal understanding llm,”arXiv preprint arXiv:2510.15870, 2025

  57. [57]

    Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action,

    J. Lu, C. Clark, S. Leeet al., “Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action,” inCVPR, 2024, pp. 26 439–26 455

  58. [58]

    Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference,

    B. Warner, A. Chaffin, B. Clavi ´eet al., “Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference,” inACL, 2025, pp. 2526–2547

  59. [59]

    Pai-bench: A comprehensive benchmark for physical ai,

    F. Zhou, J. Huang, J. Liet al., “Pai-bench: A comprehensive benchmark for physical ai,”arXiv preprint arXiv:2512.01989, 2025

  60. [60]

    Kimi-Audio Technical Report

    D. Ding, Z. Ju, Y . Lenget al., “Kimi-audio technical report,”arXiv preprint arXiv:2504.18425, 2025

  61. [61]

    Audio-reasoner: Improving reasoning capa- bility in large audio language models,

    Z. Xie, M. Lin, Z. Liuet al., “Audio-reasoner: Improving reasoning capa- bility in large audio language models,”arXiv preprint arXiv:2503.02318, 2025

  62. [62]

    Audio flamingo 2: An audio- language model with long-audio understanding and expert reasoning abilities,

    S. Ghosh, Z. Kong, S. Kumaret al., “Audio flamingo 2: An audio- language model with long-audio understanding and expert reasoning abilities,”arXiv preprint arXiv:2503.03983, 2025

  63. [63]

    Omnibench: Towards the future of universal omni-language models,

    Y . Li, Y . Ma, G. Zhanget al., “Omnibench: Towards the future of universal omni-language models,” inNeurIPS, 2025

  64. [64]

    Worldsense: Evaluating real-world omnimodal understanding for multimodal llms,

    J. Hong, S. Yan, J. Caiet al., “Worldsense: Evaluating real-world omnimodal understanding for multimodal llms,” inICLR, 2026

  65. [65]

    Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,

    Z. Zhou, R. Wang, and Z. Wu, “Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,”arXiv preprint arXiv:2505.17862, 2025

  66. [66]

    Fysicsworld: A unified full-modality benchmark for any-to-any understanding, generation, and reasoning,

    Y . Jiang, D. Yang, M. Hanet al., “Fysicsworld: A unified full-modality benchmark for any-to-any understanding, generation, and reasoning,” arXiv preprint arXiv:2512.12756, 2025

  67. [67]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

    C. Fu, Y . Dai, Y . Luoet al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inCVPR, 2025, pp. 24 108–24 118

  68. [68]

    Mmbench: Is your multi-modal model an all-around player?

    Y . Liu, H. Duan, Y . Zhanget al., “Mmbench: Is your multi-modal model an all-around player?” inECCV. Springer, 2024, pp. 216–233

  69. [69]

    Are we on the right way for evaluating large vision-language models?

    L. Chen, J. Li, X. Donget al., “Are we on the right way for evaluating large vision-language models?”NeurIPS, vol. 37, pp. 27 056–27 087, 2024

  70. [70]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

    X. Yue, Y . Ni, K. Zhanget al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” in CVPR, 2024, pp. 9556–9567

  71. [71]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,

    P. Lu, H. Bansal, T. Xiaet al., “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,” inICLR, 2024

  72. [72]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,

    T. Guan, F. Liu, X. Wuet al., “Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,” inCVPR, 2024, pp. 14 375–14 385

  73. [73]

    A diagram is worth a dozen images,

    A. Kembhavi, M. Salvato, E. Kolveet al., “A diagram is worth a dozen images,” inECCV. Springer, 2016, pp. 235–251. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 13

  74. [74]

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    S. Sakshi, U. Tyagi, S. Kumaret al., “Mmau: A massive multi- task audio understanding and reasoning benchmark,”arXiv preprint arXiv:2410.19168, 2024

  75. [75]

    Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,

    Z. Ma, Y . Ma, Y . Zhuet al., “Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,”arXiv preprint arXiv:2505.13032, 2025

  76. [76]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenzet al., “High-resolution image synthesis with latent diffusion models,” inCVPR, June 2022, pp. 10 684– 10 695

  77. [77]

    Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding,

    Z. Li, J. Zhang, Q. Linet al., “Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding,”arXiv preprint arXiv:2405.08748, 2024

  78. [78]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    C. Wu, X. Chen, Z. Wuet al., “Janus: Decoupling visual encoding for unified multimodal understanding and generation,”arXiv preprint arXiv:2410.13848, 2024

  79. [79]

    Ovis-u1 technical report,

    G.-H. Wang, S. Zhao, X. Zhanget al., “Ovis-u1 technical report,”arXiv preprint arXiv:2506.23044, 2025

  80. [80]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    C. Wu, P. Zheng, R. Yanet al., “Omnigen2: Exploration to advanced multimodal generation,”arXiv preprint arXiv:2506.18871, 2025

Showing first 80 references.