Recognition: no theorem link
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
Pith reviewed 2026-05-16 07:06 UTC · model grok-4.3
The pith
OmniFysics unifies omni-modal signals with physics laws to evolve AI physical intelligence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OmniFysics is a compact omni-modal network that unifies signal processing across images, audio, video, and text, and uses a dynamic physical data engine with FysicsAny and FysicsOmniCap mechanisms to inject explicit physical knowledge, achieving competitive performance on standard multimodal benchmarks while advancing physics-oriented evaluations through staged optimization and evolutive tuning.
What carries the argument
FysicsAny, the adaptive mechanism that produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification.
If this is right
- The system achieves competitive performance on standard multimodal benchmarks.
- It significantly advances results on physics-oriented evaluations.
- Latent-space flow matching integrates into the optimization for improved generation.
- An adaptive intent router enables more efficient execution of the network.
- The overall paradigm supports autonomous optimization of networked AI systems.
Where Pith is reading between the lines
- The same retrieval-plus-law-verification pattern could be adapted to other domains such as chemistry or biology where domain rules are well known.
- Replacing the external retrieval step with an internal learned module might increase scalability while preserving the physics constraints.
- Stronger physical intelligence could directly benefit downstream tasks like robotic planning or physics simulation where current models fail on basic dynamics.
Load-bearing premise
The FysicsAny mechanism can reliably map salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification without introducing systematic errors or biases from the retrieval process.
What would settle it
A controlled ablation that disables the physics-law-constrained verification step inside FysicsAny and then re-runs the physics-oriented evaluation benchmarks to check whether the reported performance gains disappear.
Figures
read the original abstract
The autonomous evolution of networked AI systems relies heavily on robust environmental perception. However, physical understanding remains brittle in current models because key physical signals are visually ambiguous and sparsely represented in web-scale data. To bridge the gap between data-centric learning and knowledge-based physical rules, we present OmniFysics, a compact omni-modal network that unifies signal processing and understanding across images, audio, video, and text. To enable autonomous optimization and inject explicit physical knowledge, we construct a dynamic physical data engine. Within this engine, FysicsAny acts as an adaptive mechanism that produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification. Concurrently, FysicsOmniCap distills web videos utilizing advanced audio-visual cross-modal signal processing, generating high-fidelity data pairs that emphasize dynamic physical cues. We optimize the OmniFysics network through staged multimodal alignment and evolutive instruction tuning, integrating latent-space flow matching for generation and an adaptive intent router for efficient execution. Experiments demonstrate that this evolutive optimization paradigm not only achieves competitive performance on standard multimodal benchmarks but also significantly advances physics-oriented evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents OmniFysics, a compact omni-modal network unifying signal processing and understanding across images, audio, video, and text. It introduces a dynamic physical data engine containing FysicsAny, which produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification, and FysicsOmniCap, which distills web videos into high-fidelity data pairs emphasizing dynamic physical cues. The network is optimized through staged multimodal alignment, evolutive instruction tuning, latent-space flow matching for generation, and an adaptive intent router. The central claim is that this evolutive optimization paradigm achieves competitive performance on standard multimodal benchmarks while significantly advancing physics-oriented evaluations.
Significance. If the empirical claims are substantiated with quantitative evidence, the work could advance integration of explicit physical rules into multimodal AI systems, addressing brittleness in physical understanding from web-scale data. The combination of custom data engines for physics supervision and evolutive tuning offers a novel direction for autonomous optimization, though its broader impact hinges on demonstrating that the physics gains are not artifacts of the internal pipeline.
major comments (3)
- [Abstract / Experiments] Abstract and Experiments section: The claim that experiments demonstrate 'competitive performance on standard multimodal benchmarks' and 'significantly advances physics-oriented evaluations' supplies no quantitative metrics, baselines, error bars, ablation studies, or specific benchmark scores. This absence leaves the central empirical claim without visible support and prevents assessment of the magnitude of any physics gains.
- [FysicsAny mechanism] FysicsAny mechanism (system description): The hierarchical retrieval plus physics-law-constrained verification is presented as producing reliable physics-grounded supervision, yet no error rates, bias measurements, retrieval accuracy ablations, or failure-case analysis are reported. Without these controls it is impossible to determine whether reported physics advances reflect genuine improvements or systematic artifacts from the retrieval process.
- [Evaluation] Evaluation setup: Performance on physics-oriented evaluations is measured using the custom FysicsAny and FysicsOmniCap engines that are defined, trained, and evaluated within the same manuscript. This creates a circularity burden; independent external physics benchmarks or cross-validation against established datasets are not shown, weakening the claim that the advances are generalizable.
minor comments (1)
- [Introduction / System Overview] The acronyms 'FysicsAny' and 'FysicsOmniCap' are introduced without explicit expansion or relation to prior terminology, which may hinder readability for readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important gaps in the empirical support and evaluation rigor of our work. We address each major comment below and commit to revisions that will strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: The claim that experiments demonstrate 'competitive performance on standard multimodal benchmarks' and 'significantly advances physics-oriented evaluations' supplies no quantitative metrics, baselines, error bars, ablation studies, or specific benchmark scores. This absence leaves the central empirical claim without visible support and prevents assessment of the magnitude of any physics gains.
Authors: We agree that the current manuscript version presents the experimental claims at a high level without the supporting numerical evidence. In the revised version we will expand the Experiments section to report concrete scores on standard multimodal benchmarks (including VQAv2, AudioCaps, and video understanding tasks), direct comparisons to relevant baselines, standard error bars computed over multiple random seeds, and full ablation tables isolating the dynamic physical data engine and evolutive tuning components. These additions will make the magnitude of both the competitive multimodal results and the physics-oriented gains quantitatively verifiable. revision: yes
-
Referee: [FysicsAny mechanism] FysicsAny mechanism (system description): The hierarchical retrieval plus physics-law-constrained verification is presented as producing reliable physics-grounded supervision, yet no error rates, bias measurements, retrieval accuracy ablations, or failure-case analysis are reported. Without these controls it is impossible to determine whether reported physics advances reflect genuine improvements or systematic artifacts from the retrieval process.
Authors: The referee is correct that validation metrics for FysicsAny are currently missing. The revised manuscript will include a dedicated analysis subsection reporting retrieval error rates, bias measurements stratified by object category and physical attribute, ablation results comparing accuracy with versus without the physics-law constraints, and a failure-case study with representative examples and quantitative breakdown of error types. These controls will allow readers to assess whether the physics supervision is reliable. revision: yes
-
Referee: [Evaluation] Evaluation setup: Performance on physics-oriented evaluations is measured using the custom FysicsAny and FysicsOmniCap engines that are defined, trained, and evaluated within the same manuscript. This creates a circularity burden; independent external physics benchmarks or cross-validation against established datasets are not shown, weakening the claim that the advances are generalizable.
Authors: We acknowledge the circularity concern. The revision will add evaluations on independent external physics benchmarks (e.g., Physion and established physical-reasoning datasets) that were not generated by our engines. We will also report results on held-out test splits and cross-validation protocols that separate data generation from final evaluation, thereby demonstrating that the observed physics gains generalize beyond the internal pipeline. revision: yes
Circularity Check
Physics-oriented evaluation gains reduce to internal FysicsAny data engine by construction
specific steps
-
fitted input called prediction
[Abstract]
"FysicsAny acts as an adaptive mechanism that produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification. [...] Experiments demonstrate that this evolutive optimization paradigm not only achieves competitive performance on standard multimodal benchmarks but also significantly advances physics-oriented evaluations."
The physics-oriented evaluation metric is advanced by the identical hierarchical-retrieval + law-constrained verification process that FysicsAny uses to create training supervision. No separate external physics benchmark or error-controlled ablation is cited; the 'advance' is therefore the output of the paper's own data engine, reducing the claimed result to a renaming of its input construction.
full rationale
The paper's headline result—that the OmniFysics paradigm 'significantly advances physics-oriented evaluations'—is produced by the same FysicsAny mechanism that generates the physics-grounded supervision used for training. Because the abstract presents no external benchmark, error bounds, or independent verification for the retrieval/verification pipeline, the reported physics gains are statistically forced by the paper's own data-construction step rather than emerging from an independent derivation or external test.
Axiom & Free-Parameter Ledger
free parameters (1)
- hyperparameters for staged multimodal alignment and evolutive instruction tuning
axioms (1)
- domain assumption Physical attributes of objects can be reliably retrieved and verified through hierarchical search plus physics-law constraints
invented entities (3)
-
FysicsAny
no independent evidence
-
FysicsOmniCap
no independent evidence
-
OmniFysics network
no independent evidence
Reference graph
Works this paper leans on
-
[1]
N. Park, C. H. Lee, J. Yeomet al., “Beyond language-specific neurons: The challenge of identifying speech-specific neurons in multimodal llms,” IEEE Journal of Selected Topics in Signal Processing, 2026
work page 2026
-
[2]
Rehazing for dehazing: A physics-guided parametric augmentation net,
C.-L. Chang, F.-J. Tsai, Z. Huanget al., “Rehazing for dehazing: A physics-guided parametric augmentation net,”IEEE Journal of Selected Topics in Signal Processing, 2025
work page 2025
-
[3]
Sgnet: Sequence grouping network via vision-language model for text-guided video summarization,
J. Yao, J. Zhang, and L. Zhuo, “Sgnet: Sequence grouping network via vision-language model for text-guided video summarization,”IEEE Journal of Selected Topics in Signal Processing, 2025
work page 2025
-
[4]
Cross-model adjudication for bias mitigation in large language models,
X. Li, C. Li, W. Liuet al., “Cross-model adjudication for bias mitigation in large language models,”IEEE Journal of Selected Topics in Signal Processing, 2026
work page 2026
-
[5]
A. Hurst, A. Lerer, A. P. Goucheret al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Gemini: A Family of Highly Capable Multimodal Models
R. Anil, S. Borgeaud, J.-B. Alayracet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
S. Bai, K. Chen, X. Liuet al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
D. Guo, F. Wu, F. Zhuet al., “Seed1. 5-vl technical report,”arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
G. Team, A. Kamath, J. Ferretet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
J. Xu, Z. Guo, J. Heet al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
B. Wu, C. Yan, C. Huet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
K. Team, J. Chen, Y . Ciet al., “Kling-omni technical report,”arXiv preprint arXiv:2512.16776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Longcat-flash-omni technical report,
M. L. Team, B. Wang, B. Xiaoet al., “Longcat-flash-omni technical report,”arXiv preprint arXiv:2511.00279, 2025. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 12
-
[14]
Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation,
B. Ma, C. Zou, C. Yanet al., “Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation,”arXiv preprint arXiv:2510.24821, 2025
-
[15]
Phybench: Holistic evaluation of physical perception and reasoning in large language models,
S. Qiu, S. Guo, Z.-Y . Songet al., “Phybench: Holistic evaluation of physical perception and reasoning in large language models,”arXiv preprint arXiv:2504.16074, 2025
-
[16]
Is sora a world simulator? a comprehensive survey on general world models and beyond,
Z. Zhu, X. Wang, W. Zhaoet al., “Is sora a world simulator? a comprehensive survey on general world models and beyond,”arXiv preprint arXiv:2405.03520, 2024
-
[17]
Towards a physics foundation model,
F. Wiesner, M. Wessling, and S. Baek, “Towards a physics foundation model,”arXiv preprint arXiv:2509.13805, 2025
-
[18]
Do generative video models understand physical principles?
S. Motamed, L. Culp, K. Swerskyet al., “Do generative video models understand physical principles?” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 948–958
work page 2026
-
[19]
Baichuan-omni-1.5 technical report,
Y . Li, J. Liu, T. Zhanget al., “Baichuan-omni-1.5 technical report,”arXiv preprint arXiv:2501.15368, 2025
-
[20]
Physbench: Benchmarking and enhancing vision-language models for physical world understanding,
W. Chow, J. Mao, B. Liet al., “Physbench: Benchmarking and enhancing vision-language models for physical world understanding,” inICLR, 2025
work page 2025
-
[21]
Video generation models as world simulators,
OpenAI, “Video generation models as world simulators,” https://open ai.com/research/video-generation-models-as-world-simulators, 2024, accessed: 2024-02-15
work page 2024
-
[22]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
M. Assran, A. Bardes, D. Fanet al., “V-jepa 2: Self-supervised video models enable understanding, prediction and planning,”arXiv preprint arXiv:2506.09985, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Intphys: A framework and benchmark for visual intuitive physics reasoning,
R. Riochet, M. Y . Castro, M. Bernardet al., “Intphys: A framework and benchmark for visual intuitive physics reasoning,”arXiv preprint arXiv:1803.07616, 2018
-
[24]
Clevrer: Collision events for video representation and reasoning,
K. Yi, C. Gan, Y . Liet al., “Clevrer: Collision events for video representation and reasoning,” inICLR, 2020
work page 2020
-
[25]
L. Puyin, T. Xiang, E. Maoet al., “Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models,”arXiv preprint arXiv:2512.19526, 2025
-
[26]
Y . Zhang, Y . Ma, Y . Guet al., “Abench-physics: Benchmarking physical reasoning in llms via high-difficulty and dynamic physics problems,” arXiv preprint arXiv:2507.04766, 2025
-
[27]
Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models,
L. Wang, E. Su, J. Liuet al., “Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models,”arXiv preprint arXiv:2506.17667, 2025
-
[28]
Physreason: A comprehensive benchmark towards physics-based reasoning,
X. Zhang, Y . Dong, Y . Wuet al., “Physreason: A comprehensive benchmark towards physics-based reasoning,” inACL, 2025, pp. 16 593– 16 615
work page 2025
-
[29]
Seephys: Does seeing help thinking?–benchmarking vision-based physics reasoning,
K. Xiang, H. Li, T. J. Zhanget al., “Seephys: Does seeing help thinking?–benchmarking vision-based physics reasoning,”arXiv preprint arXiv:2505.19099, 2025
-
[30]
Phystoolbench: Benchmarking physical tool understanding for mllms,
Z. Zhang, K. Chen, X. Linet al., “Phystoolbench: Benchmarking physical tool understanding for mllms,”arXiv preprint arXiv:2510.09507, 2025
-
[31]
A. Kuznetsova, H. Rom, N. Alldrinet al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,”IJCV, vol. 128, no. 7, pp. 1956–1981, 2020
work page 1956
-
[32]
OpenAI, “Introducing GPT-5,” Aug. 2025, accessed: 2025-11-03. [Online]. Available: https://openai.com/index/introducing-gpt-5/
work page 2025
-
[33]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Y . Zhang, M. Li, D. Longet al., “Qwen3 embedding: Advancing text embedding and reranking through foundation models,”arXiv preprint arXiv:2506.05176, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
S. Bai, Y . Cai, R. Chenet al., “Qwen3-vl technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2511.21631
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Thinking in space: How multimodal large language models see, remember, and recall spaces,
J. Yang, S. Yang, A. W. Guptaet al., “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inCVPR, 2025, pp. 10 632–10 643
work page 2025
-
[36]
Vggsound: A large-scale audio- visual dataset,
H. Chen, W. Xie, A. Vedaldiet al., “Vggsound: A large-scale audio- visual dataset,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 721–725
work page 2020
-
[37]
Imagebind: One embedding space to bind them all,
R. Girdhar, A. El-Nouby, Z. Liuet al., “Imagebind: One embedding space to bind them all,” inCVPR, 2023, pp. 15 180–15 190
work page 2023
-
[38]
G. Comanici, E. Bieber, M. Schaekermannet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Y . Chu, J. Xu, Q. Yanget al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Pearson correlation coefficient,
J. Benesty, J. Chen, Y . Huanget al., “Pearson correlation coefficient,” in Noise reduction in speech processing. Springer, 2009
work page 2009
-
[41]
M. Tschannen, A. Gritsenko, X. Wanget al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, local- ization, and dense features,”arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xuet al., “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
work page 2023
-
[43]
Qwen, :, A. Yanget al., “Qwen2.5 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,
S. Ji, Z. Jiang, W. Wanget al., “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,” inICLR, 2025
work page 2025
-
[45]
Scalable diffusion models with transformers,
W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inICCV, 2023, pp. 4195–4205
work page 2023
-
[46]
Wan: Open and Advanced Large-Scale Video Generative Models
T. Wan, A. Wang, B. Aiet al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Z. Du, Y . Wang, Q. Chenet al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Flow Matching for Generative Modeling
Y . Lipman, R. T. Chen, H. Ben-Hamuet al., “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
Z. Li, C. Meng, Y . Liet al., “Laion-sg: An enhanced large-scale dataset for training complex image-text models with structural annotations,”
-
[50]
Available: https://arxiv.org/abs/2412.08580
[Online]. Available: https://arxiv.org/abs/2412.08580
-
[51]
Anthropic, “Introducing claude haiku 4.5,” https://www.anthropic.com/ news/claude-haiku-4-5, Oct. 2025, accessed: 2026-01-29
work page 2025
-
[52]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
W. Wang, Z. Gao, L. Guet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
S. Lu, Y . Li, Y . Xiaet al., “Ovis2. 5 technical report,”arXiv preprint arXiv:2508.11737, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
W. Yin, Y . Ye, F. Shuet al., “Sail-vl2 technical report,”arXiv preprint arXiv:2509.14033, 2025
-
[55]
J. Xu, Z. Guo, H. Huet al., “Qwen3-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2509.17765
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Omnivinci: Enhancing archi- tecture and data for omni-modal understanding llm,
H. Ye, C.-H. H. Yang, A. Goelet al., “Omnivinci: Enhancing archi- tecture and data for omni-modal understanding llm,”arXiv preprint arXiv:2510.15870, 2025
-
[57]
Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action,
J. Lu, C. Clark, S. Leeet al., “Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action,” inCVPR, 2024, pp. 26 439–26 455
work page 2024
-
[58]
B. Warner, A. Chaffin, B. Clavi ´eet al., “Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference,” inACL, 2025, pp. 2526–2547
work page 2025
-
[59]
Pai-bench: A comprehensive benchmark for physical ai,
F. Zhou, J. Huang, J. Liet al., “Pai-bench: A comprehensive benchmark for physical ai,”arXiv preprint arXiv:2512.01989, 2025
-
[60]
D. Ding, Z. Ju, Y . Lenget al., “Kimi-audio technical report,”arXiv preprint arXiv:2504.18425, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Audio-reasoner: Improving reasoning capa- bility in large audio language models,
Z. Xie, M. Lin, Z. Liuet al., “Audio-reasoner: Improving reasoning capa- bility in large audio language models,”arXiv preprint arXiv:2503.02318, 2025
-
[62]
S. Ghosh, Z. Kong, S. Kumaret al., “Audio flamingo 2: An audio- language model with long-audio understanding and expert reasoning abilities,”arXiv preprint arXiv:2503.03983, 2025
-
[63]
Omnibench: Towards the future of universal omni-language models,
Y . Li, Y . Ma, G. Zhanget al., “Omnibench: Towards the future of universal omni-language models,” inNeurIPS, 2025
work page 2025
-
[64]
Worldsense: Evaluating real-world omnimodal understanding for multimodal llms,
J. Hong, S. Yan, J. Caiet al., “Worldsense: Evaluating real-world omnimodal understanding for multimodal llms,” inICLR, 2026
work page 2026
-
[65]
Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,
Z. Zhou, R. Wang, and Z. Wu, “Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,”arXiv preprint arXiv:2505.17862, 2025
-
[66]
Y . Jiang, D. Yang, M. Hanet al., “Fysicsworld: A unified full-modality benchmark for any-to-any understanding, generation, and reasoning,” arXiv preprint arXiv:2512.12756, 2025
-
[67]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,
C. Fu, Y . Dai, Y . Luoet al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inCVPR, 2025, pp. 24 108–24 118
work page 2025
-
[68]
Mmbench: Is your multi-modal model an all-around player?
Y . Liu, H. Duan, Y . Zhanget al., “Mmbench: Is your multi-modal model an all-around player?” inECCV. Springer, 2024, pp. 216–233
work page 2024
-
[69]
Are we on the right way for evaluating large vision-language models?
L. Chen, J. Li, X. Donget al., “Are we on the right way for evaluating large vision-language models?”NeurIPS, vol. 37, pp. 27 056–27 087, 2024
work page 2024
-
[70]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,
X. Yue, Y . Ni, K. Zhanget al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” in CVPR, 2024, pp. 9556–9567
work page 2024
-
[71]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,
P. Lu, H. Bansal, T. Xiaet al., “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,” inICLR, 2024
work page 2024
-
[72]
T. Guan, F. Liu, X. Wuet al., “Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,” inCVPR, 2024, pp. 14 375–14 385
work page 2024
-
[73]
A diagram is worth a dozen images,
A. Kembhavi, M. Salvato, E. Kolveet al., “A diagram is worth a dozen images,” inECCV. Springer, 2016, pp. 235–251. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 13
work page 2016
-
[74]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
S. Sakshi, U. Tyagi, S. Kumaret al., “Mmau: A massive multi- task audio understanding and reasoning benchmark,”arXiv preprint arXiv:2410.19168, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,
Z. Ma, Y . Ma, Y . Zhuet al., “Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,”arXiv preprint arXiv:2505.13032, 2025
-
[76]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenzet al., “High-resolution image synthesis with latent diffusion models,” inCVPR, June 2022, pp. 10 684– 10 695
work page 2022
-
[77]
Z. Li, J. Zhang, Q. Linet al., “Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding,”arXiv preprint arXiv:2405.08748, 2024
-
[78]
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
C. Wu, X. Chen, Z. Wuet al., “Janus: Decoupling visual encoding for unified multimodal understanding and generation,”arXiv preprint arXiv:2410.13848, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
G.-H. Wang, S. Zhao, X. Zhanget al., “Ovis-u1 technical report,”arXiv preprint arXiv:2506.23044, 2025
-
[80]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
C. Wu, P. Zheng, R. Yanet al., “Omnigen2: Exploration to advanced multimodal generation,”arXiv preprint arXiv:2506.18871, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.