arxiv: 2602.07064 · v2 · submitted 2026-02-05 · 💻 cs.CV

Recognition: no theorem link

OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization

Minghao Han , Dingkang Yang , Yue Jiang , Yizhou Liu , Lihua Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords OmniFysicsomni-modal networkphysical intelligencesignal processingFysicsAnyFysicsOmniCapmultimodal benchmarksphysics data engine

0 comments

The pith

OmniFysics unifies omni-modal signals with physics laws to evolve AI physical intelligence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current AI models struggle with physical understanding because key signals are ambiguous and sparse in web-scale data. The paper presents OmniFysics, a compact network that processes images, audio, video, and text in one system. It adds a dynamic physical data engine containing FysicsAny, which maps salient objects to verified physical attributes through hierarchical retrieval and physics-law constraints, plus FysicsOmniCap, which distills web videos into high-fidelity pairs focused on dynamic physical cues. The network trains via staged multimodal alignment and evolutive instruction tuning that includes latent-space flow matching. A sympathetic reader would care because the approach aims to move AI from brittle data-driven perception toward reliable physical reasoning that could matter for real-world deployment.

Core claim

OmniFysics is a compact omni-modal network that unifies signal processing across images, audio, video, and text, and uses a dynamic physical data engine with FysicsAny and FysicsOmniCap mechanisms to inject explicit physical knowledge, achieving competitive performance on standard multimodal benchmarks while advancing physics-oriented evaluations through staged optimization and evolutive tuning.

What carries the argument

FysicsAny, the adaptive mechanism that produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification.

If this is right

The system achieves competitive performance on standard multimodal benchmarks.
It significantly advances results on physics-oriented evaluations.
Latent-space flow matching integrates into the optimization for improved generation.
An adaptive intent router enables more efficient execution of the network.
The overall paradigm supports autonomous optimization of networked AI systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-plus-law-verification pattern could be adapted to other domains such as chemistry or biology where domain rules are well known.
Replacing the external retrieval step with an internal learned module might increase scalability while preserving the physics constraints.
Stronger physical intelligence could directly benefit downstream tasks like robotic planning or physics simulation where current models fail on basic dynamics.

Load-bearing premise

The FysicsAny mechanism can reliably map salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification without introducing systematic errors or biases from the retrieval process.

What would settle it

A controlled ablation that disables the physics-law-constrained verification step inside FysicsAny and then re-runs the physics-oriented evaluation benchmarks to check whether the reported performance gains disappear.

Figures

Figures reproduced from arXiv: 2602.07064 by Dingkang Yang, Lihua Zhang, Minghao Han, Yizhou Liu, Yue Jiang.

**Figure 1.** Figure 1: FysicsAny Pipeline. Overview of the pipeline for constructing physics-aware supervision from web-scale data. It integrates object centric perception, hierarchical knowledge retrieval, and physical law constrained verification to generate diverse instruction image pairs for physical attribute supervision. Algorithm 1 FysicsOmniCap Data Generation Pipeline 1: Input: Raw video dataset Draw 2: Output: Physics-… view at source ↗

**Figure 2.** Figure 2: Quantitative comparison of Mean Relative Accuracy (MRA) in real-world physical attribute estimation. The chart evaluates the predictive performance of GPT-5 and the proposed FysicsAny pipeline across 11 diverse physical properties against instrument-measured ground truths. Higher MRA percentages indicate closer alignment with actual real-world physical values. FysicsAny consistently and significantly outpe… view at source ↗

**Figure 3.** Figure 3: Overview of OmniFysics and training data distribution. (a) Model architecture. The model employs Temporal Multimodal Rotary Position Embedding to process interleaved sequences of images, audio, and text. For understanding, the Vision and Audio Encoders extract features to feed the LLM backbone. For the generation task, the Codec and VAE Encoder are utilized to assist the SpokenVoxer and Flow Head in synthe… view at source ↗

**Figure 4.** Figure 4: Training pipeline of OmniFysics. We employ a four-stage training strategy for OmniFysics to progressively enhance its omni-modal perception and physical understanding, including speech and text-to-image generation. to-Speech (S2S) samples, synthesized with emotion tags via Gemini-2.5-flash [38]. To enable emotion-aware generation, we condition generation on the query speech encoding q (a) and target text y… view at source ↗

**Figure 5.** Figure 5: OmniFysics on Vision Understanding vs. Leading [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Physics-aware Generation. Mapping Physical Parameters to Faithful Materials and Simulating Accurate Scientific Phenomena. TABLE V PERFORMANCE OF OMNIFYSICS ON IMAGE GENERATION BENCHMARKS COMPARED TO LEADING EXPERT AND OMNI MODELS. Model Size GenEval DPG-Bench Science-T2I-S Expert Models SDv1.5 [75] - 0.43 63.18 5.81 Hunyuan-DiT [76] 2B 0.63 78.87 26.00 Janus [77] 2B 0.61 79.68 30.78 Ovis-U1 [78] 3B 0.89 83… view at source ↗

**Figure 7.** Figure 7: Routing accuracy of the intent-aware router (SA-IAR) and its ablated variants across four distinct interaction scenarios, demonstrating the specific [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

The autonomous evolution of networked AI systems relies heavily on robust environmental perception. However, physical understanding remains brittle in current models because key physical signals are visually ambiguous and sparsely represented in web-scale data. To bridge the gap between data-centric learning and knowledge-based physical rules, we present OmniFysics, a compact omni-modal network that unifies signal processing and understanding across images, audio, video, and text. To enable autonomous optimization and inject explicit physical knowledge, we construct a dynamic physical data engine. Within this engine, FysicsAny acts as an adaptive mechanism that produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification. Concurrently, FysicsOmniCap distills web videos utilizing advanced audio-visual cross-modal signal processing, generating high-fidelity data pairs that emphasize dynamic physical cues. We optimize the OmniFysics network through staged multimodal alignment and evolutive instruction tuning, integrating latent-space flow matching for generation and an adaptive intent router for efficient execution. Experiments demonstrate that this evolutive optimization paradigm not only achieves competitive performance on standard multimodal benchmarks but also significantly advances physics-oriented evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniFysics adds a physics-focused data engine on top of an omni-modal backbone, but the gains rest on custom components without visible error bounds or external validation.

read the letter

The paper's core move is to pair an omni-modal network with two new data-generation pieces, FysicsAny and FysicsOmniCap, then train it in stages with flow matching and an intent router. FysicsAny pulls physical attributes for objects through hierarchical retrieval plus law-constrained checks; FysicsOmniCap pulls dynamic cues from web video. The claim is that this produces both standard multimodal performance and clearer gains on physics-oriented tests. That combination of omni-modal backbone plus explicit physics supervision is the concrete new element, and it targets a real weakness in current models where physical signals stay sparse and ambiguous. The staged alignment plus evolutive tuning is a reasonable way to inject the constraints without blowing up the training budget. The architecture itself looks compact and practical on paper. The soft spot is the evaluation. The abstract reports competitive results and significant physics advances, yet supplies no numbers, baselines, or ablations. More importantly, the physics gains are measured on data produced by the same FysicsAny and FysicsOmniCap engines, so the loop is tight. There is no reported check on retrieval error rates, source bias, or cases where the physics-law verification fails or passes incorrectly. That makes it hard to know whether the reported improvement is real or an artifact of the pipeline. The free parameters in the alignment and tuning stages add another layer that would need careful controls. This is the kind of system paper that could interest people working on embodied AI or simulation who are already trying to add physical priors. A reader looking for a ready-to-use method or a closed result will find it thin; someone looking for an idea to adapt might still pull the data-engine design. It deserves a serious referee pass so the experiments can be examined directly, but the circularity and missing error analysis on the retrieval step are the first things that need tightening.

Referee Report

3 major / 1 minor

Summary. The paper presents OmniFysics, a compact omni-modal network unifying signal processing and understanding across images, audio, video, and text. It introduces a dynamic physical data engine containing FysicsAny, which produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification, and FysicsOmniCap, which distills web videos into high-fidelity data pairs emphasizing dynamic physical cues. The network is optimized through staged multimodal alignment, evolutive instruction tuning, latent-space flow matching for generation, and an adaptive intent router. The central claim is that this evolutive optimization paradigm achieves competitive performance on standard multimodal benchmarks while significantly advancing physics-oriented evaluations.

Significance. If the empirical claims are substantiated with quantitative evidence, the work could advance integration of explicit physical rules into multimodal AI systems, addressing brittleness in physical understanding from web-scale data. The combination of custom data engines for physics supervision and evolutive tuning offers a novel direction for autonomous optimization, though its broader impact hinges on demonstrating that the physics gains are not artifacts of the internal pipeline.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: The claim that experiments demonstrate 'competitive performance on standard multimodal benchmarks' and 'significantly advances physics-oriented evaluations' supplies no quantitative metrics, baselines, error bars, ablation studies, or specific benchmark scores. This absence leaves the central empirical claim without visible support and prevents assessment of the magnitude of any physics gains.
[FysicsAny mechanism] FysicsAny mechanism (system description): The hierarchical retrieval plus physics-law-constrained verification is presented as producing reliable physics-grounded supervision, yet no error rates, bias measurements, retrieval accuracy ablations, or failure-case analysis are reported. Without these controls it is impossible to determine whether reported physics advances reflect genuine improvements or systematic artifacts from the retrieval process.
[Evaluation] Evaluation setup: Performance on physics-oriented evaluations is measured using the custom FysicsAny and FysicsOmniCap engines that are defined, trained, and evaluated within the same manuscript. This creates a circularity burden; independent external physics benchmarks or cross-validation against established datasets are not shown, weakening the claim that the advances are generalizable.

minor comments (1)

[Introduction / System Overview] The acronyms 'FysicsAny' and 'FysicsOmniCap' are introduced without explicit expansion or relation to prior terminology, which may hinder readability for readers outside the immediate subfield.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important gaps in the empirical support and evaluation rigor of our work. We address each major comment below and commit to revisions that will strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: The claim that experiments demonstrate 'competitive performance on standard multimodal benchmarks' and 'significantly advances physics-oriented evaluations' supplies no quantitative metrics, baselines, error bars, ablation studies, or specific benchmark scores. This absence leaves the central empirical claim without visible support and prevents assessment of the magnitude of any physics gains.

Authors: We agree that the current manuscript version presents the experimental claims at a high level without the supporting numerical evidence. In the revised version we will expand the Experiments section to report concrete scores on standard multimodal benchmarks (including VQAv2, AudioCaps, and video understanding tasks), direct comparisons to relevant baselines, standard error bars computed over multiple random seeds, and full ablation tables isolating the dynamic physical data engine and evolutive tuning components. These additions will make the magnitude of both the competitive multimodal results and the physics-oriented gains quantitatively verifiable. revision: yes
Referee: [FysicsAny mechanism] FysicsAny mechanism (system description): The hierarchical retrieval plus physics-law-constrained verification is presented as producing reliable physics-grounded supervision, yet no error rates, bias measurements, retrieval accuracy ablations, or failure-case analysis are reported. Without these controls it is impossible to determine whether reported physics advances reflect genuine improvements or systematic artifacts from the retrieval process.

Authors: The referee is correct that validation metrics for FysicsAny are currently missing. The revised manuscript will include a dedicated analysis subsection reporting retrieval error rates, bias measurements stratified by object category and physical attribute, ablation results comparing accuracy with versus without the physics-law constraints, and a failure-case study with representative examples and quantitative breakdown of error types. These controls will allow readers to assess whether the physics supervision is reliable. revision: yes
Referee: [Evaluation] Evaluation setup: Performance on physics-oriented evaluations is measured using the custom FysicsAny and FysicsOmniCap engines that are defined, trained, and evaluated within the same manuscript. This creates a circularity burden; independent external physics benchmarks or cross-validation against established datasets are not shown, weakening the claim that the advances are generalizable.

Authors: We acknowledge the circularity concern. The revision will add evaluations on independent external physics benchmarks (e.g., Physion and established physical-reasoning datasets) that were not generated by our engines. We will also report results on held-out test splits and cross-validation protocols that separate data generation from final evaluation, thereby demonstrating that the observed physics gains generalize beyond the internal pipeline. revision: yes

Circularity Check

1 steps flagged

Physics-oriented evaluation gains reduce to internal FysicsAny data engine by construction

specific steps

fitted input called prediction [Abstract]
"FysicsAny acts as an adaptive mechanism that produces physics-grounded supervision by mapping salient objects to verified physical attributes via hierarchical retrieval and physics-law-constrained signal verification. [...] Experiments demonstrate that this evolutive optimization paradigm not only achieves competitive performance on standard multimodal benchmarks but also significantly advances physics-oriented evaluations."

The physics-oriented evaluation metric is advanced by the identical hierarchical-retrieval + law-constrained verification process that FysicsAny uses to create training supervision. No separate external physics benchmark or error-controlled ablation is cited; the 'advance' is therefore the output of the paper's own data engine, reducing the claimed result to a renaming of its input construction.

full rationale

The paper's headline result—that the OmniFysics paradigm 'significantly advances physics-oriented evaluations'—is produced by the same FysicsAny mechanism that generates the physics-grounded supervision used for training. Because the abstract presents no external benchmark, error bounds, or independent verification for the retrieval/verification pipeline, the reported physics gains are statistically forced by the paper's own data-construction step rather than emerging from an independent derivation or external test.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the effectiveness of two newly introduced data-generation modules whose internal parameters and verification rules are not detailed, plus standard assumptions from multimodal learning literature.

free parameters (1)

hyperparameters for staged multimodal alignment and evolutive instruction tuning
Multiple training-stage parameters are implied but not enumerated or justified in the provided abstract.

axioms (1)

domain assumption Physical attributes of objects can be reliably retrieved and verified through hierarchical search plus physics-law constraints
Invoked to justify the FysicsAny supervision mechanism.

invented entities (3)

FysicsAny no independent evidence
purpose: Adaptive mechanism that maps objects to verified physical attributes
New component introduced to generate physics-grounded supervision.
FysicsOmniCap no independent evidence
purpose: Distills web videos into high-fidelity audio-visual pairs emphasizing dynamic physical cues
New component introduced for data generation.
OmniFysics network no independent evidence
purpose: Compact omni-modal backbone unifying image, audio, video, and text processing
Core new model architecture proposed in the paper.

pith-pipeline@v0.9.0 · 5513 in / 1428 out tokens · 30332 ms · 2026-05-16T07:06:39.727821+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 26 internal anchors

[1]

Beyond language-specific neurons: The challenge of identifying speech-specific neurons in multimodal llms,

N. Park, C. H. Lee, J. Yeomet al., “Beyond language-specific neurons: The challenge of identifying speech-specific neurons in multimodal llms,” IEEE Journal of Selected Topics in Signal Processing, 2026

work page 2026
[2]

Rehazing for dehazing: A physics-guided parametric augmentation net,

C.-L. Chang, F.-J. Tsai, Z. Huanget al., “Rehazing for dehazing: A physics-guided parametric augmentation net,”IEEE Journal of Selected Topics in Signal Processing, 2025

work page 2025
[3]

Sgnet: Sequence grouping network via vision-language model for text-guided video summarization,

J. Yao, J. Zhang, and L. Zhuo, “Sgnet: Sequence grouping network via vision-language model for text-guided video summarization,”IEEE Journal of Selected Topics in Signal Processing, 2025

work page 2025
[4]

Cross-model adjudication for bias mitigation in large language models,

X. Li, C. Li, W. Liuet al., “Cross-model adjudication for bias mitigation in large language models,”IEEE Journal of Selected Topics in Signal Processing, 2026

work page 2026
[5]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucheret al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Gemini: A Family of Highly Capable Multimodal Models

R. Anil, S. Borgeaud, J.-B. Alayracet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liuet al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Seed1.5-VL Technical Report

D. Guo, F. Wu, F. Zhuet al., “Seed1. 5-vl technical report,”arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Gemma 3 Technical Report

G. Team, A. Kamath, J. Ferretet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. Heet al., “Qwen2. 5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Step-Audio 2 Technical Report

B. Wu, C. Yan, C. Huet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Kling-Omni Technical Report

K. Team, J. Chen, Y . Ciet al., “Kling-omni technical report,”arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Longcat-flash-omni technical report,

M. L. Team, B. Wang, B. Xiaoet al., “Longcat-flash-omni technical report,”arXiv preprint arXiv:2511.00279, 2025. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 12

work page arXiv 2025
[14]

Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation,

B. Ma, C. Zou, C. Yanet al., “Ming-flash-omni: A sparse, unified architecture for multimodal perception and generation,”arXiv preprint arXiv:2510.24821, 2025

work page arXiv 2025
[15]

Phybench: Holistic evaluation of physical perception and reasoning in large language models,

S. Qiu, S. Guo, Z.-Y . Songet al., “Phybench: Holistic evaluation of physical perception and reasoning in large language models,”arXiv preprint arXiv:2504.16074, 2025

work page arXiv 2025
[16]

Is sora a world simulator? a comprehensive survey on general world models and beyond,

Z. Zhu, X. Wang, W. Zhaoet al., “Is sora a world simulator? a comprehensive survey on general world models and beyond,”arXiv preprint arXiv:2405.03520, 2024

work page arXiv 2024
[17]

Towards a physics foundation model,

F. Wiesner, M. Wessling, and S. Baek, “Towards a physics foundation model,”arXiv preprint arXiv:2509.13805, 2025

work page arXiv 2025
[18]

Do generative video models understand physical principles?

S. Motamed, L. Culp, K. Swerskyet al., “Do generative video models understand physical principles?” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 948–958

work page 2026
[19]

Baichuan-omni-1.5 technical report,

Y . Li, J. Liu, T. Zhanget al., “Baichuan-omni-1.5 technical report,”arXiv preprint arXiv:2501.15368, 2025

work page arXiv 2025
[20]

Physbench: Benchmarking and enhancing vision-language models for physical world understanding,

W. Chow, J. Mao, B. Liet al., “Physbench: Benchmarking and enhancing vision-language models for physical world understanding,” inICLR, 2025

work page 2025
[21]

Video generation models as world simulators,

OpenAI, “Video generation models as world simulators,” https://open ai.com/research/video-generation-models-as-world-simulators, 2024, accessed: 2024-02-15

work page 2024
[22]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fanet al., “V-jepa 2: Self-supervised video models enable understanding, prediction and planning,”arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Intphys: A framework and benchmark for visual intuitive physics reasoning,

R. Riochet, M. Y . Castro, M. Bernardet al., “Intphys: A framework and benchmark for visual intuitive physics reasoning,”arXiv preprint arXiv:1803.07616, 2018

work page arXiv 2018
[24]

Clevrer: Collision events for video representation and reasoning,

K. Yi, C. Gan, Y . Liet al., “Clevrer: Collision events for video representation and reasoning,” inICLR, 2020

work page 2020
[25]

Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models,

L. Puyin, T. Xiang, E. Maoet al., “Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models,”arXiv preprint arXiv:2512.19526, 2025

work page arXiv 2025
[26]

Abench-physics: Benchmarking physical reasoning in llms via high-difficulty and dynamic physics problems,

Y . Zhang, Y . Ma, Y . Guet al., “Abench-physics: Benchmarking physical reasoning in llms via high-difficulty and dynamic physics problems,” arXiv preprint arXiv:2507.04766, 2025

work page arXiv 2025
[27]

Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models,

L. Wang, E. Su, J. Liuet al., “Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models,”arXiv preprint arXiv:2506.17667, 2025

work page arXiv 2025
[28]

Physreason: A comprehensive benchmark towards physics-based reasoning,

X. Zhang, Y . Dong, Y . Wuet al., “Physreason: A comprehensive benchmark towards physics-based reasoning,” inACL, 2025, pp. 16 593– 16 615

work page 2025
[29]

Seephys: Does seeing help thinking?–benchmarking vision-based physics reasoning,

K. Xiang, H. Li, T. J. Zhanget al., “Seephys: Does seeing help thinking?–benchmarking vision-based physics reasoning,”arXiv preprint arXiv:2505.19099, 2025

work page arXiv 2025
[30]

Phystoolbench: Benchmarking physical tool understanding for mllms,

Z. Zhang, K. Chen, X. Linet al., “Phystoolbench: Benchmarking physical tool understanding for mllms,”arXiv preprint arXiv:2510.09507, 2025

work page arXiv 2025
[31]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,

A. Kuznetsova, H. Rom, N. Alldrinet al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,”IJCV, vol. 128, no. 7, pp. 1956–1981, 2020

work page 1956
[32]

Introducing GPT-5,

OpenAI, “Introducing GPT-5,” Aug. 2025, accessed: 2025-11-03. [Online]. Available: https://openai.com/index/introducing-gpt-5/

work page 2025
[33]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Y . Zhang, M. Li, D. Longet al., “Qwen3 embedding: Advancing text embedding and reranking through foundation models,”arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chenet al., “Qwen3-vl technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Thinking in space: How multimodal large language models see, remember, and recall spaces,

J. Yang, S. Yang, A. W. Guptaet al., “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inCVPR, 2025, pp. 10 632–10 643

work page 2025
[36]

Vggsound: A large-scale audio- visual dataset,

H. Chen, W. Xie, A. Vedaldiet al., “Vggsound: A large-scale audio- visual dataset,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 721–725

work page 2020
[37]

Imagebind: One embedding space to bind them all,

R. Girdhar, A. El-Nouby, Z. Liuet al., “Imagebind: One embedding space to bind them all,” inCVPR, 2023, pp. 15 180–15 190

work page 2023
[38]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermannet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Qwen2-Audio Technical Report

Y . Chu, J. Xu, Q. Yanget al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Pearson correlation coefficient,

J. Benesty, J. Chen, Y . Huanget al., “Pearson correlation coefficient,” in Noise reduction in speech processing. Springer, 2009

work page 2009
[41]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

M. Tschannen, A. Gritsenko, X. Wanget al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, local- ization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xuet al., “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023
[43]

Qwen2.5 Technical Report

Qwen, :, A. Yanget al., “Qwen2.5 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,

S. Ji, Z. Jiang, W. Wanget al., “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,” inICLR, 2025

work page 2025
[45]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inICCV, 2023, pp. 4195–4205

work page 2023
[46]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan, A. Wang, B. Aiet al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Z. Du, Y . Wang, Q. Chenet al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamuet al., “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Laion-sg: An enhanced large-scale dataset for training complex image-text models with structural annotations,

Z. Li, C. Meng, Y . Liet al., “Laion-sg: An enhanced large-scale dataset for training complex image-text models with structural annotations,”

work page
[50]

Available: https://arxiv.org/abs/2412.08580

[Online]. Available: https://arxiv.org/abs/2412.08580

work page arXiv
[51]

Introducing claude haiku 4.5,

Anthropic, “Introducing claude haiku 4.5,” https://www.anthropic.com/ news/claude-haiku-4-5, Oct. 2025, accessed: 2026-01-29

work page 2025
[52]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Guet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Ovis2.5 Technical Report

S. Lu, Y . Li, Y . Xiaet al., “Ovis2. 5 technical report,”arXiv preprint arXiv:2508.11737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Sail-vl2 technical report,

W. Yin, Y . Ye, F. Shuet al., “Sail-vl2 technical report,”arXiv preprint arXiv:2509.14033, 2025

work page arXiv 2025
[55]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, H. Huet al., “Qwen3-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2509.17765

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Omnivinci: Enhancing archi- tecture and data for omni-modal understanding llm,

H. Ye, C.-H. H. Yang, A. Goelet al., “Omnivinci: Enhancing archi- tecture and data for omni-modal understanding llm,”arXiv preprint arXiv:2510.15870, 2025

work page arXiv 2025
[57]

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action,

J. Lu, C. Clark, S. Leeet al., “Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action,” inCVPR, 2024, pp. 26 439–26 455

work page 2024
[58]

Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference,

B. Warner, A. Chaffin, B. Clavi ´eet al., “Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference,” inACL, 2025, pp. 2526–2547

work page 2025
[59]

Pai-bench: A comprehensive benchmark for physical ai,

F. Zhou, J. Huang, J. Liet al., “Pai-bench: A comprehensive benchmark for physical ai,”arXiv preprint arXiv:2512.01989, 2025

work page arXiv 2025
[60]

Kimi-Audio Technical Report

D. Ding, Z. Ju, Y . Lenget al., “Kimi-audio technical report,”arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Audio-reasoner: Improving reasoning capa- bility in large audio language models,

Z. Xie, M. Lin, Z. Liuet al., “Audio-reasoner: Improving reasoning capa- bility in large audio language models,”arXiv preprint arXiv:2503.02318, 2025

work page arXiv 2025
[62]

Audio flamingo 2: An audio- language model with long-audio understanding and expert reasoning abilities,

S. Ghosh, Z. Kong, S. Kumaret al., “Audio flamingo 2: An audio- language model with long-audio understanding and expert reasoning abilities,”arXiv preprint arXiv:2503.03983, 2025

work page arXiv 2025
[63]

Omnibench: Towards the future of universal omni-language models,

Y . Li, Y . Ma, G. Zhanget al., “Omnibench: Towards the future of universal omni-language models,” inNeurIPS, 2025

work page 2025
[64]

Worldsense: Evaluating real-world omnimodal understanding for multimodal llms,

J. Hong, S. Yan, J. Caiet al., “Worldsense: Evaluating real-world omnimodal understanding for multimodal llms,” inICLR, 2026

work page 2026
[65]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,

Z. Zhou, R. Wang, and Z. Wu, “Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities,”arXiv preprint arXiv:2505.17862, 2025

work page arXiv 2025
[66]

Fysicsworld: A unified full-modality benchmark for any-to-any understanding, generation, and reasoning,

Y . Jiang, D. Yang, M. Hanet al., “Fysicsworld: A unified full-modality benchmark for any-to-any understanding, generation, and reasoning,” arXiv preprint arXiv:2512.12756, 2025

work page arXiv 2025
[67]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

C. Fu, Y . Dai, Y . Luoet al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inCVPR, 2025, pp. 24 108–24 118

work page 2025
[68]

Mmbench: Is your multi-modal model an all-around player?

Y . Liu, H. Duan, Y . Zhanget al., “Mmbench: Is your multi-modal model an all-around player?” inECCV. Springer, 2024, pp. 216–233

work page 2024
[69]

Are we on the right way for evaluating large vision-language models?

L. Chen, J. Li, X. Donget al., “Are we on the right way for evaluating large vision-language models?”NeurIPS, vol. 37, pp. 27 056–27 087, 2024

work page 2024
[70]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

X. Yue, Y . Ni, K. Zhanget al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” in CVPR, 2024, pp. 9556–9567

work page 2024
[71]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,

P. Lu, H. Bansal, T. Xiaet al., “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,” inICLR, 2024

work page 2024
[72]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,

T. Guan, F. Liu, X. Wuet al., “Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models,” inCVPR, 2024, pp. 14 375–14 385

work page 2024
[73]

A diagram is worth a dozen images,

A. Kembhavi, M. Salvato, E. Kolveet al., “A diagram is worth a dozen images,” inECCV. Springer, 2016, pp. 235–251. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 13

work page 2016
[74]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S. Sakshi, U. Tyagi, S. Kumaret al., “Mmau: A massive multi- task audio understanding and reasoning benchmark,”arXiv preprint arXiv:2410.19168, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,

Z. Ma, Y . Ma, Y . Zhuet al., “Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,”arXiv preprint arXiv:2505.13032, 2025

work page arXiv 2025
[76]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenzet al., “High-resolution image synthesis with latent diffusion models,” inCVPR, June 2022, pp. 10 684– 10 695

work page 2022
[77]

Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding,

Z. Li, J. Zhang, Q. Linet al., “Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding,”arXiv preprint arXiv:2405.08748, 2024

work page arXiv 2024
[78]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

C. Wu, X. Chen, Z. Wuet al., “Janus: Decoupling visual encoding for unified multimodal understanding and generation,”arXiv preprint arXiv:2410.13848, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Ovis-u1 technical report,

G.-H. Wang, S. Zhao, X. Zhanget al., “Ovis-u1 technical report,”arXiv preprint arXiv:2506.23044, 2025

work page arXiv 2025
[80]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

C. Wu, P. Zheng, R. Yanet al., “Omnigen2: Exploration to advanced multimodal generation,”arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.