pith. sign in

arxiv: 2606.25634 · v1 · pith:7NXJU7R2new · submitted 2026-06-24 · 💻 cs.CV

SSMNBench: Diagnosing Image-based Cross-View Human-Object Understanding via Single-View Sufficiency and Multi-View Necessity

Pith reviewed 2026-06-25 21:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords cross-view understandingmultimodal large language modelsbenchmarkhuman-object interactionsingle-view sufficiencymulti-view necessitydiagnostic evaluationview perturbation
0
0 comments X

The pith

Modern MLLMs average semantics from single views and prefer certain angles rather than synthesizing cross-view geometric evidence for human-object scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SSMNBench to test whether multimodal large language models can genuinely integrate information across multiple camera views in human-centric scenes. It separates tasks into Single-View Sufficiency, where one image contains all needed information, and Multi-View Necessity, where evidence must be fused across views. Experiments on 17 models show performance drops when extra views are added to sufficient tasks and failure to combine fragmented details in necessary tasks. This pattern indicates the models rely on averaging single-image semantics or favoring particular views instead of performing true cross-view synthesis. The benchmark uses view-perturbation protocols to isolate these behaviors from mere distraction.

Core claim

SSMNBench comprises 3,300 curated QA pairs that categorize cross-view human and human-object understanding into Single-View Sufficiency and Multi-View Necessity tasks. Systematic perturbation of view availability across state-of-the-art MLLMs reveals severe distraction degradation on SVS tasks and inability to integrate fragmented geometric evidence on MVN tasks, demonstrating that models depend on multiple single-image semantic averaging and view preference rather than genuine cross-view synthesis.

What carries the argument

The SVS/MVN task categorization combined with systematic view-perturbation protocol that controls availability of camera frames while holding question content fixed.

If this is right

  • Existing multi-view benchmarks that supply a fixed bag of frames cannot distinguish robustness to distraction from actual cross-view fusion.
  • Architectures for future MLLMs must incorporate explicit mechanisms for geometric evidence integration rather than relying on semantic averaging.
  • Evaluation protocols for cross-view understanding should always include both sufficiency and necessity conditions to avoid conflating separate abilities.
  • Progress on human-centric scene reasoning requires diagnostic tests that perturb view availability rather than simply increasing the number of input frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diagnostic split could be applied to non-human scenes or video sequences to test whether the averaging behavior is specific to static human-object views.
  • If models continue to exhibit view preference, training objectives that penalize reliance on single dominant frames might be needed.
  • Robotics or AR applications that depend on multi-camera fusion would face reliability limits until cross-view synthesis improves.

Load-bearing premise

The curated QA pairs and view-perturbation protocol isolate genuine cross-view synthesis ability from single-view semantic leakage or visual distraction effects.

What would settle it

A model that shows no performance drop on SVS tasks when redundant views are added and simultaneously improves on MVN tasks by correctly combining geometric details across views would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.25634 by Chen Liu, Ling Chen, Tianchen Guo, Xin Yu.

Figure 1
Figure 1. Figure 1: Illustration of the SSMNBench curation pipeline. The construction process be￾gins by collecting dense, occlusion-heavy multi-view scenes and defining 11 distinct SVS and MVN tasks. Next, experts annotate QA pairs alongside their necessary ground￾truth views. Finally, we generate structured distractors to mitigate linguistic priors, enforce strict quality control via blind verification, and randomize input … view at source ↗
Figure 2
Figure 2. Figure 2: Visual examples of the 11 tasks in SSMNBench, categorized by their reliance on view sufficiency. SVS tasks can be resolved with a single clear view, while MVN tasks require synthesizing information from multiple viewpoints to overcome occlusion and ambiguity. – Human-Object Contact – Identifying precise physical interactions be￾tween individuals and objects, distinguishing actual contact from near-misses, … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of view variation in SSMNBench’s SVS and MVN settings. (e.g., identifying which entity is closer to a target). This requires cross-view triangulation to resolve monocular depth ambiguity. – Global Orientation Identification – Identifying the global directional facing of a human’s torso or head by linking visual cues across multiple viewpoints (e.g., determining which object or direction a pers… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples of typical model failure cases. (a) In the SVS task, the model hallucinates fine-grained interaction details, incorrectly concluding that the sub￾ject is maintaining a grip with both hands. (b) In the MVN task, the model fails to synthesize cross-view geometric evidence, incorrectly determining the subject’s global orientation by over-relying on the deceptive perspective from a single … view at source ↗
Figure 5
Figure 5. Figure 5: Annotation Interface (Part 1) [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Annotation Interface (Part 2) [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An illustrative example of the JSON annotation structure utilized in the bench￾mark. Each entry encapsulates the scene metadata, the natural language query, four multiple-choice options, the ground-truth answer, and the specific camera views re￾quired for inference. H Additional Benchmark Examples Further examples of benchmark visualizations are provided below in [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: SSMNBench Examples (Part 1) [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: SSMNBench Examples (Part 2) [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: SSMNBench Examples (Part 3) [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have shown remarkable progress in single-image perception, yet their ability to reason about complex cross-view human-centric scenes remains largely unverified. Current multi-view benchmarks evaluate models using a fixed "bag of frames" and thus conflate a model's robustness to visual distraction with its genuine ability to fuse fragmented cross-view evidence. To address this issue, we introduce SSMNBench, a diagnostic benchmark comprising 3,300 curated QA pairs for cross-view human and human-object understanding. SSMNBench uniquely categorizes tasks into Single-View Sufficiency (SVS) and Multi-View Necessity (MVN). By systematically perturbing view availability across 17 state-of-the-art MLLMs, critical limitations are revealed: models suffer from severe "distraction degradation" when presented with redundant views (SVS), and fail to integrate fragmented geometric evidence across cameras (MVN). Our evaluations demonstrate that modern MLLMs rely on multiple single-image semantic averaging and view preference rather than genuine cross-view synthesis. By exposing these fundamental vulnerabilities, SSMNBench provides a rigorous diagnostic framework to drive the advancement of future cross-view-aware multimodal architectures. The code is available at: $ \href{https://github.com/gtc-gh/SSMNBench}{\text{SSMNBench}} $

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces SSMNBench, a diagnostic benchmark of 3,300 curated QA pairs for cross-view human and human-object understanding in MLLMs. Tasks are partitioned into Single-View Sufficiency (SVS) and Multi-View Necessity (MVN). Systematic view-perturbation experiments across 17 state-of-the-art MLLMs reveal distraction degradation on SVS tasks and failure to integrate fragmented geometric evidence on MVN tasks, leading to the conclusion that models rely on single-image semantic averaging and view preference rather than genuine cross-view synthesis. Code is released at the provided GitHub link.

Significance. If the benchmark's partitioning and perturbation protocol correctly isolate cross-view synthesis from single-view leakage and distraction, the work supplies a useful diagnostic tool for exposing limitations in current MLLMs and motivating cross-view-aware architectures. The scale of the evaluation and public code release are concrete strengths that support reproducibility and follow-on research.

minor comments (3)
  1. [Abstract] Abstract: the phrase 'systematically perturbing view availability' is used without an early high-level description of the perturbation protocol; moving a one-sentence summary of the SVS/MVN view-availability controls into the abstract would improve immediate clarity.
  2. [Introduction] The manuscript states that the benchmark 'conflates a model's robustness to visual distraction with its genuine ability to fuse fragmented cross-view evidence' in prior work, but does not cite the specific multi-view benchmarks being critiqued; adding 2-3 representative citations in the introduction would strengthen the motivation.
  3. [Experiments] The claim that models 'rely on multiple single-image semantic averaging and view preference' is presented as the primary takeaway; a short additional ablation or control experiment quantifying the contribution of view preference (e.g., via explicit view-order randomization) would make this interpretation more robust.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of SSMNBench and for recommending minor revision. No specific major comments appear in the provided report.

Circularity Check

0 steps flagged

No significant circularity; pure empirical benchmark

full rationale

The paper presents SSMNBench as an empirical diagnostic benchmark that partitions QA pairs into SVS (single-view sufficient) and MVN (multi-view necessary) categories, then measures model performance under controlled view perturbations on 17 held-out MLLMs. No derivations, equations, fitted parameters, or first-principles predictions are claimed or present. Central claims rest on direct measurements of distraction degradation and integration failure rather than any self-referential construction, self-citation chain, or renamed known result. The evaluation protocol is externally falsifiable via the released code and dataset, satisfying the criterion for a self-contained benchmark with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the assumption that its QA pairs and perturbation protocol measure the intended cross-view capabilities; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The 3300 QA pairs and their SVS/MVN labels accurately capture single-view sufficiency versus multi-view necessity for human-object understanding.
    All conclusions about model limitations depend on this labeling being valid.

pith-pipeline@v0.9.1-grok · 5781 in / 1087 out tokens · 26459 ms · 2026-06-25T21:00:57.709506+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 18 linked inside Pith

  1. [1]

    arXiv preprint arXiv:2509.23661 (2025)

    An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025)

  2. [2]

    arXiv preprint arXiv:2309.16609 (2023)

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  3. [3]

    arXiv preprint arXiv:2502.13923 (2025)

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  4. [4]

    arXiv preprint arXiv:2511.22154 (2025)

    Chang, E., Huang, Z., Liao, Y., Bhavsar, S.R., Param, A., Stark, T., Ahmadyan, A., Yang, X., Wang, J., Abdullah, A., et al.: Wearvqa: A visual question answer- ing benchmark for wearables in egocentric authentic real-world scenarios. arXiv preprint arXiv:2511.22154 (2025)

  5. [5]

    arXiv preprint arXiv:2512.18231 (2025)

    Chaudhary, A., Goyal, S., Narang, P., Kumar, D.: Investigating spatial attention bias in vision-language models. arXiv preprint arXiv:2512.18231 (2025)

  6. [6]

    arXiv preprint arXiv:2401.03890 (2024)

    Chen, G., Wang, W.: A survey on 3d gaussian splatting. arXiv preprint arXiv:2401.03890 (2024)

  7. [7]

    Chen, L., Li, L., Zhao, H., Song, Y., Vinci: R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep- Agent/R1-V(2025), accessed: 2025-02-02

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024) 16 T. Guo et al

  9. [9]

    arXiv preprint arXiv:2504.13180 (2025)

    Cho, J.H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., et al.: Perceptionlm: Open-access data and models for detailed visual understanding. arXiv preprint arXiv:2504.13180 (2025)

  10. [10]

    arXiv preprint arXiv:2507.06261 (2025)

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  11. [11]

    DeepSeek-AI: Deepseek-v3 technical report (2024),https://arxiv.org/abs/ 2412.19437

  12. [12]

    IEEE Robotics and Automation Letters10(2), 1840–1847 (2024)

    Du,X.,Sun,H.,Lu,M.,Zhu,T.,Yu,X.:Dreamcar:Leveragingcar-specificpriorfor in-the-wild 3d car reconstruction. IEEE Robotics and Automation Letters10(2), 1840–1847 (2024)

  13. [13]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Du, X., Wang, Y., Sun, H., Wu, Z., Sheng, H., Wang, S., Ying, J., Lu, M., Zhu, T., Zhan, K., et al.: 3drealcar: An in-the-wild rgb-d car dataset with 360-degree views. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26488–26498 (2025)

  14. [14]

    Du, X., Wang, Y., Yu, X.: Mvgs: Multi-view-regulated gaussian splatting for novel view synthesis (2024)

  15. [15]

    arXiv preprint arXiv:2603.11531 (2026)

    Du, X., Wang, Y., Zhan, K., Yu, X.: Mobile-gs: Real-time gaussian splatting for mobile devices. arXiv preprint arXiv:2603.11531 (2026)

  16. [16]

    Neurocomputing600, 128129 (2024)

    Du, X., Yu, X., Liu, J., Dai, B., Xu, F.: Ethics-aware face recognition aided by synthetic face images. Neurocomputing600, 128129 (2024)

  17. [17]

    arXiv e-prints pp

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

  18. [18]

    arXiv preprint arXiv:2306.13394 (2023)

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

  19. [19]

    Advances in Neural Information Processing Systems38(2026)

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. Advances in Neural Information Processing Systems38(2026)

  20. [20]

    In: European Conference on Computer Vision

    Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)

  21. [21]

    In: Proceedings of the 29th ACM international conference on multimedia

    Gan, Y., Han, R., Yin, L., Feng, W., Wang, S.: Self-supervised multi-view multi- human association and tracking. In: Proceedings of the 29th ACM international conference on multimedia. pp. 282–290 (2021)

  22. [22]

    Gholami,M.,Rezaei,A.,Weimin,Z.,Mao,S.,Zhou,S.,Zhang,Y.,Akbari,M.:Spa- tial reasoning with vision-language models in ego-centric multi-view scenes (2025), https://arxiv.org/abs/2509.06266

  23. [23]

    Google DeepMind: Gemini 2.5: Our most intelligent ai model (2025),https:// blog.google/technology/google-deepmind/gemini-model-thinking-updates- march-2025/

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024)

  25. [25]

    In: International Conference on Algorithms and Architectures for Parallel Processing

    Guo, T., Du, H., Huo, H., Liu, B., Yu, X.: Who is being impersonated? deepfake audio detection and impersonated identification via extraction of id-specific fea- SSMNBench 17 tures. In: International Conference on Algorithms and Architectures for Parallel Processing. pp. 301–320. Springer (2024)

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Guo, T., Liu, C., Yu, X.: Beyond single-view sufficiency: Cvbench for cross-view human understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7154–7164 (2026)

  27. [27]

    In: Australasian Joint Conference on Artificial Intelligence

    Guo,T.,Logan,P.A.,Wackwitz,T.,Martin,D.:Plnet-12:Avision-languagebench- mark for zero-shot physical literacy analysis across 12 fundamental movements. In: Australasian Joint Conference on Artificial Intelligence. pp. 242–254. Springer (2025)

  28. [28]

    arXiv preprint arXiv:2605.18746 (2026)

    Hong, Y., Liu, J., Yin, H., Li, M., Guibas, L., Fei-Fei, L., Wu, J., Choi, Y.: Esi- bench: Towards embodied spatial intelligence that closes the perception-action loop. arXiv preprint arXiv:2605.18746 (2026)

  29. [29]

    Huang, M., Shi, Y., Peng, D., Lai, S., Xie, Z., Jin, L.: Ocr-reasoning benchmark: Unveilingthetruecapabilitiesofmllmsincomplextext-richimagereasoning.arXiv preprint arXiv:2505.17163 (2025)

  30. [30]

    Hugging Face: Open r1: A fully open reproduction of deepseek-r1 (January 2025), https://github.com/huggingface/open-r1

  31. [31]

    arXiv preprint arXiv:2506.03135 (2025)

    Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., Yi, L.: Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. arXiv preprint arXiv:2506.03135 (2025)

  32. [32]

    In: Australasian Database Conference

    Ke, Y., Yu, X., Du, H., Chapman, S., Huang, H.: Dynamic orchestration of multi-agent system for real-world multi-image agricultural vqa. In: Australasian Database Conference. pp. 153–165. Springer (2025)

  33. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Khirodkar, R., Bansal, A., Ma, L., Newcombe, R., Vo, M., Kitani, K.: Ego-humans: An ego-centric 3d multi-human benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19807–19819 (2023)

  34. [34]

    Advances in Neural Information Process- ing Systems37, 107270–107285 (2024)

    Khirodkar,R.,Song,J.T.,Cao,J.,Luo,Z.,Kitani,K.:Harmony4d:Avideodataset for in-the-wild close human interactions. Advances in Neural Information Process- ing Systems37, 107270–107285 (2024)

  35. [35]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)

  36. [36]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Liu,C.,Li,P.,Yang,L.,Wang,D.,Li,L.,Yu,X.:Robustaudio-visualsegmentation via audio-guided visual convergent alignment. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28922–28931 (2025)

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, C., Li, P.P., Yu, Q., Sheng, H., Wang, D., Li, L., Yu, X.: Benchmarking audio visual segmentation for long-untrimmed videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22712–22722 (2024)

  38. [38]

    In: European Conference on Computer Vision

    Liu, C., Qiu, F., Zhang, W., Li, L., Wang, D., Yu, X.: Compound expression recognition via curriculum learning. In: European Conference on Computer Vision. pp. 282–293. Springer (2024)

  39. [39]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, C., Yang, L., Li, P., Wang, D., Li, L., Yu, X.: Dynamic derivation and elimi- nation: Audio visual segmentation with enhanced audio semantics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3131–3141 (2025)

  40. [40]

    In: European Conference on Computer Vision

    Liu, C., Zhang, W., Qiu, F., Li, L., Wang, D., Yu, X.: Affective behaviour analysis via progressive learning. In: European Conference on Computer Vision. pp. 366–

  41. [41]

    Guo et al

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023) 18 T. Guo et al

  42. [42]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Liu, Y., Zhang, C., Xing, R., Tang, B., Yang, B., Yi, L.: Core4d: A 4d human- object-human interaction dataset for collaborative object rearrangement. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 1769–1782 (2025)

  43. [43]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

    Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual- rft: Visual reinforcement fine-tuning. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 2034–2044 (2025)

  44. [44]

    arXiv preprint arXiv:2405.20797 (2024)

    Lu, S., Li, Y., Chen, Q.G., Xu, Z., Luo, W., Zhang, K., Ye, H.J.: Ovis: Struc- tural embedding alignment for multimodal large language model. arXiv preprint arXiv:2405.20797 (2024)

  45. [45]

    In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

    Ma, W., Ye, L., de Melo, C.M., Yuille, A., Chen, J.: Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 17249–17260 (2025)

  46. [46]

    In: Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

    Nguyen, D., Ho, M.K., Ta, H., Nguyen, T.T., Chen, Q., Rav, K., Dang, Q.D., Ram- chandre, S., Phung, S.L., Liao, Z., et al.: Localizing before answering: A benchmark for grounded medical visual question answering. In: Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25). International Joint Confer- ences on Artificial Intellig...

  47. [47]

    OpenAI: Introducing gpt-5.2 (2025),https://openai.com/index/introducing- gpt-5-2/

  48. [48]

    In: International conference on med- ical image computing and computer-assisted intervention

    Özsoy, E., Örnek, E.P., Eck, U., Czempiel, T., Tombari, F., Navab, N.: 4d-or: Se- mantic scene graphs for or domain modeling. In: International conference on med- ical image computing and computer-assisted intervention. pp. 475–485. Springer (2022)

  49. [49]

    arXiv preprint arXiv:2406.17431 (2024)

    Pan, S., Guo, T., Zhang, L., Liu, P., Xing, Z., Sun, X.: A large-scale investigation of semantically incompatible apis behind compatibility issues in android apps. arXiv preprint arXiv:2406.17431 (2024)

  50. [50]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Qi, T., Li, W., Barnes, N.: Smokebench: Evaluating multimodal large language models for wildfire smoke detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1043–1053 (2026)

  51. [51]

    arXiv preprint arXiv:2501.01243 (2025)

    Qin,L.,Ou,S.,Zhang,M.,Wei,J.,Zhang,Y.,Song,X.,Liu,Y.,Wang,M.,Xu,W.: Face-human-bench: A comprehensive benchmark of face and human understanding for multi-modal assistants. arXiv preprint arXiv:2501.01243 (2025)

  52. [52]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Qiu, F., Du, H., Zhang, W., Liu, C., Li, L., Guo, T., Yu, X.: Learning transfer- able compound expressions from masked autoencoder pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4733–4741 (2024)

  53. [53]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Qiu,F.,Zhang,W.,Liu,C.,Li,L.,Du,H.,Guo,T.,Yu,X.:Language-guidedmulti- modal emotional mimicry intensity estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4742–4751 (2024)

  54. [54]

    Qwen Team: Qwen3.5: Towards native multimodal agents (February 2026),https: //qwen.ai/blog?id=qwen3.5

  55. [55]

    arXiv preprint arXiv:2504.07615 (2025)

    Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., Xu, R., Zhao, T.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)

  56. [56]

    Sun, H., Wu, J., Xia, B., Luo, Y., Zhao, Y., Qin, K., Lv, X., Zhang, T., Chang, Y., Wang, X.: Reinforcement fine-tuning powers reasoning capability of multimodal large language models (2025),https://arxiv.org/abs/2505.18536 SSMNBench 19

  57. [57]

    arXiv preprint arXiv:2312.11805 (2023)

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  58. [58]

    Team, V., Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., Tang, H., Cheng, J., Qi, J., Ji, J., Pan, L., Duan, S., Wang, W., Wang, Y., Cheng, Y., He, Z., Su, Z., Yang, Z., Pan, Z., Zeng, A., Wang, B., Chen, B., Shi, B., Pang, C., Zhang, C., Yin, D., Yang, F., Chen, G., Xu, J., Zhu, J., Chen, J., Chen, J., Chen, J., Lin, J., Wang, J., Chen, J., Lei, L., Gong, ...

  59. [59]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Tian, X., Zou, S., Yang, Z., Zhang, J.: Identifying and mitigating position bias of multi-image vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10599–10609 (2025)

  60. [60]

    arXiv preprint arXiv:2302.13971 (2023)

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  61. [61]

    arXiv preprint arXiv:2406.09411 (2024)

    Wang, F., Fu, X., Huang, J.Y., Li, Z., Liu, Q., Liu, X., Ma, M.D., Xu, N., Zhou, W., Zhang, K., et al.: Muirbench: A comprehensive benchmark for robust multi-image understanding. arXiv preprint arXiv:2406.09411 (2024)

  62. [62]

    Wang, T., Zhang, Z., Zhu, Z., Fan, Y., Xiong, J., Li, P., Ma, X., Li, Q.: From objects to anywhere: A holistic benchmark for multi-level visual grounding in 3d scenes (2025),https://arxiv.org/abs/2506.04897

  63. [63]

    arXiv preprint arXiv:2508.18265 (2025)

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  64. [64]

    In: 2023 IEEE International Conference on Big Data (BigData)

    Wu, J., Gan, W., Chen, Z., Wan, S., Yu, P.S.: Multimodal large language models: A survey. In: 2023 IEEE International Conference on Big Data (BigData). pp. 2247–2256. IEEE (2023)

  65. [65]

    arXiv preprint arXiv:2412.10302 (2024)

    Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)

  66. [66]

    arXiv preprint arXiv:2204.02824 (2022)

    Wu, Z., Qi, X., Wang, Z., Zhou, W., Yuan, K., Sun, M., Sun, Z.: Showface: Co- ordinated face inpainting with memory-disentangled refinement networks. arXiv preprint arXiv:2204.02824 (2022)

  67. [67]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wu, Z., Wang, S., Yu, X.: Metom: Metadata-guided token merging for efficient video llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10441–10450 (2026)

  68. [68]

    In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference

    Xu, Q., Cao, R., Shen, X., Du, H., Wang, S., Yu, X.: M3gym: A large-scale mul- timodal multi-view multi-person pose dataset for fitness activity understanding in real-world settings. In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference. pp. 12289–12300 (2025)

  69. [69]

    Computer Vision and Image Understanding249, 104205 (2024) 20 T

    Xu, Q., Chen, H., Du, H., Zhang, H., Łukasik, S., Zhu, T., Yu, X.: M3a: A mul- timodal misinformation dataset for media authenticity analysis. Computer Vision and Image Understanding249, 104205 (2024) 20 T. Guo et al

  70. [70]

    In: Proceedings of the ACM on Web Conference 2025

    Xu, Q., Du, H., Łukasik, S., Zhu, T., Wang, S., Yu, X.: Mdam3: A misinformation detection and analysis framework for multitype multimodal media. In: Proceedings of the ACM on Web Conference 2025. pp. 5285–5296 (2025)

  71. [71]

    arXiv preprint arXiv:2505.09388 (2025)

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  72. [72]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632– 10643 (2025)

  73. [73]

    arXiv preprint arXiv:2505.23764 (2025)

    Yang, S., Xu, R., Xie, Y., Yang, S., Li, M., Lin, J., Zhu, C., Chen, X., Duan, H., Yue, X., et al.: Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764 (2025)

  74. [74]

    arXiv preprint arXiv:2504.15280 (2025)

    Yeh, C.H., Wang, C., Tong, S., Cheng, T.Y., Wang, R., Chu, T., Zhai, Y., Chen, Y., Gao, S., Ma, Y.: Seeing from another perspective: Evaluating multi-view un- derstanding in mllms. arXiv preprint arXiv:2504.15280 (2025)

  75. [75]

    National Science Review11(12), nwae403 (2024)

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

  76. [76]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhang, J., Zhang, J., Song, Z., Shi, Z., Zhao, C., Shi, Y., Yu, J., Xu, L., Wang, J.: Hoi-mˆ 3: Capture multiple humans and objects interaction within contextual environment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 516–526 (2024)

  77. [77]

    In: Findings of the Association for Computational Linguistics: EMNLP 2025

    Zhang, K., Niu, L., Cao, Z., Meng, F., Zhou, J.: Tiu-bench: A benchmark for evaluating large multimodal models on text-rich image understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 24286–24295 (2025)

  78. [78]

    arXiv preprint arXiv:2509.02359 (2025)

    Zhang, W., Huang, Y., Xu, Y., Huang, J., Zhi, H., Ren, S., Xu, W., Zhang, J.: Why do mllms struggle with spatial understanding? a systematic analysis from data to architecture. arXiv preprint arXiv:2509.02359 (2025)

  79. [79]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhang, W., Qiu, F., Liu, C., Li, L., Du, H., Guo, T., Yu, X.: An effective en- semble learning framework for affective behaviour analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4761–4772 (2024)

  80. [80]

    arXiv preprint arXiv:2403.10825 (2024)

    Zhang, W., Qiu, F., Liu, C., Li, L., Du, H., Guo, T., Yu, X.: Affective behaviour analysis via integrating multi-modal knowledge. arXiv preprint arXiv:2403.10825 (2024)

Showing first 80 references.