SSMNBench: Diagnosing Image-based Cross-View Human-Object Understanding via Single-View Sufficiency and Multi-View Necessity

Chen Liu; Ling Chen; Tianchen Guo; Xin Yu

arxiv: 2606.25634 · v1 · pith:7NXJU7R2new · submitted 2026-06-24 · 💻 cs.CV

SSMNBench: Diagnosing Image-based Cross-View Human-Object Understanding via Single-View Sufficiency and Multi-View Necessity

Tianchen Guo , Chen Liu , Ling Chen , Xin Yu This is my paper

Pith reviewed 2026-06-25 21:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords cross-view understandingmultimodal large language modelsbenchmarkhuman-object interactionsingle-view sufficiencymulti-view necessitydiagnostic evaluationview perturbation

0 comments

The pith

Modern MLLMs average semantics from single views and prefer certain angles rather than synthesizing cross-view geometric evidence for human-object scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SSMNBench to test whether multimodal large language models can genuinely integrate information across multiple camera views in human-centric scenes. It separates tasks into Single-View Sufficiency, where one image contains all needed information, and Multi-View Necessity, where evidence must be fused across views. Experiments on 17 models show performance drops when extra views are added to sufficient tasks and failure to combine fragmented details in necessary tasks. This pattern indicates the models rely on averaging single-image semantics or favoring particular views instead of performing true cross-view synthesis. The benchmark uses view-perturbation protocols to isolate these behaviors from mere distraction.

Core claim

SSMNBench comprises 3,300 curated QA pairs that categorize cross-view human and human-object understanding into Single-View Sufficiency and Multi-View Necessity tasks. Systematic perturbation of view availability across state-of-the-art MLLMs reveals severe distraction degradation on SVS tasks and inability to integrate fragmented geometric evidence on MVN tasks, demonstrating that models depend on multiple single-image semantic averaging and view preference rather than genuine cross-view synthesis.

What carries the argument

The SVS/MVN task categorization combined with systematic view-perturbation protocol that controls availability of camera frames while holding question content fixed.

If this is right

Existing multi-view benchmarks that supply a fixed bag of frames cannot distinguish robustness to distraction from actual cross-view fusion.
Architectures for future MLLMs must incorporate explicit mechanisms for geometric evidence integration rather than relying on semantic averaging.
Evaluation protocols for cross-view understanding should always include both sufficiency and necessity conditions to avoid conflating separate abilities.
Progress on human-centric scene reasoning requires diagnostic tests that perturb view availability rather than simply increasing the number of input frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same diagnostic split could be applied to non-human scenes or video sequences to test whether the averaging behavior is specific to static human-object views.
If models continue to exhibit view preference, training objectives that penalize reliance on single dominant frames might be needed.
Robotics or AR applications that depend on multi-camera fusion would face reliability limits until cross-view synthesis improves.

Load-bearing premise

The curated QA pairs and view-perturbation protocol isolate genuine cross-view synthesis ability from single-view semantic leakage or visual distraction effects.

What would settle it

A model that shows no performance drop on SVS tasks when redundant views are added and simultaneously improves on MVN tasks by correctly combining geometric details across views would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.25634 by Chen Liu, Ling Chen, Tianchen Guo, Xin Yu.

**Figure 1.** Figure 1: Illustration of the SSMNBench curation pipeline. The construction process begins by collecting dense, occlusion-heavy multi-view scenes and defining 11 distinct SVS and MVN tasks. Next, experts annotate QA pairs alongside their necessary groundtruth views. Finally, we generate structured distractors to mitigate linguistic priors, enforce strict quality control via blind verification, and randomize input … view at source ↗

**Figure 2.** Figure 2: Visual examples of the 11 tasks in SSMNBench, categorized by their reliance on view sufficiency. SVS tasks can be resolved with a single clear view, while MVN tasks require synthesizing information from multiple viewpoints to overcome occlusion and ambiguity. – Human-Object Contact – Identifying precise physical interactions between individuals and objects, distinguishing actual contact from near-misses, … view at source ↗

**Figure 3.** Figure 3: Illustration of view variation in SSMNBench’s SVS and MVN settings. (e.g., identifying which entity is closer to a target). This requires cross-view triangulation to resolve monocular depth ambiguity. – Global Orientation Identification – Identifying the global directional facing of a human’s torso or head by linking visual cues across multiple viewpoints (e.g., determining which object or direction a pers… view at source ↗

**Figure 4.** Figure 4: Qualitative examples of typical model failure cases. (a) In the SVS task, the model hallucinates fine-grained interaction details, incorrectly concluding that the subject is maintaining a grip with both hands. (b) In the MVN task, the model fails to synthesize cross-view geometric evidence, incorrectly determining the subject’s global orientation by over-relying on the deceptive perspective from a single … view at source ↗

**Figure 5.** Figure 5: Annotation Interface (Part 1) [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

**Figure 6.** Figure 6: Annotation Interface (Part 2) [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: An illustrative example of the JSON annotation structure utilized in the benchmark. Each entry encapsulates the scene metadata, the natural language query, four multiple-choice options, the ground-truth answer, and the specific camera views required for inference. H Additional Benchmark Examples Further examples of benchmark visualizations are provided below in [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: SSMNBench Examples (Part 1) [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

**Figure 9.** Figure 9: SSMNBench Examples (Part 2) [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

**Figure 10.** Figure 10: SSMNBench Examples (Part 3) [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have shown remarkable progress in single-image perception, yet their ability to reason about complex cross-view human-centric scenes remains largely unverified. Current multi-view benchmarks evaluate models using a fixed "bag of frames" and thus conflate a model's robustness to visual distraction with its genuine ability to fuse fragmented cross-view evidence. To address this issue, we introduce SSMNBench, a diagnostic benchmark comprising 3,300 curated QA pairs for cross-view human and human-object understanding. SSMNBench uniquely categorizes tasks into Single-View Sufficiency (SVS) and Multi-View Necessity (MVN). By systematically perturbing view availability across 17 state-of-the-art MLLMs, critical limitations are revealed: models suffer from severe "distraction degradation" when presented with redundant views (SVS), and fail to integrate fragmented geometric evidence across cameras (MVN). Our evaluations demonstrate that modern MLLMs rely on multiple single-image semantic averaging and view preference rather than genuine cross-view synthesis. By exposing these fundamental vulnerabilities, SSMNBench provides a rigorous diagnostic framework to drive the advancement of future cross-view-aware multimodal architectures. The code is available at: $ \href{https://github.com/gtc-gh/SSMNBench}{\text{SSMNBench}} $

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The SVS/MVN split and distraction finding are the real contribution, but the benchmark's ability to isolate cross-view synthesis still needs verification from the full construction details.

read the letter

The paper's useful move is splitting tasks into Single-View Sufficiency and Multi-View Necessity, then showing that extra views hurt performance on SVS cases through distraction and that models do not integrate geometric evidence on MVN cases. This goes past the usual fixed-bag multi-view tests and points to a concrete failure mode in current MLLMs.

The evaluations cover 17 models on 3,300 QA pairs, and the code release helps anyone who wants to check or extend the benchmark. That setup gives a practical diagnostic for human-centric multi-camera scenes.

The main soft spot is whether the view perturbations and QA curation actually block single-view semantic leakage or other confounds. The abstract states the protocol does this, but the strength of the central claim rests on those details holding up under inspection. If the controls are loose, the distraction degradation and integration failure could partly reflect dataset artifacts rather than model behavior.

The work is aimed at people testing or improving MLLMs for robotics and surveillance applications. A reader who needs a benchmark to measure cross-view limits will find it directly usable.

It has a clear empirical structure and a reproducible component, so it deserves peer review rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The paper introduces SSMNBench, a diagnostic benchmark of 3,300 curated QA pairs for cross-view human and human-object understanding in MLLMs. Tasks are partitioned into Single-View Sufficiency (SVS) and Multi-View Necessity (MVN). Systematic view-perturbation experiments across 17 state-of-the-art MLLMs reveal distraction degradation on SVS tasks and failure to integrate fragmented geometric evidence on MVN tasks, leading to the conclusion that models rely on single-image semantic averaging and view preference rather than genuine cross-view synthesis. Code is released at the provided GitHub link.

Significance. If the benchmark's partitioning and perturbation protocol correctly isolate cross-view synthesis from single-view leakage and distraction, the work supplies a useful diagnostic tool for exposing limitations in current MLLMs and motivating cross-view-aware architectures. The scale of the evaluation and public code release are concrete strengths that support reproducibility and follow-on research.

minor comments (3)

[Abstract] Abstract: the phrase 'systematically perturbing view availability' is used without an early high-level description of the perturbation protocol; moving a one-sentence summary of the SVS/MVN view-availability controls into the abstract would improve immediate clarity.
[Introduction] The manuscript states that the benchmark 'conflates a model's robustness to visual distraction with its genuine ability to fuse fragmented cross-view evidence' in prior work, but does not cite the specific multi-view benchmarks being critiqued; adding 2-3 representative citations in the introduction would strengthen the motivation.
[Experiments] The claim that models 'rely on multiple single-image semantic averaging and view preference' is presented as the primary takeaway; a short additional ablation or control experiment quantifying the contribution of view preference (e.g., via explicit view-order randomization) would make this interpretation more robust.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of SSMNBench and for recommending minor revision. No specific major comments appear in the provided report.

Circularity Check

0 steps flagged

No significant circularity; pure empirical benchmark

full rationale

The paper presents SSMNBench as an empirical diagnostic benchmark that partitions QA pairs into SVS (single-view sufficient) and MVN (multi-view necessary) categories, then measures model performance under controlled view perturbations on 17 held-out MLLMs. No derivations, equations, fitted parameters, or first-principles predictions are claimed or present. Central claims rest on direct measurements of distraction degradation and integration failure rather than any self-referential construction, self-citation chain, or renamed known result. The evaluation protocol is externally falsifiable via the released code and dataset, satisfying the criterion for a self-contained benchmark with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the assumption that its QA pairs and perturbation protocol measure the intended cross-view capabilities; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 3300 QA pairs and their SVS/MVN labels accurately capture single-view sufficiency versus multi-view necessity for human-object understanding.
All conclusions about model limitations depend on this labeling being valid.

pith-pipeline@v0.9.1-grok · 5781 in / 1087 out tokens · 26459 ms · 2026-06-25T21:00:57.709506+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 18 linked inside Pith

[1]

arXiv preprint arXiv:2509.23661 (2025)

An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025)

Pith/arXiv arXiv 2025
[2]

arXiv preprint arXiv:2309.16609 (2023)

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

Pith/arXiv arXiv 2023
[3]

arXiv preprint arXiv:2502.13923 (2025)

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

Pith/arXiv arXiv 2025
[4]

arXiv preprint arXiv:2511.22154 (2025)

Chang, E., Huang, Z., Liao, Y., Bhavsar, S.R., Param, A., Stark, T., Ahmadyan, A., Yang, X., Wang, J., Abdullah, A., et al.: Wearvqa: A visual question answer- ing benchmark for wearables in egocentric authentic real-world scenarios. arXiv preprint arXiv:2511.22154 (2025)

arXiv 2025
[5]

arXiv preprint arXiv:2512.18231 (2025)

Chaudhary, A., Goyal, S., Narang, P., Kumar, D.: Investigating spatial attention bias in vision-language models. arXiv preprint arXiv:2512.18231 (2025)

arXiv 2025
[6]

arXiv preprint arXiv:2401.03890 (2024)

Chen, G., Wang, W.: A survey on 3d gaussian splatting. arXiv preprint arXiv:2401.03890 (2024)

Pith/arXiv arXiv 2024
[7]

Chen, L., Li, L., Zhao, H., Song, Y., Vinci: R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep- Agent/R1-V(2025), accessed: 2025-02-02

2025
[8]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024) 16 T. Guo et al

2024
[9]

arXiv preprint arXiv:2504.13180 (2025)

Cho, J.H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., et al.: Perceptionlm: Open-access data and models for detailed visual understanding. arXiv preprint arXiv:2504.13180 (2025)

arXiv 2025
[10]

arXiv preprint arXiv:2507.06261 (2025)

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

Pith/arXiv arXiv 2025
[11]

DeepSeek-AI: Deepseek-v3 technical report (2024),https://arxiv.org/abs/ 2412.19437

Pith/arXiv arXiv 2024
[12]

IEEE Robotics and Automation Letters10(2), 1840–1847 (2024)

Du,X.,Sun,H.,Lu,M.,Zhu,T.,Yu,X.:Dreamcar:Leveragingcar-specificpriorfor in-the-wild 3d car reconstruction. IEEE Robotics and Automation Letters10(2), 1840–1847 (2024)

2024
[13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Du, X., Wang, Y., Sun, H., Wu, Z., Sheng, H., Wang, S., Ying, J., Lu, M., Zhu, T., Zhan, K., et al.: 3drealcar: An in-the-wild rgb-d car dataset with 360-degree views. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26488–26498 (2025)

2025
[14]

Du, X., Wang, Y., Yu, X.: Mvgs: Multi-view-regulated gaussian splatting for novel view synthesis (2024)

2024
[15]

arXiv preprint arXiv:2603.11531 (2026)

Du, X., Wang, Y., Zhan, K., Yu, X.: Mobile-gs: Real-time gaussian splatting for mobile devices. arXiv preprint arXiv:2603.11531 (2026)

arXiv 2026
[16]

Neurocomputing600, 128129 (2024)

Du, X., Yu, X., Liu, J., Dai, B., Xu, F.: Ethics-aware face recognition aided by synthetic face images. Neurocomputing600, 128129 (2024)

2024
[17]

arXiv e-prints pp

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

2024
[18]

arXiv preprint arXiv:2306.13394 (2023)

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

Pith/arXiv arXiv 2023
[19]

Advances in Neural Information Processing Systems38(2026)

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. Advances in Neural Information Processing Systems38(2026)

2026
[20]

In: European Conference on Computer Vision

Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)

2024
[21]

In: Proceedings of the 29th ACM international conference on multimedia

Gan, Y., Han, R., Yin, L., Feng, W., Wang, S.: Self-supervised multi-view multi- human association and tracking. In: Proceedings of the 29th ACM international conference on multimedia. pp. 282–290 (2021)

2021
[22]

Gholami,M.,Rezaei,A.,Weimin,Z.,Mao,S.,Zhou,S.,Zhang,Y.,Akbari,M.:Spa- tial reasoning with vision-language models in ego-centric multi-view scenes (2025), https://arxiv.org/abs/2509.06266

arXiv 2025
[23]

Google DeepMind: Gemini 2.5: Our most intelligent ai model (2025),https:// blog.google/technology/google-deepmind/gemini-model-thinking-updates- march-2025/

2025
[24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024)

2024
[25]

In: International Conference on Algorithms and Architectures for Parallel Processing

Guo, T., Du, H., Huo, H., Liu, B., Yu, X.: Who is being impersonated? deepfake audio detection and impersonated identification via extraction of id-specific fea- SSMNBench 17 tures. In: International Conference on Algorithms and Architectures for Parallel Processing. pp. 301–320. Springer (2024)

2024
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guo, T., Liu, C., Yu, X.: Beyond single-view sufficiency: Cvbench for cross-view human understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7154–7164 (2026)

2026
[27]

In: Australasian Joint Conference on Artificial Intelligence

Guo,T.,Logan,P.A.,Wackwitz,T.,Martin,D.:Plnet-12:Avision-languagebench- mark for zero-shot physical literacy analysis across 12 fundamental movements. In: Australasian Joint Conference on Artificial Intelligence. pp. 242–254. Springer (2025)

2025
[28]

arXiv preprint arXiv:2605.18746 (2026)

Hong, Y., Liu, J., Yin, H., Li, M., Guibas, L., Fei-Fei, L., Wu, J., Choi, Y.: Esi- bench: Towards embodied spatial intelligence that closes the perception-action loop. arXiv preprint arXiv:2605.18746 (2026)

Pith/arXiv arXiv 2026
[29]

Huang, M., Shi, Y., Peng, D., Lai, S., Xie, Z., Jin, L.: Ocr-reasoning benchmark: Unveilingthetruecapabilitiesofmllmsincomplextext-richimagereasoning.arXiv preprint arXiv:2505.17163 (2025)

Pith/arXiv arXiv 2025
[30]

Hugging Face: Open r1: A fully open reproduction of deepseek-r1 (January 2025), https://github.com/huggingface/open-r1

2025
[31]

arXiv preprint arXiv:2506.03135 (2025)

Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., Yi, L.: Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. arXiv preprint arXiv:2506.03135 (2025)

arXiv 2025
[32]

In: Australasian Database Conference

Ke, Y., Yu, X., Du, H., Chapman, S., Huang, H.: Dynamic orchestration of multi-agent system for real-world multi-image agricultural vqa. In: Australasian Database Conference. pp. 153–165. Springer (2025)

2025
[33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Khirodkar, R., Bansal, A., Ma, L., Newcombe, R., Vo, M., Kitani, K.: Ego-humans: An ego-centric 3d multi-human benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19807–19819 (2023)

2023
[34]

Advances in Neural Information Process- ing Systems37, 107270–107285 (2024)

Khirodkar,R.,Song,J.T.,Cao,J.,Luo,Z.,Kitani,K.:Harmony4d:Avideodataset for in-the-wild close human interactions. Advances in Neural Information Process- ing Systems37, 107270–107285 (2024)

2024
[35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)

2024
[36]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu,C.,Li,P.,Yang,L.,Wang,D.,Li,L.,Yu,X.:Robustaudio-visualsegmentation via audio-guided visual convergent alignment. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28922–28931 (2025)

2025
[37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, C., Li, P.P., Yu, Q., Sheng, H., Wang, D., Li, L., Yu, X.: Benchmarking audio visual segmentation for long-untrimmed videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22712–22722 (2024)

2024
[38]

In: European Conference on Computer Vision

Liu, C., Qiu, F., Zhang, W., Li, L., Wang, D., Yu, X.: Compound expression recognition via curriculum learning. In: European Conference on Computer Vision. pp. 282–293. Springer (2024)

2024
[39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, C., Yang, L., Li, P., Wang, D., Li, L., Yu, X.: Dynamic derivation and elimi- nation: Audio visual segmentation with enhanced audio semantics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3131–3141 (2025)

2025
[40]

In: European Conference on Computer Vision

Liu, C., Zhang, W., Qiu, F., Li, L., Wang, D., Yu, X.: Affective behaviour analysis via progressive learning. In: European Conference on Computer Vision. pp. 366–
[41]

Guo et al

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023) 18 T. Guo et al

2023
[42]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Liu, Y., Zhang, C., Xing, R., Tang, B., Yang, B., Yi, L.: Core4d: A 4d human- object-human interaction dataset for collaborative object rearrangement. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 1769–1782 (2025)

2025
[43]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual- rft: Visual reinforcement fine-tuning. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 2034–2044 (2025)

2034
[44]

arXiv preprint arXiv:2405.20797 (2024)

Lu, S., Li, Y., Chen, Q.G., Xu, Z., Luo, W., Zhang, K., Ye, H.J.: Ovis: Struc- tural embedding alignment for multimodal large language model. arXiv preprint arXiv:2405.20797 (2024)

arXiv 2024
[45]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

Ma, W., Ye, L., de Melo, C.M., Yuille, A., Chen, J.: Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 17249–17260 (2025)

2025
[46]

In: Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

Nguyen, D., Ho, M.K., Ta, H., Nguyen, T.T., Chen, Q., Rav, K., Dang, Q.D., Ram- chandre, S., Phung, S.L., Liao, Z., et al.: Localizing before answering: A benchmark for grounded medical visual question answering. In: Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25). International Joint Confer- ences on Artificial Intellig...

2025
[47]

OpenAI: Introducing gpt-5.2 (2025),https://openai.com/index/introducing- gpt-5-2/

2025
[48]

In: International conference on med- ical image computing and computer-assisted intervention

Özsoy, E., Örnek, E.P., Eck, U., Czempiel, T., Tombari, F., Navab, N.: 4d-or: Se- mantic scene graphs for or domain modeling. In: International conference on med- ical image computing and computer-assisted intervention. pp. 475–485. Springer (2022)

2022
[49]

arXiv preprint arXiv:2406.17431 (2024)

Pan, S., Guo, T., Zhang, L., Liu, P., Xing, Z., Sun, X.: A large-scale investigation of semantically incompatible apis behind compatibility issues in android apps. arXiv preprint arXiv:2406.17431 (2024)

arXiv 2024
[50]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Qi, T., Li, W., Barnes, N.: Smokebench: Evaluating multimodal large language models for wildfire smoke detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1043–1053 (2026)

2026
[51]

arXiv preprint arXiv:2501.01243 (2025)

Qin,L.,Ou,S.,Zhang,M.,Wei,J.,Zhang,Y.,Song,X.,Liu,Y.,Wang,M.,Xu,W.: Face-human-bench: A comprehensive benchmark of face and human understanding for multi-modal assistants. arXiv preprint arXiv:2501.01243 (2025)

arXiv 2025
[52]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Qiu, F., Du, H., Zhang, W., Liu, C., Li, L., Guo, T., Yu, X.: Learning transfer- able compound expressions from masked autoencoder pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4733–4741 (2024)

2024
[53]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Qiu,F.,Zhang,W.,Liu,C.,Li,L.,Du,H.,Guo,T.,Yu,X.:Language-guidedmulti- modal emotional mimicry intensity estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4742–4751 (2024)

2024
[54]

Qwen Team: Qwen3.5: Towards native multimodal agents (February 2026),https: //qwen.ai/blog?id=qwen3.5

2026
[55]

arXiv preprint arXiv:2504.07615 (2025)

Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., Xu, R., Zhao, T.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)

Pith/arXiv arXiv 2025
[56]

Sun, H., Wu, J., Xia, B., Luo, Y., Zhao, Y., Qin, K., Lv, X., Zhang, T., Chang, Y., Wang, X.: Reinforcement fine-tuning powers reasoning capability of multimodal large language models (2025),https://arxiv.org/abs/2505.18536 SSMNBench 19

arXiv 2025
[57]

arXiv preprint arXiv:2312.11805 (2023)

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

Pith/arXiv arXiv 2023
[58]

Team, V., Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., Tang, H., Cheng, J., Qi, J., Ji, J., Pan, L., Duan, S., Wang, W., Wang, Y., Cheng, Y., He, Z., Su, Z., Yang, Z., Pan, Z., Zeng, A., Wang, B., Chen, B., Shi, B., Pang, C., Zhang, C., Yin, D., Yang, F., Chen, G., Xu, J., Zhu, J., Chen, J., Chen, J., Chen, J., Lin, J., Wang, J., Chen, J., Lei, L., Gong, ...

Pith/arXiv arXiv 2025
[59]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Tian, X., Zou, S., Yang, Z., Zhang, J.: Identifying and mitigating position bias of multi-image vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10599–10609 (2025)

2025
[60]

arXiv preprint arXiv:2302.13971 (2023)

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

Pith/arXiv arXiv 2023
[61]

arXiv preprint arXiv:2406.09411 (2024)

Wang, F., Fu, X., Huang, J.Y., Li, Z., Liu, Q., Liu, X., Ma, M.D., Xu, N., Zhou, W., Zhang, K., et al.: Muirbench: A comprehensive benchmark for robust multi-image understanding. arXiv preprint arXiv:2406.09411 (2024)

Pith/arXiv arXiv 2024
[62]

Wang, T., Zhang, Z., Zhu, Z., Fan, Y., Xiong, J., Li, P., Ma, X., Li, Q.: From objects to anywhere: A holistic benchmark for multi-level visual grounding in 3d scenes (2025),https://arxiv.org/abs/2506.04897

arXiv 2025
[63]

arXiv preprint arXiv:2508.18265 (2025)

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

Pith/arXiv arXiv 2025
[64]

In: 2023 IEEE International Conference on Big Data (BigData)

Wu, J., Gan, W., Chen, Z., Wan, S., Yu, P.S.: Multimodal large language models: A survey. In: 2023 IEEE International Conference on Big Data (BigData). pp. 2247–2256. IEEE (2023)

2023
[65]

arXiv preprint arXiv:2412.10302 (2024)

Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)

Pith/arXiv arXiv 2024
[66]

arXiv preprint arXiv:2204.02824 (2022)

Wu, Z., Qi, X., Wang, Z., Zhou, W., Yuan, K., Sun, M., Sun, Z.: Showface: Co- ordinated face inpainting with memory-disentangled refinement networks. arXiv preprint arXiv:2204.02824 (2022)

arXiv 2022
[67]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, Z., Wang, S., Yu, X.: Metom: Metadata-guided token merging for efficient video llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10441–10450 (2026)

2026
[68]

In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference

Xu, Q., Cao, R., Shen, X., Du, H., Wang, S., Yu, X.: M3gym: A large-scale mul- timodal multi-view multi-person pose dataset for fitness activity understanding in real-world settings. In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference. pp. 12289–12300 (2025)

2025
[69]

Computer Vision and Image Understanding249, 104205 (2024) 20 T

Xu, Q., Chen, H., Du, H., Zhang, H., Łukasik, S., Zhu, T., Yu, X.: M3a: A mul- timodal misinformation dataset for media authenticity analysis. Computer Vision and Image Understanding249, 104205 (2024) 20 T. Guo et al

2024
[70]

In: Proceedings of the ACM on Web Conference 2025

Xu, Q., Du, H., Łukasik, S., Zhu, T., Wang, S., Yu, X.: Mdam3: A misinformation detection and analysis framework for multitype multimodal media. In: Proceedings of the ACM on Web Conference 2025. pp. 5285–5296 (2025)

2025
[71]

arXiv preprint arXiv:2505.09388 (2025)

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

Pith/arXiv arXiv 2025
[72]

In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632– 10643 (2025)

2025
[73]

arXiv preprint arXiv:2505.23764 (2025)

Yang, S., Xu, R., Xie, Y., Yang, S., Li, M., Lin, J., Zhu, C., Chen, X., Duan, H., Yue, X., et al.: Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764 (2025)

Pith/arXiv arXiv 2025
[74]

arXiv preprint arXiv:2504.15280 (2025)

Yeh, C.H., Wang, C., Tong, S., Cheng, T.Y., Wang, R., Chu, T., Zhai, Y., Chen, Y., Gao, S., Ma, Y.: Seeing from another perspective: Evaluating multi-view un- derstanding in mllms. arXiv preprint arXiv:2504.15280 (2025)

arXiv 2025
[75]

National Science Review11(12), nwae403 (2024)

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

2024
[76]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, J., Zhang, J., Song, Z., Shi, Z., Zhao, C., Shi, Y., Yu, J., Xu, L., Wang, J.: Hoi-mˆ 3: Capture multiple humans and objects interaction within contextual environment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 516–526 (2024)

2024
[77]

In: Findings of the Association for Computational Linguistics: EMNLP 2025

Zhang, K., Niu, L., Cao, Z., Meng, F., Zhou, J.: Tiu-bench: A benchmark for evaluating large multimodal models on text-rich image understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 24286–24295 (2025)

2025
[78]

arXiv preprint arXiv:2509.02359 (2025)

Zhang, W., Huang, Y., Xu, Y., Huang, J., Zhi, H., Ren, S., Xu, W., Zhang, J.: Why do mllms struggle with spatial understanding? a systematic analysis from data to architecture. arXiv preprint arXiv:2509.02359 (2025)

arXiv 2025
[79]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhang, W., Qiu, F., Liu, C., Li, L., Du, H., Guo, T., Yu, X.: An effective en- semble learning framework for affective behaviour analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4761–4772 (2024)

2024
[80]

arXiv preprint arXiv:2403.10825 (2024)

Zhang, W., Qiu, F., Liu, C., Li, L., Du, H., Guo, T., Yu, X.: Affective behaviour analysis via integrating multi-modal knowledge. arXiv preprint arXiv:2403.10825 (2024)

arXiv 2024

Showing first 80 references.

[1] [1]

arXiv preprint arXiv:2509.23661 (2025)

An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025)

Pith/arXiv arXiv 2025

[2] [2]

arXiv preprint arXiv:2309.16609 (2023)

Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

Pith/arXiv arXiv 2023

[3] [3]

arXiv preprint arXiv:2502.13923 (2025)

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

Pith/arXiv arXiv 2025

[4] [4]

arXiv preprint arXiv:2511.22154 (2025)

Chang, E., Huang, Z., Liao, Y., Bhavsar, S.R., Param, A., Stark, T., Ahmadyan, A., Yang, X., Wang, J., Abdullah, A., et al.: Wearvqa: A visual question answer- ing benchmark for wearables in egocentric authentic real-world scenarios. arXiv preprint arXiv:2511.22154 (2025)

arXiv 2025

[5] [5]

arXiv preprint arXiv:2512.18231 (2025)

Chaudhary, A., Goyal, S., Narang, P., Kumar, D.: Investigating spatial attention bias in vision-language models. arXiv preprint arXiv:2512.18231 (2025)

arXiv 2025

[6] [6]

arXiv preprint arXiv:2401.03890 (2024)

Chen, G., Wang, W.: A survey on 3d gaussian splatting. arXiv preprint arXiv:2401.03890 (2024)

Pith/arXiv arXiv 2024

[7] [7]

Chen, L., Li, L., Zhao, H., Song, Y., Vinci: R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep- Agent/R1-V(2025), accessed: 2025-02-02

2025

[8] [8]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024) 16 T. Guo et al

2024

[9] [9]

arXiv preprint arXiv:2504.13180 (2025)

Cho, J.H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., et al.: Perceptionlm: Open-access data and models for detailed visual understanding. arXiv preprint arXiv:2504.13180 (2025)

arXiv 2025

[10] [10]

arXiv preprint arXiv:2507.06261 (2025)

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

Pith/arXiv arXiv 2025

[11] [11]

DeepSeek-AI: Deepseek-v3 technical report (2024),https://arxiv.org/abs/ 2412.19437

Pith/arXiv arXiv 2024

[12] [12]

IEEE Robotics and Automation Letters10(2), 1840–1847 (2024)

Du,X.,Sun,H.,Lu,M.,Zhu,T.,Yu,X.:Dreamcar:Leveragingcar-specificpriorfor in-the-wild 3d car reconstruction. IEEE Robotics and Automation Letters10(2), 1840–1847 (2024)

2024

[13] [13]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Du, X., Wang, Y., Sun, H., Wu, Z., Sheng, H., Wang, S., Ying, J., Lu, M., Zhu, T., Zhan, K., et al.: 3drealcar: An in-the-wild rgb-d car dataset with 360-degree views. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26488–26498 (2025)

2025

[14] [14]

Du, X., Wang, Y., Yu, X.: Mvgs: Multi-view-regulated gaussian splatting for novel view synthesis (2024)

2024

[15] [15]

arXiv preprint arXiv:2603.11531 (2026)

Du, X., Wang, Y., Zhan, K., Yu, X.: Mobile-gs: Real-time gaussian splatting for mobile devices. arXiv preprint arXiv:2603.11531 (2026)

arXiv 2026

[16] [16]

Neurocomputing600, 128129 (2024)

Du, X., Yu, X., Liu, J., Dai, B., Xu, F.: Ethics-aware face recognition aided by synthetic face images. Neurocomputing600, 128129 (2024)

2024

[17] [17]

arXiv e-prints pp

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

2024

[18] [18]

arXiv preprint arXiv:2306.13394 (2023)

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)

Pith/arXiv arXiv 2023

[19] [19]

Advances in Neural Information Processing Systems38(2026)

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. Advances in Neural Information Processing Systems38(2026)

2026

[20] [20]

In: European Conference on Computer Vision

Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)

2024

[21] [21]

In: Proceedings of the 29th ACM international conference on multimedia

Gan, Y., Han, R., Yin, L., Feng, W., Wang, S.: Self-supervised multi-view multi- human association and tracking. In: Proceedings of the 29th ACM international conference on multimedia. pp. 282–290 (2021)

2021

[22] [22]

Gholami,M.,Rezaei,A.,Weimin,Z.,Mao,S.,Zhou,S.,Zhang,Y.,Akbari,M.:Spa- tial reasoning with vision-language models in ego-centric multi-view scenes (2025), https://arxiv.org/abs/2509.06266

arXiv 2025

[23] [23]

Google DeepMind: Gemini 2.5: Our most intelligent ai model (2025),https:// blog.google/technology/google-deepmind/gemini-model-thinking-updates- march-2025/

2025

[24] [24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024)

2024

[25] [25]

In: International Conference on Algorithms and Architectures for Parallel Processing

Guo, T., Du, H., Huo, H., Liu, B., Yu, X.: Who is being impersonated? deepfake audio detection and impersonated identification via extraction of id-specific fea- SSMNBench 17 tures. In: International Conference on Algorithms and Architectures for Parallel Processing. pp. 301–320. Springer (2024)

2024

[26] [26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guo, T., Liu, C., Yu, X.: Beyond single-view sufficiency: Cvbench for cross-view human understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7154–7164 (2026)

2026

[27] [27]

In: Australasian Joint Conference on Artificial Intelligence

Guo,T.,Logan,P.A.,Wackwitz,T.,Martin,D.:Plnet-12:Avision-languagebench- mark for zero-shot physical literacy analysis across 12 fundamental movements. In: Australasian Joint Conference on Artificial Intelligence. pp. 242–254. Springer (2025)

2025

[28] [28]

arXiv preprint arXiv:2605.18746 (2026)

Hong, Y., Liu, J., Yin, H., Li, M., Guibas, L., Fei-Fei, L., Wu, J., Choi, Y.: Esi- bench: Towards embodied spatial intelligence that closes the perception-action loop. arXiv preprint arXiv:2605.18746 (2026)

Pith/arXiv arXiv 2026

[29] [29]

Huang, M., Shi, Y., Peng, D., Lai, S., Xie, Z., Jin, L.: Ocr-reasoning benchmark: Unveilingthetruecapabilitiesofmllmsincomplextext-richimagereasoning.arXiv preprint arXiv:2505.17163 (2025)

Pith/arXiv arXiv 2025

[30] [30]

Hugging Face: Open r1: A fully open reproduction of deepseek-r1 (January 2025), https://github.com/huggingface/open-r1

2025

[31] [31]

arXiv preprint arXiv:2506.03135 (2025)

Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., Yi, L.: Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. arXiv preprint arXiv:2506.03135 (2025)

arXiv 2025

[32] [32]

In: Australasian Database Conference

Ke, Y., Yu, X., Du, H., Chapman, S., Huang, H.: Dynamic orchestration of multi-agent system for real-world multi-image agricultural vqa. In: Australasian Database Conference. pp. 153–165. Springer (2025)

2025

[33] [33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Khirodkar, R., Bansal, A., Ma, L., Newcombe, R., Vo, M., Kitani, K.: Ego-humans: An ego-centric 3d multi-human benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19807–19819 (2023)

2023

[34] [34]

Advances in Neural Information Process- ing Systems37, 107270–107285 (2024)

Khirodkar,R.,Song,J.T.,Cao,J.,Luo,Z.,Kitani,K.:Harmony4d:Avideodataset for in-the-wild close human interactions. Advances in Neural Information Process- ing Systems37, 107270–107285 (2024)

2024

[35] [35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)

2024

[36] [36]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu,C.,Li,P.,Yang,L.,Wang,D.,Li,L.,Yu,X.:Robustaudio-visualsegmentation via audio-guided visual convergent alignment. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28922–28931 (2025)

2025

[37] [37]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, C., Li, P.P., Yu, Q., Sheng, H., Wang, D., Li, L., Yu, X.: Benchmarking audio visual segmentation for long-untrimmed videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22712–22722 (2024)

2024

[38] [38]

In: European Conference on Computer Vision

Liu, C., Qiu, F., Zhang, W., Li, L., Wang, D., Yu, X.: Compound expression recognition via curriculum learning. In: European Conference on Computer Vision. pp. 282–293. Springer (2024)

2024

[39] [39]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, C., Yang, L., Li, P., Wang, D., Li, L., Yu, X.: Dynamic derivation and elimi- nation: Audio visual segmentation with enhanced audio semantics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3131–3141 (2025)

2025

[40] [40]

In: European Conference on Computer Vision

Liu, C., Zhang, W., Qiu, F., Li, L., Wang, D., Yu, X.: Affective behaviour analysis via progressive learning. In: European Conference on Computer Vision. pp. 366–

[41] [41]

Guo et al

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023) 18 T. Guo et al

2023

[42] [42]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Liu, Y., Zhang, C., Xing, R., Tang, B., Yang, B., Yi, L.: Core4d: A 4d human- object-human interaction dataset for collaborative object rearrangement. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 1769–1782 (2025)

2025

[43] [43]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual- rft: Visual reinforcement fine-tuning. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 2034–2044 (2025)

2034

[44] [44]

arXiv preprint arXiv:2405.20797 (2024)

Lu, S., Li, Y., Chen, Q.G., Xu, Z., Luo, W., Zhang, K., Ye, H.J.: Ovis: Struc- tural embedding alignment for multimodal large language model. arXiv preprint arXiv:2405.20797 (2024)

arXiv 2024

[45] [45]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

Ma, W., Ye, L., de Melo, C.M., Yuille, A., Chen, J.: Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 17249–17260 (2025)

2025

[46] [46]

In: Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)

Nguyen, D., Ho, M.K., Ta, H., Nguyen, T.T., Chen, Q., Rav, K., Dang, Q.D., Ram- chandre, S., Phung, S.L., Liao, Z., et al.: Localizing before answering: A benchmark for grounded medical visual question answering. In: Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25). International Joint Confer- ences on Artificial Intellig...

2025

[47] [47]

OpenAI: Introducing gpt-5.2 (2025),https://openai.com/index/introducing- gpt-5-2/

2025

[48] [48]

In: International conference on med- ical image computing and computer-assisted intervention

Özsoy, E., Örnek, E.P., Eck, U., Czempiel, T., Tombari, F., Navab, N.: 4d-or: Se- mantic scene graphs for or domain modeling. In: International conference on med- ical image computing and computer-assisted intervention. pp. 475–485. Springer (2022)

2022

[49] [49]

arXiv preprint arXiv:2406.17431 (2024)

Pan, S., Guo, T., Zhang, L., Liu, P., Xing, Z., Sun, X.: A large-scale investigation of semantically incompatible apis behind compatibility issues in android apps. arXiv preprint arXiv:2406.17431 (2024)

arXiv 2024

[50] [50]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Qi, T., Li, W., Barnes, N.: Smokebench: Evaluating multimodal large language models for wildfire smoke detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1043–1053 (2026)

2026

[51] [51]

arXiv preprint arXiv:2501.01243 (2025)

Qin,L.,Ou,S.,Zhang,M.,Wei,J.,Zhang,Y.,Song,X.,Liu,Y.,Wang,M.,Xu,W.: Face-human-bench: A comprehensive benchmark of face and human understanding for multi-modal assistants. arXiv preprint arXiv:2501.01243 (2025)

arXiv 2025

[52] [52]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Qiu, F., Du, H., Zhang, W., Liu, C., Li, L., Guo, T., Yu, X.: Learning transfer- able compound expressions from masked autoencoder pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4733–4741 (2024)

2024

[53] [53]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Qiu,F.,Zhang,W.,Liu,C.,Li,L.,Du,H.,Guo,T.,Yu,X.:Language-guidedmulti- modal emotional mimicry intensity estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4742–4751 (2024)

2024

[54] [54]

Qwen Team: Qwen3.5: Towards native multimodal agents (February 2026),https: //qwen.ai/blog?id=qwen3.5

2026

[55] [55]

arXiv preprint arXiv:2504.07615 (2025)

Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., Xu, R., Zhao, T.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)

Pith/arXiv arXiv 2025

[56] [56]

Sun, H., Wu, J., Xia, B., Luo, Y., Zhao, Y., Qin, K., Lv, X., Zhang, T., Chang, Y., Wang, X.: Reinforcement fine-tuning powers reasoning capability of multimodal large language models (2025),https://arxiv.org/abs/2505.18536 SSMNBench 19

arXiv 2025

[57] [57]

arXiv preprint arXiv:2312.11805 (2023)

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

Pith/arXiv arXiv 2023

[58] [58]

Team, V., Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., Tang, H., Cheng, J., Qi, J., Ji, J., Pan, L., Duan, S., Wang, W., Wang, Y., Cheng, Y., He, Z., Su, Z., Yang, Z., Pan, Z., Zeng, A., Wang, B., Chen, B., Shi, B., Pang, C., Zhang, C., Yin, D., Yang, F., Chen, G., Xu, J., Zhu, J., Chen, J., Chen, J., Chen, J., Lin, J., Wang, J., Chen, J., Lei, L., Gong, ...

Pith/arXiv arXiv 2025

[59] [59]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Tian, X., Zou, S., Yang, Z., Zhang, J.: Identifying and mitigating position bias of multi-image vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10599–10609 (2025)

2025

[60] [60]

arXiv preprint arXiv:2302.13971 (2023)

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

Pith/arXiv arXiv 2023

[61] [61]

arXiv preprint arXiv:2406.09411 (2024)

Wang, F., Fu, X., Huang, J.Y., Li, Z., Liu, Q., Liu, X., Ma, M.D., Xu, N., Zhou, W., Zhang, K., et al.: Muirbench: A comprehensive benchmark for robust multi-image understanding. arXiv preprint arXiv:2406.09411 (2024)

Pith/arXiv arXiv 2024

[62] [62]

Wang, T., Zhang, Z., Zhu, Z., Fan, Y., Xiong, J., Li, P., Ma, X., Li, Q.: From objects to anywhere: A holistic benchmark for multi-level visual grounding in 3d scenes (2025),https://arxiv.org/abs/2506.04897

arXiv 2025

[63] [63]

arXiv preprint arXiv:2508.18265 (2025)

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

Pith/arXiv arXiv 2025

[64] [64]

In: 2023 IEEE International Conference on Big Data (BigData)

Wu, J., Gan, W., Chen, Z., Wan, S., Yu, P.S.: Multimodal large language models: A survey. In: 2023 IEEE International Conference on Big Data (BigData). pp. 2247–2256. IEEE (2023)

2023

[65] [65]

arXiv preprint arXiv:2412.10302 (2024)

Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)

Pith/arXiv arXiv 2024

[66] [66]

arXiv preprint arXiv:2204.02824 (2022)

Wu, Z., Qi, X., Wang, Z., Zhou, W., Yuan, K., Sun, M., Sun, Z.: Showface: Co- ordinated face inpainting with memory-disentangled refinement networks. arXiv preprint arXiv:2204.02824 (2022)

arXiv 2022

[67] [67]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wu, Z., Wang, S., Yu, X.: Metom: Metadata-guided token merging for efficient video llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10441–10450 (2026)

2026

[68] [68]

In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference

Xu, Q., Cao, R., Shen, X., Du, H., Wang, S., Yu, X.: M3gym: A large-scale mul- timodal multi-view multi-person pose dataset for fitness activity understanding in real-world settings. In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference. pp. 12289–12300 (2025)

2025

[69] [69]

Computer Vision and Image Understanding249, 104205 (2024) 20 T

Xu, Q., Chen, H., Du, H., Zhang, H., Łukasik, S., Zhu, T., Yu, X.: M3a: A mul- timodal misinformation dataset for media authenticity analysis. Computer Vision and Image Understanding249, 104205 (2024) 20 T. Guo et al

2024

[70] [70]

In: Proceedings of the ACM on Web Conference 2025

Xu, Q., Du, H., Łukasik, S., Zhu, T., Wang, S., Yu, X.: Mdam3: A misinformation detection and analysis framework for multitype multimodal media. In: Proceedings of the ACM on Web Conference 2025. pp. 5285–5296 (2025)

2025

[71] [71]

arXiv preprint arXiv:2505.09388 (2025)

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

Pith/arXiv arXiv 2025

[72] [72]

In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632– 10643 (2025)

2025

[73] [73]

arXiv preprint arXiv:2505.23764 (2025)

Yang, S., Xu, R., Xie, Y., Yang, S., Li, M., Lin, J., Zhu, C., Chen, X., Duan, H., Yue, X., et al.: Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764 (2025)

Pith/arXiv arXiv 2025

[74] [74]

arXiv preprint arXiv:2504.15280 (2025)

Yeh, C.H., Wang, C., Tong, S., Cheng, T.Y., Wang, R., Chu, T., Zhai, Y., Chen, Y., Gao, S., Ma, Y.: Seeing from another perspective: Evaluating multi-view un- derstanding in mllms. arXiv preprint arXiv:2504.15280 (2025)

arXiv 2025

[75] [75]

National Science Review11(12), nwae403 (2024)

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)

2024

[76] [76]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Zhang, J., Zhang, J., Song, Z., Shi, Z., Zhao, C., Shi, Y., Yu, J., Xu, L., Wang, J.: Hoi-mˆ 3: Capture multiple humans and objects interaction within contextual environment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 516–526 (2024)

2024

[77] [77]

In: Findings of the Association for Computational Linguistics: EMNLP 2025

Zhang, K., Niu, L., Cao, Z., Meng, F., Zhou, J.: Tiu-bench: A benchmark for evaluating large multimodal models on text-rich image understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 24286–24295 (2025)

2025

[78] [78]

arXiv preprint arXiv:2509.02359 (2025)

Zhang, W., Huang, Y., Xu, Y., Huang, J., Zhi, H., Ren, S., Xu, W., Zhang, J.: Why do mllms struggle with spatial understanding? a systematic analysis from data to architecture. arXiv preprint arXiv:2509.02359 (2025)

arXiv 2025

[79] [79]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhang, W., Qiu, F., Liu, C., Li, L., Du, H., Guo, T., Yu, X.: An effective en- semble learning framework for affective behaviour analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4761–4772 (2024)

2024

[80] [80]

arXiv preprint arXiv:2403.10825 (2024)

Zhang, W., Qiu, F., Liu, C., Li, L., Du, H., Guo, T., Yu, X.: Affective behaviour analysis via integrating multi-modal knowledge. arXiv preprint arXiv:2403.10825 (2024)

arXiv 2024