arxiv: 2604.21814 · v1 · submitted 2026-04-23 · 💻 cs.CV · cs.AI

Recognition: unknown

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

Bowen Liu , Li Yang , Shanshan Song , Mingyu Tang , Zhifang Gao , Qifeng Chen , Yangqiu Song , Huimin Chen

show 1 more author

Xiaomeng Li

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords capsule endoscopyvideo summarizationdiagnosis-driven taskcontext organizationevidence aggregationVideoCAP datasetclinician workflow

0 comments

The pith

A divide-then-diagnose framework turns ultra-long capsule endoscopy videos into concise and accurate medical diagnoses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Capsule endoscopy produces videos with tens of thousands of frames where relevant findings are rare and easily lost among normal ones. The paper defines the task of diagnosis-driven video summarization, which requires both selecting key evidence frames and generating diagnoses from them. It supplies VideoCAP, a dataset of 240 full-length videos annotated directly from clinical reports to support this task. DiCE solves it by first screening for candidate frames, then organizing those frames into separate coherent contexts that mirror how clinicians read, and finally aggregating evidence within each context for a judgment. The result is a short summary that covers distinct lesions without being overwhelmed by redundancy or ambiguity.

Core claim

DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments.

What carries the argument

The divide-then-diagnose pipeline that first screens candidates then applies a Context Weaver to group them by distinct lesion events and an Evidence Converger to combine multi-frame observations for each group.

Load-bearing premise

That diagnostically relevant events are extremely sparse and can be identified reliably in an initial screening pass without missing critical findings.

What would settle it

A collection of capsule endoscopy videos containing densely clustered or easily overlooked abnormalities in which the screening step omits key frames that later cause wrong diagnoses.

read the original abstract

Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new task and releases the first diagnosis-annotated CE video dataset, which is useful, but the core reliability claim depends on an untested screening step with no reported recall.

read the letter

The real advance here is the formal definition of diagnosis-driven CE video summarization as a distinct task from frame-level work, plus the release of VideoCAP with 240 full videos annotated directly from clinical reports. That dataset alone gives researchers something concrete to work with for video-level reasoning in medical imaging, where most prior stuff stayed at per-frame detection or classification. The DiCE pipeline follows a clinician-style divide-then-diagnose flow: quick candidate screening, then Context Weaver to bundle events into coherent contexts, then Evidence Converger to pool evidence per context. That structure makes sense on paper and matches how doctors actually read these long videos. Experiments claim consistent gains over SOTA on concise, reliable summaries, which is at least directionally interesting for workload reduction. The soft spot is the screening stage. The whole pipeline stands or falls on whether the first pass recovers every diagnostically relevant event from tens of thousands of frames despite blur, debris, and viewpoint shifts. Any false negative is unrecoverable downstream, yet the work does not appear to report per-lesion recall or false-negative rates for that module, nor does it stress-test worst-case videos with rare or ambiguous findings. Without those numbers the clinical reliability claim stays unanchored. This is worth bringing to a reading group for the task definition and data release. I would not cite the method yet because the validation gap is load-bearing, but a serious editor should send it to peer review—the new task and dataset are substantive enough to justify referee time even if the authors need to add recall analysis and more ablation on the screening step.

Referee Report

2 major / 2 minor

Summary. The paper introduces the new task of diagnosis-driven CE video summarization for ultra-long capsule endoscopy videos, releases the VideoCAP dataset (240 full-length videos with annotations derived from real clinical reports), and proposes the DiCE framework. DiCE performs an initial efficient candidate screening pass over the raw video, followed by a Context Weaver that organizes screened candidates into coherent diagnostic contexts preserving distinct lesion events, and an Evidence Converger that aggregates multi-frame evidence within each context for robust clip-level diagnoses. The central claim is that this clinician-inspired divide-then-diagnose pipeline consistently outperforms state-of-the-art methods while producing concise and clinically reliable diagnostic summaries.

Significance. If the empirical claims hold, the work would meaningfully advance video-level analysis in capsule endoscopy, a domain where manual review of tens of thousands of frames remains a clinical bottleneck. The new VideoCAP dataset provides realistic supervision for both evidence extraction and diagnosis, and the emphasis on preserving distinct lesion contexts addresses a practical gap left by frame-level classification methods. The clinician-mirroring workflow is a constructive direction for sparse-event video understanding.

major comments (2)

[Experiments / Method (candidate screening)] The screening module is load-bearing: any false negative is irrecoverable by the downstream Context Weaver and Evidence Converger. The experiments section reports results on VideoCAP but does not include per-lesion recall, false-negative rates, or worst-case analysis for videos containing rare/ambiguous findings under motion blur and artifacts; without these metrics the claim of clinical reliability cannot be evaluated.
[Abstract and Experiments] The abstract and method assert that DiCE 'consistently outperforms state-of-the-art methods' and yields 'clinically reliable' summaries, yet the provided description supplies no quantitative metrics, baseline comparisons, ablation studies isolating the Context Weaver or Evidence Converger, or statistical significance tests. These details are required to substantiate the central performance claim.

minor comments (2)

[Method] The terms 'Context Weaver' and 'Evidence Converger' are introduced without a clear mathematical formulation or pseudocode; adding explicit definitions or algorithmic outlines would improve reproducibility.
[Abstract / Dataset] The abstract states that diagnostically relevant events are 'extremely sparse' but does not quantify the average number of positive frames per video or the sparsity ratio on VideoCAP; including these statistics would strengthen the motivation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential impact of the diagnosis-driven CE video summarization task, the VideoCAP dataset, and the DiCE framework. We address each major comment below and will revise the manuscript to strengthen the experimental validation and presentation of results.

read point-by-point responses

Referee: [Experiments / Method (candidate screening)] The screening module is load-bearing: any false negative is irrecoverable by the downstream Context Weaver and Evidence Converger. The experiments section reports results on VideoCAP but does not include per-lesion recall, false-negative rates, or worst-case analysis for videos containing rare/ambiguous findings under motion blur and artifacts; without these metrics the claim of clinical reliability cannot be evaluated.

Authors: We agree that the screening module is critical, as any false negative cannot be recovered downstream. The current end-to-end results on VideoCAP demonstrate overall performance, but we acknowledge the need for more granular analysis. In the revised manuscript, we will add per-lesion recall, false-negative rates, and a dedicated worst-case analysis focusing on videos with rare or ambiguous findings under motion blur and artifacts. This will provide a more complete evaluation of clinical reliability. revision: yes
Referee: [Abstract and Experiments] The abstract and method assert that DiCE 'consistently outperforms state-of-the-art methods' and yields 'clinically reliable' summaries, yet the provided description supplies no quantitative metrics, baseline comparisons, ablation studies isolating the Context Weaver or Evidence Converger, or statistical significance tests. These details are required to substantiate the central performance claim.

Authors: We agree that the abstract and method sections would benefit from more explicit quantitative support to substantiate the claims. In the revision, we will incorporate key performance metrics, direct baseline comparisons, ablation studies that isolate the contributions of the Context Weaver and Evidence Converger, and statistical significance tests. These additions will be drawn from the existing experimental results and presented clearly to strengthen the central performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive framework with no equations or self-referential reductions.

full rationale

The paper defines a new task (diagnosis-driven CE video summarization), introduces the VideoCAP dataset, and describes the DiCE framework as a clinician-inspired pipeline consisting of candidate screening, Context Weaver, and Evidence Converger. No mathematical derivations, equations, or parameter-fitting steps are referenced in the abstract or method outline that could reduce to fitted inputs or self-definitions. The workflow is presented as an architectural choice mirroring clinical practice rather than a derived result. No load-bearing self-citations or uniqueness theorems appear. Experiments on the introduced dataset provide external validation, keeping the central claims independent of any tautological construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard machine-learning assumptions about neural networks learning from sparse annotations and on the untested premise that a divide-then-diagnose pipeline faithfully captures clinical reading practice; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption Neural networks can learn to identify diagnostically relevant frames from video data annotated via clinical reports.
Implicit in proposing a learning-based screening and aggregation pipeline.
domain assumption Organizing candidate frames into distinct lesion-event contexts improves diagnostic accuracy over frame-independent processing.
Core motivation for the Context Weaver component.

invented entities (2)

Context Weaver no independent evidence
purpose: Organizes screened candidates into coherent diagnostic contexts that preserve distinct lesion events.
New module introduced to mirror clinician workflow.
Evidence Converger no independent evidence
purpose: Aggregates multi-frame evidence within each context into robust clip-level judgments.
New module for robust diagnosis from grouped frames.

pith-pipeline@v0.9.0 · 5586 in / 1489 out tokens · 27024 ms · 2026-05-09T22:29:50.258967+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild
cs.CV 2026-05 unverdicted novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Ranade, A

doi: 10.1109/BigData52589.2021.9671281. Bidossessi Emmanuel Agossou, Marius Pedersen, Kiran Raja, Anuja Vats, and Pål Anders Floor. Influence of color correction on pathology detection in Capsule Endoscopy, January 2025.http://arxiv.org/abs/2502.00076. arXiv:2502.00076 [cs]. Y. Akihito et al. The see-ai project dataset. Kaggle dataset, 2022.https://doi.or...

work page doi:10.1109/bigdata52589.2021.9671281 2021
[2]

arXiv:2412.19218 [cs]

http://arxiv.org/abs/2412.19218. arXiv:2412.19218 [cs]. Yasin Almalioglu et al. EndoL2H: Deep Super-Resolution for Capsule Endoscopy, June 2020.http://arxiv.org/abs/ 2002.05459. arXiv:2002.05459 [cs]. Patrícia Andrade et al. Ai-assisted capsule endoscopy for detection of ulcers and erosions in crohn’s disease: a multicenter validation study.Clinical Gastr...

work page arXiv 2020
[3]

Shuai Bai et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025a. Shuai Bai et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025b. A. Charoen et al. Rhode island gastroenterology video capsule endoscopy data set.Scientific Data, 9:602,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Junying Chen et al

doi: 10.1038/s41597-022-01726-3.https://doi.org/10.1038/s41597-022-01726-3. Junying Chen et al. Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale, 2024a.https://arxiv.org/abs/2406.19280. Zhe Chen et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of...

work page doi:10.1038/s41597-022-01726-3.https://doi.org/10.1038/s41597-022-01726-3
[5]

Towards XAI in the SOC – A User- Centric Study of Explainable Alerts with SHAP and LIME, in: 2022 IEEE International Conference on Big Data (Big Data), IEEE, Osaka, Japan

doi: 10.1109/BigData55660.2022.10020333. Xiaoqing Guo and Yixuan Yuan. Triple ANet: Adaptive Abnormal-aware Attention Network for WCE Image Classification. In Dinggang Shen, Tianming Liu, Terry M. Peters, Lawrence H. Staib, Caroline Essert, Sean Zhou, Pew-Thian Yap, and Ali Khan, editors,Medical Image Computing and Computer Assisted Intervention – MICCAI ...

work page doi:10.1109/bigdata55660.2022.10020333 2022
[6]

ISBN 978-3-030-32239-7

Springer International Publishing. ISBN 978-3-030-32239-7. doi: 10.1007/978-3-030-32239-7_33. Xiaoqing Guo and Yixuan Yuan. Semi-supervised WCE image classification with adaptive aggregated attention. Medical Image Analysis, 64:101733, August

work page doi:10.1007/978-3-030-32239-7_33
[7]

doi: 10.1016/j.media.2020.101733

ISSN 1361-8415. doi: 10.1016/j.media.2020.101733. https: //www.sciencedirect.com/science/article/pii/S1361841520300979. Tsedeke Temesgen Habe, Keijo Haataja, and Pekka Toivanen. Precision enhancement in wireless capsule endoscopy: a novel transformer-based approach for real-time video object detection.Frontiers in Artificial Intelligence, 8:1529814,

work page doi:10.1016/j.media.2020.101733 2020
[8]

Handa, D

13 P. Handa, D. D. Gunjan, P. N. Goel, and P. S. Indu. Ai-koda dataset: An ai-image dataset for automatic assessment of cleanliness in video capsule endoscopy as per korea-canada scores. figshare, May 2024.https://doi.org/10.6084/ m9.figshare.25807915.v1. Ishita Harish et al. CAVE-Net: Classifying Abnormalities in Video Capsule Endoscopy, December 2024.ht...

work page arXiv 2024
[9]

Colon-x: Advancing intelligent colonoscopy from multimodal understanding to clinical reasoning

Ge-Peng Ji, Jingyi Liu, Deng-Ping Fan, and Nick Barnes. Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning, December 2025a.http://arxiv.org/abs/2512.03667. arXiv:2512.03667 [cs]. Ge-Peng Ji et al. Frontiers in Intelligent Colonoscopy, February 2025b. http://arxiv.org/abs/2410.17241. arXiv:2410.17241 [eess]. A. K...

work page arXiv
[10]

https://doi.org/10.1055/ s-0043-105488

doi: 10.1055/s-0043-105488. https://doi.org/10.1055/ s-0043-105488. M. Le Floch et al. Galar - a large multi-label video capsule endoscopy dataset.Scientific Data, 12:828,

work page doi:10.1055/s-0043-105488
[11]

Bowen Liu, Haoyang Li, Shuning Wang, Shuo Nie, and Shanghang Zhang

doi: 10.1038/s41597-025-05112-7.https://doi.org/10.1038/s41597-025-05112-7. Bowen Liu, Haoyang Li, Shuning Wang, Shuo Nie, and Shanghang Zhang. Subgraph aggregation for out-of-distribution generalization on graphs.Proceedings of the AAAI Conference on Artificial Intelligence, 39(18):18763–18771, Apr. 2025a. doi: 10.1609/aaai.v39i18.34065.https://ojs.aaai....

work page doi:10.1038/s41597-025-05112-7.https://doi.org/10.1038/s41597-025-05112-7
[12]

doi: 10.1007/s40747-023-01271-5.https://link.springer.com/10.1007/s40747-023-01271-5

ISSN 2199-4536, 2198-6053. doi: 10.1007/s40747-023-01271-5.https://link.springer.com/10.1007/s40747-023-01271-5. Ahmed Mohammed, Ivar Farup, Marius Pedersen, Sule Yildirim, and Øistein Hovde. PS-DeVCEM: Pathology-sensitive deep learning model for video capsule endoscopy based on weakly labeled data.Computer Vision and Image Understanding, 201:103062,

work page doi:10.1007/s40747-023-01271-5.https://link.springer.com/10.1007/s40747-023-01271-5
[13]

Hunter Morera et al

doi: 10.1016/j.cviu.2020.103062. Hunter Morera et al. Reduction of video capsule endoscopy reading times using deep learning with small data. Algorithms, 15(10):339,

work page doi:10.1016/j.cviu.2020.103062 2020
[14]

Seung-Joo Nam, Gwiseong Moon, Jung-Hwan Park, Yoon Kim, Yun Jeong Lim, and Hyun-Soo Choi

doi: 10.3390/a15100339. Seung-Joo Nam, Gwiseong Moon, Jung-Hwan Park, Yoon Kim, Yun Jeong Lim, and Hyun-Soo Choi. Deep Learning- Based Real-Time Organ Localization and Transit Time Estimation in Wireless Capsule Endoscopy.Biomedicines, 12(8):1704, July

work page doi:10.3390/a15100339
[15]

doi: 10.3390/biomedicines12081704.https://www.mdpi.com/2227-9059/ 12/8/1704

ISSN 2227-9059. doi: 10.3390/biomedicines12081704.https://www.mdpi.com/2227-9059/ 12/8/1704. Dong Jun Oh, Youngbae Hwang, Sang Hoon Kim, Ji Hyung Nam, Min Kyu Jung, and Yun Jeong Lim. Reading of small bowel capsule endoscopy after frame reduction using an artificial intelligence algorithm.BMC gastroenterology, 24(1):80,

work page doi:10.3390/biomedicines12081704.https://www.mdpi.com/2227-9059/
[16]

Lakshmi Srinivas Panchananam, Praveen Kumar Chandaliya, Kishor Upla, and Kiran Raja

doi: 10.3390/diagnostics13193133. Lakshmi Srinivas Panchananam, Praveen Kumar Chandaliya, Kishor Upla, and Kiran Raja. Capsule Endoscopy Multi- classification via Gated Attention and Wavelet Transformations, December 2024.http://arxiv.org/abs/2410.19363. arXiv:2410.19363 [cs]. Marco Pennazio et al. Small-bowel capsule endoscopy and device-assisted enteros...

work page doi:10.3390/diagnostics13193133 2024
[17]

14 Marcel Roth, Micha V

doi: 10.1016/j.ijmedinf.2024.105792. 14 Marcel Roth, Micha V. Nowak, Adrian Krenzer, and Frank Puppe. Domain-Adaptive Pre-training of Self-Supervised Foundation Models for Medical Image Classification in Gastrointestinal Endoscopy, December 2024.http://arxiv. org/abs/2410.21302. arXiv:2410.21302 [cs]. Andrew Sellergren et al. Medgemma technical report.arX...

work page doi:10.1016/j.ijmedinf.2024.105792 2024
[18]

DINOv3

Oriane Siméoni et al. Dinov3.arXiv preprint arXiv:2508.10104,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

ColonNet: A Hybrid Of DenseNet121 And U-NET Model For Detection And Segmentation Of GI Bleeding, December 2024.http://arxiv.org/abs/2412.05216

Ayushman Singh, Sharad Prakash, Aniket Das, and Nidhi Kushwaha. ColonNet: A Hybrid Of DenseNet121 And U-NET Model For Detection And Segmentation Of GI Bleeding, December 2024.http://arxiv.org/abs/2412.05216. arXiv:2412.05216 [eess]. P. H. Smedsrud et al. Kvasir-capsule, a video capsule endoscopy dataset.Scientific Data, 8:142,

work page arXiv 2024
[20]

Cristiano Spada et al

doi: 10.1038/s41597-021-00920-z.https://doi.org/10.1038/s41597-021-00920-z. Cristiano Spada et al. Ai-assisted capsule endoscopy reading in suspected small bowel bleeding: a multicentre prospective study.The Lancet Digital Health, 6(5):e345–e353,

work page doi:10.1038/s41597-021-00920-z.https://doi.org/10.1038/s41597-021-00920-z
[21]

doi: 10.1109/ACCESS.2020.3044759

ISSN 2169-3536. doi: 10.1109/ACCESS.2020.3044759. https://ieeexplore.ieee.org/document/9293302/. Qiaozhi Tan, Long Bai, Guankun Wang, Mobarakol Islam, and Hongliang Ren. EndoOOD: Uncertainty-aware Out-of-distribution Detection in Capsule Endoscopy Diagnosis, February 2024.http://arxiv.org/abs/2402.11476. arXiv:2402.11476 [cs]. Canhui Tang et al. Tspo: Tem...

work page doi:10.1109/access.2020.3044759 2020
[22]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265,

work page internal anchor Pith review arXiv
[24]

VideoAgent: Long-form video understanding with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. VideoAgent: Long-form video understanding with large language model as agent. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, volume 15138, pages 58–76. Springer Nature Switzerland. ISBN 978-3-031-72988-1 978-3-0...

work page doi:10.1007/978-3-031-72989-8_4 2024
[25]

doi: 10.3748/wjg.v30.i48.5111.https: //www.wjgnet.com/1007-9327/full/v30/i48/5111.htm

ISSN 1007-9327. doi: 10.3748/wjg.v30.i48.5111.https: //www.wjgnet.com/1007-9327/full/v30/i48/5111.htm. Xia Xie et al. Development and Validation of an Artificial Intelligence Model for Small Bowel Capsule Endoscopy Video Review.JAMA Network Open, 5(7):e2221992, July

work page doi:10.3748/wjg.v30.i48.5111.https:
[26]

doi: 10.1001/jamanetworkopen.2022.21992

ISSN 2574-3805. doi: 10.1001/jamanetworkopen.2022.21992. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2794207. Xiaohan Xing, Yixuan Yuan, and Max Q-H Meng. Zoom in lesions for better diagnosis: Attention guided deformation network for wce image classification.IEEE Transactions on Medical Imaging, 39(12):4047–4059,

work page doi:10.1001/jamanetworkopen.2022.21992 2022
[27]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Weiwen Xu et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044,

work page internal anchor Pith review arXiv
[28]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review arXiv
[29]

doi: 10.1109/TMI.2021.3083586

ISSN 1558-254X. doi: 10.1109/TMI.2021.3083586. https://ieeexplore.ieee.org/document/9440441/?arnumber= 9440441. 15

work page doi:10.1109/tmi.2021.3083586 2021