arxiv: 2605.06537 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

Bodong Du , Bowen Liu , Yang Yu , Xinpeng Ding , Zhiheng Wu , Shuning Wang , Shuo Nie , Naiming Liu

show 3 more authors

Qifeng Chen Yangqiu Song Xiaomeng Li

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical video understandinglong-context MLLMssparse evidence retrievalclinical proceduresbenchmarkmulti-hop reasoningmultimodal models

0 comments

The pith

Current AI models achieve only 41.1 percent accuracy on full-length medical procedure videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedHorizon, a benchmark for long-context medical video understanding that includes 759 hours of complete clinical procedures and 1,253 questions requiring retrieval of sparse evidence followed by multi-hop clinical reasoning. Evaluation of general, medical, and long-video multimodal models shows the top performer reaches just 41.1 percent accuracy, far below robust understanding. The benchmark demonstrates that evidence is extremely sparse at 0.166 percent of frames on average, testing the retrieval-before-reasoning capability missing in prior short-clip or pre-segmented evaluations. Analysis identifies that performance does not scale with more frames and that bottlenecks lie in evidence retrieval and interpretation due to weak procedural reasoning and attention issues under redundancy.

Core claim

We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding that preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questions, where the best model reaches only 41.1 percent accuracy.

What carries the argument

MedHorizon benchmark with full untrimmed videos and evidence annotations that force models to locate temporally sparse decisive frames (0.166 percent average) before aggregating findings across the procedure.

If this is right

Performance does not scale reliably with sampling more frames from the video.
Evidence retrieval and clinical interpretation are the main bottlenecks for current models.
Weak procedural reasoning and attention drift under redundancy cause these bottlenecks.
Generic sampling methods only partially balance local detail with global coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future models could benefit from explicit sparse-evidence retrieval mechanisms before reasoning.
This benchmark may be extended to other medical imaging modalities or non-clinical long videos with similar sparsity.
Clinical AI systems for procedure review would need to address attention drift in redundant streams to be practical.

Load-bearing premise

The 1,253 questions and their evidence annotations faithfully capture the sparse-evidence and multi-hop reasoning demands of real clinical video review without selection bias or annotation artifacts.

What would settle it

A new model achieving high accuracy, such as above 70 percent, on the MedHorizon benchmark using only standard frame sampling and existing MLLM architectures would challenge the conclusion that current systems remain far from robust full-procedure understanding.

read the original abstract

Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving the retrieval-before-reasoning problem under-tested. We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding. MedHorizon preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questionsthat jointly evaluate sparse evidence understanding and multi-hop clinical reasoning. Its evidence is extremely sparse, with only 0.166% evidence frames on average, requiring models to search noisy procedural streams before interpreting and aggregating findings. We evaluate representative general-domain, medical-domain, and long-video MLLMs. The best model reaches only 41.1% accuracy, showing that current systems remain far from robust full-procedure understanding. Further analysis yields four key findings: performance does not scale reliably with more frames, evidence retrieval and clinical interpretation remain primary bottlenecks; these bottlenecks are rooted in weak procedural reasoning and attention drift under redundancy, and generic sampling methods only partially balances local detail with global coverage. MedHorizon provides a rigorous testbed for MLLMs that retrieve sparse evidence and reason over complete clinical workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedHorizon is a useful new benchmark for full-length medical video with quantified sparse evidence, but the abstract leaves question validation and annotation process too thin to fully trust the 41.1% headline number.

read the letter

MedHorizon is the first benchmark that keeps entire unedited clinical procedures and measures performance when evidence sits in just 0.166% of frames. That combination of full length, in-the-wild data, and explicit sparsity is new and directly targets the retrieval-before-reasoning problem that shorter-clip benchmarks avoid. The evaluation across general, medical, and long-video MLLMs plus the four findings on attention drift, sampling limits, and non-scaling with more frames give a practical picture of current bottlenecks. Those parts are worth having in the literature. The soft spot is the missing methodological detail. The abstract reports 1,253 evidence-grounded questions and the 41.1% top score but does not describe how the questions were written, how evidence frames were selected, or any inter-annotator checks. Without that, it is difficult to rule out that some questions can be answered from language priors or short segments rather than true long-context search. The stress-test concern about possible annotation artifacts or shortcuts therefore lands until the full methods section shows otherwise. This paper is aimed at groups building medical MLLMs or long-video models who need a harder test than existing short-clip sets. A reader focused on clinical workflow understanding will find the bottleneck analysis useful even if they do not adopt the full dataset. It deserves peer review because the core construction is a clear step forward and the exposed gap is real; the validation gaps are fixable in revision rather than fatal.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces MedHorizon, an in-the-wild benchmark for long-context medical video understanding consisting of 759 hours of full-length clinical procedure videos and 1,253 evidence-grounded multiple-choice questions with extremely sparse evidence annotations (0.166% frames on average). It evaluates general, medical, and long-video MLLMs, reporting a best accuracy of 41.1%, and identifies four findings regarding bottlenecks in evidence retrieval, clinical reasoning, and frame sampling under redundancy.

Significance. Should the benchmark's questions be confirmed to necessitate long-context retrieval and multi-hop reasoning without annotation artifacts or shortcuts, this work would be significant for advancing medical multimodal AI by providing a challenging, realistic testbed that exposes limitations in current MLLMs' ability to handle full-procedure videos, potentially informing improvements in attention mechanisms and procedural reasoning.

major comments (1)

[Abstract] The interpretation of the 41.1% accuracy as indicating that current systems are far from robust full-procedure understanding depends on the questions truly requiring sparse evidence search and aggregation; however, the manuscript provides no details on question validation, inter-annotator agreement for evidence frame selection, or ablations showing that performance drops when non-evidence frames are removed or that questions are unanswerable from single short segments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that additional details on question validation, inter-annotator agreement, and targeted ablations are necessary to fully substantiate the claim that the benchmark requires sparse evidence retrieval and long-context reasoning. We address the comment below and will incorporate the requested elements in the revision.

read point-by-point responses

Referee: [Abstract] The interpretation of the 41.1% accuracy as indicating that current systems are far from robust full-procedure understanding depends on the questions truly requiring sparse evidence search and aggregation; however, the manuscript provides no details on question validation, inter-annotator agreement for evidence frame selection, or ablations showing that performance drops when non-evidence frames are removed or that questions are unanswerable from single short segments.

Authors: We acknowledge the validity of this observation. The current version of the manuscript describes the evidence-grounded annotation process and reports the average evidence sparsity of 0.166% but does not include explicit validation statistics or the suggested ablations. In the revised manuscript we will add a new subsection (and supporting appendix) that details the multi-stage question validation workflow, reports inter-annotator agreement for evidence-frame localization, and presents ablation results in which models receive either (i) videos with all evidence frames removed or (ii) only short random clips. These experiments will quantify the performance drop and thereby confirm that the questions cannot be solved without long-context retrieval and multi-hop reasoning. We view this as a necessary and straightforward strengthening of the paper. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivations or self-referential predictions

full rationale

The paper introduces MedHorizon as an in-the-wild benchmark for long-context medical video understanding, providing 1,253 evidence-grounded questions on 759 hours of procedures. It evaluates existing MLLMs without any claimed derivations, fitted parameters, or predictions derived from its own inputs. The results (e.g., best model at 41.1%) are direct empirical measurements against external models, independent of any self-referential construction. No equations, ansatzes, or uniqueness theorems are invoked that reduce to the paper's own definitions or citations. The work is self-contained as a dataset and evaluation study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations, free parameters, or new axioms; it relies on standard evaluation practices in computer vision and multimodal learning.

pith-pipeline@v0.9.0 · 5600 in / 1231 out tokens · 80965 ms · 2026-05-08T12:25:12.182623+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 20 canonical work pages · 12 internal anchors

[1]

Pixel-wise recognition for holistic surgical scene understanding

Nicol \'a s Ayobi, Santiago Rodr \' guez, Alejandra P \'e rez, Isabela Hern \'a ndez, Nicol \'a s Aparicio, Eug \'e nie Dessevres, Sebasti \'a n Pe \ n a, Jessica Santander, Juan Ignacio Caicedo, Nicol \'a s Fern \'a ndez, et al. Pixel-wise recognition for holistic surgical scene understanding. Medical Image Analysis, page 103726, 2025

2025
[4]

Real-colon: A dataset for developing real-world ai applications in colonoscopy

Carlo Biffi, Giulio Antonelli, Sebastian Bernhofer, Cesare Hassan, Daizen Hirata, Mineo Iwatate, Andreas Maieron, Pietro Salvagnini, and Andrea Cherubini. Real-colon: A dataset for developing real-world ai applications in colonoscopy. Scientific Data, 11 0 (1): 0 539, 2024

2024
[5]

Artificial intelligence for surgical scene understanding: a systematic review and reporting quality meta-analysis

Matthias Carstens, Shubha Vasisht, Zheyuan Zhang, Iulia Barbur, Annika Reinke, Lena Maier-Hein, et al. Artificial intelligence for surgical scene understanding: a systematic review and reporting quality meta-analysis. npj Digital Medicine, 9 0 (1), 2025

2025
[8]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185--24198, 2024 b

2024
[10]

Corley, Christopher D

Douglas A. Corley, Christopher D. Jensen, Amy R. Marks, Wei K. Zhao, Jeffrey K. Lee, Chyke A. Doubeni, Ann G. Zauber, Jennie de Boer, Bruce H. Fireman, Jane E. Schottinger, et al. Adenoma detection rate and risk of colorectal cancer and death. New England Journal of Medicine, 370 0 (14): 0 1298--1306, 2014

2014
[11]

Lovat, Peter Mountney, et al

Thomas De Carvalho, Rawen Kader, Patrick Brandao, Juana Gonzalez-Bueno Puyal, Laurence B. Lovat, Peter Mountney, et al. Automated colonoscopy withdrawal phase duration estimation using cecum detection and surgical tasks classification. Biomedical Optics Express, 14 0 (6): 0 2629, 2023

2023
[12]

Deep learning in surgical workflow analysis: A review of phase and step recognition

Kubilay Can Demir, Hannah Schieber, Tobias Weise, Daniel Roth, Matthias May, and Andreas Maier. Deep learning in surgical workflow analysis: A review of phase and step recognition. IEEE Journal of Biomedical and Health Informatics, 27 0 (11): 0 5405--5417, 2023

2023
[13]

Standardized cine-loop documentation in abdominal ultrasound facilitates offline image interpretation

Johann Baptist Dormagen, Mario Gaarder, and Anders Drolsum. Standardized cine-loop documentation in abdominal ultrasound facilitates offline image interpretation. Acta Radiologica, 56 0 (1): 0 3--9, 2015

2015
[14]

We're expanding our Gemini 2.5 family of models

Tulsee Doshi and the Gemini Team . We're expanding our Gemini 2.5 family of models. https://blog.google/products/gemini/gemini-2-5-model-family-expands, 2025. Official Google blog, accessed Apr. 27, 2026

2025
[15]

Yadlapati, Mark Benson, Andrew J

Anna Duloy, Rena H. Yadlapati, Mark Benson, Andrew J. Gawron, Charles J. Kahi, Tonya R. Kaltenbach, et al. Video-based assessments of colonoscopy inspection quality correlate with quality metrics and highlight areas for improvement. Clinical Gastroenterology and Hepatology, 17 0 (4): 0 691--700, 2019

2019
[16]

Real-time ultrasound demonstration of uterine isthmus contractions during pregnancy

Alba Farr \`a s, Sara Catal \'a n, Alba Casellas, Teresa Higueras, In \'e s Calero, Mar \' a Goya, Nerea Maiz, Maia Brik, and Elena Carreras. Real-time ultrasound demonstration of uterine isthmus contractions during pregnancy. American Journal of Obstetrics and Gynecology, 230 0 (1): 0 89--e1, 2024

2024
[17]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24108--24118, 2025

2025
[18]

Khan, Sophia Bano, Hani J

Runlong He, Mengya Xu, Adrito Das, Danyal Z. Khan, Sophia Bano, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, and Mobarakol Islam. Pitvqa: Image-grounded text embedding llm for visual question answering in pituitary surgery. In Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024, pages 488--498, 2024

2024
[19]

Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding

Ming Hu, Peng Xia, Lin Wang, Siyuan Yan, Feilong Tang, Zhongxing Xu, Yimin Luo, Kaimin Song, Jurgen Leitner, Xuelian Cheng, et al. Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding. In Computer Vision -- ECCV 2024, pages 481--500, 2024

2024
[21]

A new dataset and versatile multi-task surgical workflow analysis framework for thoracoscopic mitral valvuloplasty

Meng Lan, Weixin Si, Xinjian Yan, and Xiaomeng Li. A new dataset and versatile multi-task surgical workflow analysis framework for thoracoscopic mitral valvuloplasty. Medical Image Analysis, page 103724, 2025

2025
[22]

Lang, Luigi P

Roberto M. Lang, Luigi P. Badano, Victor Mor-Avi, Jonathan Afilalo, Anderson Armstrong, Laura Ernande, et al. Recommendations for cardiac chamber quantification by echocardiography in adults: An update from the american society of echocardiography and the european association of cardiovascular imaging. Journal of the American Society of Echocardiography, ...

2015
[23]

e l L Lavanchy, Sanat Ramesh, Diego Dall’Alba, Cristians Gonzalez, Paolo Fiorini, Beat P M \

Jo \"e l L Lavanchy, Sanat Ramesh, Diego Dall’Alba, Cristians Gonzalez, Paolo Fiorini, Beat P M \"u ller-Stich, Philipp C Nett, Jacques Marescaux, Didier Mutter, and Nicolas Padoy. Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass surgery. International journal of computer assisted radiology and surgery, 19...

2024
[24]

Galar-a large multi-label video capsule endoscopy dataset

Maxime Le Floch, Fabian Wolf, Lucian McIntyre, Christoph Weinert, Albrecht Palm, Konrad Volk, Paul Herzog, Sophie Helene Kirk, Jonas L Steinh \"a user, Catrein Stopp, et al. Galar-a large multi-label video capsule endoscopy dataset. Scientific Data, 12 0 (1): 0 828, 2025

2025
[25]

Detecting moments and highlights in videos via natural language queries

Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34: 0 11846--11858, 2021

2021
[26]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195--22206, 2024 a

2024
[28]

Surgpub-video: A comprehensive surgical video framework for enhanced surgical intelligence in vision-language model

Yaoqian Li, Xikai Yang, Dunyuan Xu, Yang Yu, Litao Zhao, Xiaowei Hu, Jinpeng Li, and Pheng-Ann Heng. Surgpub-video: A comprehensive surgical video framework for enhanced surgical intelligence in vision-language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6628--6635, 2026

2026
[30]

Kimi K2.5 : Visual agentic intelligence

Moonshot AI . Kimi K2.5 : Visual agentic intelligence. https://www.kimi.com/blog/kimi-k2-5, 2026. Official technical blog, accessed Apr. 27, 2026

2026
[31]

Introducing GPT -5.4

OpenAI . Introducing GPT -5.4. https://openai.com/index/introducing-gpt-5-4/, 2026 a . Product release, accessed Apr. 27, 2026

2026
[32]

Introducing GPT-5.4 mini and nano

OpenAI . Introducing GPT-5.4 mini and nano. https://openai.com/index/introducing-gpt-5-4-mini-and-nano/, 2026 b . Product release, accessed Apr. 27, 2026

2026
[34]

Qwen3.5 : Towards native multimodal agents

Qwen Team . Qwen3.5 : Towards native multimodal agents. https://qwen.ai/blog?id=qwen3.5, 2026 a . Official release blog, accessed May 7, 2026

2026
[35]

Qwen3.6 : Advancing multimodal intelligence

Qwen Team . Qwen3.6 : Advancing multimodal intelligence. https://qwen.ai/blog?id=qwen3.6, 2026 b . Official release blog, accessed May 7, 2026

2026
[36]

Rex, Joseph C

Douglas K. Rex, Joseph C. Anderson, Lynn F. Butterly, Lukejohn W. Day, Jason A. Dominitz, Tonya Kaltenbach, et al. Quality indicators for colonoscopy. American Journal of Gastroenterology, 119 0 (9): 0 1754--1780, 2024

2024
[37]

L. J. Salomon, Z. Alfirevic, V. Berghella, C. M. Bilardo, G. E. Chalouhi, F. Da Silva Costa, et al. ISUOG practice guidelines (updated): performance of the routine mid-trimester fetal ultrasound scan. Ultrasound in Obstetrics & Gynecology, 59 0 (6): 0 840--856, 2022

2022
[38]

Medgrpo: Multi-task reinforcement learning for heterogeneous medical video understanding

Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, and Ziyan Wu. Medgrpo: Multi-task reinforcement learning for heterogeneous medical video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

2026
[39]

Adaptive keyframe sampling for long video understanding

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29118--29128, 2025

2025
[41]

Mimic-iv-echo-ext-mimicechoqa: A benchmark dataset for echocardiogram-based visual question answering

Rahul Thapa, Andrew Li, Qingyang Wu, Bryan He, Yuki Sahashi, Christina Binder-Rodriguez, Angela Zhang, David Ouyang, and James Zou. Mimic-iv-echo-ext-mimicechoqa: A benchmark dataset for echocardiogram-based visual question answering. PhysioNet, 2025 b

2025
[42]

Gemini 3.1 pro: A smarter model for your most complex tasks

The Gemini Team . Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/, 2026. Official Google blog, accessed Apr. 27, 2026

2026
[44]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958--22967, 2025 a

2025
[46]

Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy, 2022

Ziyi Wang, Bo Lu, Yonghao Long, Fangxun Zhong, Tak-Hong Cheung, Qi Dou, and Yunhui Liu. Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy, 2022

2022
[47]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. In Advances in Neural Information Processing Systems, pages 28828--28857, 2024

2024
[49]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, et al. Minicpm-v: A GPT -4v level MLLM on your phone. arXiv preprint arXiv:2408.01800, 2024

work page internal anchor Pith review arXiv 2024
[54]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[55]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. arXiv preprint arXiv:2508.18265 , year=. 2508.18265 , archivePrefix=

work page internal anchor Pith review arXiv
[56]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. arXiv preprint arXiv:2504.10479 , year=. 2504.10479 , archivePrefix=

work page internal anchor Pith review arXiv
[57]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=. 2409.12191 , archivePrefix=

work page internal anchor Pith review arXiv
[58]

Qwen2.5-VL Technical Report

Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=. 2502.13923 , archivePrefix=

work page internal anchor Pith review arXiv
[59]

Qwen3-VL Technical Report

Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=. 2511.21631 , archivePrefix=

work page internal anchor Pith review arXiv
[60]

MedGemma Technical Report

MedGemma Technical Report , author=. arXiv preprint arXiv:2507.05201 , year=. 2507.05201 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044,

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning , author=. arXiv preprint arXiv:2506.07044 , year=. 2506.07044 , archivePrefix=

work page arXiv
[62]

et al.: HuatuoGPT-Vision, To- wards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale (Sep 2024)

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale , author=. arXiv preprint arXiv:2406.19280 , year=. 2406.19280 , archivePrefix=

work page arXiv
[63]

Simeoni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Michael and others , journal=. 2025 , eprint=

2025
[64]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Adaptive Keyframe Sampling for Long Video Understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=. 2025 , eprint=

2025
[66]

arXiv preprint arXiv:2603.00512 , year=

Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding , author=. arXiv preprint arXiv:2603.00512 , year=. 2603.00512 , archivePrefix=

work page arXiv
[67]

American Journal of Gastroenterology , volume=

Quality Indicators for Colonoscopy , author=. American Journal of Gastroenterology , volume=
[68]

New England Journal of Medicine , volume=

Adenoma Detection Rate and Risk of Colorectal Cancer and Death , author=. New England Journal of Medicine , volume=
[69]

Clinical Gastroenterology and Hepatology , volume=

Video-Based Assessments of Colonoscopy Inspection Quality Correlate With Quality Metrics and Highlight Areas for Improvement , author=. Clinical Gastroenterology and Hepatology , volume=
[70]

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos , author=. arXiv preprint arXiv:2604.21814 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Biomedical Optics Express , volume=

Automated colonoscopy withdrawal phase duration estimation using cecum detection and surgical tasks classification , author=. Biomedical Optics Express , volume=
[72]

Salomon, L. J. and Alfirevic, Z. and Berghella, V. and Bilardo, C. M. and Chalouhi, G. E. and Da Silva Costa, F. and others , journal=
[73]

Journal of the American Society of Echocardiography , volume=

Recommendations for Cardiac Chamber Quantification by Echocardiography in Adults: An Update from the American Society of Echocardiography and the European Association of Cardiovascular Imaging , author=. Journal of the American Society of Echocardiography , volume=
[74]

Acta Radiologica , volume=

Standardized cine-loop documentation in abdominal ultrasound facilitates offline image interpretation , author=. Acta Radiologica , volume=
[75]

IEEE Journal of Biomedical and Health Informatics , volume=

Deep Learning in Surgical Workflow Analysis: A Review of Phase and Step Recognition , author=. IEEE Journal of Biomedical and Health Informatics , volume=
[76]

npj Digital Medicine , volume=

Artificial intelligence for surgical scene understanding: a systematic review and reporting quality meta-analysis , author=. npj Digital Medicine , volume=
[77]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=. 2025 , eprint=

2025
[78]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[79]

Advances in Neural Information Processing Systems , pages=

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding , author=. Advances in Neural Information Processing Systems , pages=. 2024 , eprint=

2024
[80]

Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , pages=

PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery , author=. Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , pages=

2024
[81]

Computer Vision -- ECCV 2024 , pages=

OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding , author=. Computer Vision -- ECCV 2024 , pages=. 2024 , eprint=

2024
[82]

arXiv preprint arXiv:2504.14391 , year=

How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos? , author=. arXiv preprint arXiv:2504.14391 , year=. 2504.14391 , archivePrefix=

work page arXiv
[83]

arXiv preprint arXiv:2603.06570 , year=

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning , author=. arXiv preprint arXiv:2603.06570 , year=. 2603.06570 , archivePrefix=

work page arXiv
[84]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

SurgPub-Video: A Comprehensive Surgical Video Framework for Enhanced Surgical Intelligence in Vision-Language Model , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2026 , eprint=

2026
[85]

MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=. 2512.06581 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[86]

PhysioNet , year=

MIMIC-IV-ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering , author=. PhysioNet , year=
[87]

MiniCPM-V: A

Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and others , journal=. MiniCPM-V: A. 2024 , eprint=

2024
[88]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

LLaVA-Video: Video Instruction Tuning With Synthetic Data , author=. arXiv preprint arXiv:2410.02713 , year=. 2410.02713 , archivePrefix=

work page internal anchor Pith review arXiv
[89]

2026 , howpublished=

Introducing. 2026 , howpublished=

2026
[90]

2026 , howpublished=

Gemini 3.1 Pro: A Smarter Model for Your Most Complex Tasks , author=. 2026 , howpublished=

2026
[91]

We're Expanding Our

Doshi, Tulsee and. We're Expanding Our. 2025 , howpublished=

2025
[92]

2026 , howpublished=

2026
[93]

2026 , howpublished =

2026
[94]

et al.: Hulu-med: A transparent generalist model towards holistic medical vision-language understanding (2025), https://arxiv.org/abs/2510.08668

Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding , author=. arXiv preprint arXiv:2510.08668 , year=. 2510.08668 , archivePrefix=

work page arXiv
[95]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding , author=. arXiv preprint arXiv:2501.13106 , year=

work page internal anchor Pith review arXiv
[96]

Long Context Transfer from Language to Vision

Long Context Transfer from Language to Vision , author=. arXiv preprint arXiv:2406.16852 , year=

work page internal anchor Pith review arXiv
[97]

Videochat-flash: Hierarchical compression for long-context video modeling,

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling , author=. arXiv preprint arXiv:2501.00574 , year=

work page arXiv
[98]

Advances in Neural Information Processing Systems , volume=

Detecting moments and highlights in videos via natural language queries , author=. Advances in Neural Information Processing Systems , volume=

Showing first 80 references.