pith. machine review for the scientific record. sign in

arxiv: 2605.06537 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords medical video understandinglong-context MLLMssparse evidence retrievalclinical proceduresbenchmarkmulti-hop reasoningmultimodal models
0
0 comments X

The pith

Current AI models achieve only 41.1 percent accuracy on full-length medical procedure videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedHorizon, a benchmark for long-context medical video understanding that includes 759 hours of complete clinical procedures and 1,253 questions requiring retrieval of sparse evidence followed by multi-hop clinical reasoning. Evaluation of general, medical, and long-video multimodal models shows the top performer reaches just 41.1 percent accuracy, far below robust understanding. The benchmark demonstrates that evidence is extremely sparse at 0.166 percent of frames on average, testing the retrieval-before-reasoning capability missing in prior short-clip or pre-segmented evaluations. Analysis identifies that performance does not scale with more frames and that bottlenecks lie in evidence retrieval and interpretation due to weak procedural reasoning and attention issues under redundancy.

Core claim

We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding that preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questions, where the best model reaches only 41.1 percent accuracy.

What carries the argument

MedHorizon benchmark with full untrimmed videos and evidence annotations that force models to locate temporally sparse decisive frames (0.166 percent average) before aggregating findings across the procedure.

If this is right

  • Performance does not scale reliably with sampling more frames from the video.
  • Evidence retrieval and clinical interpretation are the main bottlenecks for current models.
  • Weak procedural reasoning and attention drift under redundancy cause these bottlenecks.
  • Generic sampling methods only partially balance local detail with global coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future models could benefit from explicit sparse-evidence retrieval mechanisms before reasoning.
  • This benchmark may be extended to other medical imaging modalities or non-clinical long videos with similar sparsity.
  • Clinical AI systems for procedure review would need to address attention drift in redundant streams to be practical.

Load-bearing premise

The 1,253 questions and their evidence annotations faithfully capture the sparse-evidence and multi-hop reasoning demands of real clinical video review without selection bias or annotation artifacts.

What would settle it

A new model achieving high accuracy, such as above 70 percent, on the MedHorizon benchmark using only standard frame sampling and existing MLLM architectures would challenge the conclusion that current systems remain far from robust full-procedure understanding.

read the original abstract

Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving the retrieval-before-reasoning problem under-tested. We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding. MedHorizon preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questionsthat jointly evaluate sparse evidence understanding and multi-hop clinical reasoning. Its evidence is extremely sparse, with only 0.166% evidence frames on average, requiring models to search noisy procedural streams before interpreting and aggregating findings. We evaluate representative general-domain, medical-domain, and long-video MLLMs. The best model reaches only 41.1% accuracy, showing that current systems remain far from robust full-procedure understanding. Further analysis yields four key findings: performance does not scale reliably with more frames, evidence retrieval and clinical interpretation remain primary bottlenecks; these bottlenecks are rooted in weak procedural reasoning and attention drift under redundancy, and generic sampling methods only partially balances local detail with global coverage. MedHorizon provides a rigorous testbed for MLLMs that retrieve sparse evidence and reason over complete clinical workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces MedHorizon, an in-the-wild benchmark for long-context medical video understanding consisting of 759 hours of full-length clinical procedure videos and 1,253 evidence-grounded multiple-choice questions with extremely sparse evidence annotations (0.166% frames on average). It evaluates general, medical, and long-video MLLMs, reporting a best accuracy of 41.1%, and identifies four findings regarding bottlenecks in evidence retrieval, clinical reasoning, and frame sampling under redundancy.

Significance. Should the benchmark's questions be confirmed to necessitate long-context retrieval and multi-hop reasoning without annotation artifacts or shortcuts, this work would be significant for advancing medical multimodal AI by providing a challenging, realistic testbed that exposes limitations in current MLLMs' ability to handle full-procedure videos, potentially informing improvements in attention mechanisms and procedural reasoning.

major comments (1)
  1. [Abstract] The interpretation of the 41.1% accuracy as indicating that current systems are far from robust full-procedure understanding depends on the questions truly requiring sparse evidence search and aggregation; however, the manuscript provides no details on question validation, inter-annotator agreement for evidence frame selection, or ablations showing that performance drops when non-evidence frames are removed or that questions are unanswerable from single short segments.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that additional details on question validation, inter-annotator agreement, and targeted ablations are necessary to fully substantiate the claim that the benchmark requires sparse evidence retrieval and long-context reasoning. We address the comment below and will incorporate the requested elements in the revision.

read point-by-point responses
  1. Referee: [Abstract] The interpretation of the 41.1% accuracy as indicating that current systems are far from robust full-procedure understanding depends on the questions truly requiring sparse evidence search and aggregation; however, the manuscript provides no details on question validation, inter-annotator agreement for evidence frame selection, or ablations showing that performance drops when non-evidence frames are removed or that questions are unanswerable from single short segments.

    Authors: We acknowledge the validity of this observation. The current version of the manuscript describes the evidence-grounded annotation process and reports the average evidence sparsity of 0.166% but does not include explicit validation statistics or the suggested ablations. In the revised manuscript we will add a new subsection (and supporting appendix) that details the multi-stage question validation workflow, reports inter-annotator agreement for evidence-frame localization, and presents ablation results in which models receive either (i) videos with all evidence frames removed or (ii) only short random clips. These experiments will quantify the performance drop and thereby confirm that the questions cannot be solved without long-context retrieval and multi-hop reasoning. We view this as a necessary and straightforward strengthening of the paper. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivations or self-referential predictions

full rationale

The paper introduces MedHorizon as an in-the-wild benchmark for long-context medical video understanding, providing 1,253 evidence-grounded questions on 759 hours of procedures. It evaluates existing MLLMs without any claimed derivations, fitted parameters, or predictions derived from its own inputs. The results (e.g., best model at 41.1%) are direct empirical measurements against external models, independent of any self-referential construction. No equations, ansatzes, or uniqueness theorems are invoked that reduce to the paper's own definitions or citations. The work is self-contained as a dataset and evaluation study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations, free parameters, or new axioms; it relies on standard evaluation practices in computer vision and multimodal learning.

pith-pipeline@v0.9.0 · 5600 in / 1231 out tokens · 80965 ms · 2026-05-08T12:25:12.182623+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 20 canonical work pages · 12 internal anchors

  1. [1]

    Pixel-wise recognition for holistic surgical scene understanding

    Nicol \'a s Ayobi, Santiago Rodr \' guez, Alejandra P \'e rez, Isabela Hern \'a ndez, Nicol \'a s Aparicio, Eug \'e nie Dessevres, Sebasti \'a n Pe \ n a, Jessica Santander, Juan Ignacio Caicedo, Nicol \'a s Fern \'a ndez, et al. Pixel-wise recognition for holistic surgical scene understanding. Medical Image Analysis, page 103726, 2025

  2. [4]

    Real-colon: A dataset for developing real-world ai applications in colonoscopy

    Carlo Biffi, Giulio Antonelli, Sebastian Bernhofer, Cesare Hassan, Daizen Hirata, Mineo Iwatate, Andreas Maieron, Pietro Salvagnini, and Andrea Cherubini. Real-colon: A dataset for developing real-world ai applications in colonoscopy. Scientific Data, 11 0 (1): 0 539, 2024

  3. [5]

    Artificial intelligence for surgical scene understanding: a systematic review and reporting quality meta-analysis

    Matthias Carstens, Shubha Vasisht, Zheyuan Zhang, Iulia Barbur, Annika Reinke, Lena Maier-Hein, et al. Artificial intelligence for surgical scene understanding: a systematic review and reporting quality meta-analysis. npj Digital Medicine, 9 0 (1), 2025

  4. [8]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185--24198, 2024 b

  5. [10]

    Corley, Christopher D

    Douglas A. Corley, Christopher D. Jensen, Amy R. Marks, Wei K. Zhao, Jeffrey K. Lee, Chyke A. Doubeni, Ann G. Zauber, Jennie de Boer, Bruce H. Fireman, Jane E. Schottinger, et al. Adenoma detection rate and risk of colorectal cancer and death. New England Journal of Medicine, 370 0 (14): 0 1298--1306, 2014

  6. [11]

    Lovat, Peter Mountney, et al

    Thomas De Carvalho, Rawen Kader, Patrick Brandao, Juana Gonzalez-Bueno Puyal, Laurence B. Lovat, Peter Mountney, et al. Automated colonoscopy withdrawal phase duration estimation using cecum detection and surgical tasks classification. Biomedical Optics Express, 14 0 (6): 0 2629, 2023

  7. [12]

    Deep learning in surgical workflow analysis: A review of phase and step recognition

    Kubilay Can Demir, Hannah Schieber, Tobias Weise, Daniel Roth, Matthias May, and Andreas Maier. Deep learning in surgical workflow analysis: A review of phase and step recognition. IEEE Journal of Biomedical and Health Informatics, 27 0 (11): 0 5405--5417, 2023

  8. [13]

    Standardized cine-loop documentation in abdominal ultrasound facilitates offline image interpretation

    Johann Baptist Dormagen, Mario Gaarder, and Anders Drolsum. Standardized cine-loop documentation in abdominal ultrasound facilitates offline image interpretation. Acta Radiologica, 56 0 (1): 0 3--9, 2015

  9. [14]

    We're expanding our Gemini 2.5 family of models

    Tulsee Doshi and the Gemini Team . We're expanding our Gemini 2.5 family of models. https://blog.google/products/gemini/gemini-2-5-model-family-expands, 2025. Official Google blog, accessed Apr. 27, 2026

  10. [15]

    Yadlapati, Mark Benson, Andrew J

    Anna Duloy, Rena H. Yadlapati, Mark Benson, Andrew J. Gawron, Charles J. Kahi, Tonya R. Kaltenbach, et al. Video-based assessments of colonoscopy inspection quality correlate with quality metrics and highlight areas for improvement. Clinical Gastroenterology and Hepatology, 17 0 (4): 0 691--700, 2019

  11. [16]

    Real-time ultrasound demonstration of uterine isthmus contractions during pregnancy

    Alba Farr \`a s, Sara Catal \'a n, Alba Casellas, Teresa Higueras, In \'e s Calero, Mar \' a Goya, Nerea Maiz, Maia Brik, and Elena Carreras. Real-time ultrasound demonstration of uterine isthmus contractions during pregnancy. American Journal of Obstetrics and Gynecology, 230 0 (1): 0 89--e1, 2024

  12. [17]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24108--24118, 2025

  13. [18]

    Khan, Sophia Bano, Hani J

    Runlong He, Mengya Xu, Adrito Das, Danyal Z. Khan, Sophia Bano, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, and Mobarakol Islam. Pitvqa: Image-grounded text embedding llm for visual question answering in pituitary surgery. In Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024, pages 488--498, 2024

  14. [19]

    Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding

    Ming Hu, Peng Xia, Lin Wang, Siyuan Yan, Feilong Tang, Zhongxing Xu, Yimin Luo, Kaimin Song, Jurgen Leitner, Xuelian Cheng, et al. Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding. In Computer Vision -- ECCV 2024, pages 481--500, 2024

  15. [21]

    A new dataset and versatile multi-task surgical workflow analysis framework for thoracoscopic mitral valvuloplasty

    Meng Lan, Weixin Si, Xinjian Yan, and Xiaomeng Li. A new dataset and versatile multi-task surgical workflow analysis framework for thoracoscopic mitral valvuloplasty. Medical Image Analysis, page 103724, 2025

  16. [22]

    Lang, Luigi P

    Roberto M. Lang, Luigi P. Badano, Victor Mor-Avi, Jonathan Afilalo, Anderson Armstrong, Laura Ernande, et al. Recommendations for cardiac chamber quantification by echocardiography in adults: An update from the american society of echocardiography and the european association of cardiovascular imaging. Journal of the American Society of Echocardiography, ...

  17. [23]

    e l L Lavanchy, Sanat Ramesh, Diego Dall’Alba, Cristians Gonzalez, Paolo Fiorini, Beat P M \

    Jo \"e l L Lavanchy, Sanat Ramesh, Diego Dall’Alba, Cristians Gonzalez, Paolo Fiorini, Beat P M \"u ller-Stich, Philipp C Nett, Jacques Marescaux, Didier Mutter, and Nicolas Padoy. Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass surgery. International journal of computer assisted radiology and surgery, 19...

  18. [24]

    Galar-a large multi-label video capsule endoscopy dataset

    Maxime Le Floch, Fabian Wolf, Lucian McIntyre, Christoph Weinert, Albrecht Palm, Konrad Volk, Paul Herzog, Sophie Helene Kirk, Jonas L Steinh \"a user, Catrein Stopp, et al. Galar-a large multi-label video capsule endoscopy dataset. Scientific Data, 12 0 (1): 0 828, 2025

  19. [25]

    Detecting moments and highlights in videos via natural language queries

    Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34: 0 11846--11858, 2021

  20. [26]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Lou, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195--22206, 2024 a

  21. [28]

    Surgpub-video: A comprehensive surgical video framework for enhanced surgical intelligence in vision-language model

    Yaoqian Li, Xikai Yang, Dunyuan Xu, Yang Yu, Litao Zhao, Xiaowei Hu, Jinpeng Li, and Pheng-Ann Heng. Surgpub-video: A comprehensive surgical video framework for enhanced surgical intelligence in vision-language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6628--6635, 2026

  22. [30]

    Kimi K2.5 : Visual agentic intelligence

    Moonshot AI . Kimi K2.5 : Visual agentic intelligence. https://www.kimi.com/blog/kimi-k2-5, 2026. Official technical blog, accessed Apr. 27, 2026

  23. [31]

    Introducing GPT -5.4

    OpenAI . Introducing GPT -5.4. https://openai.com/index/introducing-gpt-5-4/, 2026 a . Product release, accessed Apr. 27, 2026

  24. [32]

    Introducing GPT-5.4 mini and nano

    OpenAI . Introducing GPT-5.4 mini and nano. https://openai.com/index/introducing-gpt-5-4-mini-and-nano/, 2026 b . Product release, accessed Apr. 27, 2026

  25. [34]

    Qwen3.5 : Towards native multimodal agents

    Qwen Team . Qwen3.5 : Towards native multimodal agents. https://qwen.ai/blog?id=qwen3.5, 2026 a . Official release blog, accessed May 7, 2026

  26. [35]

    Qwen3.6 : Advancing multimodal intelligence

    Qwen Team . Qwen3.6 : Advancing multimodal intelligence. https://qwen.ai/blog?id=qwen3.6, 2026 b . Official release blog, accessed May 7, 2026

  27. [36]

    Rex, Joseph C

    Douglas K. Rex, Joseph C. Anderson, Lynn F. Butterly, Lukejohn W. Day, Jason A. Dominitz, Tonya Kaltenbach, et al. Quality indicators for colonoscopy. American Journal of Gastroenterology, 119 0 (9): 0 1754--1780, 2024

  28. [37]

    L. J. Salomon, Z. Alfirevic, V. Berghella, C. M. Bilardo, G. E. Chalouhi, F. Da Silva Costa, et al. ISUOG practice guidelines (updated): performance of the routine mid-trimester fetal ultrasound scan. Ultrasound in Obstetrics & Gynecology, 59 0 (6): 0 840--856, 2022

  29. [38]

    Medgrpo: Multi-task reinforcement learning for heterogeneous medical video understanding

    Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, and Ziyan Wu. Medgrpo: Multi-task reinforcement learning for heterogeneous medical video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  30. [39]

    Adaptive keyframe sampling for long video understanding

    Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29118--29128, 2025

  31. [41]

    Mimic-iv-echo-ext-mimicechoqa: A benchmark dataset for echocardiogram-based visual question answering

    Rahul Thapa, Andrew Li, Qingyang Wu, Bryan He, Yuki Sahashi, Christina Binder-Rodriguez, Angela Zhang, David Ouyang, and James Zou. Mimic-iv-echo-ext-mimicechoqa: A benchmark dataset for echocardiogram-based visual question answering. PhysioNet, 2025 b

  32. [42]

    Gemini 3.1 pro: A smarter model for your most complex tasks

    The Gemini Team . Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/, 2026. Official Google blog, accessed Apr. 27, 2026

  33. [44]

    Lvbench: An extreme long video understanding benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958--22967, 2025 a

  34. [46]

    Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy, 2022

    Ziyi Wang, Bo Lu, Yonghao Long, Fangxun Zhong, Tak-Hong Cheung, Qi Dou, and Yunhui Liu. Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy, 2022

  35. [47]

    Longvideobench: A benchmark for long-context interleaved video-language understanding

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. In Advances in Neural Information Processing Systems, pages 28828--28857, 2024

  36. [49]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, et al. Minicpm-v: A GPT -4v level MLLM on your phone. arXiv preprint arXiv:2408.01800, 2024

  37. [54]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  38. [55]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. arXiv preprint arXiv:2508.18265 , year=. 2508.18265 , archivePrefix=

  39. [56]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models , author=. arXiv preprint arXiv:2504.10479 , year=. 2504.10479 , archivePrefix=

  40. [57]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. arXiv preprint arXiv:2409.12191 , year=. 2409.12191 , archivePrefix=

  41. [58]

    Qwen2.5-VL Technical Report

    Qwen2.5-VL Technical Report , author=. arXiv preprint arXiv:2502.13923 , year=. 2502.13923 , archivePrefix=

  42. [59]

    Qwen3-VL Technical Report

    Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=. 2511.21631 , archivePrefix=

  43. [60]

    MedGemma Technical Report

    MedGemma Technical Report , author=. arXiv preprint arXiv:2507.05201 , year=. 2507.05201 , archivePrefix=

  44. [61]

    Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044,

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning , author=. arXiv preprint arXiv:2506.07044 , year=. 2506.07044 , archivePrefix=

  45. [62]

    et al.: HuatuoGPT-Vision, To- wards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale (Sep 2024)

    HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale , author=. arXiv preprint arXiv:2406.19280 , year=. 2406.19280 , archivePrefix=

  46. [63]

    Simeoni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Michael and others , journal=. 2025 , eprint=

  47. [64]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Adaptive Keyframe Sampling for Long Video Understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=. 2025 , eprint=

  48. [66]

    arXiv preprint arXiv:2603.00512 , year=

    Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding , author=. arXiv preprint arXiv:2603.00512 , year=. 2603.00512 , archivePrefix=

  49. [67]

    American Journal of Gastroenterology , volume=

    Quality Indicators for Colonoscopy , author=. American Journal of Gastroenterology , volume=

  50. [68]

    New England Journal of Medicine , volume=

    Adenoma Detection Rate and Risk of Colorectal Cancer and Death , author=. New England Journal of Medicine , volume=

  51. [69]

    Clinical Gastroenterology and Hepatology , volume=

    Video-Based Assessments of Colonoscopy Inspection Quality Correlate With Quality Metrics and Highlight Areas for Improvement , author=. Clinical Gastroenterology and Hepatology , volume=

  52. [70]

    Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

    Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos , author=. arXiv preprint arXiv:2604.21814 , year=

  53. [71]

    Biomedical Optics Express , volume=

    Automated colonoscopy withdrawal phase duration estimation using cecum detection and surgical tasks classification , author=. Biomedical Optics Express , volume=

  54. [72]

    Salomon, L. J. and Alfirevic, Z. and Berghella, V. and Bilardo, C. M. and Chalouhi, G. E. and Da Silva Costa, F. and others , journal=

  55. [73]

    Journal of the American Society of Echocardiography , volume=

    Recommendations for Cardiac Chamber Quantification by Echocardiography in Adults: An Update from the American Society of Echocardiography and the European Association of Cardiovascular Imaging , author=. Journal of the American Society of Echocardiography , volume=

  56. [74]

    Acta Radiologica , volume=

    Standardized cine-loop documentation in abdominal ultrasound facilitates offline image interpretation , author=. Acta Radiologica , volume=

  57. [75]

    IEEE Journal of Biomedical and Health Informatics , volume=

    Deep Learning in Surgical Workflow Analysis: A Review of Phase and Step Recognition , author=. IEEE Journal of Biomedical and Health Informatics , volume=

  58. [76]

    npj Digital Medicine , volume=

    Artificial intelligence for surgical scene understanding: a systematic review and reporting quality meta-analysis , author=. npj Digital Medicine , volume=

  59. [77]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=. 2025 , eprint=

  60. [78]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    MVBench: A Comprehensive Multi-modal Video Understanding Benchmark , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  61. [79]

    Advances in Neural Information Processing Systems , pages=

    LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding , author=. Advances in Neural Information Processing Systems , pages=. 2024 , eprint=

  62. [80]

    Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , pages=

    PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery , author=. Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024 , pages=

  63. [81]

    Computer Vision -- ECCV 2024 , pages=

    OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding , author=. Computer Vision -- ECCV 2024 , pages=. 2024 , eprint=

  64. [82]

    arXiv preprint arXiv:2504.14391 , year=

    How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos? , author=. arXiv preprint arXiv:2504.14391 , year=. 2504.14391 , archivePrefix=

  65. [83]

    arXiv preprint arXiv:2603.06570 , year=

    SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning , author=. arXiv preprint arXiv:2603.06570 , year=. 2603.06570 , archivePrefix=

  66. [84]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    SurgPub-Video: A Comprehensive Surgical Video Framework for Enhanced Surgical Intelligence in Vision-Language Model , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2026 , eprint=

  67. [85]

    MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

    MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=. 2512.06581 , archivePrefix=

  68. [86]

    PhysioNet , year=

    MIMIC-IV-ECHO-Ext-MIMICEchoQA: A Benchmark Dataset for Echocardiogram-Based Visual Question Answering , author=. PhysioNet , year=

  69. [87]

    MiniCPM-V: A

    Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and others , journal=. MiniCPM-V: A. 2024 , eprint=

  70. [88]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    LLaVA-Video: Video Instruction Tuning With Synthetic Data , author=. arXiv preprint arXiv:2410.02713 , year=. 2410.02713 , archivePrefix=

  71. [89]

    2026 , howpublished=

    Introducing. 2026 , howpublished=

  72. [90]

    2026 , howpublished=

    Gemini 3.1 Pro: A Smarter Model for Your Most Complex Tasks , author=. 2026 , howpublished=

  73. [91]

    We're Expanding Our

    Doshi, Tulsee and. We're Expanding Our. 2025 , howpublished=

  74. [92]

    2026 , howpublished=

  75. [93]

    2026 , howpublished =

  76. [94]

    et al.: Hulu-med: A transparent generalist model towards holistic medical vision-language understanding (2025), https://arxiv.org/abs/2510.08668

    Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding , author=. arXiv preprint arXiv:2510.08668 , year=. 2510.08668 , archivePrefix=

  77. [95]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding , author=. arXiv preprint arXiv:2501.13106 , year=

  78. [96]

    Long Context Transfer from Language to Vision

    Long Context Transfer from Language to Vision , author=. arXiv preprint arXiv:2406.16852 , year=

  79. [97]

    Videochat-flash: Hierarchical compression for long-context video modeling,

    VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling , author=. arXiv preprint arXiv:2501.00574 , year=

  80. [98]

    Advances in Neural Information Processing Systems , volume=

    Detecting moments and highlights in videos via natural language queries , author=. Advances in Neural Information Processing Systems , volume=

Showing first 80 references.