PEEK: Picking Essential frames via Efficient Knowledge distillation

Anas Filali Razzouki; Khalil Guetari; Killian Steunou; Moun\^im A. El-Yacoubi; Yannis Tevissen

arxiv: 2605.31029 · v1 · pith:I4ESWN7Ynew · submitted 2026-05-29 · 💻 cs.CV

PEEK: Picking Essential frames via Efficient Knowledge distillation

Killian Steunou , Anas Filali Razzouki , Khalil Guetari , Moun\^im A. El-Yacoubi , Yannis Tevissen This is my paper

Pith reviewed 2026-06-28 22:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords frame selectionvideo captioningknowledge distillationadaptive samplingactivitynet captionsmsr-vttvision language models

0 comments

The pith

PEEK transfers caption-conditioned frame rankings from a teacher model into a lightweight visual-only student for efficient video captioning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PEEK as a way to select the most useful frames from a video when captioning models can process only a small number of them. It does so by first training a heavy teacher that sees both frames and captions to produce relevance rankings, then distilling those rankings into a fast student that runs on frames alone. The student is shown to match or exceed prior adaptive selection methods on ActivityNet Captions and MSR-VTT, with the largest gains appearing when only one or two frames are kept. The method also adds far less runtime overhead than competing adaptive approaches. A reader would care because uniform sampling wastes the limited frame budget on uninformative content while existing adaptive methods are too slow for practical use.

Core claim

PEEK distills caption-conditioned frame relevance rankings produced by a stronger teacher model into a lightweight temporal model that operates solely on visual content, enabling dynamic frame selection that outperforms state-of-the-art adaptive baselines on ActivityNet Captions and MSR-VTT, especially at one- and two-frame budgets, while adding only 5.2 percent to captioning runtime.

What carries the argument

Distillation of caption-conditioned frame relevance rankings into a lightweight temporal student model that uses only visual input at inference.

If this is right

PEEK wins 14 of 16 configurations on ActivityNet Captions across downstream vision-language models.
It obtains the best CIDEr scores for most frame budgets when only one or two frames are selected.
Zero-shot transfer to MSR-VTT is strongest at low frame budgets.
Runtime overhead is 5.2 percent versus 65.4 percent for CSTA and 211.9 percent for MaxInfo.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation pattern could be tested on other video tasks that must select frames under tight compute limits, such as action recognition or video question answering.
If the student can approximate caption-aware decisions from visuals alone, similar teacher-student setups might reduce the need for text supervision during inference in other multimodal settings.
The efficiency gain suggests PEEK could be deployed in streaming or mobile video pipelines where adding even modest overhead is costly.

Load-bearing premise

The caption-conditioned relevance rankings produced by the teacher can be transferred effectively to a student that never sees captions.

What would settle it

On a held-out video set, if the frames chosen by the distilled student produce lower CIDEr scores than uniform sampling at the same low frame budget, the transfer claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.31029 by Anas Filali Razzouki, Khalil Guetari, Killian Steunou, Moun\^im A. El-Yacoubi, Yannis Tevissen.

**Figure 1.** Figure 1: Overview of PEEK. (a) A frozen SigLIP 2 dual encoder acts as an Oracle teacher, producing per-frame relevance targets from ground-truth captions. (b) A small Transformer distills the teacher’s ranking into a query-free selector operating on MobileCLIP2 visual embeddings alone. (c) At inference, the segment is split into k equal temporal windows and the highest-scoring frame within each (blue dot) is kept. … view at source ↗

**Figure 2.** Figure 2: Top frames selected on an ActivityNet Captions test segment in which a man plays [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Per-frame relevance scores on three ActivityNet Captions test segments. Curves [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only $5.2\%$ to the captioning time, compared with $65.4\%$ for CSTA and $211.9\%$ for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEEK shows a distillation trick that improves low-frame video captioning over prior adaptive samplers with minimal overhead and released code.

read the letter

The main point is that this paper distills caption-conditioned frame rankings from a teacher into a cheap visual-only student model, then uses the student for dynamic sampling. That setup produces better CIDEr scores than the cited baselines on ActivityNet Captions and MSR-VTT, especially at one or two frames, while adding only 5.2% to captioning time.

What stands out is the empirical package: wins in 14 of 16 ActivityNet configurations, zero-shot transfer results, and direct efficiency comparisons against CSTA and MaxInfo. Releasing code and the checkpoint makes the numbers checkable rather than just claimed.

The soft spots are modest. Results turn mixed at four and eight frames on MSR-VTT, which is unsurprising once temporal coverage improves. The method still depends on a strong teacher for the initial rankings, though the paper measures the transfer directly through downstream metrics. No hidden data issues or circular claims appear in the reported setup.

This is useful for anyone working on resource-limited video-language pipelines who needs better than uniform sampling without heavy compute. The work is incremental but grounded in concrete numbers and reproducible artifacts, so it belongs in the literature.

I would send it to peer review. The empirical claims are falsifiable and the implementation is public, which is enough to justify referee time even if revisions are needed on the higher-frame results.

Referee Report

0 major / 2 minor

Summary. The paper introduces PEEK, a dynamic frame sampling method for video captioning that distills caption-conditioned relevance rankings from a teacher model into a lightweight visual-only student model. It claims to outperform prior adaptive samplers on CIDEr across multiple VLMs and frame budgets on ActivityNet Captions (winning 14/16 configurations) and MSR-VTT, with particular strength at 1-2 frame budgets, while adding only 5.2% overhead compared to 65.4% and 211.9% for baselines CSTA and MaxInfo. Code and checkpoint are released.

Significance. If the empirical results hold under the released implementation, the work demonstrates a practical efficiency gain for video-language pipelines by replacing expensive adaptive sampling with a distilled lightweight selector that preserves or improves caption quality at low frame counts. The public release of code and checkpoint is a clear strength, directly supporting verification of the reported CIDEr gains and overhead numbers.

minor comments (2)

The abstract states that PEEK obtains the best CIDEr 'for most frame budgets' and wins 14/16 configurations on ActivityNet Captions; the results section should include an explicit table (or set of tables) breaking down per-VLM, per-budget scores so readers can verify the exact count without ambiguity.
The efficiency overhead figures (5.2%, 65.4%, 211.9%) are given as percentages added to captioning time; the methods or experimental setup section should state the precise baseline (uniform sampling? captioning model alone?) and measurement protocol used to obtain these numbers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the practical efficiency gains at low frame budgets, and the recommendation for minor revision. The public release of code and checkpoints is intended to support verification of the reported results.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical distillation pipeline for frame selection in video captioning, with performance claims based on direct evaluation against external baselines (CIDEr scores on ActivityNet Captions and MSR-VTT across multiple VLMs and frame budgets). No mathematical derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the method is tested via released code and checkpoints, rendering results falsifiable without reduction to internal definitions or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; no explicit free parameters, axioms, or invented entities are described. The central claim rests on the unstated domain assumption that teacher rankings transfer effectively.

axioms (1)

domain assumption Caption-conditioned frame relevance rankings from a teacher model can be distilled into a visual-only student without major performance loss
This transfer is the core mechanism asserted to enable the efficiency and accuracy gains.

pith-pipeline@v0.9.1-grok · 5827 in / 1319 out tokens · 22187 ms · 2026-06-28T22:44:55.930750+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding
cs.CV 2026-06 unverdicted novelty 6.0

Fits a model where logit-accuracy scales linearly in log frame budget B with distance-dependent exponent α(D) that decays log-linearly with temporal distance D, based on 155k binary predictions across ten models.

Reference graph

Works this paper leans on

41 extracted references · 30 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report, F...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

METEOR: An Automatic Metric for MT Evalua- tion with Improved Correlation with Human Judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An Automatic Metric for MT Evalua- tion with Improved Correlation with Human Judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michiga...

2005
[4]

El Yacoubi

Marija Brkic, Anas Filali Razzouki, Yannis Tevissen, Khalil Guetari, and Mounim A. El Yacoubi. Frame Sampling Strategies Matter: A Benchmark for small vision lan- guage models, September 2025. URLhttp://arxiv.org/abs/2509.14769. arXiv:2509.14769 [cs]

work page arXiv 2025
[5]

ActivityNet: A Large-Scale Video Benchmark for Human Activity Understand- ing

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understand- ing. pages 961–970, 2015. URLhttps://www.cv-foundation.org/ openaccess/content_cvpr_2015/html/Heilbron_ActivityNet_A_ Large-Scale_2015_CVPR_paper.html

2015
[6]

LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning

Lianying Chao, Linfeng Yin, Peiyu Ren, Yifan Jiang, Qiaoyu Ren, Dingcheng Shan, Jing-cheng Pang, Sijie Wu, Xubin Li, and Kai Zhang. LFS: Learnable Frame Selec- tor for Event-Aware and Temporally Diverse Video Captioning, January 2026. URL http://arxiv.org/abs/2601.14594. 16STEUNOUET AL.: PEEK

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Less Is More: Picking Informative Frames for Video Captioning

Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. Less Is More: Picking Informative Frames for Video Captioning, March 2018. URLhttp:// arxiv.org/abs/1803.01457. arXiv:1803.01457 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

MobileCLIP2: Improving Multi-Modal Re- inforced Training, August 2025

Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev, Oncel Tuzel, and Hadi Pouransari. MobileCLIP2: Improving Multi-Modal Re- inforced Training, August 2025. URLhttp://arxiv.org/abs/2508.20691. arXiv:2508.20691 [cs.CV]

work page arXiv 2025
[9]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zi- han Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-MME: The First-Ever Comprehensive Evalua- tion Benchmark of Multi-modal LLMs in Video Analysis,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

M-LLM Based Video Frame Selection for Efficient Video Understanding, March 2025

Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, and Trishul Chilimbi. M-LLM Based Video Frame Selection for Efficient Video Understanding, March 2025. URLhttp: //arxiv.org/abs/2502.19680

work page arXiv 2025
[11]

Adaptive Greedy Frame Selection for Long Video Understanding

Yuning Huang and Fengqing Zhu. Adaptive Greedy Frame Selection for Long Video Understanding, March 2026. URLhttp://arxiv.org/abs/2603.20180

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Dense- Captioning Events in Videos, May 2017

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense- Captioning Events in Videos, May 2017. URLhttp://arxiv.org/abs/1705. 00754

2017
[13]

MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum V olume for Enhanced Video Understanding, December 2025

Pengyi Li, Irina Abdullaeva, Alexander Gambashidze, Andrey Kuznetsov, and Ivan Oseledets. MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum V olume for Enhanced Video Understanding, December 2025. URLhttp://arxiv. org/abs/2502.03183

work page arXiv 2025
[14]

KeyVideoLLM: Towards Large-scale Video Keyframe Selection, August 2024

Hao Liang, Jiapeng Li, Tianyi Bai, Xijie Huang, Linzhuang Sun, Zhengren Wang, Conghui He, Bin Cui, Chong Chen, and Wentao Zhang. KeyVideoLLM: Towards Large-scale Video Keyframe Selection, August 2024. URLhttp://arxiv.org/ abs/2407.03104

work page arXiv 2024
[15]

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. As- sociation for Computational Linguistics. URLhttps://aclanthology.org/ W04-1013/

2004
[16]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInterna- tional conference on learning representations, 2019. URLhttps://openreview. net/forum?id=Bkg6RiCqY7

2019
[17]

SmolVLM: Redefining small and efficient multimodal models

Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, and Thomas Wolf. SmolVLM: Redefining small and efficient multimodal models, April 2025. URLhttps://arxiv.org/abs...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

TempCore: Are Video QA Benchmarks Temporally Grounded? A Frame Selection Sensitivity Analysis and Benchmark, March 2026

Hyunjong Ok and Jaeho Lee. TempCore: Are Video QA Benchmarks Temporally Grounded? A Frame Selection Sensitivity Analysis and Benchmark, March 2026. URL http://arxiv.org/abs/2509.01167

work page arXiv 2026
[19]

BLEU : a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. In Pierre Isabelle, Eugene Char- niak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Associa- tion for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Com...

work page doi:10.3115/1073083.1073135 2002
[20]

R. L. Plackett. The Analysis of Permutations.Applied Statistics, 24(2):193, 1975. ISSN 00359254. doi: 10.2307/2346567. URLhttps://www.jstor.org/stable/ 2346567?origin=crossref

work page doi:10.2307/2346567 1975
[21]

Learning Transferable Visual Models From Natural Language Supervision, February 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, February 2021. URLhttp://arxiv.org/abs/2103. 00020

2021
[22]

CSTA: CNN-based Spatiotemporal At- tention for Video Summarization, May 2024

Jaewon Son, Jaehun Park, and Kwangsu Kim. CSTA: CNN-based Spatiotemporal At- tention for Video Summarization, May 2024. URLhttp://arxiv.org/abs/ 2405.11905. arXiv:2405.11905 [cs.CV]

work page arXiv 2024
[23]

From Frames to Clips: Training-free Adaptive Key Clip Selection for Long- Form Video Understanding, December 2025

Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, and Garin Kessler. From Frames to Clips: Training-free Adaptive Key Clip Selection for Long- Form Video Understanding, December 2025. URLhttp://arxiv.org/abs/ 2510.02262. arXiv:2510.02262 [cs]

work page arXiv 2025
[24]

Think-Clip- Sample: Slow-Fast Frame Selection for Video Understanding, January 2026

Wenhui Tan, Ruihua Song, Jiaze Li, Jianzhong Ju, and Zhenbo Luo. Think-Clip- Sample: Slow-Fast Frame Selection for Video Understanding, January 2026. URL http://arxiv.org/abs/2601.11359

work page arXiv 2026
[25]

Adaptive Keyframe Sampling for Long Video Understanding, February 2025

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive Keyframe Sampling for Long Video Understanding, February 2025. URLhttp:// arxiv.org/abs/2502.21271

work page arXiv 2025
[26]

Video Understanding with Large Language Models: A Survey, December 2023

Yolo Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali V osoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu. Video Understanding with Large Language Models: A Survey, December 2023. URLhttps://arxiv.org/abs/...

work page arXiv 2023
[27]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id=qwen3.5

2026
[28]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua 18STEUNOUET AL.: PEEK Zhai. SigLIP 2: Multilingual Vision-Language Encoders with Improved Seman- tic Understanding, Localization...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

CIDEr: Consensus-based Image Description Evaluation

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus- based Image Description Evaluation, November 2014. URLhttps://arxiv. org/abs/1411.5726v2

work page internal anchor Pith review Pith/arXiv arXiv 2014
[30]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A Bench- mark for Long-context Interleaved Video-Language Understanding, July 2024. URL https://arxiv.org/abs/2407.15754v1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Listwise approach to learning to rank: Theory and algorithm

Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: Theory and algorithm. InProceedings of the 25th International Conference on Machine Learning - ICML ’08, pages 1192–1199, Helsinki, Finland,
[32]

ISBN 978-1-60558-205-4

ACM Press. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390306. URL http://portal.acm.org/citation.cfm?doid=1390156.1390306

work page doi:10.1145/1390156.1390306
[33]

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, Las Vegas, NV , USA, June
[34]

ISBN 978-1-4673-8851-1

IEEE. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.571. URLhttp: //ieeexplore.ieee.org/document/7780940/

work page doi:10.1109/cvpr.2016.571 2016
[35]

Frame-V oyager: Learning to Query Frames for Video Large Language Models, March

Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, and Qianru Sun. Frame-V oyager: Learning to Query Frames for Video Large Language Models, March
[36]

URLhttp://arxiv.org/abs/2410.03226

work page arXiv
[37]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre-Training, September 2023. URLhttp://arxiv.org/ abs/2303.15343. arXiv:2303.15343 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Q-Frame: Query- aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs, July 2025

Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. Q-Frame: Query- aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs, July 2025. URLhttp://arxiv.org/abs/2506.22139

work page arXiv 2025
[39]

Apollo: An Exploration of Video Understanding in Large Multi- modal Models, December 2024

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen- Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung- Levy, and Xide Xia. Apollo: An Exploration of Video Understanding in Large Multi- modal Models, December 2024. URLhttp://arxiv.org/abs/2412.10360. arXiv:2412.10360 [cs]

work page arXiv 2024
[40]

Video- Brain: Learning Adaptive Frame Sampling for Long Video Understanding, February

Junbo Zou, Ziheng Huang, Shengjie Zhang, Liwen Zhang, and Weining Shen. Video- Brain: Learning Adaptive Frame Sampling for Long Video Understanding, February
[41]

URLhttp://arxiv.org/abs/2602.04094

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report, F...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

METEOR: An Automatic Metric for MT Evalua- tion with Improved Correlation with Human Judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An Automatic Metric for MT Evalua- tion with Improved Correlation with Human Judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors,Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michiga...

2005

[4] [4]

El Yacoubi

Marija Brkic, Anas Filali Razzouki, Yannis Tevissen, Khalil Guetari, and Mounim A. El Yacoubi. Frame Sampling Strategies Matter: A Benchmark for small vision lan- guage models, September 2025. URLhttp://arxiv.org/abs/2509.14769. arXiv:2509.14769 [cs]

work page arXiv 2025

[5] [5]

ActivityNet: A Large-Scale Video Benchmark for Human Activity Understand- ing

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understand- ing. pages 961–970, 2015. URLhttps://www.cv-foundation.org/ openaccess/content_cvpr_2015/html/Heilbron_ActivityNet_A_ Large-Scale_2015_CVPR_paper.html

2015

[6] [6]

LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning

Lianying Chao, Linfeng Yin, Peiyu Ren, Yifan Jiang, Qiaoyu Ren, Dingcheng Shan, Jing-cheng Pang, Sijie Wu, Xubin Li, and Kai Zhang. LFS: Learnable Frame Selec- tor for Event-Aware and Temporally Diverse Video Captioning, January 2026. URL http://arxiv.org/abs/2601.14594. 16STEUNOUET AL.: PEEK

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Less Is More: Picking Informative Frames for Video Captioning

Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. Less Is More: Picking Informative Frames for Video Captioning, March 2018. URLhttp:// arxiv.org/abs/1803.01457. arXiv:1803.01457 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

MobileCLIP2: Improving Multi-Modal Re- inforced Training, August 2025

Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev, Oncel Tuzel, and Hadi Pouransari. MobileCLIP2: Improving Multi-Modal Re- inforced Training, August 2025. URLhttp://arxiv.org/abs/2508.20691. arXiv:2508.20691 [cs.CV]

work page arXiv 2025

[9] [9]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zi- han Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-MME: The First-Ever Comprehensive Evalua- tion Benchmark of Multi-modal LLMs in Video Analysis,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

M-LLM Based Video Frame Selection for Efficient Video Understanding, March 2025

Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, and Trishul Chilimbi. M-LLM Based Video Frame Selection for Efficient Video Understanding, March 2025. URLhttp: //arxiv.org/abs/2502.19680

work page arXiv 2025

[11] [11]

Adaptive Greedy Frame Selection for Long Video Understanding

Yuning Huang and Fengqing Zhu. Adaptive Greedy Frame Selection for Long Video Understanding, March 2026. URLhttp://arxiv.org/abs/2603.20180

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Dense- Captioning Events in Videos, May 2017

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense- Captioning Events in Videos, May 2017. URLhttp://arxiv.org/abs/1705. 00754

2017

[13] [13]

MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum V olume for Enhanced Video Understanding, December 2025

Pengyi Li, Irina Abdullaeva, Alexander Gambashidze, Andrey Kuznetsov, and Ivan Oseledets. MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum V olume for Enhanced Video Understanding, December 2025. URLhttp://arxiv. org/abs/2502.03183

work page arXiv 2025

[14] [14]

KeyVideoLLM: Towards Large-scale Video Keyframe Selection, August 2024

Hao Liang, Jiapeng Li, Tianyi Bai, Xijie Huang, Linzhuang Sun, Zhengren Wang, Conghui He, Bin Cui, Chong Chen, and Wentao Zhang. KeyVideoLLM: Towards Large-scale Video Keyframe Selection, August 2024. URLhttp://arxiv.org/ abs/2407.03104

work page arXiv 2024

[15] [15]

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. As- sociation for Computational Linguistics. URLhttps://aclanthology.org/ W04-1013/

2004

[16] [16]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInterna- tional conference on learning representations, 2019. URLhttps://openreview. net/forum?id=Bkg6RiCqY7

2019

[17] [17]

SmolVLM: Redefining small and efficient multimodal models

Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, and Thomas Wolf. SmolVLM: Redefining small and efficient multimodal models, April 2025. URLhttps://arxiv.org/abs...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

TempCore: Are Video QA Benchmarks Temporally Grounded? A Frame Selection Sensitivity Analysis and Benchmark, March 2026

Hyunjong Ok and Jaeho Lee. TempCore: Are Video QA Benchmarks Temporally Grounded? A Frame Selection Sensitivity Analysis and Benchmark, March 2026. URL http://arxiv.org/abs/2509.01167

work page arXiv 2026

[19] [19]

BLEU : a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. In Pierre Isabelle, Eugene Char- niak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Associa- tion for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Com...

work page doi:10.3115/1073083.1073135 2002

[20] [20]

R. L. Plackett. The Analysis of Permutations.Applied Statistics, 24(2):193, 1975. ISSN 00359254. doi: 10.2307/2346567. URLhttps://www.jstor.org/stable/ 2346567?origin=crossref

work page doi:10.2307/2346567 1975

[21] [21]

Learning Transferable Visual Models From Natural Language Supervision, February 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision, February 2021. URLhttp://arxiv.org/abs/2103. 00020

2021

[22] [22]

CSTA: CNN-based Spatiotemporal At- tention for Video Summarization, May 2024

Jaewon Son, Jaehun Park, and Kwangsu Kim. CSTA: CNN-based Spatiotemporal At- tention for Video Summarization, May 2024. URLhttp://arxiv.org/abs/ 2405.11905. arXiv:2405.11905 [cs.CV]

work page arXiv 2024

[23] [23]

From Frames to Clips: Training-free Adaptive Key Clip Selection for Long- Form Video Understanding, December 2025

Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, and Garin Kessler. From Frames to Clips: Training-free Adaptive Key Clip Selection for Long- Form Video Understanding, December 2025. URLhttp://arxiv.org/abs/ 2510.02262. arXiv:2510.02262 [cs]

work page arXiv 2025

[24] [24]

Think-Clip- Sample: Slow-Fast Frame Selection for Video Understanding, January 2026

Wenhui Tan, Ruihua Song, Jiaze Li, Jianzhong Ju, and Zhenbo Luo. Think-Clip- Sample: Slow-Fast Frame Selection for Video Understanding, January 2026. URL http://arxiv.org/abs/2601.11359

work page arXiv 2026

[25] [25]

Adaptive Keyframe Sampling for Long Video Understanding, February 2025

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive Keyframe Sampling for Long Video Understanding, February 2025. URLhttp:// arxiv.org/abs/2502.21271

work page arXiv 2025

[26] [26]

Video Understanding with Large Language Models: A Survey, December 2023

Yolo Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali V osoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu. Video Understanding with Large Language Models: A Survey, December 2023. URLhttps://arxiv.org/abs/...

work page arXiv 2023

[27] [27]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id=qwen3.5

2026

[28] [28]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua 18STEUNOUET AL.: PEEK Zhai. SigLIP 2: Multilingual Vision-Language Encoders with Improved Seman- tic Understanding, Localization...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

CIDEr: Consensus-based Image Description Evaluation

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus- based Image Description Evaluation, November 2014. URLhttps://arxiv. org/abs/1411.5726v2

work page internal anchor Pith review Pith/arXiv arXiv 2014

[30] [30]

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A Bench- mark for Long-context Interleaved Video-Language Understanding, July 2024. URL https://arxiv.org/abs/2407.15754v1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Listwise approach to learning to rank: Theory and algorithm

Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: Theory and algorithm. InProceedings of the 25th International Conference on Machine Learning - ICML ’08, pages 1192–1199, Helsinki, Finland,

[32] [32]

ISBN 978-1-60558-205-4

ACM Press. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390306. URL http://portal.acm.org/citation.cfm?doid=1390156.1390306

work page doi:10.1145/1390156.1390306

[33] [33]

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, Las Vegas, NV , USA, June

[34] [34]

ISBN 978-1-4673-8851-1

IEEE. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.571. URLhttp: //ieeexplore.ieee.org/document/7780940/

work page doi:10.1109/cvpr.2016.571 2016

[35] [35]

Frame-V oyager: Learning to Query Frames for Video Large Language Models, March

Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, Hao Zhang, and Qianru Sun. Frame-V oyager: Learning to Query Frames for Video Large Language Models, March

[36] [36]

URLhttp://arxiv.org/abs/2410.03226

work page arXiv

[37] [37]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre-Training, September 2023. URLhttp://arxiv.org/ abs/2303.15343. arXiv:2303.15343 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Q-Frame: Query- aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs, July 2025

Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, and Jian Luan. Q-Frame: Query- aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs, July 2025. URLhttp://arxiv.org/abs/2506.22139

work page arXiv 2025

[39] [39]

Apollo: An Exploration of Video Understanding in Large Multi- modal Models, December 2024

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen- Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung- Levy, and Xide Xia. Apollo: An Exploration of Video Understanding in Large Multi- modal Models, December 2024. URLhttp://arxiv.org/abs/2412.10360. arXiv:2412.10360 [cs]

work page arXiv 2024

[40] [40]

Video- Brain: Learning Adaptive Frame Sampling for Long Video Understanding, February

Junbo Zou, Ziheng Huang, Shengjie Zhang, Liwen Zhang, and Weining Shen. Video- Brain: Learning Adaptive Frame Sampling for Long Video Understanding, February

[41] [41]

URLhttp://arxiv.org/abs/2602.04094

work page internal anchor Pith review Pith/arXiv arXiv