pith. machine review for the scientific record. sign in

arxiv: 2605.10762 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

Ali Habibullah, Lama Ayash, Mohamed Eltahir, Naeemullah Khan, Tanveer Hussain

Pith reviewed 2026-05-12 03:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords long-video understandingvision-language modelsframe selectionposterior probingadaptive computetraining-free inferencetest-time adaptation
0
0 comments X

The pith

GridProbe selects long-video frames by probing a frozen VLM's own answer posteriors on a grid, cutting compute up to 3x with little accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-video VLMs face quadratic attention costs when processing thousands of frames in one forward pass. GridProbe replaces similarity-based frame selectors, which fail on reasoning queries, with a training-free posterior probe that scores evidence directly in answer space using the frozen model. Frames are laid out on a K by K grid; short row and column probes yield peak posteriors whose outer product forms an importance map. Shape-Adaptive Selection then uses the map's skewness and kurtosis to set a per-question frame budget M_eff instead of a fixed count. The result is sub-quadratic cost with accuracy that matches or exceeds the full-frame baseline on Video-MME-v2 and LongVideoBench, plus gains from mixing small selectors with larger QA models.

Core claim

GridProbe arranges frames on a K×K grid and runs lightweight row and column probes on the frozen VLM, each reporting its peak posterior as query-conditioned confidence. The outer product of these posteriors produces an interpretable importance map. Shape-Adaptive Selection applies a closed-form rule based on the map's skewness and kurtosis to replace the fixed frame budget M with a per-question M_eff that tracks intrinsic difficulty without seeing the answer, delivering adaptive test-time compute.

What carries the argument

The importance map formed by the outer product of row and column peak posteriors, which drives Shape-Adaptive Selection to determine question-specific M_eff from skewness and kurtosis.

If this is right

  • On Video-MME-v2 it matches the full-frame baseline within 1.6 pp average accuracy at 3.36× lower TFLOPs.
  • On LongVideoBench it improves accuracy by 0.9 pp while using 0.35× the compute.
  • A 2B selector paired with a 4B or 8B QA model outperforms the 2B monolithic baseline by up to 4.0 pp at 0.52× compute.
  • The resulting importance maps support interpretability for diagnostics, grounding, and distillation without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method shows that a VLM's internal posterior distribution can locate evidence even when its contrastive pretraining features cannot.
  • Decoupling the selector from the QA model points to modular long-video systems that scale without joint retraining.
  • The same grid-probe idea could be applied to streaming video or other modalities where evidence location needs to be inferred at test time.

Load-bearing premise

The peak posteriors from the lightweight row and column probes reliably mark which frames hold evidence for the question, including on reasoning-heavy queries where contrastive signals are weak.

What would settle it

A collection of reasoning queries on which the importance map concentrates on the wrong frames, producing accuracy below the monolithic baseline despite the claimed compute savings.

Figures

Figures reproduced from arXiv: 2605.10762 by Ali Habibullah, Lama Ayash, Mohamed Eltahir, Naeemullah Khan, Tanveer Hussain.

Figure 1
Figure 1. Figure 1: (a) VMME-V2 Pareto across QA model sizes. GridProbe variants in the green region Pareto-dominate the 2B baseline. (b) Compute reduction across K at fixed 2B QA. Abstract Long-video understanding in VLMs is bottlenecked by a single monolithic forward pass over thousands of frames at quadratic attention cost. A common mitigation is to first select a small subset of informative frames before the forward pass;… view at source ↗
Figure 2
Figure 2. Figure 2: GridProbe pipeline. Stage 1: 2K row/column probes on K2 candidate frames yield an importance map. Stage 2: one focused pass on the top-Meff cells, sized adaptively from the map’s distribution shape. Shape-driven adaptive test-time compute. A closed-form statistic on the importance map dis￾tribution replaces the fixed frame budget M with a per-question Meff that adapts to the question difficulty. The Redund… view at source ↗
Figure 3
Figure 3. Figure 3: Encoder-space (top) vs answer-space (bottom) selection signals. Encoder-space scoring [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GridProbe’s adaptive Meff (blue) and the 2B baseline accuracy (red), smoothed across signed skew(M) on V2 (K=12, n=3,200). The two curves mirror each other: both signed extremes route to small Meff on intrinsically easier questions, while the near-uniform middle gets near-K2 coverage on intrinsically harder ones, an empirical realization of the redundancy principle (§3.3). it, doubling K to gain finer prob… view at source ↗
Figure 5
Figure 5. Figure 5: Three Video-MME-v2 queries exercise three distribution-shape regimes ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Long-video understanding in VLMs is bottlenecked by a single monolithic forward pass over thousands of frames at quadratic attention cost. A common mitigation is to first select a small subset of informative frames before the forward pass; common for training-free selectors via auxiliary encoder-space similarities. Such signals are capped by contrastive pretraining, which usually fails on reasoning-heavy queries (negation, cross-frame counting, holistic summarization). We propose GridProbe, an efficient training-free posterior-probing inference paradigm that scores evidence in answer space using a frozen VLM's own reasoning and then selects question-relevant frames adaptively, resulting in sub-quadratic attention cost with little to no accuracy loss. We arrange frames on a $K{\times}K$ grid and run lightweight row R and column C probes, where each probe reads its peak posterior as a query-conditioned confidence. The outer product of R and C yields an interpretable importance map whose skewness and kurtosis drive Shape-Adaptive Selection, a closed-form rule that reliably replaces the fixed frame budget $M$ with a per-question $M_{\mathrm{eff}}$. We show empirically that $M_{\mathrm{eff}}$ tracks intrinsic question difficulty without ever seeing the answer, a sign of test-time adaptive compute. On Video-MME-v2, GridProbe matches the monolithic baseline within $1.6$ pp Avg Acc at $3.36\times$ TFLOPs reduction, while on LongVideoBench it Pareto-dominates the baseline ($+0.9$ pp at $0.35\times$ compute). Because the selector and QA models can be decoupled, pairing a small 2B selector with a stronger 4B or 8B QA is strictly Pareto-dominant over the 2B monolithic baseline (up to $+4.0$ pp at $0.52\times$ compute, on average), with no retraining. Finally, the interpretability of the importance maps opens future avenues for behavioral diagnostics, grounding, and frame-selection distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes GridProbe, a training-free posterior-probing inference paradigm for long-video VLMs. Frames are arranged on a K×K grid; lightweight row R and column C probes extract peak posteriors as query-conditioned confidence scores from the frozen VLM; their outer product yields an interpretable importance map; skewness and kurtosis of this map drive a closed-form Shape-Adaptive Selection rule that replaces a fixed frame budget M with a per-question M_eff, achieving sub-quadratic attention cost. Empirical results show near-parity or better accuracy than monolithic baselines on Video-MME-v2 and LongVideoBench, with additional gains from decoupling small selector models from larger QA models.

Significance. If the localization assumption holds, the work delivers practical efficiency gains (3.36× TFLOPs reduction at 1.6 pp accuracy drop on Video-MME-v2; +0.9 pp at 0.35× compute on LongVideoBench) plus model-decoupling benefits without retraining and interpretable importance maps. These are load-bearing strengths for test-time adaptive compute in VLMs.

major comments (2)
  1. [Abstract / §3] Abstract and §3 (method): the headline gains (matching baseline within 1.6 pp at 3.36× TFLOPs on Video-MME-v2; Pareto dominance +0.9 pp at 0.35× on LongVideoBench) depend on the claim that peak posteriors from linear row/column probes reliably localize evidence for reasoning-heavy queries (negation, cross-frame counting, holistic summarization). Because each probe sees only a linear subset of frames, the outer-product map and subsequent Shape-Adaptive Selection can inherit partial or spurious signals; this assumption is central and requires explicit per-query-type validation or failure-case analysis to support the reported improvements.
  2. [§4] §4 (experiments): the Pareto claims and decoupling results (2B selector + 4B/8B QA) are presented without visible error bars, data-split details, or controls for post-hoc tuning/selection bias. This makes it impossible to verify that M_eff genuinely tracks intrinsic question difficulty rather than benchmark artifacts, directly affecting the soundness of the adaptive-compute conclusion.
minor comments (2)
  1. [Abstract] Abstract: the definition of M_eff and the role of skewness/kurtosis in the closed-form rule could be stated more explicitly to aid readers before the method section.
  2. [§3] Notation: the grid size K and probe dimensions are introduced without a short illustrative example; a small diagram or equation for the outer-product map would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the efficiency and interpretability contributions of GridProbe. We address each major comment below and indicate the revisions planned for the next manuscript version.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (method): the headline gains (matching baseline within 1.6 pp at 3.36× TFLOPs on Video-MME-v2; Pareto dominance +0.9 pp at 0.35× on LongVideoBench) depend on the claim that peak posteriors from linear row/column probes reliably localize evidence for reasoning-heavy queries (negation, cross-frame counting, holistic summarization). Because each probe sees only a linear subset of frames, the outer-product map and subsequent Shape-Adaptive Selection can inherit partial or spurious signals; this assumption is central and requires explicit per-query-type validation or failure-case analysis to support the reported improvements.

    Authors: We agree that the localization assumption for reasoning-heavy queries is central and that a more granular validation would strengthen the paper. While the reported results on Video-MME-v2 and LongVideoBench already encompass such queries and demonstrate near-parity performance, the current manuscript does not include an explicit per-query-type breakdown or dedicated failure-case analysis. We will revise §4 to add this: accuracy stratified by query category (negation, cross-frame counting, summarization), plus selected importance-map visualizations for both successful and partial-localization cases, to directly address potential spurious signals from the linear probes. revision: yes

  2. Referee: [§4] §4 (experiments): the Pareto claims and decoupling results (2B selector + 4B/8B QA) are presented without visible error bars, data-split details, or controls for post-hoc tuning/selection bias. This makes it impossible to verify that M_eff genuinely tracks intrinsic question difficulty rather than benchmark artifacts, directly affecting the soundness of the adaptive-compute conclusion.

    Authors: We acknowledge that the experimental section would benefit from greater statistical transparency. The manuscript reports mean accuracies but omits error bars, explicit split details, and bias controls. In the revision we will add: (i) standard-error bars over multiple runs, (ii) precise documentation of the Video-MME-v2 and LongVideoBench splits, and (iii) additional ablations comparing M_eff against fixed-budget and random-selection baselines on the same splits. We will also clarify that Shape-Adaptive Selection hyperparameters were tuned on a small held-out validation set disjoint from the reported test sets, thereby addressing post-hoc selection concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: mechanism defined directly from VLM forward passes

full rationale

The derivation consists of arranging frames into a KxK grid, extracting peak posteriors from row/column probes on the frozen VLM, forming an importance map via outer product, and applying a closed-form Shape-Adaptive Selection rule driven by skewness and kurtosis. These operations are explicit algorithmic definitions that consume the model's own outputs; they do not reduce any claimed performance gain or M_eff to a quantity fitted from the target data or to a self-citation chain. Empirical results on Video-MME-v2 and LongVideoBench are presented as external validation rather than forced equalities. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that a frozen VLM's answer posteriors on synthetic row/column probes correlate with true evidence location for arbitrary queries. No free parameters are explicitly named in the abstract, though K (grid size) and any thresholds inside the closed-form selection rule function as tunable choices. No new entities are postulated.

axioms (1)
  • domain assumption A frozen VLM's peak posteriors on lightweight row and column probes serve as reliable proxies for frame importance on the original query.
    Invoked to justify replacing similarity-based selection with posterior probing; central to the claim that the method works for reasoning-heavy queries where contrastive signals fail.

pith-pipeline@v0.9.0 · 5685 in / 1512 out tokens · 35340 ms · 2026-05-12T03:41:09.369138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 4 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Mdp3: A training-free approach for list-wise frame selection in video-llms

    Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Ming Li. Mdp3: A training-free approach for list-wise frame selection in video-llms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24090–24101, 2025

  3. [3]

    Frame-voyager: Learning to query frames for video large language models.arXiv preprint arXiv:2410.03226, 2024

    Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, et al. Frame-voyager: Learning to query frames for video large language models.arXiv preprint arXiv:2410.03226, 2024

  4. [4]

    status":

    Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, and Yang You. Focus: Efficient keyframe selection for long video understanding.arXiv preprint arXiv:2510.27280, 2025

  5. [5]

    Hfs: Holistic query-aware frame selection for efficient video reasoning

    Yiqing Yang and Kin-Man Lam. Hfs: Holistic query-aware frame selection for efficient video reasoning. arXiv preprint arXiv:2512.11534, 2025

  6. [6]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  7. [7]

    Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding

    Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, and Si Liu. Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8592–8603, 2025. 10

  8. [8]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  9. [9]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  10. [10]

    Frag: Frame selection augmented generation for long video and long document understanding.arXiv preprint arXiv:2504.17447, 2025

    De-An Huang, Subhashree Radhakrishnan, Zhiding Yu, and Jan Kautz. Frag: Frame selection augmented generation for long video and long document understanding.arXiv preprint arXiv:2504.17447, 2025

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    Videoatlas: Navigating long-form video in logarithmic compute.arXiv preprint arXiv:2603.17948, 2026

    Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain, and Naeemullah Khan. Videoatlas: Navigating long-form video in logarithmic compute.arXiv preprint arXiv:2603.17948, 2026

  13. [13]

    Videoagent: Long-form video under- standing with large language model as agent

    Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video under- standing with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024

  14. [14]

    Adaptive video understanding agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning.arXiv preprint arXiv:2410.20252, 2024

    Sullam Jeoung, Goeric Huybrechts, Bhavana Ganesh, Aram Galstyan, and Sravan Bodapati. Adaptive video understanding agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning.arXiv preprint arXiv:2410.20252, 2024

  15. [15]

    Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

    Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

  16. [16]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828– 28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828– 28857, 2024. 11 A Implementation Details Models.All experiments use Qwen3-VL-Instruct backbones at three sizes (2B, 4B, 8B parameters) loaded from HuggingFace Tra...