arxiv: 2605.09860 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

Anudeepsekhar Bolimera, Chen Li, Fangyi Chen, Han Zhang, Marios Savvides, Zhantao Yang

Pith reviewed 2026-05-12 04:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords commitment depthtemporal abstractionvision-language policylong-horizon reasoningsliding puzzleSokobanadaptive replanningopen-loop execution

0 comments

The pith

Vision-language policies that learn a state-dependent commitment depth solve more long-horizon puzzles with fewer actions than any fixed depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that long-horizon reasoning requires not only choosing actions but also deciding how many primitive steps to execute open-loop before the next observation. By turning commitment depth into a learnable output predicted from the visual state inside the same vision-language policy, the approach automatically balances replanning cost against compounding errors. On Sliding Puzzle and Sokoban the resulting adaptive policy reaches higher success rates while using fewer total actions than every constant-depth baseline. A supporting theoretical result proves that any state-conditioned commitment strategy strictly outperforms every fixed depth once the locally optimal depth varies across states.

Core claim

The central claim is that a model-native vision-language policy jointly predicting actions and state-conditioned commitment depth Pareto-dominates every non-degenerate fixed-depth baseline across Sliding Puzzle and Sokoban, delivering up to 12.5 percentage points higher solve rate with roughly 25 percent fewer primitive actions per episode, while a 7B model already exceeds GPT-5.5 and Claude Sonnet and every tested open-weight vision-language model scores zero zero-shot success; the theory further shows state-conditioned commitment strictly dominates fixed depth whenever locally optimal depths vary by state.

What carries the argument

Commitment depth, defined as the number of primitive actions executed open-loop between replans and treated as a learnable state-conditioned variable output by the policy together with the action sequence.

Load-bearing premise

The vision-language policy can reliably predict a useful commitment depth from visual state alone and the standard surrogate for execution cost and error accumulation used in the analysis matches real performance.

What would settle it

Running the identical Sliding Puzzle and Sokoban evaluations with the learned policy forced to use a single fixed depth equal to the average of its own predicted depths and finding no drop in solve rate or increase in actions.

Figures

Figures reproduced from arXiv: 2605.09860 by Anudeepsekhar Bolimera, Chen Li, Fangyi Chen, Han Zhang, Marios Savvides, Zhantao Yang.

**Figure 1.** Figure 1: Adaptive commitment depth strictly Pareto-dominates the strongest fixed-depth baseline on both tasks, and the oracle depth distribution is markedly non-degenerate. (a, b) Solve rate vs. primitive actions per episode under the loose decision budget; each point is a fixed-depth policy at the indicated h, and the marker labeled “A” is our adaptive policy. Adaptive sits to the upper-left of every h ∈ H = {1, 2… view at source ↗

**Figure 2.** Figure 2: Unified VLM policy with state-conditioned commitment depth. Backbone → ztk ; π h emits hk ∈ {1, 2, 4, 8}; π a generates a length-hk sequence, executed open-loop. Heads share backbone and one GRPO objective; no solver or planner at evaluation. Decision budget and commitment depth The decision budget K ∈ N caps the number of decision points per episode—equivalently, the number of forward passes the policy is… view at source ↗

**Figure 3.** Figure 3: reports the headline result at the loose budget. Adaptive achieves strictly higher solve rate and strictly fewer primitive actions per episode than the strongest fixed-depth baseline on both tasks: 56.3%|37 vs. 43.8%|49 on Sliding (h=4), 35.9%|30 vs. 32.8%|40 on Sokoban (h=8). Fixed-depth baselines are trained on multi-h macro-action data (§4); the strongest-fixed claim is verified in App. F.2 and Pareto-d… view at source ↗

**Figure 4.** Figure 4: State-dependent commitment depth tracks progress-signal reliability. (a) Top: commitment-depth distribution on solved test trajectories (loose budget), bucketed by remainingdistance (near/mid/far); mean h¯ (red diamonds, right axis) decreases monotonically with distance on both tasks. (b) Bottom: per-decision failure rate P(∆d ≤ 0) (bars, left axis) and mean improvement ∆¯ d (diamonds, ±1σ, right axis)—bo… view at source ↗

**Figure 5.** Figure 5: Task-conditioned commitment depth emerges during RL training. Fixed-h solve rate at three checkpoints. Sliding: h ⋆=4 throughout. Sokoban: h ⋆ shifts from 4 early to 6 at middle/late checkpoints. The unified policy adapts to each task. Sliding Sokoban 0 1 2 Avg. wasted actions 0.9 0.3 1.9 0.3 1.2 0.2 Wasted actions Sliding Sokoban 0.00 0.25 0.50 0.75 1.00 Avg. backward actions 0.4 0.1 0.9 0.1 0.5 0.1 Backw… view at source ↗

**Figure 6.** Figure 6: Adaptive trajectories on solved episodes are straighter, not just shorter. Per-decision metrics on solved episodes (loose budget). Left: wasted actions (∆d = 0). Middle: backward actions (∆d < 0). Right: mean progress per action ∆¯ d. Adaptive matches or exceeds both fixed-depth baselines on every metric on both tasks. absolute solve rate at h=6 rising substantially. One unified policy adapts to each task’… view at source ↗

**Figure 7.** Figure 7: Training-time entropy of both heads. Per-step entropy averaged over the rollout state distribution, with 5-step moving average. Left: Sliding; right: Sokoban. Depth-head stays at ∼ 0.91 log 4 (dashed: uniform upper bound); action-head minima 0.300 (Sliding), 0.681 (Sokoban). Neither head collapses. 6.3 Robustness We verify three robustness axes. Full data and figures in App. F.3. Random seeds. Across three… view at source ↗

**Figure 8.** Figure 8: Detailed pipeline for data generation and collection. For each episode, a reproducible and configurable random environment and a selected collector agent are initialized. The agent predicts a commitment for open-loop execution and renders the next state for prediction through a separate renderer. All commitments and states are saved and compared against the heuristics solver for ground-truth-based signal. … view at source ↗

**Figure 9.** Figure 9: Task illustrations. For each task we show one representative instance, with the initial state on the left and the goal state on the right of each pair. Left pair: Sliding Puzzle (3×3). Tiles must be rearranged into canonical row-major order (goal) by sliding tiles one at a time into the empty cell; every primitive action is reversible. Right pair: Sokoban. The agent (Player) must push every box (B) onto a … view at source ↗

**Figure 10.** Figure 10: Single-commitment training pipeline: solve rate at every (htrain, heval) combination on each task and training pipeline (loose evaluation budget, test set). Top: per-cell solve rate; cells along the diagonal (htrain = heval) are the only configurations in which training and evaluation align; the missing cell (Sokoban SFT, htrain = 2, heval = 8) is marked “–”. Bottom: best diagonal cell solve rate compared… view at source ↗

**Figure 11.** Figure 11: Adaptive commitment-depth allocation across Ktrain. Top: Sliding (Keval = Kloose = 15; Ktrain ∈ {4, 5, 10}). Bottom: Sokoban (Keval = 6; Ktrain ∈ {3, 6, 10}). Defaults are 5 and 6. Left: adaptive (blue) and best fixed-depth (orange) solve rate. Middle: best fixed depth h ⋆ . Right: adaptive’s π h distribution, stacked. 0.84/0.85; Sokoban tight: 0.67 vs. 0.65/0.58). The mechanism reported in the main text—… view at source ↗

**Figure 12.** Figure 12: Complete efficiency breakdown across solved / unsolved episodes and tight / loose budgets. Per-decision metrics for adaptive vs. RL best fixed vs. SFT best fixed. Top two rows: per-decision wasted actions, backward actions, and progress per action across all (outcome, budget) combinations. Bottom row, left two: progress per action on solved episodes at tight and loose budget. Bottom row, right: adaptive’s… view at source ↗

**Figure 13.** Figure 13: Example system prompt for zero-shot vision language model baseline on Sokoban. The puzzle name and its state and action format descriptions are task dependent, and are different for Sliding Puzzle. Notes on individual baselines. Open-weight VLMs. All seven open-weight VLMs we evaluated score 0% solve rate on both tasks; this includes the open-weight version of our own backbone, Qwen2.5-VL-7B, evaluated ze… view at source ↗

read the original abstract

Long-horizon reasoning requires deciding not only what actions to take, but how deeply to commit before the next observation. We formalize this as \emph{commitment depth}: the number of primitive actions executed open-loop between replans. Commitment depth induces a trade-off between replanning cost and compounding execution error, yet most existing long-horizon systems fix it as a hand-designed scalar. In this work, we instead treat commitment depth as a learnable, state-conditioned variable of the policy itself. We instantiate this within a model-native vision--language policy that jointly predicts both what to execute and for how long. Across Sliding Puzzle and Sokoban, the resulting adaptive policy Pareto-dominates every non-degenerate fixed-depth baseline, achieving up to 12.5 percentage points higher solve rate while using approximately 25\% fewer primitive actions per episode. Despite using a 7B backbone, our method outperforms GPT-5.5 and Claude Sonnet on both tasks, while every tested open-weight vision--language model achieves 0\% zero-shot success. We further present a theoretical analysis showing that, under the standard commitment-depth surrogate, state-conditioned commitment strictly dominates any fixed depth whenever the locally optimal depth varies across states.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper treats commitment depth as a jointly learned, state-conditioned output inside a VLM policy and reports Pareto gains over fixed depths on two puzzle domains, but the results do not isolate whether the adaptation is what produces the lift.

read the letter

The paper's central move is to let the vision-language policy predict both the action sequence and the number of steps to commit to before the next observation, with that depth conditioned on the current visual state. This replaces the usual hand-tuned scalar horizon with something the model outputs directly. On Sliding Puzzle and Sokoban the adaptive policy beats every fixed-depth baseline in solve rate and action count, and it even outperforms GPT-5.5 and Claude Sonnet despite using a 7B backbone. The theoretical sketch under the standard surrogate cost model is simple and shows that state-varying commitment is strictly better when the locally optimal depth changes across states. Those are the concrete results worth noting. The framing is clean and the problem it targets is real for any long-horizon VL agent that must balance replanning cost against execution error. The empirical Pareto dominance on the two domains is the strongest part of the work. The soft spots are exactly where the stress-test points. There is no distribution of predicted depths, no per-state comparison to an oracle depth, and no ablation that keeps the rest of the trained policy but forces fixed depth. Without those, it is impossible to tell whether the reported 12.5-point solve-rate lift and 25% action reduction come from the adaptive mechanism or simply from training the full policy end-to-end. Training details, error bars, and optimization choices for the joint head are also absent, so the empirical claim remains preliminary. This paper is for people building vision-language agents that interleave planning and execution under latency or compute limits. A reader already working on hierarchical RL or long-horizon VLMs will see a practical idea they can test. It deserves peer review because the problem is well-motivated and the approach is straightforward to strengthen with the missing controls and statistics.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes learning state-conditioned commitment depths as part of a vision-language policy for long-horizon tasks such as Sliding Puzzle and Sokoban. The adaptive policy is claimed to Pareto-dominate fixed-depth baselines, achieving up to 12.5 percentage points higher solve rates and approximately 25% fewer primitive actions per episode, while also outperforming GPT-5.5 and Claude Sonnet despite using a 7B backbone. A theoretical analysis shows that state-conditioned commitment strictly dominates any fixed depth under the standard surrogate when locally optimal depths vary across states.

Significance. The approach of making temporal abstraction learnable within the policy could have significant implications for efficient long-horizon vision-language reasoning if the results hold. The empirical Pareto dominance on two standard domains and the theoretical dominance argument are notable strengths. However, the absence of error bars, training details, and specific ablations in the provided claims reduces the immediate impact.

major comments (3)

[Empirical Evaluation] The central empirical claim of Pareto dominance and performance gains (12.5 pp solve rate, 25% fewer actions) lacks an ablation study that fixes the commitment depth prediction while retaining the joint policy training. Without this, it is unclear whether the improvements stem from adaptive depths or from the joint optimization of the prediction head itself.
[Theoretical Analysis] The strict dominance result is derived from the standard commitment-depth surrogate cost model. However, no empirical validation is provided showing that this surrogate accurately captures the real execution costs and error accumulation in the Sliding Puzzle and Sokoban environments.
[Results] No information is given on the distribution of predicted commitment depths across states or per-state comparisons to locally optimal depths. This is necessary to confirm that the policy learns meaningful state-varying commitments rather than a near-constant depth.

minor comments (2)

[Abstract] The abstract states that every tested open-weight vision-language model achieves 0% zero-shot success, but does not specify which models were evaluated or the exact zero-shot setup.
[Abstract] Training details, hyperparameters, and error bars for the reported solve rates and action counts are not mentioned, which would aid in assessing the reliability of the 12.5 pp and 25% figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and outline revisions that will strengthen the manuscript.

read point-by-point responses

Referee: [Empirical Evaluation] The central empirical claim of Pareto dominance and performance gains (12.5 pp solve rate, 25% fewer actions) lacks an ablation study that fixes the commitment depth prediction while retaining the joint policy training. Without this, it is unclear whether the improvements stem from adaptive depths or from the joint optimization of the prediction head itself.

Authors: We agree that isolating the contribution of state-conditioned depths from joint training is important. In the revised manuscript we will add an ablation that trains the full model jointly but then freezes the commitment-depth head to output a constant value (the mean depth observed during training) at evaluation time. This will be compared directly against the adaptive policy on the same backbone and training regime to quantify the benefit attributable to adaptivity. revision: yes
Referee: [Theoretical Analysis] The strict dominance result is derived from the standard commitment-depth surrogate cost model. However, no empirical validation is provided showing that this surrogate accurately captures the real execution costs and error accumulation in the Sliding Puzzle and Sokoban environments.

Authors: The theoretical claim is stated under the standard surrogate used throughout the temporal-abstraction literature. To address the request for validation, the revision will include a new subsection that correlates predicted commitment depths with measured per-step error rates and replanning frequency observed in our rollouts, thereby providing empirical grounding for the surrogate within the evaluated domains. revision: yes
Referee: [Results] No information is given on the distribution of predicted commitment depths across states or per-state comparisons to locally optimal depths. This is necessary to confirm that the policy learns meaningful state-varying commitments rather than a near-constant depth.

Authors: We concur that such diagnostics are needed to demonstrate that the policy exploits state variation. The revised manuscript will add a figure showing the histogram of predicted depths over held-out states for both environments, together with per-state comparisons to locally optimal depths (computed via breadth-first search on solvable small instances) to confirm that the learned policy deviates meaningfully from any fixed depth. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper's central theoretical result states that state-conditioned commitment strictly dominates fixed depth under the standard commitment-depth surrogate whenever locally optimal depth varies across states. This is presented as a general mathematical property of the surrogate cost model rather than a quantity fitted or defined inside the paper. Empirical claims compare the learned policy against external fixed-depth baselines and other VLMs without reducing reported gains to any self-defined parameter or self-citation chain. No self-definitional loops, fitted inputs renamed as predictions, or ansatzes imported via author-overlapping citations are exhibited in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a standard surrogate cost model for replanning versus execution error is sufficient to prove dominance, plus the empirical claim that the 7B model can be trained to predict useful depths. No new physical entities are postulated.

axioms (1)

domain assumption The commitment-depth surrogate (replanning cost plus compounding execution error) is a faithful proxy for true task performance.
Invoked in the theoretical analysis section of the abstract to establish strict dominance.

invented entities (1)

state-conditioned commitment depth no independent evidence
purpose: Learnable variable inside the policy that decides how many primitive actions to execute open-loop before the next observation.
Introduced as the core modeling choice; no independent falsifiable prediction (e.g., a specific depth distribution) is supplied outside the training loop.

pith-pipeline@v0.9.0 · 5537 in / 1491 out tokens · 49801 ms · 2026-05-12T04:53:13.187383+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 16 internal anchors

[1]

Robust physics-based manipulation by interleaving open and closed-loop execution.arXiv preprint arXiv:2105.08325, 2021

Wisdom C Agboh and Mehmet R Dogar. Robust physics-based manipulation by interleaving open and closed-loop execution.arXiv preprint arXiv:2105.08325, 2021

work page arXiv 2021
[2]

Hindsight experience replay.Advances in neural information processing systems, 30, 2017

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay.Advances in neural information processing systems, 30, 2017

work page 2017
[3]

PyTorch 2: Faster machine learning through dynamic Python bytecode transformation and graph compilation,

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesen- sky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michae...

work page doi:10.1145/3620665.3640366 2024
[4]

The claude 3 model family: Opus, sonnet, haiku, 2025

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2025. URL https://api. semanticscholar.org/CorpusID:268232499

work page 2025
[5]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

work page 2017
[6]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,

work page
[8]

URLhttps://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Adaptive rollout length for model- based rl using model-free deep rl.arXiv preprint arXiv:2206.02380, 2022

Abhinav Bhatia, Philip S Thomas, and Shlomo Zilberstein. Adaptive rollout length for model- based rl using model-free deep rl.arXiv preprint arXiv:2206.02380, 2022

work page arXiv 2022
[10]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

work page 2025
[11]

On the Measure of Intelligence

François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review arXiv 1911
[12]

Arc prize 2024: Technical report

Francois Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2024: Technical report.arXiv preprint arXiv:2412.04604, 2024

work page arXiv 2024
[13]

Arc prize 2025: Technical report, 2026

François Chollet, Mike Knoop, Gregory Kamradt, and Bryan Landers. Arc prize 2025: Technical report, 2026. URLhttps://arxiv.org/abs/2601.10904

work page arXiv 2025
[14]

Hierarchical reinforcement learning with the maxq value function decomposition.Journal of artificial intelligence research, 13:227–303, 2000

Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition.Journal of artificial intelligence research, 13:227–303, 2000. 10

work page 2000
[15]

Learning and executing generalized robot plans.Artificial intelligence, 3:251–288, 1972

Richard E Fikes, Peter E Hart, and Nils J Nilsson. Learning and executing generalized robot plans.Artificial intelligence, 3:251–288, 1972

work page 1972
[16]

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

ARC Foundation. Arc-agi-3: A new challenge for frontier agentic intelligence.arXiv preprint arXiv:2603.24621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, and Yu Cheng. Thinkmorph: Emergent properties in multimodal interleaved chain-of-thought reasoning, 2026. URLhttps://arxiv.org/abs/2510.27492

work page arXiv 2026
[18]

Human-like planning for reaching in cluttered environments

Mohamed Hasan, Matthew Warburton, Wisdom C Agboh, Mehmet R Dogar, Matteo Leonetti, He Wang, Faisal Mushtaq, Mark Mon-Williams, and Anthony G Cohn. Human-like planning for reaching in cluttered environments. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 7784–7790. IEEE, 2020

work page 2020
[19]

Hierarchical solution of markov decision processes using macro-actions.arXiv preprint arXiv:1301.7381, 2013

Milos Hauskrecht, Nicolas Meuleau, Leslie Pack Kaelbling, Thomas L Dean, and Craig Boutilier. Hierarchical solution of markov decision processes using macro-actions.arXiv preprint arXiv:1301.7381, 2013

work page arXiv 2013
[20]

When to replan? an adaptive replanning strategy for autonomous navigation using deep reinforcement learning

Kohei Honda, Ryo Yonetani, Mai Nishimura, and Tadashi Kozuno. When to replan? an adaptive replanning strategy for autonomous navigation using deep reinforcement learning. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6650–6656. IEEE, 2024

work page 2024
[21]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[22]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024. URLhttps://arxiv.org/abs/2406.09403

work page arXiv 2024
[23]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Mixture of horizons in action chunking.arXiv preprint arXiv:2511.19433, 2025

Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, and Mingyu Ding. Mixture of horizons in action chunking, 2025. URL https://arxiv.org/abs/2511.19433

work page arXiv 2025
[25]

Randomized preprocessing of configuration for fast path plan- ning

Lydia Kavraki and J-C Latombe. Randomized preprocessing of configuration for fast path plan- ning. InProceedings of the 1994 IEEE International Conference on Robotics and Automation, pages 2138–2145. IEEE, 1994

work page 1994
[26]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

work page 2024
[28]

Open-loop pomdp simplification and safe skipping of replanning with formal performance guarantees.arXiv preprint arXiv:2604.01352, 2026

Da Kong and Vadim Indelman. Open-loop pomdp simplification and safe skipping of replanning with formal performance guarantees.arXiv preprint arXiv:2604.01352, 2026

work page arXiv 2026
[29]

Macro-operators: A weak method for learning.Artificial intelligence, 26(1): 35–77, 1985

Richard E Korf. Macro-operators: A weak method for learning.Artificial intelligence, 26(1): 35–77, 1985. 11

work page 1985
[30]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[31]

Rapidly-exploring random trees: Progress and prospects: Steven m

Steven M LaValle and James J Kuffner. Rapidly-exploring random trees: Progress and prospects: Steven m. lavalle, iowa state university, a james j. kuffner, jr., university of tokyo, tokyo, japan. Algorithmic and computational robotics, pages 303–307, 2001

work page 2001
[32]

Open loop execution of tree-search algorithms, extended version.arXiv preprint arXiv:1805.01367, 2018

Erwan Lecarpentier, Guillaume Infantes, Charles Lesire, and Emmanuel Rachelson. Open loop execution of tree-search algorithms, extended version.arXiv preprint arXiv:1805.01367, 2018

work page arXiv 2018
[33]

Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

Yuanchang Liang, Xiaobo Wang, Kai Wang, Shuo Wang, Xiaojiang Peng, Haoyu Chen, David Kim Huat Chua, and Prahlad Vadakkepat. Adaptive action chunking at inference-time for vision-language-action models.arXiv preprint arXiv:2604.04161, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Visual agentic reinforcement fine-tuning

Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual agentic reinforcement fine-tuning, 2025. URL https://arxiv.org/abs/2505.14246

work page arXiv 2025
[35]

Constrained model predictive control: Stability and optimality.Automatica, 36(6):789–814, 2000

David Q Mayne, James B Rawlings, Christopher V Rao, and Pierre OM Scokaert. Constrained model predictive control: Stability and optimality.Automatica, 36(6):789–814, 2000

work page 2000
[36]

Thinking with images, 2025

OpenAI. Thinking with images, 2025. URL https://openai.com/index/ thinking-with-images/

work page 2025
[37]

Efficient reductions for imitation learning

Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InProceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668. JMLR Workshop and Conference Proceedings, 2010

work page 2010
[38]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011

work page 2011
[39]

Schrader

Max-Philipp B. Schrader. gym-sokoban. https://github.com/mpSchrader/ gym-sokoban, 2018

work page 2018
[40]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Learning to repeat: Fine grained action repetition for deep reinforcement learning.arXiv preprint arXiv:1702.06054, 2017

Sahil Sharma, Aravind Srinivas, and Balaraman Ravindran. Learning to repeat: Fine grained action repetition for deep reinforcement learning.arXiv preprint arXiv:1702.06054, 2017

work page arXiv 2017
[42]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Dynamic frame skip deep q network

Aravind Srinivas, Sahil Sharma, and Balaraman Ravindran. Dynamic frame skip deep q network. arXiv preprint arXiv:1605.05365, 2016

work page arXiv 2016
[44]

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, Linjie Li, Yu Cheng, Heng Ji, Junxian He, and Yi R. Fung. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers,

work page
[45]

URLhttps://arxiv.org/abs/2506.23918

work page internal anchor Pith review arXiv
[46]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

work page 1999
[47]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Ba...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Feudal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. InInternational conference on machine learning, pages 3540–3549. PMLR, 2017

work page 2017
[49]

TRL: Transformers Rein- forcement Learning, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Rein- forcement Learning, 2020. URLhttps://github.com/huggingface/trl

work page 2020
[50]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning, 2025. URL https://arxiv.org/abs/2505.15966

work page internal anchor Pith review arXiv 2025
[51]

Learning to combat compounding-error in model-based reinforcement learning.arXiv preprint arXiv:1912.11206, 2019

Chenjun Xiao, Yifan Wu, Chen Ma, Dale Schuurmans, and Martin Müller. Learning to combat compounding-error in model-based reinforcement learning.arXiv preprint arXiv:1912.11206, 2019

work page arXiv 1912
[52]

Visual planning: Let’s think only with images, 2026

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli´c. Visual planning: Let’s think only with images, 2026. URL https://arxiv.org/abs/2505. 11409

work page 2026
[53]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

work page 2024
[54]

Think fast and slow: Step-level cognitive depth adaptation for llm agents.arXiv preprint arXiv:2602.12662, 2026

Ruihan Yang, Fanghua Ye, Xiang We, Ruoqing Zhao, Kang Luo, Xinbo Xu, Bo Zhao, Ruotian Ma, Shanyi Wang, Zhaopeng Tu, et al. Think fast and slow: Step-level cognitive depth adaptation for llm agents.arXiv preprint arXiv:2602.12662, 2026

work page arXiv 2026
[55]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

arXiv preprint arXiv:2510.24514 , year=

Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, and Furu Wei. Latent sketchpad: Sketching visual thoughts to elicit multimodal reasoning in mllms, 2025. URL https://arxiv.org/ abs/2510.24514

work page arXiv 2025
[57]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing "thinking with images" via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2505.14362

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

For Lemma 1, the existence of an interior optimum requires only that q(h)/h→0 as h→0 + and q(h)→1 as h grows; the precise functional form determines the location of h⋆ but not its existence

work page
[61]

planning-dominant

For Prop. 1, strict dominance requires only that the per-state optimum h⋆(s) be a non- constant function of state; this is a structural property of the state-dependent error landscape, not of any specific functional form. Power-law is the cleanest stylization in which to make these arguments precise. Why we do not estimate (c, α) empirically.The power-law...

work page 2022
[62]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page