pith. sign in

arxiv: 2605.19976 · v1 · pith:TOKXLR7Enew · submitted 2026-05-19 · 💻 cs.CV

RECIPE: Procedural Planning via Grounding in Instructional Video

Pith reviewed 2026-05-20 06:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords procedural planninginstructional videoreinforcement learninggroundingvisual planningGRPOzero-shot
0
0 comments X

The pith

RECIPE trains procedural planners by using how well their steps ground in video transcripts as a reward signal instead of clean labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the asymmetry between difficult label extraction and easy verification in noisy instructional videos can be exploited to improve visual planning models. By measuring grounding quality through text embeddings against ASR narrations and using it as a reward for GRPO, RECIPE converts large-scale but messy video data into a scalable verifier for generated step sequences. This yields gains over base models from 0.5B to 7B parameters on seven benchmarks, with 7-8 point in-domain and up to 16 point zero-shot improvements, while beating supervised fine-tuning on both annotated and pseudo-labeled data. The method works for both Socratic text-history and direct video inputs and maintains plan diversity where supervised approaches do not.

Core claim

RECIPE uses grounding quality as a reward for GRPO, turning noisy instructional video corpora into verifiers rather than label sources, which improves macro-accuracy by 7 to 8 points in-domain and up to 16 points zero-shot over base checkpoints at all scales and outperforms supervised fine-tuning while remaining robust without human annotations.

What carries the argument

Grounding quality reward for GRPO, where quality is computed from precomputed text embeddings matching generated plans to ASR transcripts, acting as a scalable verifier for procedural step sequences.

If this is right

  • RECIPE-RL outperforms supervised fine-tuning on both annotated and pseudo-labeled plans.
  • Gains hold at every model scale from 0.5B to 7B and across all seven evaluated benchmarks.
  • Zero-shot performance improves when the method is used as the proposal stage in a propose-assess-search planner.
  • Plan diversity is preserved, unlike under supervised fine-tuning which collapses it.
  • The framework applies uniformly to both textual history and direct video input configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Verification via embedding similarity may substitute for annotation in other sequential generation domains with abundant but noisy data.
  • Combining this reward with stronger vision-language models could further reduce reliance on any form of supervision.
  • The approach raises whether similar verification-generation asymmetries exist in tasks like code synthesis or multi-step reasoning.

Load-bearing premise

Grounding quality measured via precomputed text embeddings against ASR transcripts accurately proxies the correctness and usefulness of a generated procedural plan.

What would settle it

Human evaluation of plan quality on a new video set where RECIPE-RL plans receive lower ratings than base-model plans despite higher embedding-based grounding scores.

Figures

Figures reproduced from arXiv: 2605.19976 by Antonino Furnari, Lorenzo Torresani, Luigi Seminara.

Figure 1
Figure 1. Figure 1: Overview of RECIPE. (A) The planner predicts a sequence of natural-language steps given a partial video and a goal. (B) Two input configurations of the same planner: a Socratic pipeline that first turns the video into a textual history, and a Video pipeline that reads video tokens directly. (C) Generated plans are scored against a noisy instructional-video corpus (HowTo100M) by a two-stage text alignment, … view at source ↗
Figure 2
Figure 2. Figure 2: Main results: macro accuracy (%). RECIPE-RL improves over the base checkpoint at every scale and on both splits. All Qwen models use the Socratic configuration and annotated supervision. Per-dataset breakdown in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Same corpus, two roles. Using HT100M as a pseudo-label source for SFT degrades the base checkpoint at every scale; using the same corpus as a verifier for RL improves it substantially. Both RECIPE variants (RL only and SFT→RL) outperform both SFT baselines. penalty rules cap scores when the prediction is verbose or empty (full rubric, prompt, and penalty rules in Appendix B). We report macro accuracy (%): … view at source ↗
Figure 4
Figure 4. Figure 4: Robustness to weak supervision (Qwen2.5-3B). RECIPE-RL is nearly unaffected by the supervision mix; SFT collapses when annotations are scarce, with zero-shot accuracy falling below the base at ≥ 75% weak supervision. Base RL, 0% RL, 100% Model In ZS In ZS In ZS Qwen-0.5B 1.5 0.7 9.1 7.3 11.9 9.1 Qwen-3B 31.4 23.1 39.2 39.4 38.6 36.0 Qwen-7B 39.2 32.5 46.6 46.1 46.3 46.3 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 5
Figure 5. Figure 5: RECIPE-RL is both more diverse and more accurate than SFT. Procedural diversity (x-axis) vs. Score@1 macro accuracy (y-axis) on COIN; RECIPE-RL is the only config￾uration in the upper-right region. CrossTask action via embedding-based nearest neighbor over the 105-action taxonomy, following the same remapping protocol used by the LLM-based VidAssist baselines. We integrate RECIPE-RL into VidAssist’s propos… view at source ↗
read the original abstract

Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist. Large-scale instructional video corpora offer orders of magnitude more procedural content, but supervised fine-tuning on pseudo-labels from their noisy ASR narrations propagates segmentation and alignment errors and stays single-trajectory. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a reward for GRPO, turning the noisy corpus into a verifier rather than a label source. The framework applies uniformly to two planner input configurations (Socratic, with a textual history extracted by a frozen VLM, and Video, consuming video tokens directly) and to annotated and weakly supervised regimes. We evaluate on 7 procedural benchmarks using a reference-based LLM-as-judge protocol scoring plans across 6 procedural criteria. RECIPE-RL improves over the base checkpoint at all scales (0.5B, 3B, 7B) and every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot. It outperforms supervised fine-tuning on both annotated and pseudo-labeled plans (the latter degrades the base) and remains robust without human annotations. Used as the proposal stage of a prior propose-assess-search planner, it improves over the strongest zero-shot baseline at every horizon on Visual Planning for Assistance, and on COIN it preserves the generation diversity that SFT collapses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RECIPE, a reinforcement learning approach for visual procedural planning that exploits an asymmetry between hard label extraction and cheap verification: it uses precomputed text-embedding cosine similarity between generated plan steps and ASR transcripts from large instructional video corpora as a reward signal for GRPO. The method is applied uniformly to Socratic (textual history) and Video (direct token) input configurations in both annotated and weakly supervised regimes. It reports consistent macro-accuracy gains of +7–8 points in-domain and up to +16 points zero-shot over base checkpoints at 0.5B/3B/7B scales, outperforming supervised fine-tuning on both clean and pseudo-labeled data, while preserving diversity when used as a proposal stage in a propose-assess-search planner.

Significance. If the results hold, the work offers a scalable route to procedural planning that turns noisy video corpora into verifiers rather than label sources, addressing the annotation bottleneck that has limited prior datasets. Explicit strengths include the uniform framework across input types, robustness without human annotations, and the downstream improvement on Visual Planning for Assistance and COIN benchmarks. The approach is falsifiable via the reported LLM-as-judge protocol on six procedural criteria and could generalize if the grounding proxy proves reliable.

major comments (3)
  1. [§5.2, Table 2] §5.2 and Table 2: the central claims rest on macro-accuracy gains of +7–16 points measured by an LLM-as-judge protocol, yet no error bars, statistical significance tests, judge calibration details, or inter-annotator agreement are reported; this transparency gap is load-bearing because the gains could be sensitive to judge variability.
  2. [§3.2] §3.2: the grounding reward is defined as cosine similarity between generated steps and precomputed ASR embeddings; the manuscript provides no human validation, ablation on embedding model choice, or correlation analysis with the six downstream procedural criteria, leaving open the possibility that optimization improves lexical overlap rather than plan correctness or usefulness.
  3. [§4.1] §4.1: while the paper shows RECIPE-RL outperforms SFT on pseudo-labeled plans, there is no ablation isolating the contribution of the GRPO reward versus the base checkpoint or the proposal stage, making it difficult to attribute the zero-shot gains specifically to the grounding signal.
minor comments (2)
  1. [Figure 3] Figure 3: the diagram of the Socratic vs. Video input pipelines would benefit from explicit labeling of the frozen VLM component and the embedding computation step for clarity.
  2. [§2.3] §2.3: the definition of the six procedural criteria used by the LLM judge could be expanded with one-sentence operationalizations to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§5.2, Table 2] §5.2 and Table 2: the central claims rest on macro-accuracy gains of +7–16 points measured by an LLM-as-judge protocol, yet no error bars, statistical significance tests, judge calibration details, or inter-annotator agreement are reported; this transparency gap is load-bearing because the gains could be sensitive to judge variability.

    Authors: We agree that greater transparency around the LLM-as-judge protocol is warranted. In the revised manuscript we will report error bars obtained from multiple independent judge runs with varied seeds, include statistical significance tests (paired t-tests) on the reported macro-accuracy differences, expand the description of judge calibration and prompt details, and provide inter-annotator agreement figures on a held-out subset of plans evaluated by both the LLM judge and human annotators. These additions will directly mitigate concerns about judge variability. revision: yes

  2. Referee: [§3.2] §3.2: the grounding reward is defined as cosine similarity between generated steps and precomputed ASR embeddings; the manuscript provides no human validation, ablation on embedding model choice, or correlation analysis with the six downstream procedural criteria, leaving open the possibility that optimization improves lexical overlap rather than plan correctness or usefulness.

    Authors: We acknowledge that explicit validation of the reward proxy would strengthen the claims. A full-scale human validation study or exhaustive embedding-model ablation was not included in the original submission owing to annotation and compute costs. In the revision we will add a correlation analysis between per-step grounding reward values and the six procedural-criterion scores to show alignment beyond lexical overlap. We will also note the specific embedding model employed and the robustness of results across model scales as indirect support for its suitability. revision: partial

  3. Referee: [§4.1] §4.1: while the paper shows RECIPE-RL outperforms SFT on pseudo-labeled plans, there is no ablation isolating the contribution of the GRPO reward versus the base checkpoint or the proposal stage, making it difficult to attribute the zero-shot gains specifically to the grounding signal.

    Authors: The existing base-model and SFT comparisons already separate the effect of RL training from supervised fine-tuning. To isolate the grounding reward more cleanly we will add, in the revised manuscript, an ablation that applies GRPO with a non-grounding (constant) reward while keeping the proposal stage fixed, together with training curves of the grounding reward itself. This will allow readers to attribute zero-shot gains more directly to the grounding signal. revision: yes

Circularity Check

0 steps flagged

No significant circularity: reward derived from independent external embeddings and transcripts

full rationale

The paper's central mechanism defines a reward for GRPO based on precomputed text-embedding similarity between generated plans and ASR transcripts from instructional videos. This reward signal is constructed from external, pre-existing data sources (ASR narrations and frozen embedding models) that do not depend on the policy parameters being optimized or on any fitted values derived from the target benchmarks. No equations or steps in the provided description reduce the reported gains (+7 to +16 points) to a tautology by construction, self-definition, or self-citation load-bearing premise. The derivation remains self-contained against external benchmarks, with the improvement presented as an empirical outcome of RL optimization rather than a renaming or re-derivation of the input reward itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that ASR grounding quality is a faithful proxy for plan quality; no explicit free parameters or new invented entities are described in the abstract.

axioms (1)
  • domain assumption Grounding quality measured against ASR transcripts via text embeddings is a reliable and scalable proxy for the validity of generated procedural plans
    This asymmetry is presented as the key enabler that turns the noisy corpus into a verifier.

pith-pipeline@v0.9.0 · 5866 in / 1387 out tokens · 50321 ms · 2026-05-20T06:21:20.186886+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

  1. [1]

    Unsupervised learning from narrated instruction videos

    Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste- Julien. Unsupervised learning from narrated instruction videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4575–4583, 2016

  2. [2]

    My view is the best view: Procedure learning from egocentric videos

    Siddhant Bansal, Chetan Arora, and CV Jawahar. My view is the best view: Procedure learning from egocentric videos. InEuropean Conference on Computer Vision, pages 657–675. Springer, 2022

  3. [3]

    Procedure planning in instructional videos

    Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. InEuropean Conference on Computer Vision, pages 334–350. Springer, 2020

  4. [4]

    Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

    Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

  5. [5]

    Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

    Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

  6. [6]

    Derpanis, Animesh Garg, and Allan D

    Nikita Dvornik, Isma Hadji, Konstantinos G. Derpanis, Animesh Garg, and Allan D. Jepson. Drop-DTW: Aligning common signal between sequences while dropping outliers. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  7. [7]

    Derpanis, Animesh Garg, Richard P

    Nikita Dvornik, Isma Hadji, Ran Zhang, Konstantinos G. Derpanis, Animesh Garg, Richard P. Wildes, and Allan D. Jepson. StepFormer: Self-supervised step discovery and localization in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  8. [8]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-R1: Reinforcing video reasoning in MLLMs.arXiv preprint arXiv:2503.21776, 2025

  9. [9]

    Ego4D: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, 2022

  10. [10]

    Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  11. [11]

    DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025

  12. [12]

    Planning without search: Refining frontier LLMs with offline goal-conditioned RL

    Joey Hong, Anca Dragan, and Sergey Levine. Planning without search: Refining frontier LLMs with offline goal-conditioned RL. InAdvances in Neural Information Processing Systems, 2025

  13. [13]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

  14. [14]

    Propose, assess, search: Harnessing LLMs for goal-oriented planning in instructional videos

    Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Fu-Jen Chu, Kris Kitani, Gedas Bertasius, and Xitong Yang. Propose, assess, search: Harnessing LLMs for goal-oriented planning in instructional videos. InEuropean Conference on Computer Vision (ECCV), pages 436–452. Springer, 2024

  15. [15]

    OpenVLA: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning (C...

  16. [16]

    Understanding the Effects of RLHF on LLM Generalisation and Diversity

    Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefen- stette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity.arXiv preprint arXiv:2310.06452, 2023. 14

  17. [17]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

  18. [18]

    Error detection in egocentric procedural task videos

    Shih-Po Lee, Zijia Lu, and Kristen Grauman. Error detection in egocentric procedural task videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  19. [19]

    Encouraging good processes without the need for good answers: Reinforcement learning for llm agent planning

    Zhiwei Li, Yong Hu, and Wenqing Wang. Encouraging good processes without the need for good answers: Reinforcement learning for llm agent planning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1654–1666, 2025

  20. [20]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  21. [21]

    Learning to ground instructional articles in videos through narrations

    Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. Learning to ground instructional articles in videos through narrations. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15201–15213, 2023

  22. [22]

    End-to-end learning of visual representations from uncurated instructional videos

    Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9879–9889, 2020

  23. [23]

    HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2630–2640, 2019

  24. [24]

    Why not use your textbook? knowledge-enhanced procedure planning of instructional videos

    Kumaranage Ravindu Yasas Nagasinghe, Honglu Zhou, Malitha Gunawardhana, Martin Renqiang Min, Daniel Harari, and Muhammad Haris Khan. Why not use your textbook? knowledge-enhanced procedure planning of instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18816–18826, 2024

  25. [25]

    Needleman and Christian D

    Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins.Journal of Molecular Biology, 48(3):443–453, 1970

  26. [26]

    Ng, Daishi Harada, and Stuart J

    Andrew Y . Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InProceedings of the Sixteenth International Conference on Machine Learning (ICML), pages 278–287. Morgan Kaufmann, 1999

  27. [27]

    SCHEMA: State CHanges MAtter for procedure planning in instructional videos

    Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, and Shih-Fu Chang. SCHEMA: State CHanges MAtter for procedure planning in instructional videos. InThe Twelfth International Conference on Learning Representations, 2024

  28. [28]

    Pretrained language models as visual planners for human assistance

    Dhruvesh Patel, Hamid Eghbalzadeh, Nitin Kamra, Michael Louis Iuzzolino, Unnat Jain, and Ruta Desai. Pretrained language models as visual planners for human assistance. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15302–15314, 2023

  29. [29]

    Captaincook4d: A dataset for understanding errors in procedural activities

    Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for understanding errors in procedural activities. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and...

  30. [30]

    EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

    Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, and Xihui Liu. EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

  31. [31]

    Sentence-BERT: Sentence embeddings using siamese BERT-networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992. Association for Computational Linguistics, 2019

  32. [32]

    Dynamic programming algorithm optimization for spoken word recognition

    Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978. 15

  33. [33]

    Viterbiplannet: Injecting procedural knowledge via differentiable viterbi for planning in instructional videos

    Luigi Seminara, Daniele Moltisanti, and Antonino Furnari. Viterbiplannet: Injecting procedural knowledge via differentiable viterbi for planning in instructional videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  34. [34]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  35. [35]

    Howtocaption: Prompting llms to transform video annotations at scale

    Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, and Hilde Kuehne. Howtocaption: Prompting llms to transform video annotations at scale. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

  36. [36]

    Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  37. [37]

    Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    Kiran Somasundaram, Jing Dong, Huixuan Tang, Julian Straub, Mingfei Yan, Michael Goesele, Jakob J. Engel, Renzo De Nardi, and Richard Newcombe. Project aria: A new tool for egocentric multi-modal AI research.arXiv preprint arXiv:2308.13561, 2023

  38. [38]

    Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  39. [39]

    Jina embeddings v3: Multilingual text encoder with low-rank adaptations

    Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, and Han Xiao. Jina embeddings v3: Multilingual text encoder with low-rank adaptations. In Claudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Sil...

  40. [40]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, second edition, 2018

  41. [41]

    COIN: A large-scale dataset for comprehensive instructional video analysis

    Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  42. [42]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  43. [43]

    PDPP: Projected diffusion for procedure planning in instructional videos

    Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. PDPP: Projected diffusion for procedure planning in instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14836–14845, 2023

  44. [44]

    HoloAssist: An egocentric human interaction dataset for interactive AI assistants in the real world

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. HoloAssist: An egocentric human interaction dataset for interactive AI assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), ...

  45. [45]

    Open-event procedure planning in instructional videos.arXiv preprint arXiv:2407.05119, 2024

    Yilu Wu, Hanlin Wang, Jing Wang, and Limin Wang. Open-event procedure planning in instructional videos.arXiv preprint arXiv:2407.05119, 2024

  46. [46]

    PlanLLM: Video procedure planning with refinable large language models

    Dejie Yang, Zijing Zhao, and Yang Liu. PlanLLM: Video procedure planning with refinable large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9166–9174, 2025

  47. [47]

    RAP: Retrieval-augmented planner for adaptive procedure planning in instructional videos

    Ali Zare, Yulei Niu, Hammad Ayyubi, and Shih-Fu Chang. RAP: Retrieval-augmented planner for adaptive procedure planning in instructional videos. InEuropean Conference on Computer Vision (ECCV), 2024

  48. [48]

    Enhancing visual planning with auxiliary tasks and multi-token prediction

    Ce Zhang, Yale Song, Ruta Desai, Michael Louis Iuzzolino, Joseph Tighe, Gedas Bertasius, and Satwik Kottur. Enhancing visual planning with auxiliary tasks and multi-token prediction. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4190–4200, March 2026

  49. [49]

    Derpanis, Richard P

    He Zhao, Isma Hadji, Nikita Dvornik, Konstantinos G. Derpanis, Richard P. Wildes, and Allan D. Jepson. P3IV: Probabilistic procedure planning from instructional videos with weak supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2938–2948, 2022. 16

  50. [50]

    Learning procedure-aware video representation from instructional videos and their narrations

    Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li. Learning procedure-aware video representation from instructional videos and their narrations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14825–14835, 2023

  51. [51]

    useful procedural diversity

    Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. A Implementation details A.1 HowToCaption corpus The reward signal is computed against ...