RECIPE: Procedural Planning via Grounding in Instructional Video

Antonino Furnari; Lorenzo Torresani; Luigi Seminara

arxiv: 2605.19976 · v1 · pith:TOKXLR7Enew · submitted 2026-05-19 · 💻 cs.CV

RECIPE: Procedural Planning via Grounding in Instructional Video

Luigi Seminara , Antonino Furnari , Lorenzo Torresani This is my paper

Pith reviewed 2026-05-20 06:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords procedural planninginstructional videoreinforcement learninggroundingvisual planningGRPOzero-shot

0 comments

The pith

RECIPE trains procedural planners by using how well their steps ground in video transcripts as a reward signal instead of clean labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the asymmetry between difficult label extraction and easy verification in noisy instructional videos can be exploited to improve visual planning models. By measuring grounding quality through text embeddings against ASR narrations and using it as a reward for GRPO, RECIPE converts large-scale but messy video data into a scalable verifier for generated step sequences. This yields gains over base models from 0.5B to 7B parameters on seven benchmarks, with 7-8 point in-domain and up to 16 point zero-shot improvements, while beating supervised fine-tuning on both annotated and pseudo-labeled data. The method works for both Socratic text-history and direct video inputs and maintains plan diversity where supervised approaches do not.

Core claim

RECIPE uses grounding quality as a reward for GRPO, turning noisy instructional video corpora into verifiers rather than label sources, which improves macro-accuracy by 7 to 8 points in-domain and up to 16 points zero-shot over base checkpoints at all scales and outperforms supervised fine-tuning while remaining robust without human annotations.

What carries the argument

Grounding quality reward for GRPO, where quality is computed from precomputed text embeddings matching generated plans to ASR transcripts, acting as a scalable verifier for procedural step sequences.

If this is right

RECIPE-RL outperforms supervised fine-tuning on both annotated and pseudo-labeled plans.
Gains hold at every model scale from 0.5B to 7B and across all seven evaluated benchmarks.
Zero-shot performance improves when the method is used as the proposal stage in a propose-assess-search planner.
Plan diversity is preserved, unlike under supervised fine-tuning which collapses it.
The framework applies uniformly to both textual history and direct video input configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Verification via embedding similarity may substitute for annotation in other sequential generation domains with abundant but noisy data.
Combining this reward with stronger vision-language models could further reduce reliance on any form of supervision.
The approach raises whether similar verification-generation asymmetries exist in tasks like code synthesis or multi-step reasoning.

Load-bearing premise

Grounding quality measured via precomputed text embeddings against ASR transcripts accurately proxies the correctness and usefulness of a generated procedural plan.

What would settle it

Human evaluation of plan quality on a new video set where RECIPE-RL plans receive lower ratings than base-model plans despite higher embedding-based grounding scores.

Figures

Figures reproduced from arXiv: 2605.19976 by Antonino Furnari, Lorenzo Torresani, Luigi Seminara.

**Figure 1.** Figure 1: Overview of RECIPE. (A) The planner predicts a sequence of natural-language steps given a partial video and a goal. (B) Two input configurations of the same planner: a Socratic pipeline that first turns the video into a textual history, and a Video pipeline that reads video tokens directly. (C) Generated plans are scored against a noisy instructional-video corpus (HowTo100M) by a two-stage text alignment, … view at source ↗

**Figure 2.** Figure 2: Main results: macro accuracy (%). RECIPE-RL improves over the base checkpoint at every scale and on both splits. All Qwen models use the Socratic configuration and annotated supervision. Per-dataset breakdown in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Same corpus, two roles. Using HT100M as a pseudo-label source for SFT degrades the base checkpoint at every scale; using the same corpus as a verifier for RL improves it substantially. Both RECIPE variants (RL only and SFT→RL) outperform both SFT baselines. penalty rules cap scores when the prediction is verbose or empty (full rubric, prompt, and penalty rules in Appendix B). We report macro accuracy (%): … view at source ↗

**Figure 4.** Figure 4: Robustness to weak supervision (Qwen2.5-3B). RECIPE-RL is nearly unaffected by the supervision mix; SFT collapses when annotations are scarce, with zero-shot accuracy falling below the base at ≥ 75% weak supervision. Base RL, 0% RL, 100% Model In ZS In ZS In ZS Qwen-0.5B 1.5 0.7 9.1 7.3 11.9 9.1 Qwen-3B 31.4 23.1 39.2 39.4 38.6 36.0 Qwen-7B 39.2 32.5 46.6 46.1 46.3 46.3 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 5.** Figure 5: RECIPE-RL is both more diverse and more accurate than SFT. Procedural diversity (x-axis) vs. Score@1 macro accuracy (y-axis) on COIN; RECIPE-RL is the only configuration in the upper-right region. CrossTask action via embedding-based nearest neighbor over the 105-action taxonomy, following the same remapping protocol used by the LLM-based VidAssist baselines. We integrate RECIPE-RL into VidAssist’s propos… view at source ↗

read the original abstract

Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist. Large-scale instructional video corpora offer orders of magnitude more procedural content, but supervised fine-tuning on pseudo-labels from their noisy ASR narrations propagates segmentation and alignment errors and stays single-trajectory. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a reward for GRPO, turning the noisy corpus into a verifier rather than a label source. The framework applies uniformly to two planner input configurations (Socratic, with a textual history extracted by a frozen VLM, and Video, consuming video tokens directly) and to annotated and weakly supervised regimes. We evaluate on 7 procedural benchmarks using a reference-based LLM-as-judge protocol scoring plans across 6 procedural criteria. RECIPE-RL improves over the base checkpoint at all scales (0.5B, 3B, 7B) and every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot. It outperforms supervised fine-tuning on both annotated and pseudo-labeled plans (the latter degrades the base) and remains robust without human annotations. Used as the proposal stage of a prior propose-assess-search planner, it improves over the strongest zero-shot baseline at every horizon on Visual Planning for Assistance, and on COIN it preserves the generation diversity that SFT collapses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RECIPE gets consistent planning gains by treating ASR embedding similarity as an RL reward on instructional video instead of pseudo-labeling, though the proxy's reliability is the main open question.

read the letter

The core takeaway is that RECIPE flips the usual bottleneck in procedural planning by using precomputed text embedding similarity between generated steps and ASR transcripts as a reward for GRPO. This lets them scale to large noisy video corpora without needing clean labels, and the abstract reports steady improvements over the base model at 0.5B, 3B, and 7B scales on every benchmark, plus better zero-shot results than supervised fine-tuning on either annotated or pseudo-labeled data.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RECIPE, a reinforcement learning approach for visual procedural planning that exploits an asymmetry between hard label extraction and cheap verification: it uses precomputed text-embedding cosine similarity between generated plan steps and ASR transcripts from large instructional video corpora as a reward signal for GRPO. The method is applied uniformly to Socratic (textual history) and Video (direct token) input configurations in both annotated and weakly supervised regimes. It reports consistent macro-accuracy gains of +7–8 points in-domain and up to +16 points zero-shot over base checkpoints at 0.5B/3B/7B scales, outperforming supervised fine-tuning on both clean and pseudo-labeled data, while preserving diversity when used as a proposal stage in a propose-assess-search planner.

Significance. If the results hold, the work offers a scalable route to procedural planning that turns noisy video corpora into verifiers rather than label sources, addressing the annotation bottleneck that has limited prior datasets. Explicit strengths include the uniform framework across input types, robustness without human annotations, and the downstream improvement on Visual Planning for Assistance and COIN benchmarks. The approach is falsifiable via the reported LLM-as-judge protocol on six procedural criteria and could generalize if the grounding proxy proves reliable.

major comments (3)

[§5.2, Table 2] §5.2 and Table 2: the central claims rest on macro-accuracy gains of +7–16 points measured by an LLM-as-judge protocol, yet no error bars, statistical significance tests, judge calibration details, or inter-annotator agreement are reported; this transparency gap is load-bearing because the gains could be sensitive to judge variability.
[§3.2] §3.2: the grounding reward is defined as cosine similarity between generated steps and precomputed ASR embeddings; the manuscript provides no human validation, ablation on embedding model choice, or correlation analysis with the six downstream procedural criteria, leaving open the possibility that optimization improves lexical overlap rather than plan correctness or usefulness.
[§4.1] §4.1: while the paper shows RECIPE-RL outperforms SFT on pseudo-labeled plans, there is no ablation isolating the contribution of the GRPO reward versus the base checkpoint or the proposal stage, making it difficult to attribute the zero-shot gains specifically to the grounding signal.

minor comments (2)

[Figure 3] Figure 3: the diagram of the Socratic vs. Video input pipelines would benefit from explicit labeling of the frozen VLM component and the embedding computation step for clarity.
[§2.3] §2.3: the definition of the six procedural criteria used by the LLM judge could be expanded with one-sentence operationalizations to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§5.2, Table 2] §5.2 and Table 2: the central claims rest on macro-accuracy gains of +7–16 points measured by an LLM-as-judge protocol, yet no error bars, statistical significance tests, judge calibration details, or inter-annotator agreement are reported; this transparency gap is load-bearing because the gains could be sensitive to judge variability.

Authors: We agree that greater transparency around the LLM-as-judge protocol is warranted. In the revised manuscript we will report error bars obtained from multiple independent judge runs with varied seeds, include statistical significance tests (paired t-tests) on the reported macro-accuracy differences, expand the description of judge calibration and prompt details, and provide inter-annotator agreement figures on a held-out subset of plans evaluated by both the LLM judge and human annotators. These additions will directly mitigate concerns about judge variability. revision: yes
Referee: [§3.2] §3.2: the grounding reward is defined as cosine similarity between generated steps and precomputed ASR embeddings; the manuscript provides no human validation, ablation on embedding model choice, or correlation analysis with the six downstream procedural criteria, leaving open the possibility that optimization improves lexical overlap rather than plan correctness or usefulness.

Authors: We acknowledge that explicit validation of the reward proxy would strengthen the claims. A full-scale human validation study or exhaustive embedding-model ablation was not included in the original submission owing to annotation and compute costs. In the revision we will add a correlation analysis between per-step grounding reward values and the six procedural-criterion scores to show alignment beyond lexical overlap. We will also note the specific embedding model employed and the robustness of results across model scales as indirect support for its suitability. revision: partial
Referee: [§4.1] §4.1: while the paper shows RECIPE-RL outperforms SFT on pseudo-labeled plans, there is no ablation isolating the contribution of the GRPO reward versus the base checkpoint or the proposal stage, making it difficult to attribute the zero-shot gains specifically to the grounding signal.

Authors: The existing base-model and SFT comparisons already separate the effect of RL training from supervised fine-tuning. To isolate the grounding reward more cleanly we will add, in the revised manuscript, an ablation that applies GRPO with a non-grounding (constant) reward while keeping the proposal stage fixed, together with training curves of the grounding reward itself. This will allow readers to attribute zero-shot gains more directly to the grounding signal. revision: yes

Circularity Check

0 steps flagged

No significant circularity: reward derived from independent external embeddings and transcripts

full rationale

The paper's central mechanism defines a reward for GRPO based on precomputed text-embedding similarity between generated plans and ASR transcripts from instructional videos. This reward signal is constructed from external, pre-existing data sources (ASR narrations and frozen embedding models) that do not depend on the policy parameters being optimized or on any fitted values derived from the target benchmarks. No equations or steps in the provided description reduce the reported gains (+7 to +16 points) to a tautology by construction, self-definition, or self-citation load-bearing premise. The derivation remains self-contained against external benchmarks, with the improvement presented as an empirical outcome of RL optimization rather than a renaming or re-derivation of the input reward itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that ASR grounding quality is a faithful proxy for plan quality; no explicit free parameters or new invented entities are described in the abstract.

axioms (1)

domain assumption Grounding quality measured against ASR transcripts via text embeddings is a reliable and scalable proxy for the validity of generated procedural plans
This asymmetry is presented as the key enabler that turns the noisy corpus into a verifier.

pith-pipeline@v0.9.0 · 5866 in / 1387 out tokens · 50321 ms · 2026-05-20T06:21:20.186886+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We exploit this asymmetry in RECIPE, which uses grounding quality as a reward signal for Group Relative Policy Optimization (GRPO), turning the noisy corpus into a verification signal rather than a labeling source.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The reward credits the policy only for the alignment improvement the continuation adds beyond the history alone.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

[1]

Unsupervised learning from narrated instruction videos

Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste- Julien. Unsupervised learning from narrated instruction videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4575–4583, 2016

work page 2016
[2]

My view is the best view: Procedure learning from egocentric videos

Siddhant Bansal, Chetan Arora, and CV Jawahar. My view is the best view: Procedure learning from egocentric videos. InEuropean Conference on Computer Vision, pages 657–675. Springer, 2022

work page 2022
[3]

Procedure planning in instructional videos

Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. InEuropean Conference on Computer Vision, pages 334–350. Springer, 2020

work page 2020
[4]

Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

work page arXiv 2025
[5]

Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

work page 2026
[6]

Derpanis, Animesh Garg, and Allan D

Nikita Dvornik, Isma Hadji, Konstantinos G. Derpanis, Animesh Garg, and Allan D. Jepson. Drop-DTW: Aligning common signal between sequences while dropping outliers. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[7]

Derpanis, Animesh Garg, Richard P

Nikita Dvornik, Isma Hadji, Ran Zhang, Konstantinos G. Derpanis, Animesh Garg, Richard P. Wildes, and Allan D. Jepson. StepFormer: Self-supervised step discovery and localization in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[8]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-R1: Reinforcing video reasoning in MLLMs.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, 2022

work page 2022
[10]

Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[11]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025

work page 2025
[12]

Planning without search: Refining frontier LLMs with offline goal-conditioned RL

Joey Hong, Anca Dragan, and Sergey Levine. Planning without search: Refining frontier LLMs with offline goal-conditioned RL. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[13]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

work page 2022
[14]

Propose, assess, search: Harnessing LLMs for goal-oriented planning in instructional videos

Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Fu-Jen Chu, Kris Kitani, Gedas Bertasius, and Xitong Yang. Propose, assess, search: Harnessing LLMs for goal-oriented planning in instructional videos. InEuropean Conference on Computer Vision (ECCV), pages 436–452. Springer, 2024

work page 2024
[15]

OpenVLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning (C...

work page 2024
[16]

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefen- stette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity.arXiv preprint arXiv:2310.06452, 2023. 14

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Error detection in egocentric procedural task videos

Shih-Po Lee, Zijia Lu, and Kristen Grauman. Error detection in egocentric procedural task videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[19]

Encouraging good processes without the need for good answers: Reinforcement learning for llm agent planning

Zhiwei Li, Yong Hu, and Wenqing Wang. Encouraging good processes without the need for good answers: Reinforcement learning for llm agent planning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1654–1666, 2025

work page 2025
[20]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[21]

Learning to ground instructional articles in videos through narrations

Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. Learning to ground instructional articles in videos through narrations. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15201–15213, 2023

work page 2023
[22]

End-to-end learning of visual representations from uncurated instructional videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9879–9889, 2020

work page 2020
[23]

HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2630–2640, 2019

work page 2019
[24]

Why not use your textbook? knowledge-enhanced procedure planning of instructional videos

Kumaranage Ravindu Yasas Nagasinghe, Honglu Zhou, Malitha Gunawardhana, Martin Renqiang Min, Daniel Harari, and Muhammad Haris Khan. Why not use your textbook? knowledge-enhanced procedure planning of instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18816–18826, 2024

work page 2024
[25]

Needleman and Christian D

Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins.Journal of Molecular Biology, 48(3):443–453, 1970

work page 1970
[26]

Ng, Daishi Harada, and Stuart J

Andrew Y . Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InProceedings of the Sixteenth International Conference on Machine Learning (ICML), pages 278–287. Morgan Kaufmann, 1999

work page 1999
[27]

SCHEMA: State CHanges MAtter for procedure planning in instructional videos

Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, and Shih-Fu Chang. SCHEMA: State CHanges MAtter for procedure planning in instructional videos. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[28]

Pretrained language models as visual planners for human assistance

Dhruvesh Patel, Hamid Eghbalzadeh, Nitin Kamra, Michael Louis Iuzzolino, Unnat Jain, and Ruta Desai. Pretrained language models as visual planners for human assistance. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15302–15314, 2023

work page 2023
[29]

Captaincook4d: A dataset for understanding errors in procedural activities

Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for understanding errors in procedural activities. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and...

work page 2024
[30]

EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, and Xihui Liu. EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

work page arXiv 2024
[31]

Sentence-BERT: Sentence embeddings using siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992. Association for Computational Linguistics, 2019

work page 2019
[32]

Dynamic programming algorithm optimization for spoken word recognition

Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978. 15

work page 1978
[33]

Viterbiplannet: Injecting procedural knowledge via differentiable viterbi for planning in instructional videos

Luigi Seminara, Daniele Moltisanti, and Antonino Furnari. Viterbiplannet: Injecting procedural knowledge via differentiable viterbi for planning in instructional videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Howtocaption: Prompting llms to transform video annotations at scale

Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, and Hilde Kuehne. Howtocaption: Prompting llms to transform video annotations at scale. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

work page 2024
[36]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[37]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Kiran Somasundaram, Jing Dong, Huixuan Tang, Julian Straub, Mingfei Yan, Michael Goesele, Jakob J. Engel, Renzo De Nardi, and Richard Newcombe. Project aria: A new tool for egocentric multi-modal AI research.arXiv preprint arXiv:2308.13561, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[39]

Jina embeddings v3: Multilingual text encoder with low-rank adaptations

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, and Han Xiao. Jina embeddings v3: Multilingual text encoder with low-rank adaptations. In Claudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Sil...

work page 2025
[40]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, second edition, 2018

work page 2018
[41]

COIN: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[42]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

PDPP: Projected diffusion for procedure planning in instructional videos

Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. PDPP: Projected diffusion for procedure planning in instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14836–14845, 2023

work page 2023
[44]

HoloAssist: An egocentric human interaction dataset for interactive AI assistants in the real world

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. HoloAssist: An egocentric human interaction dataset for interactive AI assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), ...

work page 2023
[45]

Open-event procedure planning in instructional videos.arXiv preprint arXiv:2407.05119, 2024

Yilu Wu, Hanlin Wang, Jing Wang, and Limin Wang. Open-event procedure planning in instructional videos.arXiv preprint arXiv:2407.05119, 2024

work page arXiv 2024
[46]

PlanLLM: Video procedure planning with refinable large language models

Dejie Yang, Zijing Zhao, and Yang Liu. PlanLLM: Video procedure planning with refinable large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9166–9174, 2025

work page 2025
[47]

RAP: Retrieval-augmented planner for adaptive procedure planning in instructional videos

Ali Zare, Yulei Niu, Hammad Ayyubi, and Shih-Fu Chang. RAP: Retrieval-augmented planner for adaptive procedure planning in instructional videos. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[48]

Enhancing visual planning with auxiliary tasks and multi-token prediction

Ce Zhang, Yale Song, Ruta Desai, Michael Louis Iuzzolino, Joseph Tighe, Gedas Bertasius, and Satwik Kottur. Enhancing visual planning with auxiliary tasks and multi-token prediction. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4190–4200, March 2026

work page 2026
[49]

Derpanis, Richard P

He Zhao, Isma Hadji, Nikita Dvornik, Konstantinos G. Derpanis, Richard P. Wildes, and Allan D. Jepson. P3IV: Probabilistic procedure planning from instructional videos with weak supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2938–2948, 2022. 16

work page 2022
[50]

Learning procedure-aware video representation from instructional videos and their narrations

Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li. Learning procedure-aware video representation from instructional videos and their narrations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14825–14835, 2023

work page 2023
[51]

useful procedural diversity

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. A Implementation details A.1 HowToCaption corpus The reward signal is computed against ...

work page 2019

[1] [1]

Unsupervised learning from narrated instruction videos

Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste- Julien. Unsupervised learning from narrated instruction videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4575–4583, 2016

work page 2016

[2] [2]

My view is the best view: Procedure learning from egocentric videos

Siddhant Bansal, Chetan Arora, and CV Jawahar. My view is the best view: Procedure learning from egocentric videos. InEuropean Conference on Computer Vision, pages 657–675. Springer, 2022

work page 2022

[3] [3]

Procedure planning in instructional videos

Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, and Juan Carlos Niebles. Procedure planning in instructional videos. InEuropean Conference on Computer Vision, pages 334–350. Springer, 2020

work page 2020

[4] [4]

Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. Planning with reasoning using vision language world model.arXiv preprint arXiv:2509.02722, 2025

work page arXiv 2025

[5] [5]

Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. Egoplan-bench: Benchmarking multimodal large language models for human-level planning.International Journal of Computer Vision, 134(3):118, 2026

work page 2026

[6] [6]

Derpanis, Animesh Garg, and Allan D

Nikita Dvornik, Isma Hadji, Konstantinos G. Derpanis, Animesh Garg, and Allan D. Jepson. Drop-DTW: Aligning common signal between sequences while dropping outliers. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[7] [7]

Derpanis, Animesh Garg, Richard P

Nikita Dvornik, Isma Hadji, Ran Zhang, Konstantinos G. Derpanis, Animesh Garg, Richard P. Wildes, and Allan D. Jepson. StepFormer: Self-supervised step discovery and localization in instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[8] [8]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-R1: Reinforcing video reasoning in MLLMs.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18995–19012, 2022

work page 2022

[10] [10]

Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[11] [11]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645:633–638, 2025

work page 2025

[12] [12]

Planning without search: Refining frontier LLMs with offline goal-conditioned RL

Joey Hong, Anca Dragan, and Sergey Levine. Planning without search: Refining frontier LLMs with offline goal-conditioned RL. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[13] [13]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR, 2022

work page 2022

[14] [14]

Propose, assess, search: Harnessing LLMs for goal-oriented planning in instructional videos

Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Fu-Jen Chu, Kris Kitani, Gedas Bertasius, and Xitong Yang. Propose, assess, search: Harnessing LLMs for goal-oriented planning in instructional videos. InEuropean Conference on Computer Vision (ECCV), pages 436–452. Springer, 2024

work page 2024

[15] [15]

OpenVLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning (C...

work page 2024

[16] [16]

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefen- stette, and Roberta Raileanu. Understanding the effects of rlhf on llm generalisation and diversity.arXiv preprint arXiv:2310.06452, 2023. 14

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirz...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Error detection in egocentric procedural task videos

Shih-Po Lee, Zijia Lu, and Kristen Grauman. Error detection in egocentric procedural task videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[19] [19]

Encouraging good processes without the need for good answers: Reinforcement learning for llm agent planning

Zhiwei Li, Yong Hu, and Wenqing Wang. Encouraging good processes without the need for good answers: Reinforcement learning for llm agent planning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1654–1666, 2025

work page 2025

[20] [20]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019

[21] [21]

Learning to ground instructional articles in videos through narrations

Effrosyni Mavroudi, Triantafyllos Afouras, and Lorenzo Torresani. Learning to ground instructional articles in videos through narrations. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15201–15213, 2023

work page 2023

[22] [22]

End-to-end learning of visual representations from uncurated instructional videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9879–9889, 2020

work page 2020

[23] [23]

HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2630–2640, 2019

work page 2019

[24] [24]

Why not use your textbook? knowledge-enhanced procedure planning of instructional videos

Kumaranage Ravindu Yasas Nagasinghe, Honglu Zhou, Malitha Gunawardhana, Martin Renqiang Min, Daniel Harari, and Muhammad Haris Khan. Why not use your textbook? knowledge-enhanced procedure planning of instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18816–18826, 2024

work page 2024

[25] [25]

Needleman and Christian D

Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins.Journal of Molecular Biology, 48(3):443–453, 1970

work page 1970

[26] [26]

Ng, Daishi Harada, and Stuart J

Andrew Y . Ng, Daishi Harada, and Stuart J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InProceedings of the Sixteenth International Conference on Machine Learning (ICML), pages 278–287. Morgan Kaufmann, 1999

work page 1999

[27] [27]

SCHEMA: State CHanges MAtter for procedure planning in instructional videos

Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, and Shih-Fu Chang. SCHEMA: State CHanges MAtter for procedure planning in instructional videos. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[28] [28]

Pretrained language models as visual planners for human assistance

Dhruvesh Patel, Hamid Eghbalzadeh, Nitin Kamra, Michael Louis Iuzzolino, Unnat Jain, and Ruta Desai. Pretrained language models as visual planners for human assistance. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15302–15314, 2023

work page 2023

[29] [29]

Captaincook4d: A dataset for understanding errors in procedural activities

Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Bhavya Gouripeddi, Qifan Zhang, Jikai Wang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, Yu Xiang, and Vibhav Gogate. Captaincook4d: A dataset for understanding errors in procedural activities. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and...

work page 2024

[30] [30]

EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, and Xihui Liu. EgoPlan-Bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447, 2024

work page arXiv 2024

[31] [31]

Sentence-BERT: Sentence embeddings using siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992. Association for Computational Linguistics, 2019

work page 2019

[32] [32]

Dynamic programming algorithm optimization for spoken word recognition

Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, 1978. 15

work page 1978

[33] [33]

Viterbiplannet: Injecting procedural knowledge via differentiable viterbi for planning in instructional videos

Luigi Seminara, Daniele Moltisanti, and Antonino Furnari. Viterbiplannet: Injecting procedural knowledge via differentiable viterbi for planning in instructional videos. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026

[34] [34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Howtocaption: Prompting llms to transform video annotations at scale

Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, and Hilde Kuehne. Howtocaption: Prompting llms to transform video annotations at scale. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

work page 2024

[36] [36]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[37] [37]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Kiran Somasundaram, Jing Dong, Huixuan Tang, Julian Straub, Mingfei Yan, Michael Goesele, Jakob J. Engel, Renzo De Nardi, and Richard Newcombe. Project aria: A new tool for egocentric multi-modal AI research.arXiv preprint arXiv:2308.13561, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[39] [39]

Jina embeddings v3: Multilingual text encoder with low-rank adaptations

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Nan Wang, and Han Xiao. Jina embeddings v3: Multilingual text encoder with low-rank adaptations. In Claudia Hauff, Craig Macdonald, Dietmar Jannach, Gabriella Kazai, Franco Maria Nardini, Fabio Pinelli, Fabrizio Sil...

work page 2025

[40] [40]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, second edition, 2018

work page 2018

[41] [41]

COIN: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[42] [42]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

PDPP: Projected diffusion for procedure planning in instructional videos

Hanlin Wang, Yilu Wu, Sheng Guo, and Limin Wang. PDPP: Projected diffusion for procedure planning in instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14836–14845, 2023

work page 2023

[44] [44]

HoloAssist: An egocentric human interaction dataset for interactive AI assistants in the real world

Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. HoloAssist: An egocentric human interaction dataset for interactive AI assistants in the real world. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), ...

work page 2023

[45] [45]

Open-event procedure planning in instructional videos.arXiv preprint arXiv:2407.05119, 2024

Yilu Wu, Hanlin Wang, Jing Wang, and Limin Wang. Open-event procedure planning in instructional videos.arXiv preprint arXiv:2407.05119, 2024

work page arXiv 2024

[46] [46]

PlanLLM: Video procedure planning with refinable large language models

Dejie Yang, Zijing Zhao, and Yang Liu. PlanLLM: Video procedure planning with refinable large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9166–9174, 2025

work page 2025

[47] [47]

RAP: Retrieval-augmented planner for adaptive procedure planning in instructional videos

Ali Zare, Yulei Niu, Hammad Ayyubi, and Shih-Fu Chang. RAP: Retrieval-augmented planner for adaptive procedure planning in instructional videos. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024

[48] [48]

Enhancing visual planning with auxiliary tasks and multi-token prediction

Ce Zhang, Yale Song, Ruta Desai, Michael Louis Iuzzolino, Joseph Tighe, Gedas Bertasius, and Satwik Kottur. Enhancing visual planning with auxiliary tasks and multi-token prediction. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4190–4200, March 2026

work page 2026

[49] [49]

Derpanis, Richard P

He Zhao, Isma Hadji, Nikita Dvornik, Konstantinos G. Derpanis, Richard P. Wildes, and Allan D. Jepson. P3IV: Probabilistic procedure planning from instructional videos with weak supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2938–2948, 2022. 16

work page 2022

[50] [50]

Learning procedure-aware video representation from instructional videos and their narrations

Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li. Learning procedure-aware video representation from instructional videos and their narrations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14825–14835, 2023

work page 2023

[51] [51]

useful procedural diversity

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic. Cross-task weakly supervised learning from instructional videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. A Implementation details A.1 HowToCaption corpus The reward signal is computed against ...

work page 2019