PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Chengxuan Qian; Dingcheng Wang; Han Liu; Haoran Lu; Haosen Sun; Jianshu Zhang; Letian Xue

arxiv: 2601.15224 · v2 · pith:YFEIEFXDnew · submitted 2026-01-21 · 💻 cs.CV · cs.CL

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Jianshu Zhang , Chengxuan Qian , Haosen Sun , Haoran Lu , Dingcheng Wang , Letian Xue , Han Liu This is my paper

Pith reviewed 2026-05-25 06:48 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords progress reasoningvision-language modelstask progress estimationProgress-Benchtwo-stage reasoningProgressLM-45KProgressLM-3B

0 comments

The pith

A small vision-language model trained on progress reasoning data improves at estimating task progress on entirely new tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models can describe visible content but generally cannot infer how far a task has advanced from partial observations over time. It creates Progress-Bench to test this capability and examines a two-stage reasoning method that first describes the current state then estimates progress. Prompting alone gives only limited, model-specific gains, but training ProgressLM-3B on the ProgressLM-45K dataset produces consistent improvements that hold on the benchmark even though the training and evaluation tasks share no overlap. This matters because real applications such as assembly, cooking, or navigation require tracking ongoing advancement rather than static recognition alone. The work also maps out common failure patterns, including sensitivity to viewpoint and demonstration format.

Core claim

Progress reasoning requires inferring long-horizon dynamics from partial views. Most of the 14 tested VLMs perform poorly on Progress-Bench, showing sensitivity to modality and viewpoint plus weak handling of unanswerable cases. A human-inspired two-stage paradigm yields only modest prompting gains, yet the same paradigm applied through supervised training on ProgressLM-45K produces ProgressLM-3B, which delivers consistent accuracy lifts on the benchmark despite complete task disjointness between training and evaluation sets.

What carries the argument

Two-stage progress reasoning paradigm (state description followed by progress estimation), instantiated via training on the ProgressLM-45K dataset to produce ProgressLM-3B.

If this is right

Explicit training on progress signals can instill temporal reasoning that prompting alone does not reliably produce.
Small-scale models can acquire this capability when given suitable data.
Progress estimation generalizes across task domains when the training set targets the underlying reasoning structure.
Benchmark results expose specific weaknesses such as viewpoint sensitivity that future models must address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage approach could be applied to embodied agents that must monitor their own ongoing actions.
Progress reasoning may serve as a building block for longer-horizon planning and goal monitoring in autonomous systems.
The observed transfer from disjoint tasks suggests the model learns abstract progress concepts rather than surface task patterns.

Load-bearing premise

The signals captured in the ProgressLM-45K dataset reflect generalizable progress estimation that transfers to the activities in Progress-Bench.

What would settle it

If ProgressLM-3B shows no accuracy gain or degrades relative to its base model when evaluated on Progress-Bench, the transfer claim would be falsified.

Figures

Figures reproduced from arXiv: 2601.15224 by Chengxuan Qian, Dingcheng Wang, Han Liu, Haoran Lu, Haosen Sun, Jianshu Zhang, Letian Xue.

**Figure 1.** Figure 1: Given a task demonstration and a single observation, the goal is to estimate how much of the task has already been completed. Direct prediction can often judge whether the task is unfinished, but struggles to assign a well-calibrated progress score. Progress reasoning instead follows a coarse-to-fine process: it first performs episodic retrieval to coarsely locate the observation along the demonstrated tas… view at source ↗

**Figure 2.** Figure 2: Overview of PROGRESS-BENCH construction. (a) Demonstration setup: tasks are presented as either vision-based demonstrations with key frames or text-based ones with step-wise actions, each annotated with progress scores. (b) Observation sampling: observations are sampled from intermediate or boundary positions between demonstration steps, with progress labels assigned by interpolation; vision-based settin… view at source ↗

**Figure 3.** Figure 3: Data statistics of PROGRESS-BENCH and PROGRESSLM-45K (25K for SFT while 20K for RL). Traj and Samp denote the numbers of task trajectories and sampled observations to be estimated, respectively. The upper-right panel shows the four distinct robotic embodiments included, while the lower-right panel visualizes the diversity of objects involved in task interactions. Data Source and Statistics. We build PROGRE… view at source ↗

**Figure 4.** Figure 4: Unanswerable Detection Accuracy (UDA) across models under two settings. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of predicted progress scores. Some models exhibit collapsed or clustered [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Raincloud plots of per-sample normalized score prediction error across models and [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Coupling between the two stages progress reasoning of PROGRESSLM. A diagonal concentration indicates that the anchor selected during episodic retrieval consistently constrains second-stage mental simulation. Are the two reasoning stages coupled rather than independent? Yes—the anchor retrieved in the first stage systematically constrains subsequent progress estimation. To examine this coupling, we anal… view at source ↗

**Figure 8.** Figure 8: Illustration of implicit state accumula [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Data distribution statistics across Benchmark, SFT, and RL splits. This figure shows the distribution of samples produced by our data construction pipeline across Benchmark, Supervised Fine-Tuning (SFT), and Reinforcement Learning (RL) stages. Samples are organized by demonstration–observation setting (Vision Same-View, Vision Cross-View, Text, Vision Unanswerable, Text Unanswerable), with stacked bars de… view at source ↗

**Figure 10.** Figure 10: Data construction pipeline. This Sankey diagram illustrates how raw manipulation trajectories from four heterogeneous robotic platforms (Franka, AgileX, Humanoid, and UR5e) are transformed through our data construction process. Demonstrations are first organized into vision and text modalities, then further expanded into multiple demonstration–observation settings, including vision same-view, vision cros… view at source ↗

**Figure 11.** Figure 11: Case Visualization of vision-based unanswerable samples construction via Image Editing. To test whether models can recognize ill-defined progress, we construct visual unanswerable samples by breaking the semantic consistency between demonstrations and observations while preserving realism. Given an image at a specific manipulation step, we edit the key object using three strategies: (a) Color Change, alt… view at source ↗

**Figure 12.** Figure 12: Diagnostic analysis of coupled progress reasoning. Heatmaps show the relationship between the episodic retrieval anchor index (x-axis) and the score-aligned demonstration index (y-axis) under Vision Same-View, Vision Cross-View, and Text settings. A strong diagonal indicates tight coupling between episodic retrieval and progress estimation. While coupling is strongest in the same-view setting and graduall… view at source ↗

**Figure 13.** Figure 13: Unanswerable Detection Accuracy (UDA) across models with and without trainingfree thinking. This figure compares unanswerable detection accuracy under Vision-based and Text-based demonstrations, contrasting standard inference (NoThink) with training-free explicit reasoning (Think). Across both modalities, enabling training-free thinking consistently improves UDA for most models, with particularly pronoun… view at source ↗

**Figure 14.** Figure 14: Vision-Based Case Visualization (Same-View). This example illustrates how the model performs progress estimation by coupling episodic retrieval with mental simulation. Given a current observation (right), the model retrieves the most semantically aligned demonstration step (No. 7) from the visual demo sequence (left), where the plates are nearly stacked. Based on this retrieved anchor, the model estimates… view at source ↗

**Figure 15.** Figure 15: Vision-Based Case Visualization (Cross-View). This example demonstrates cross-view progress estimation under viewpoint mismatch between the demonstration sequence shown at the top and the current observation shown at the bottom left. Given the current state image captured from a different camera perspective, the model retrieves the most semantically aligned demonstration step No. 5 at 80 percent progress,… view at source ↗

**Figure 16.** Figure 16: Text-Based Demonstration Case Visualization. This example illustrates progress estimation using text-based demonstrations. Given the current visual observation, the model retrieves the most semantically aligned textual instruction Step 3 by grounding language-described action semantics to the observed object state, where the plate is lifted and held above the rack but not yet placed. To bridge the gap bet… view at source ↗

**Figure 17.** Figure 17: Vision-based Demonstration Unanswerable Case Visualization. This example illustrates a visual unanswerable scenario where the current observation is semantically inconsistent with the given demonstration. While the demonstration depicts a task of stacking a blue block on a pink block, the observed state shows the robot holding an unrelated white block that does not appear in any demonstration step. As no… view at source ↗

**Figure 18.** Figure 18: Text-based Demonstration Unanswerable Case Visualization. This example illustrates a text unanswerable scenario where the current visual observation is semantically incompatible with the textual demonstration. While the task goal and instructions describe stacking bowls on a bowl holder, the observed state contains a stack of cups on the floor, involving different object categories and spatial configurati… view at source ↗

**Figure 19.** Figure 19: In-the-wild Generalization on Human Activities. This example demonstrates the model’s ability to generalize coupled progress reasoning beyond robotic manipulation to humanperformed activities. Given a sequence of demonstration frames depicting the step-by-step process of opening a jar and pouring its contents, the model retrieves the most semantically aligned demonstration step (No. 3) for the current ob… view at source ↗

**Figure 20.** Figure 20: Gradio-based Human Filtering Platform for Visual Unanswerable Data Generation. We employ a Gradio-based annotation interface to manually verify the quality of edited images used for visual unanswerable construction. Annotators are presented with the original and edited images alongside the task goal, step-level demonstrations, editing strategy, and prompt. Each edited sample is retained only if it simulta… view at source ↗

read the original abstract

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails. Website: https://progresslm.github.io/ProgressLM/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Progress-Bench, a benchmark for evaluating task progress reasoning in VLMs from partial observations, along with ProgressLM-45K, a curated dataset supporting a two-stage human-inspired reasoning paradigm. It evaluates 14 VLMs, finding most struggle with modality/viewpoint sensitivity and unanswerable cases; training-free prompting yields limited gains, while a trained ProgressLM-3B model reports consistent improvements on the benchmark despite training on a fully disjoint task set from the evaluation tasks. Further analyses examine error patterns.

Significance. If the generalization result holds, the work addresses a clear gap in VLM capabilities for long-horizon dynamics beyond static recognition, with potential impact on robotics and agentic systems. Strengths include the new benchmark, the explicit training on a disjoint task set (if verified), and the empirical demonstration that small-scale training can yield gains where prompting does not. The dataset and benchmark are concrete contributions that could enable follow-on work.

major comments (2)

[Abstract, §3] Abstract and §3 (dataset/task description): The headline claim that ProgressLM-3B 'achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks' is load-bearing for the argument that the model acquires abstract progress reasoning rather than exploiting shared task dynamics. The manuscript asserts full disjointness but provides no enumerated breakdown of training vs. evaluation task categories, object classes, action sequences, or environments (normally expected in §3.2 or Table 1), leaving open the possibility that both sets contain overlapping long-horizon manipulation or navigation primitives.
[§4] §4 (experiments) and associated tables/figures: The soundness assessment is limited by the absence of reported metrics such as error bars, ablation studies on the two-stage paradigm components, or dataset statistics (e.g., task distribution, sequence lengths) that would allow verification of whether the reported gains are robust or sensitive to post-hoc choices in data curation or evaluation.

minor comments (2)

[Abstract] The abstract references 'Further analyses reveal characteristic error patterns' without indicating the specific section or figure where these are presented, which would improve readability.
Website link is provided but no mention of whether code, model weights, or the full ProgressLM-45K dataset splits are released to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments highlight areas where additional details can strengthen the presentation of our claims regarding task disjointness and experimental robustness. We address each point below and commit to revisions that clarify these aspects without altering the core findings.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (dataset/task description): The headline claim that ProgressLM-3B 'achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks' is load-bearing for the argument that the model acquires abstract progress reasoning rather than exploiting shared task dynamics. The manuscript asserts full disjointness but provides no enumerated breakdown of training vs. evaluation task categories, object classes, action sequences, or environments (normally expected in §3.2 or Table 1), leaving open the possibility that both sets contain overlapping long-horizon manipulation or navigation primitives.

Authors: We agree that an explicit breakdown would make the disjointness claim more verifiable and directly address potential concerns about shared primitives. The training tasks in ProgressLM-45K were sourced from distinct video collections and task definitions separate from the Progress-Bench evaluation set to avoid overlap in specific sequences and environments. In the revised manuscript, we will expand §3.2 with a new table enumerating training vs. evaluation task categories, object classes, action sequences, and environments, along with a brief justification of the curation process to confirm full disjointness. revision: yes
Referee: [§4] §4 (experiments) and associated tables/figures: The soundness assessment is limited by the absence of reported metrics such as error bars, ablation studies on the two-stage paradigm components, or dataset statistics (e.g., task distribution, sequence lengths) that would allow verification of whether the reported gains are robust or sensitive to post-hoc choices in data curation or evaluation.

Authors: We concur that these elements would enhance the assessment of robustness. The current results reflect single-run evaluations on the benchmark, but we have access to the underlying data for multiple seeds. In the revision, we will add error bars computed over 3 runs to the main tables in §4, include a dedicated ablation subsection on the two-stage paradigm components (e.g., removing the first or second stage), and report dataset statistics such as task distribution and average sequence lengths in §3 or the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and evaluation on explicitly disjoint task sets

full rationale

The paper introduces Progress-Bench and trains ProgressLM-3B on ProgressLM-45K, asserting the task sets are fully disjoint, then reports empirical gains. No equations, fitted parameters, or self-citations are used to derive the central claim; results rest on experimental measurement rather than any reduction of outputs to inputs by construction. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the curated dataset and the assumption that progress reasoning can be learned from static image observations; these are domain assumptions without independent verification in the abstract.

axioms (1)

domain assumption The ProgressLM-45K dataset supplies transferable signals for progress reasoning across disjoint task sets
The training-based improvement is presented as evidence of generalization from this dataset.

pith-pipeline@v0.9.0 · 5739 in / 1103 out tokens · 22813 ms · 2026-05-25T06:48:53.011025+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons
cs.RO 2026-03 unverdicted novelty 6.0

Robometer combines intra-trajectory progress supervision with inter-trajectory preference supervision on a 1M-trajectory dataset to learn more generalizable robotic reward functions than prior methods.
IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools
cs.CV 2026-05 unverdicted novelty 5.0

IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localizatio...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 2 Pith papers

[1]

Analyze the input state-to-estimate image and identify objects that could serve as plausible replacements

work page
[2]

Replace the target object in both the task goal and all step-by-step instructions

work page
[3]

Preserve the original sentence structure, action verbs, and spatial markers (e.g., [left], [right],[towards]) The model outputs the modified task goal and instructions in a structured XML format, as shown in Table 9. This approach ensures that the edited instructions remain grammatically coherent 18 Vision Same-View Vision Cross-View Text Vision Unanswera...

work page 2000
[4]

YES” (keep) or “NO

Object Replacement: Replacing the target object with a semantically different object (e.g., replacing an egg with an orange) 3.Occlusion/Removal: Hiding or removing key objects from the scene The model first reasons about which strategy would most effectively violate the instruction while maintaining visual realism, then outputs a concise editing prompt (...

work page
[5]

<image> 0% <image> 25% <image> 50% <image> 75% <image> 100%

insert the battery...”), the model receives the current state image along with the ground-truth reference step (which text step most closely matches the current state) and progress score. The model generates reasoning in two phases: <ref_think> explains why the given reference step is most relevant to the current image, and <score_think> justifies how com...

work page 2025
[6]

24 0% 14% 29% 43% 57% 71% 86% 100% State to Estimate <ref_think>The current state image shows a stack of plates being placed on the table, which is similar to the statein the No

serving as the generation engine with60%GPU memory utilization. 24 0% 14% 29% 43% 57% 71% 86% 100% State to Estimate <ref_think>The current state image shows a stack of plates being placed on the table, which is similar to the statein the No. 7 demo image where the plates are also being stacked. The robot arm is still in the process of placingthe plates, ...

work page
[7]

move towards the plate

work page
[8]

move the plate to the plate rack

work page
[9]

position the plate vertically in the rack

work page
[10]

lift the plate

release the plate Task Goal: placing a plate in a plate rack <ref_think> The current state image shows a plate being held above aplate rack, which aligns with the action described in Step 3 of thedemonstration: "lift the plate." The plate is not yet in the rack,indicating that the action of lifting is in progress or just completed. This step is the most r...

work page
[11]

[right]pick up the bowl on the far right

work page
[12]

[right]place the rightmost bowl on the table

work page
[14]

[left]place the leftmost bowl on the first bowl

work page
[15]

[left]pick up the bowl on the far left

work page
[16]

stage_to_estimate

[left]place the leftmost bowl on the second bowl Task Goal: place the bowls on the bowl holder onto the table, and stacked together <ref_think> The current image shows a stack ofcups on the floor, which is entirely different fromthe task goal of stacking plates on a plate holder.The objects in the image (cups) do not align with the objects mentioned in th...

work page
[17]

Check the current state image carefully

work page
[18]

Analyze the overall task goal and visual demonstration to understand how the task progresses from start to completion

work page
[19]

Identify the reference states from the visual demonstration that are most related to the current state image

work page
[20]

Compare the current state image with the chosen reference state, determining whether the image is behind or after the reference state

work page
[21]

Estimate the progress numerically as a floating-point value between 0% and 100%

work page
[22]

n/a"</ref> <score_think> Reason for comparing the current state image with the reference state or

If you really cannot match the current state image to any of the states from demon- stration, you need to explain the reason within ‘<ref_think></ref_think>‘ and output "n/a" within ‘<ref></ref>‘, ‘<score_think></score_think>‘, and ‘<score></score>‘. Your response must strictly follow this format: <ref_think> Reason for choosing the most related state fro...

work page
[23]

Read the task goal to understand the task objective and the entity being operated on

work page
[24]

Analyze the textual demonstration to understand how the task progresses from start to completion

work page
[25]

Examine the current state image carefully. If the target is incorrect (different from the object metioned in task goal) or you really cannot match the current image to any step in the demonstration, you must explain the reason within<ref_think></ref_think> and output “n/a” within <ref></ref>, <score_think></score_think>, and <score></score>

work page
[26]

stage_to_estimate

If a match is possible, examine all steps in the textual demonstration, where each step represents an independent action. Identify the single step whose action is most closely related to the current state image. Then compare the current image with that reference step to determine whether it corresponds to an earlier or later stage, and finally estimate th...

work page
[27]

Analyze the demonstration images to understand how the task visually progresses from start to completion

work page
[28]

Identify the frame (or frames) from the demonstration that are visually most similar to the current state image

work page
[29]

Compare the current state to that reference frame and determine whether it shows more or less progress

work page
[30]

stage_to_estimate

Finally, provide a numeric progress estimation between 0% and 100%, or both <ref> and <score>be “n/a” while encountering abnormal situation. Your response must strictly follow this format: <ref_think> Your reasoning for choosing the closest demonstration frame as the reference, OR explanation of why the situation is abnormal and no reference can be identi...

work page
[31]

Analyze the text_demo to understand how the task visually and conceptually progresses from start to completion

work page
[32]

Identify the step from the text_demo that are most visually and semantically similar to the current state image

work page
[33]

Compare the current state image with the chosen reference step to determine whether it represents an earlier or later stage

work page
[34]

Estimate the progress numerically as a floating-point value between 0% and 100%, or both <ref>and<score>be “n/a” while encontering abnormal situation. Your response must strictly follow this format: <ref_think> Your reasoning for choosing the most similar text_demo step as the reference, OR explanation of why the situation is abnormal and no reference can...

work page
[35]

Color Change: Alter the color of critical objects (e.g., change a red apple to green)

work page
[36]

Object Replacement: Replace the target object with a different object (e.g., replace an egg with an orange)

work page
[37]

Occlusion/Removal: Hide or remove key objects from the scene Requirements:

work page
[38]

The edited image should clearly violate the corresponding instruction

work page
[39]

Maintain visual realism and coherence—the edited image must look natural and believable

work page
[40]

Ensure the edit would cause the overall task goal to fail

work page
[41]

Object Replace- ment

The modification should be semantically meaningful (not just noise or blur). Output Format: <strategy_think> Analyze the current instruction and image content. Think step by step about which editing strategy would most effectively violate this instruction while maintaining realism. Consider the key objects involved and how modifying them would break the i...

work page
[42]

Keep the original sentence format and structure - ONLY replace the object name

work page
[43]

put your edited task goal here

For each step in Step-by-step Instructions, preserve ALL markers like [right], [left], [towards], etc. in their EXACT original positions. Output Format: <edited_goal>"put your edited task goal here"</edited_goal> <edited_demo> "text_demo": ["your edited step 1", "your edited step 2", "your edited step 3", ..., "your edited step n"] </edited_demo> Table 9:...

work page

[1] [1]

Analyze the input state-to-estimate image and identify objects that could serve as plausible replacements

work page

[2] [2]

Replace the target object in both the task goal and all step-by-step instructions

work page

[3] [3]

Preserve the original sentence structure, action verbs, and spatial markers (e.g., [left], [right],[towards]) The model outputs the modified task goal and instructions in a structured XML format, as shown in Table 9. This approach ensures that the edited instructions remain grammatically coherent 18 Vision Same-View Vision Cross-View Text Vision Unanswera...

work page 2000

[4] [4]

YES” (keep) or “NO

Object Replacement: Replacing the target object with a semantically different object (e.g., replacing an egg with an orange) 3.Occlusion/Removal: Hiding or removing key objects from the scene The model first reasons about which strategy would most effectively violate the instruction while maintaining visual realism, then outputs a concise editing prompt (...

work page

[5] [5]

<image> 0% <image> 25% <image> 50% <image> 75% <image> 100%

insert the battery...”), the model receives the current state image along with the ground-truth reference step (which text step most closely matches the current state) and progress score. The model generates reasoning in two phases: <ref_think> explains why the given reference step is most relevant to the current image, and <score_think> justifies how com...

work page 2025

[6] [6]

24 0% 14% 29% 43% 57% 71% 86% 100% State to Estimate <ref_think>The current state image shows a stack of plates being placed on the table, which is similar to the statein the No

serving as the generation engine with60%GPU memory utilization. 24 0% 14% 29% 43% 57% 71% 86% 100% State to Estimate <ref_think>The current state image shows a stack of plates being placed on the table, which is similar to the statein the No. 7 demo image where the plates are also being stacked. The robot arm is still in the process of placingthe plates, ...

work page

[7] [7]

move towards the plate

work page

[8] [8]

move the plate to the plate rack

work page

[9] [9]

position the plate vertically in the rack

work page

[10] [10]

lift the plate

release the plate Task Goal: placing a plate in a plate rack <ref_think> The current state image shows a plate being held above aplate rack, which aligns with the action described in Step 3 of thedemonstration: "lift the plate." The plate is not yet in the rack,indicating that the action of lifting is in progress or just completed. This step is the most r...

work page

[11] [11]

[right]pick up the bowl on the far right

work page

[12] [12]

[right]place the rightmost bowl on the table

work page

[13] [14]

[left]place the leftmost bowl on the first bowl

work page

[14] [15]

[left]pick up the bowl on the far left

work page

[15] [16]

stage_to_estimate

[left]place the leftmost bowl on the second bowl Task Goal: place the bowls on the bowl holder onto the table, and stacked together <ref_think> The current image shows a stack ofcups on the floor, which is entirely different fromthe task goal of stacking plates on a plate holder.The objects in the image (cups) do not align with the objects mentioned in th...

work page

[16] [17]

Check the current state image carefully

work page

[17] [18]

Analyze the overall task goal and visual demonstration to understand how the task progresses from start to completion

work page

[18] [19]

Identify the reference states from the visual demonstration that are most related to the current state image

work page

[19] [20]

Compare the current state image with the chosen reference state, determining whether the image is behind or after the reference state

work page

[20] [21]

Estimate the progress numerically as a floating-point value between 0% and 100%

work page

[21] [22]

n/a"</ref> <score_think> Reason for comparing the current state image with the reference state or

If you really cannot match the current state image to any of the states from demon- stration, you need to explain the reason within ‘<ref_think></ref_think>‘ and output "n/a" within ‘<ref></ref>‘, ‘<score_think></score_think>‘, and ‘<score></score>‘. Your response must strictly follow this format: <ref_think> Reason for choosing the most related state fro...

work page

[22] [23]

Read the task goal to understand the task objective and the entity being operated on

work page

[23] [24]

Analyze the textual demonstration to understand how the task progresses from start to completion

work page

[24] [25]

Examine the current state image carefully. If the target is incorrect (different from the object metioned in task goal) or you really cannot match the current image to any step in the demonstration, you must explain the reason within<ref_think></ref_think> and output “n/a” within <ref></ref>, <score_think></score_think>, and <score></score>

work page

[25] [26]

stage_to_estimate

If a match is possible, examine all steps in the textual demonstration, where each step represents an independent action. Identify the single step whose action is most closely related to the current state image. Then compare the current image with that reference step to determine whether it corresponds to an earlier or later stage, and finally estimate th...

work page

[26] [27]

Analyze the demonstration images to understand how the task visually progresses from start to completion

work page

[27] [28]

Identify the frame (or frames) from the demonstration that are visually most similar to the current state image

work page

[28] [29]

Compare the current state to that reference frame and determine whether it shows more or less progress

work page

[29] [30]

stage_to_estimate

Finally, provide a numeric progress estimation between 0% and 100%, or both <ref> and <score>be “n/a” while encountering abnormal situation. Your response must strictly follow this format: <ref_think> Your reasoning for choosing the closest demonstration frame as the reference, OR explanation of why the situation is abnormal and no reference can be identi...

work page

[30] [31]

Analyze the text_demo to understand how the task visually and conceptually progresses from start to completion

work page

[31] [32]

Identify the step from the text_demo that are most visually and semantically similar to the current state image

work page

[32] [33]

Compare the current state image with the chosen reference step to determine whether it represents an earlier or later stage

work page

[33] [34]

Estimate the progress numerically as a floating-point value between 0% and 100%, or both <ref>and<score>be “n/a” while encontering abnormal situation. Your response must strictly follow this format: <ref_think> Your reasoning for choosing the most similar text_demo step as the reference, OR explanation of why the situation is abnormal and no reference can...

work page

[34] [35]

Color Change: Alter the color of critical objects (e.g., change a red apple to green)

work page

[35] [36]

Object Replacement: Replace the target object with a different object (e.g., replace an egg with an orange)

work page

[36] [37]

Occlusion/Removal: Hide or remove key objects from the scene Requirements:

work page

[37] [38]

The edited image should clearly violate the corresponding instruction

work page

[38] [39]

Maintain visual realism and coherence—the edited image must look natural and believable

work page

[39] [40]

Ensure the edit would cause the overall task goal to fail

work page

[40] [41]

Object Replace- ment

The modification should be semantically meaningful (not just noise or blur). Output Format: <strategy_think> Analyze the current instruction and image content. Think step by step about which editing strategy would most effectively violate this instruction while maintaining realism. Consider the key objects involved and how modifying them would break the i...

work page

[41] [42]

Keep the original sentence format and structure - ONLY replace the object name

work page

[42] [43]

put your edited task goal here

For each step in Step-by-step Instructions, preserve ALL markers like [right], [left], [towards], etc. in their EXACT original positions. Output Format: <edited_goal>"put your edited task goal here"</edited_goal> <edited_demo> "text_demo": ["your edited step 1", "your edited step 2", "your edited step 3", ..., "your edited step n"] </edited_demo> Table 9:...

work page