arxiv: 2605.08747 · v3 · submitted 2026-05-09 · 💻 cs.AI

Recognition: no theorem link

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

Jie Chen, Lei Yi, Lihuang Fang, Mingxu Wang, Rui Jiang, Ying Chen, Zhifeng Gu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 00:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords embodied agentsterminal commitmentworld completionself-terminationevaluation frameworkbenchmark decouplingVIGIL protocol

0 comments

The pith

Embodied agents often reach task goals without correctly reporting termination, and VIGIL separates these into distinct W and B scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard evaluations of embodied agents collapse distinct failures—never finishing, finishing but not stopping, and claiming success without evidence—into one score. The paper introduces VIGIL to score world-state completion independently from benchmark success that also requires a correct terminal report verified against hidden state. This matters because models with similar completion rates still differ by up to 19.7 percentage points in success, showing some convert achieved states into accurate reports while others drift past goals without closing. The protocol uses only egocentric RGB input and no action-success signals, making four outcome categories visible instead of one. By isolating terminal commitment, the work lets developers target whether agents truly recognize when tasks are done.

Core claim

Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. VIGIL makes terminal commitment independently measurable by requiring agents to end each episode with a semantic report checked deterministically against hidden world state. This yields separate world-state completion W and benchmark success B scores, where B additionally requires a correct terminal report. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B. An action-feedback intervention improves W broadly, yet commitment failures persist in models that do not already ground

What carries the argument

VIGIL evaluation protocol, which forces agents to issue a semantic terminal report at episode end that is checked deterministically against hidden world state, separating world-state completion W from benchmark success B.

If this is right

Action feedback improves world completion across models but leaves terminal commitment failures untouched unless reports are already grounded in achieved states.
Models with near-identical execution can still differ sharply in converting achieved states into correct terminal reports.
Four distinct failure modes—missed execution, post-attainment drift, unsupported commitment, and verified success—become separately scorable.
Benchmarks can now isolate and target self-termination ability rather than treating all non-success as equivalent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training regimes that reward only execution may need explicit supervision on termination decisions to close the W-B gap.
In real deployments, weak terminal commitment could cause agents to keep acting after goals are met or to stop without confirmation.
The same decoupling might appear in long-horizon planning or multi-step reasoning tasks where knowing when to halt matters.
Internal representations of task boundaries could be probed by checking whether achieved states reliably trigger correct reports.

Load-bearing premise

A deterministic semantic check of the agent's terminal report against the true hidden world state is sufficient to measure genuine terminal commitment without extra grounding or multi-step verification of the report.

What would settle it

Observe whether models that achieve high W but low B still show persistent B gaps after an action-feedback intervention that supplies execution signals, or whether all high-W models converge to high B once reports must match achieved states.

Figures

Figures reproduced from arXiv: 2605.08747 by Jie Chen, Lei Yi, Lihuang Fang, Mingxu Wang, Rui Jiang, Ying Chen, Zhifeng Gu.

**Figure 1.** Figure 1: Controlled evaluation protocol. VIGILcontains eight task families: a diagnostic tier (PG, DA, SV, VS) that isolates a single bottleneck, and a compositional tier (AI, SI, SM, CR) that combines them in multi-step interaction. All episodes use strict first-person observation and mandatory report termination. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. … view at source ↗

**Figure 2.** Figure 2: Evaluation pipeline. The top row illustrates an example trajectory. The bottom row summarizes the per-step interface: the agent acts from only the current egocentric RGB frame, the task instruction, and bounded dialogue history, without action-success or goal-completion feedback. A terminal report is evaluated against the hidden world-state condition by deterministic rules. The key mechanism is the termina… view at source ↗

**Figure 3.** Figure 3: Episode outcome partition for 10 anchor models [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Terminal-commitment profiles (sorted by B descending). (a) Mean belief lag: steps between first worldgoal satisfaction and the correct terminal report. When agents do report correctly, they do so within 0.9–1.9 steps of that event (panel (a) rounds each model to one decimal place). (b) Among W =1 episodes, percentages are the fraction of each primitive action type among all steps after the world goal is f… view at source ↗

**Figure 5.** Figure 5: State-verification trajectory example. The same initial observation can lead to correct closure (Gemini3.1-Pro [32] and GPT-5.4 [34]) or a false report (Doubao-Seed-1.8 [33]). Claude-Sonnet-4 [35] moves away from the initially visible microwave before reporting, illustrating that even atomic verification probes can fail through unnecessary action followed by an incorrect terminal judgment. 27 [PITH_FULL_… view at source ↗

**Figure 6.** Figure 6: Approach-and-interact trajectory example. [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

read the original abstract

Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard embodied evaluations conflate task execution with terminal commitment and introduces the VIGIL framework, which decouples world-state completion (W) from benchmark success (B) by requiring a deterministically checked semantic terminal report against hidden ground truth. On 1,000 frozen episodes across 20 models, it reports up to 19.7 pp gaps in B for comparable W, distinguishes four outcome categories, and shows via action-feedback intervention that execution signals improve W but commitment failures persist in models that do not already ground reports in achieved states.

Significance. If the VIGIL measurement protocol is shown to be unbiased, the work would be significant for embodied AI by making self-termination failures independently visible and quantifiable, enabling targeted improvements beyond standard success rates. The use of frozen episodes and broad model coverage provides a reproducible empirical foundation for distinguishing missed execution, post-attainment drift, unsupported commitment, and verified success.

major comments (2)

[Abstract] Abstract and experimental protocol: the central claim that VIGIL independently measures terminal commitment (via the B-W gap) rests on the untested assumption that a deterministic semantic match of the final report to hidden state is sufficient and unbiased; this could be confounded by language-model variance in verbalizing egocentric observations rather than genuine commitment, and the action-feedback intervention does not isolate this.
[Abstract] Abstract: the reported 19.7 pp difference in B for models with comparable W is presented without error bars, episode-selection criteria, statistical significance tests, or per-model breakdowns, which is load-bearing for the claim that the framework makes the distinction visible across systems.

minor comments (1)

[Abstract] The abstract lists four outcome categories but does not name them explicitly, reducing immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. The feedback highlights important considerations for strengthening the presentation of the VIGIL framework's measurement protocol and empirical claims. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract and experimental protocol: the central claim that VIGIL independently measures terminal commitment (via the B-W gap) rests on the untested assumption that a deterministic semantic match of the final report to hidden state is sufficient and unbiased; this could be confounded by language-model variance in verbalizing egocentric observations rather than genuine commitment, and the action-feedback intervention does not isolate this.

Authors: We appreciate the referee's concern regarding potential confounds in the semantic matching protocol. The deterministic check against hidden ground truth is designed to require agents to produce a report that accurately reflects the achieved world state, making verbalization accuracy an integral component of terminal commitment rather than a separate confound; models that execute correctly but cannot verbalize the outcome fail B by definition. The action-feedback intervention demonstrates that execution signals broadly improve W scores while commitment failures (B-W gaps) persist specifically in models lacking grounded reporting, providing evidence that the distinction is not solely attributable to verbalization variance. We acknowledge that additional controls could further isolate these factors and will add a dedicated discussion section on potential language-model confounds, including qualitative examples of report mismatches, in the revised manuscript. revision: partial
Referee: [Abstract] Abstract: the reported 19.7 pp difference in B for models with comparable W is presented without error bars, episode-selection criteria, statistical significance tests, or per-model breakdowns, which is load-bearing for the claim that the framework makes the distinction visible across systems.

Authors: We agree that the abstract and main claims would benefit from explicit statistical support. The 19.7 pp gap is computed from aggregated results over the full set of 1,000 frozen episodes spanning 20 models, with episodes selected to ensure coverage of diverse task scenarios and initial conditions. In the revised version, we will update the abstract to reference these details and incorporate error bars (e.g., via bootstrapping), statistical significance tests (e.g., paired comparisons with confidence intervals), and per-model breakdowns into the results section and supplementary materials to substantiate the cross-system visibility of the B-W distinction. revision: yes

Circularity Check

0 steps flagged

No circularity: VIGIL metrics defined by external protocol without self-referential reduction

full rationale

The paper defines VIGIL as an evaluation protocol that separates world-state completion (W) from benchmark success (B) via deterministic semantic matching of terminal reports against hidden ground truth. No equations, derivations, or first-principles results are presented that reduce the observed B-W differences or the commitment measure to fitted parameters, self-citations, or inputs by construction. The four outcome categories and empirical gaps across models follow directly from the protocol's independent checks, with no load-bearing self-citation chains or ansatzes invoked to justify the separation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review limits visibility into any hidden parameters or assumptions; no explicit free parameters, axioms, or invented entities are stated beyond standard embodied RL setup.

pith-pipeline@v0.9.0 · 5553 in / 1176 out tokens · 30388 ms · 2026-05-13T00:59:15.244806+00:00 · methodology

Review history (2 revisions) →