pith. machine review for the scientific record. sign in

arxiv: 2604.07480 · v1 · submitted 2026-04-08 · 💻 cs.RO · cs.AI· cs.FL

Recognition: unknown

Active Reward Machine Inference From Raw State Trajectories

Antoine Aspeel, Mohamad Louai Shehab, Necmiye Ozay

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:22 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.FL
keywords reward machinesactive learningreinforcement learningrobot task inferenceautomaton learningstate trajectoriesmulti-stage tasks
0
0 comments X

The pith

Reward machines for multi-stage robot tasks can be learned from raw state trajectories alone, without rewards or labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an underlying reward machine, an automaton structure encoding the memory for completing multi-stage tasks, can be recovered uniquely when only sequences of raw states are available from trajectories generated under different policies. No direct observations of rewards, state labels, or internal machine nodes are needed; the method identifies the machine structure and labeling function from the pattern of state visits alone. It further demonstrates an active-learning extension that queries extensions to existing trajectories, reducing the total data required for reliable inference. A reader would care because this removes the manual engineering of high-level features or reward functions that currently limits deployment of such machines in robotics.

Core claim

Given state trajectories generated by executing a collection of policies in an underlying Markov decision process, the structure of the reward machine together with the labeling function that maps states to automaton symbols can be uniquely recovered from the state sequences without any reward or label observations.

What carries the argument

The reward machine, an automaton-like structure that captures the memory required to accomplish multi-stage tasks, reconstructed by identifying consistent transition patterns and acceptance conditions directly from raw state data.

Load-bearing premise

An underlying reward machine exists and the observed state trajectories are generated in a way that permits unique identification of both its structure and its labeling function from state sequences alone.

What would settle it

Finding two different reward machines that generate exactly the same collection of state trajectories under the same set of policies would show that the data are insufficient for unique identification.

Figures

Figures reproduced from arXiv: 2604.07480 by Antoine Aspeel, Mohamad Louai Shehab, Necmiye Ozay.

Figure 1
Figure 1. Figure 1: (a) The warehouse grid world. (b) The pick and drop reward machine. (c) [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) The room grid world. (b) The patrol reward machine. (c) The Tetris [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a): Reduction in solution set size vs depth. (b): Growth of the number [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
read the original abstract

Reward machines are automaton-like structures that capture the memory required to accomplish a multi-stage task. When combined with reinforcement learning or optimal control methods, they can be used to synthesize robot policies to achieve such tasks. However, specifying a reward machine by hand, including a labeling function capturing high-level features that the decisions are based on, can be a daunting task. This paper deals with the problem of learning reward machines directly from raw state and policy information. As opposed to existing works, we assume no access to observations of rewards, labels, or machine nodes, and show what trajectory data is sufficient for learning the reward machine in this information-scarce regime. We then extend the result to an active learning setting where we incrementally query trajectory extensions to improve data (and indirectly computational) efficiency. Results are demonstrated with several grid world examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that specific collections of raw state trajectories (with no observed rewards, labels, or automaton nodes) are sufficient to recover both the structure and the unknown labeling function of an underlying reward machine, and that this result extends to an active-learning procedure that incrementally queries trajectory extensions to improve data efficiency. The claims are supported by theoretical sufficiency arguments and demonstrations on several grid-world examples.

Significance. If the identifiability result holds under the stated conditions, the work would meaningfully reduce the manual specification burden for reward machines in long-horizon RL and control tasks. The active-learning extension is a natural and potentially practical contribution. However, the absence of quantitative metrics, explicit proof sketches, or error analysis in the provided abstract leaves the strength of the central claim difficult to assess without the full manuscript.

major comments (3)
  1. [Abstract / Theoretical results section] The central sufficiency claim (that raw state trajectories alone suffice for unique recovery of both RM structure and labeling function) is load-bearing yet lacks explicit conditions on state distinguishability or policy coverage. The skeptic note correctly flags that multiple labelings or transition structures can produce identical state sequences in raw or high-dimensional spaces; without a concrete injectivity argument or counter-example analysis, the result risks being vacuous outside specially constructed grid worlds.
  2. [Experimental results] The manuscript reports demonstrations on grid worlds but provides no quantitative metrics (success rates, sample complexity, reconstruction error, or comparison to baselines). This makes it impossible to evaluate whether the recovered machines are accurate or merely consistent with the observed trajectories.
  3. [Active learning section] The active-learning extension is described as querying trajectory extensions, but the paper does not specify the query selection criterion, the stopping condition, or how the incremental data is guaranteed to preserve the sufficiency property established for the passive case.
minor comments (2)
  1. [Notation] Notation for the labeling function and automaton states should be introduced earlier and used consistently; the abstract refers to “machine nodes” while the body presumably uses different terminology.
  2. [Experiments] The grid-world examples would benefit from a table summarizing the number of trajectories, state dimensionality, and recovery accuracy for each environment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which will help improve the clarity and rigor of our manuscript. We address each major comment below and commit to making the necessary revisions.

read point-by-point responses
  1. Referee: [Abstract / Theoretical results section] The central sufficiency claim (that raw state trajectories alone suffice for unique recovery of both RM structure and labeling function) is load-bearing yet lacks explicit conditions on state distinguishability or policy coverage. The skeptic note correctly flags that multiple labelings or transition structures can produce identical state sequences in raw or high-dimensional spaces; without a concrete injectivity argument or counter-example analysis, the result risks being vacuous outside specially constructed grid worlds.

    Authors: We agree that more explicit conditions are needed to support the sufficiency claim. The original manuscript assumes that the state space allows for distinction based on the labeling function and that trajectories provide sufficient coverage, but these were not stated clearly. In the revised version, we will add a formal statement of assumptions including state distinguishability (distinct labels produce observably different state features) and policy coverage (trajectories visit all necessary state-action pairs for the task). We will include a proof sketch for the uniqueness of the recovered RM and labeling function, and discuss limitations in high-dimensional spaces with a counterexample analysis showing when the result holds. revision: yes

  2. Referee: [Experimental results] The manuscript reports demonstrations on grid worlds but provides no quantitative metrics (success rates, sample complexity, reconstruction error, or comparison to baselines). This makes it impossible to evaluate whether the recovered machines are accurate or merely consistent with the observed trajectories.

    Authors: The experiments were intended as illustrative demonstrations rather than comprehensive benchmarks. However, we acknowledge that quantitative metrics are essential for assessing the method's performance. We will revise the experimental section to include metrics such as the percentage of correctly recovered automaton structures across multiple trials, average reconstruction error for the labeling function, sample complexity curves, and comparisons to existing reward machine learning baselines. Error bars and statistical analysis will also be added. revision: yes

  3. Referee: [Active learning section] The active-learning extension is described as querying trajectory extensions, but the paper does not specify the query selection criterion, the stopping condition, or how the incremental data is guaranteed to preserve the sufficiency property established for the passive case.

    Authors: We will provide a detailed description of the active learning procedure. The query selection criterion selects trajectory extensions that maximize information gain by targeting transitions where the current inferred machine has ambiguity in the labeling or structure. The stopping condition is when additional queries no longer change the inferred reward machine and the data satisfies the coverage requirements. We will prove that the active queries preserve the sufficiency by showing that the accumulated dataset meets the conditions of the passive identifiability theorem. revision: yes

Circularity Check

0 steps flagged

No circularity: sufficiency result is derived from trajectory distinguishability assumptions, not by construction or self-citation.

full rationale

The paper's core claim is that certain sets of raw state trajectories (without rewards, labels, or nodes) are sufficient to recover the reward machine structure and labeling function, with an extension to active querying of extensions. This is framed as a data-sufficiency result under the assumption that an underlying RM exists and trajectories allow unique identification. No equations or steps in the abstract reduce a 'prediction' to a fitted parameter, rename a known result, or rely on self-citation chains for the uniqueness theorem. The derivation appears self-contained against external benchmarks of automaton learning from traces, with the uniqueness condition stated as a prerequisite rather than smuggled in via prior author work. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that trajectories are generated by some reward machine whose structure can be recovered from state sequences alone.

axioms (1)
  • domain assumption An underlying reward machine exists that generates the observed state trajectories via some policy.
    Invoked to justify that the machine is identifiable from raw states without labels or rewards.

pith-pipeline@v0.9.0 · 5442 in / 1147 out tokens · 64011 ms · 2026-05-10T17:22:47.538085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction. MIT Press, 1998

  2. [2]

    Recent advances in hierarchical reinforcement learning,

    A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,”Discrete Event Dynamic Systems, vol. 13, no. 1–2, pp. 41–77, 2003

  3. [3]

    Synthesis for robots: Guaran- tees and feedback for robot behavior,

    H. Kress-Gazit, G. Fainekos, and G. J. Pappas, “Synthesis for robots: Guaran- tees and feedback for robot behavior,”Annual Review of Control, Robotics, and Autonomous Systems, vol. 1, pp. 211–236, 2018

  4. [4]

    Modular multitask reinforcement learning with policy sketches,

    J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” inProceedings of the 34th International Conference on Ma- chine Learning, 2017. 16 M.L. Shehab et al

  5. [5]

    Reward machines: Exploiting reward function structure in reinforcement learning,

    R. Toro Icarte, T. Klassen, R. Valenzano, and S. A. McIlraith, “Reward machines: Exploiting reward function structure in reinforcement learning,” inAdvances in Neural Information Processing Systems, 2018

  6. [6]

    Learning reward machines for partially ob- servable reinforcement learning,

    R. Toro Icarte and S. A. McIlraith, “Learning reward machines for partially ob- servable reinforcement learning,” inAdvances in Neural Information Processing Systems, 2019

  7. [7]

    Reward machines for vision-based robotic manipulation,

    A. Camacho, J. Varley, A. Zeng, D. Jain, A. Iscen, and D. Kalashnikov, “Reward machines for vision-based robotic manipulation,” in2021 IEEE International Con- ference on Robotics and Automation (ICRA). IEEE, 2021, pp. 14284–14290

  8. [8]

    Learning quadruped locomotion policies using logical rules,

    D. DeFazio, Y. Hayamizu, and S. Zhang, “Learning quadruped locomotion policies using logical rules,” inProceedings of the International Conference on Automated Planning and Scheduling, vol. 34, 2024, pp. 142–150

  9. [9]

    Reward machine inference for robotic manipulation,

    M. Baert, E. Malomgré, S. Leroux, and P. Simoens, “Reward machine inference for robotic manipulation,” inIBRL @ RLC 2025, 2025

  10. [10]

    Concrete Problems in AI Safety

    D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,”arXiv preprint arXiv:1606.06565, 2016

  11. [11]

    Deep reward machine learning,

    Z. Xu, A. Gawel, K. Muller, M. Yan, D. Juan, and Y. Hu, “Deep reward machine learning,” inProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020

  12. [12]

    Learning reward machines from partially observed policies,

    M. L. Shehab, A. Aspeel, and N. Ozay, “Learning reward machines from partially observed policies,”Transactions on Machine Learning Research, 2025

  13. [13]

    Learning to plan with logical automata,

    B. Araki, K. Vodrahalli, T. Leech, C.-I. Vasile, M. D. Donahue, and D. L. Rus, “Learning to plan with logical automata,”Robotics: Science and Systems Founda- tion, 2019

  14. [14]

    Joint inference of reward machines and policies for reinforcement learning,

    Z. Xu, I. Gavran, Y. Ahmad, R. Majumdar, D. Neider, U. Topcu, and B. Wu, “Joint inference of reward machines and policies for reinforcement learning,” in Proceedings of the International Conference on Automated Planning and Schedul- ing, vol. 30, 2020, pp. 590–598

  15. [15]

    Learning reward machines: A study in partially observable reinforcement learning,

    R. T. Icarte, T. Q. Klassen, R. Valenzano, M. P. Castro, E. Waldie, and S. A. McIl- raith, “Learning reward machines: A study in partially observable reinforcement learning,”Artificial Intelligence, vol. 323, p. 103989, 2023

  16. [16]

    Reinforcement learning with predefined and inferred reward machines in stochastic games,

    J. Hu, Y. Paliwal, H. Kim, Y. Wang, and Z. Xu, “Reinforcement learning with predefined and inferred reward machines in stochastic games,”Neurocomputing, vol. 599, p. 128170, 2024

  17. [17]

    Learning task automata for reinforcement learning using hidden markov models,

    A. Abate, Y. Almulla, J. Fox, D. Hyland, and M. Wooldridge, “Learning task automata for reinforcement learning using hidden markov models,” inECAI 2023. IOS Press, 2023, pp. 3–10

  18. [18]

    Learning regular sets from queries and counterexamples,

    D. Angluin, “Learning regular sets from queries and counterexamples,”Information and computation, vol. 75, no. 2, pp. 87–106, 1987

  19. [19]

    Active finite reward automaton inference and reinforcement learning using queries and counterexamples,

    Z. Xu, B. Wu, A. Ojha, D. Neider, and U. Topcu, “Active finite reward automaton inference and reinforcement learning using queries and counterexamples,” inMa- chine Learning and Knowledge Extraction, A. Holzinger, P. Kieseberg, A. M. Tjoa, and E. Weippl, Eds. Cham: Springer International Publishing, 2021

  20. [20]

    Active task-inference-guided deep inverse reinforcement learning,

    F. Memarian, Z. Xu, B. Wu, M. Wen, and U. Topcu, “Active task-inference-guided deep inverse reinforcement learning,” in2020 59th IEEE Conference on Decision and Control (CDC). IEEE, 2020, pp. 1932–1938

  21. [21]

    Symbolic task inference in deep reinforcement learning,

    H. Hasanbeig, N. Y. Jeppu, A. Abate, T. Melham, and D. Kroening, “Symbolic task inference in deep reinforcement learning,”Journal of Artificial Intelligence Research, vol. 80, pp. 1099–1137, 2024. Active Reward Machine Inference 17

  22. [22]

    DeepSynth: Automata synthesis for automatic task segmentation in deep reinforcement learn- ing,

    M. Hasanbeig, N. Y. Jeppu, A. Abate, T. Melham, and D. Kroening, “DeepSynth: Automata synthesis for automatic task segmentation in deep reinforcement learn- ing,” inAAAI, vol. 35, no. 9, 2021, pp. 7647–7656

  23. [23]

    Induction of subgoal automata for reinforcement learning,

    D. Furelos-Blanco, M. Law, A. Russo, K. Broda, and A. Jonsson, “Induction of subgoal automata for reinforcement learning,” inAAAI, vol. 34, no. 04, 2020

  24. [24]

    Reward machine inference for robotic ma- nipulation,

    M. Baert, S. Leroux, and P. Simoens, “Reward machine inference for robotic ma- nipulation,”arXiv preprint arXiv:2412.10096, 2024

  25. [25]

    Reward machines for deep rl in noisy and uncertain environments,

    A. Li, Z. Chen, T. Klassen, P. Vaezipoor, R. Toro Icarte, and S. McIlraith, “Reward machines for deep rl in noisy and uncertain environments,”Advances in Neural Information Processing Systems, vol. 37, pp. 110341–110368, 2024

  26. [26]

    Learn- ing robust reward machines from noisy labels,

    R.Parac,L.Nodari,L.Ardon,D.Furelos-Blanco,F.Cerutti,andA.Russo,“Learn- ing robust reward machines from noisy labels,”arXiv preprint arXiv:2408.14871, 2024

  27. [27]

    Joint learning of reward machines and policies in environments with partially known semantics,

    C. K. Verginis, C. Koprulu, S. Chinchali, and U. Topcu, “Joint learning of reward machines and policies in environments with partially known semantics,”Artificial Intelligence, vol. 333, p. 104146, 2024

  28. [28]

    Maximum entropy inverse reinforcement learning

    B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.” inAAAI, vol. 8. Chicago, IL, USA, 2008

  29. [29]

    Learning true objec- tives: Linear algebraic characterizations of identifiability in inverse reinforcement learning,

    M. L. Shehab, A. Aspeel, N. Arechiga, A. Best, and N. Ozay, “Learning true objec- tives: Linear algebraic characterizations of identifiability in inverse reinforcement learning,” inL4DC. PMLR, 2024, pp. 1266–1277

  30. [30]

    The complexity of theorem-proving procedures,

    S. A. Cook, “The complexity of theorem-proving procedures,”Proceedings of the third annual ACM symposium on Theory of computing, pp. 151–158, 1971

  31. [31]

    Biere, M

    A. Biere, M. J. Heule, H. van Maaren, and T. Walsh,Handbook of satisfiability. IOS press, 2009, vol. 185

  32. [32]

    Identifiability in inverse reinforcement learn- ing,

    H. Cao, S. Cohen, and L. Szpruch, “Identifiability in inverse reinforcement learn- ing,”Advances in Neural Information Processing Systems, vol. 34, 2021

  33. [33]

    Active preference-based learning of reward functions,

    D. Sadigh, A. D. Dragan, S. S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” inRSS, Jul. 2017

  34. [34]

    Batch active preference-based learning of reward functions,

    E. Biyik and D. Sadigh, “Batch active preference-based learning of reward functions,” inProceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, A. Billard, A. Dragan, J. Peters, and J. Morimoto, Eds., vol. 87. PMLR, 29–31 Oct 2018, pp. 519–528. [Online]. Available: https://proceedings.mlr.press/v87/biyik18a.html

  35. [35]

    Asking easy questions: A user- friendly approach to active reward learning,

    E. Biyik, M. Palan, N. C. Landolfi, and D. Sadigh, “Asking easy questions: A user- friendly approach to active reward learning,” inProceedings of the 3rd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, L. P. Kaelbling, D. Kragic, and K. Sugiura, Eds., vol. 100. PMLR, 30 Oct–01 Nov 2020, pp. 1177–1194. [Online]. Available: http...

  36. [36]

    Analysis of a greedy active learning strategy,

    S. Dasgupta, “Analysis of a greedy active learning strategy,” inAdvances in Neural Information Processing Systems 17 (NeurIPS 2004), 2004, pp. 337–344

  37. [37]

    Randomized algorithms,

    R. Motwani and P. Raghavan, “Randomized algorithms,”ACM Computing Surveys (CSUR), vol. 28, no. 1, pp. 33–37, 1996

  38. [38]

    Deepresiduallearningforimagerecognition,

    K.He,X.Zhang,S.Ren,andJ.Sun,“Deepresiduallearningforimagerecognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  39. [39]

    A mathematical characterization of minimally sufficient robot brains,

    B. Sakcak, K. G. Timperi, V. Weinstein, and S. M. LaValle, “A mathematical characterization of minimally sufficient robot brains,”The International Journal of Robotics Research, vol. 43, no. 9, pp. 1342–1362, 2024

  40. [40]

    Toward learning pomdps beyond full-rank actions and state observability,

    S. Shaw, T. Manderson, C. Kessens, and N. Roy, “Toward learning pomdps beyond full-rank actions and state observability,”arXiv preprint arXiv:2601.18930, 2026