Active Learning within Constrained Environments through Imitation of an Expert Questioner

Kalesha Bullard; Sonia Chernova; Yannick Schroecker

arxiv: 1907.00921 · v1 · pith:HIOFMOQOnew · submitted 2019-07-01 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

Active Learning within Constrained Environments through Imitation of an Expert Questioner

Kalesha Bullard , Yannick Schroecker , Sonia Chernova This is my paper

Pith reviewed 2026-05-25 11:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ROstat.ML

keywords active learningimitation learningconstrained environmentsconcept learningquery selectionenvironmental constraints

0 comments

The pith

Imitation of an expert questioner lets an active learner optimize both its own progress and external constraints inside one objective function.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard active learning selects queries based only on internal learning goals, but realistic settings add outside limits such as time or resource budgets. The paper trains an agent by imitating an expert questioner so that the resulting policy embeds both learning progress and those external constraints into a single objective. Experiments on a concept-learning task under varied constraints show the imitation-trained agent statistically outperforms other active learners in most conditions. A sympathetic reader would care because this removes the need for separate handling of constraints and opens active learning to human environments where rules are imposed externally.

Core claim

By imitating the policy of an expert questioner, an active learning agent learns to select queries that jointly optimize its internal learning objectives and externally imposed environmental constraints within a unified objective function. On a concept learning task the resulting agent generalizes across different environmental conditions and statistically outperforms all other active learners tested under most of the constrained conditions.

What carries the argument

Imitation learning from an expert questioner's demonstrations, which produces an objective function that simultaneously advances learning progress and respects external constraints.

If this is right

The agent handles time and resource constraints directly inside its query-selection objective rather than through separate mechanisms.
Performance remains strong across a range of different environmental conditions in the concept-learning task.
The single-objective formulation yields statistically better results than standard active learners under most tested constraints.
The method supports direct adaptation of active learning to realistic human settings where constraints arrive from outside the learner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same imitation approach could be tested on tasks where constraints involve safety or interaction rules rather than time or resources.
Expert demonstrations may encode subtle trade-offs that are hard to write as explicit mathematical penalties.
Deployment in live human interactions would show whether the learned policy continues to adapt when constraints change during a session.

Load-bearing premise

Demonstrations from the expert questioner can be imitated to create an objective function that effectively trades off learning progress against environmental constraints without any separate optimization step.

What would settle it

Repeating the concept-learning experiments and finding that the imitation-based agent shows no statistical outperformance over the other active learners in the constrained conditions would falsify the central result.

Figures

Figures reproduced from arXiv: 1907.00921 by Kalesha Bullard, Sonia Chernova, Yannick Schroecker.

**Figure 2.** Figure 2: Illustration of object state changes for [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Prepare Lunch Task. Shows performance (test accuracy with standard error) for each AL strategy under different environmentally [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Pack Lunchbox Task. Shows performance (test accuracy [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Questioning Behavior of each Strategy in Prepare-Lunch [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Active learning agents typically employ a query selection algorithm which solely considers the agent's learning objectives. However, this may be insufficient in more realistic human domains. This work uses imitation learning to enable an agent in a constrained environment to concurrently reason about both its internal learning goals and environmental constraints externally imposed, all within its objective function. Experiments are conducted on a concept learning task to test generalization of the proposed algorithm to different environmental conditions and analyze how time and resource constraints impact efficacy of solving the learning problem. Our findings show the environmentally-aware learning agent is able to statistically outperform all other active learners explored under most of the constrained conditions. A key implication is adaptation for active learning agents to more realistic human environments, where constraints are often externally imposed on the learner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses imitation learning to fold external constraints into an active learner's single objective and reports outperformance on constrained concept learning tasks, but the abstract gives almost no experimental details.

read the letter

The main takeaway is that imitation of an expert questioner lets the active learner handle both its learning progress and externally imposed constraints inside one objective function, and the experiments claim this beats other active learners under most of the tested constrained conditions on a concept learning task. The new element is treating the constraints as something the agent can internalize through imitation rather than managing them as a separate post-processing step. The work tests generalization across different environmental conditions and examines how time and resource limits change performance, which directly addresses the gap between theoretical active learning and human-facing settings where limits are common. That framing and the empirical focus on realistic constraints are the parts that hold up as useful. The soft spots are straightforward: the abstract states statistical outperformance without naming the baselines, the constraint modeling, the statistical tests, or the imitation learning implementation details. Without those, it is difficult to judge whether the gains are robust or whether the transfer from expert policy actually succeeds in jointly optimizing both goals. The evidence is empirical rather than circular, but its strength cannot be assessed from the given text. This paper is for active learning researchers who care about deployment constraints in applied domains. Readers looking for concrete ways to make query selection respect external limits would get the most from the idea and the condition-variation experiments. It deserves a serious referee because the problem is practical and the claims are testable once the methods and results are examined in full. I would send it to peer review to get the experimental protocol and numbers checked rather than desk-rejecting it.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes using imitation learning to train an active learning agent that imitates an expert questioner, thereby incorporating externally imposed environmental constraints (such as time and resource limits) directly into the agent's objective function alongside its internal learning goals. The method is evaluated on a concept learning task, with experiments testing generalization across different environmental conditions; the central empirical claim is that the resulting environmentally-aware agent statistically outperforms other active learners under most constrained conditions.

Significance. If the empirical claims are supported by properly documented experiments, the work would address a practical gap in active learning by enabling agents to operate under realistic external constraints without separate handling of learning and constraint objectives. This could have implications for human-in-the-loop applications. However, the provided text supplies no experimental protocols, statistical tests, baseline descriptions, or modeling details, preventing assessment of whether the result would hold or advance the field.

major comments (2)

Abstract: the claim that the agent 'is able to statistically outperform all other active learners explored under most of the constrained conditions' provides no information on experimental design, number of trials, statistical tests performed, specific baselines, variance, or how constraints were operationalized, rendering the central empirical claim unverifiable from the manuscript.
Abstract / Experiments section: the imitation learning setup is described only at a high level as transferring an expert policy into a joint objective; no details are given on the imitation algorithm, reward formulation, or how the transfer avoids separate handling of constraints, which is load-bearing for the weakest assumption identified in the reader's report.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments, which identify important areas where additional detail would improve verifiability of the central claims. We address each major comment below and commit to revisions that directly incorporate the requested information without altering the underlying contributions.

read point-by-point responses

Referee: Abstract: the claim that the agent 'is able to statistically outperform all other active learners explored under most of the constrained conditions' provides no information on experimental design, number of trials, statistical tests performed, specific baselines, variance, or how constraints were operationalized, rendering the central empirical claim unverifiable from the manuscript.

Authors: We agree the abstract is too terse on these points. The Experiments section of the manuscript details a concept-learning task with 20 independent trials per condition, baselines consisting of uncertainty sampling, query-by-committee, and random selection, operationalization of constraints as hard limits on total queries and per-step response time, and use of paired t-tests with reported means and standard deviations to establish statistical significance. We will revise the abstract to include a concise statement of trial count, statistical test, and constraint operationalization so the claim can be assessed from the abstract alone. revision: yes
Referee: Abstract / Experiments section: the imitation learning setup is described only at a high level as transferring an expert policy into a joint objective; no details are given on the imitation algorithm, reward formulation, or how the transfer avoids separate handling of constraints, which is load-bearing for the weakest assumption identified in the reader's report.

Authors: The abstract intentionally summarizes at a high level, but the body describes the use of behavioral cloning to transfer the expert policy and the construction of a single scalar reward that adds a constraint-violation penalty directly to the information-gain term, thereby embedding both objectives in one optimization rather than using a separate constraint handler. We acknowledge that explicit pseudocode for the imitation step and the exact reward equation would strengthen the presentation. We will add these specifics to the Method and Experiments sections in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, derivations, or first-principles claims that could reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. The work is framed as an empirical evaluation of an imitation-learning approach for active learning under constraints, with performance claims resting on statistical comparisons to baselines rather than any internal mathematical reduction. No load-bearing steps matching the enumerated circularity patterns are identifiable from the given material.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; all such elements are unknown.

pith-pipeline@v0.9.0 · 5663 in / 973 out tokens · 21653 ms · 2026-05-25T11:47:04.576437+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Apprenticeship learning via inverse reinforcement learn- ing

[Abbeel and Ng, 2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learn- ing. In Proceedings of the twenty-ﬁrst international con- ference on Machine learning, page

work page 2004
[2]

Learning from richer human guid- ance: Augmenting comparison-based learning with fea- ture queries

[Basu et al., 2018] Chandrayee Basu, Mukesh Singhal, and Anca D Dragan. Learning from richer human guid- ance: Augmenting comparison-based learning with fea- ture queries. In Proceedings of the 2018 ACM/IEEE Inter- national Conference on Human-Robot Interaction , pages 132–140. ACM,

work page 2018
[3]

Human-driven feature selec- tion for a robotic agent learning classiﬁcation tasks from demonstration

[Bullard et al., 2018a] Kalesha Bullard, Sonia Chernova, and Andrea L Thomaz. Human-driven feature selec- tion for a robotic agent learning classiﬁcation tasks from demonstration. In 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 6923–6930. IEEE,

work page 2018
[4]

Towards intelligent arbitration of diverse active learning queries

[Bullard et al., 2018b] Kalesha Bullard, Andrea L Thomaz, and Sonia Chernova. Towards intelligent arbitration of diverse active learning queries. In 2018 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS), pages 6049–6056. IEEE,

work page 2018
[5]

Designing robot learners that ask good ques- tions

[Cakmak and Thomaz, 2012] Maya Cakmak and Andrea L Thomaz. Designing robot learners that ask good ques- tions. In ACM/IEEE Int Conf on Human-Robot Interac- tion, pages 17–24,

work page 2012
[6]

Designing interactions for robot active learners

[Cakmak et al., 2010] Maya Cakmak, Crystal Chao, and An- drea L Thomaz. Designing interactions for robot active learners. Autonomous Mental Development, IEEE Trans- actions on, 2(2):108–118,

work page 2010
[7]

Transparent active learning for robots

[Chao et al., 2010] Crystal Chao, Maya Cakmak, and An- drea Lockerd Thomaz. Transparent active learning for robots. In ACM/IEEE Int. Conf. on Human-Robot Inter- action, pages 317–324,

work page 2010
[8]

Interactive policy learning through conﬁdence- based autonomy

[Chernova and Veloso, 2009] Sonia Chernova and Manuela Veloso. Interactive policy learning through conﬁdence- based autonomy. Journal of Artiﬁcial Intelligence Re- search, 34(1):1,

work page 2009
[9]

Active reward learning

[Daniel et al., 2014] Christian Daniel, Malte Viering, Jan Metz, Oliver Kroemer, and Jan Peters. Active reward learning. In Robotics: Science and Systems,

work page 2014
[10]

A sur- vey on instance selection for active learning

[Fu et al., 2013] Yifan Fu, Xingquan Zhu, and Bin Li. A sur- vey on instance selection for active learning. Knowledge and information systems, pages 1–35,

work page 2013
[11]

Analyzing the im- pact of different feature queries in active learning for social robots

[Gonzalez-Pacheco et al., 2018] V´ıctor Gonzalez-Pacheco, Mar´ıa Malfaz, A Castro-Gonzalez, Jos ´e Carlos Castillo, F Alonso, and Miguel Angel Salichs. Analyzing the im- pact of different feature queries in active learning for social robots. International Journal of Social Robotics, pages 1– 14,

work page 2018
[12]

The symbol grounding prob- lem

[Harnad, 1990] Stevan Harnad. The symbol grounding prob- lem. Physica D: Nonlinear Phenomena , 42(1):335–346,

work page 1990
[13]

Discovering task constraints through observa- tion and active learning

[Hayes and Scassellati, 2014] Bradley Hayes and Brian Scassellati. Discovering task constraints through observa- tion and active learning. In 2014 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pages 4442–4449,

work page 2014
[14]

Training a robot via human feedback: A case study

[Knox et al., 2013] W Bradley Knox, Peter Stone, and Cyn- thia Breazeal. Training a robot via human feedback: A case study. In International Conference on Social Robotics, pages 460–470. Springer,

work page 2013
[15]

Combining active learning and reac- tive control for robot grasping

[Kroemer et al., 2010] OB Kroemer, Renaud Detry, Justus Piater, and Jan Peters. Combining active learning and reac- tive control for robot grasping. Robotics and Autonomous Systems, 58(9):1105–1116,

work page 2010
[16]

Active learning for teach- ing a robot grounded relational symbols

[Kulick et al., 2013] Johannes Kulick, Marc Toussaint, To- bias Lang, and Manuel Lopes. Active learning for teach- ing a robot grounded relational symbols. In Proceedings of the Twenty-Third international joint conference on Arti- ﬁcial Intelligence, pages 1451–1457. AAAI Press,

work page 2013
[17]

A large-scale hierarchical multi-view rgb-d object dataset

[Lai et al., 2011] Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. A large-scale hierarchical multi-view rgb-d object dataset. In IEEE Int. Conf. on Robotics and Au- tomation, pages 1817–1824,

work page 2011
[18]

Active learning for reward estimation in inverse reinforcement learning

[Lopes et al., 2009] Manuel Lopes, Francisco Melo, and Luis Montesano. Active learning for reward estimation in inverse reinforcement learning. In Machine Learn- ing and Knowledge Discovery in Databases, pages 31–46. Springer,

work page 2009
[19]

Al- gorithms for inverse reinforcement learning

[Ng et al., 2000] Andrew Y Ng, Stuart J Russell, et al. Al- gorithms for inverse reinforcement learning. In Icml, vol- ume 1, page 2,

work page 2000
[20]

An algorithmic perspective on imitation learn- ing

[Osa et al., 2018] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, Jan Peters, et al. An algorithmic perspective on imitation learn- ing. Foundations and TrendsR⃝ in Robotics, 7(1-2):1–179,

work page 2018
[21]

Ac- tive robot learning for temporal task models

[Racca and Kyrki, 2018] Mattia Racca and Ville Kyrki. Ac- tive robot learning for temporal task models. In Proceed- ings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, pages 123–131. ACM,

work page 2018
[22]

Modeling humans as observation providers using pomdps

[Rosenthal and Veloso, 2011] Stephanie Rosenthal and Manuela Veloso. Modeling humans as observation providers using pomdps. In RO-MAN, 2011 IEEE, pages 53–58. IEEE,

work page 2011
[23]

Active learning

[Settles, 2012] Burr Settles. Active learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 6(1):1–114,

work page 2012
[24]

Opportunistic active learning for grounding natural language descriptions

[Thomason et al., 2017] Jesse Thomason, Aishwarya Pad- makumar, Jivko Sinapov, Justin Hart, Peter Stone, and Raymond J Mooney. Opportunistic active learning for grounding natural language descriptions. In Conference on Robot Learning, pages 67–76,

work page 2017
[25]

Maximum entropy inverse reinforcement learning

[Ziebart et al., 2008] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008

work page 2008

[1] [1]

Apprenticeship learning via inverse reinforcement learn- ing

[Abbeel and Ng, 2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learn- ing. In Proceedings of the twenty-ﬁrst international con- ference on Machine learning, page

work page 2004

[2] [2]

Learning from richer human guid- ance: Augmenting comparison-based learning with fea- ture queries

[Basu et al., 2018] Chandrayee Basu, Mukesh Singhal, and Anca D Dragan. Learning from richer human guid- ance: Augmenting comparison-based learning with fea- ture queries. In Proceedings of the 2018 ACM/IEEE Inter- national Conference on Human-Robot Interaction , pages 132–140. ACM,

work page 2018

[3] [3]

Human-driven feature selec- tion for a robotic agent learning classiﬁcation tasks from demonstration

[Bullard et al., 2018a] Kalesha Bullard, Sonia Chernova, and Andrea L Thomaz. Human-driven feature selec- tion for a robotic agent learning classiﬁcation tasks from demonstration. In 2018 IEEE International Conference on Robotics and Automation (ICRA) , pages 6923–6930. IEEE,

work page 2018

[4] [4]

Towards intelligent arbitration of diverse active learning queries

[Bullard et al., 2018b] Kalesha Bullard, Andrea L Thomaz, and Sonia Chernova. Towards intelligent arbitration of diverse active learning queries. In 2018 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS), pages 6049–6056. IEEE,

work page 2018

[5] [5]

Designing robot learners that ask good ques- tions

[Cakmak and Thomaz, 2012] Maya Cakmak and Andrea L Thomaz. Designing robot learners that ask good ques- tions. In ACM/IEEE Int Conf on Human-Robot Interac- tion, pages 17–24,

work page 2012

[6] [6]

Designing interactions for robot active learners

[Cakmak et al., 2010] Maya Cakmak, Crystal Chao, and An- drea L Thomaz. Designing interactions for robot active learners. Autonomous Mental Development, IEEE Trans- actions on, 2(2):108–118,

work page 2010

[7] [7]

Transparent active learning for robots

[Chao et al., 2010] Crystal Chao, Maya Cakmak, and An- drea Lockerd Thomaz. Transparent active learning for robots. In ACM/IEEE Int. Conf. on Human-Robot Inter- action, pages 317–324,

work page 2010

[8] [8]

Interactive policy learning through conﬁdence- based autonomy

[Chernova and Veloso, 2009] Sonia Chernova and Manuela Veloso. Interactive policy learning through conﬁdence- based autonomy. Journal of Artiﬁcial Intelligence Re- search, 34(1):1,

work page 2009

[9] [9]

Active reward learning

[Daniel et al., 2014] Christian Daniel, Malte Viering, Jan Metz, Oliver Kroemer, and Jan Peters. Active reward learning. In Robotics: Science and Systems,

work page 2014

[10] [10]

A sur- vey on instance selection for active learning

[Fu et al., 2013] Yifan Fu, Xingquan Zhu, and Bin Li. A sur- vey on instance selection for active learning. Knowledge and information systems, pages 1–35,

work page 2013

[11] [11]

Analyzing the im- pact of different feature queries in active learning for social robots

[Gonzalez-Pacheco et al., 2018] V´ıctor Gonzalez-Pacheco, Mar´ıa Malfaz, A Castro-Gonzalez, Jos ´e Carlos Castillo, F Alonso, and Miguel Angel Salichs. Analyzing the im- pact of different feature queries in active learning for social robots. International Journal of Social Robotics, pages 1– 14,

work page 2018

[12] [12]

The symbol grounding prob- lem

[Harnad, 1990] Stevan Harnad. The symbol grounding prob- lem. Physica D: Nonlinear Phenomena , 42(1):335–346,

work page 1990

[13] [13]

Discovering task constraints through observa- tion and active learning

[Hayes and Scassellati, 2014] Bradley Hayes and Brian Scassellati. Discovering task constraints through observa- tion and active learning. In 2014 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pages 4442–4449,

work page 2014

[14] [14]

Training a robot via human feedback: A case study

[Knox et al., 2013] W Bradley Knox, Peter Stone, and Cyn- thia Breazeal. Training a robot via human feedback: A case study. In International Conference on Social Robotics, pages 460–470. Springer,

work page 2013

[15] [15]

Combining active learning and reac- tive control for robot grasping

[Kroemer et al., 2010] OB Kroemer, Renaud Detry, Justus Piater, and Jan Peters. Combining active learning and reac- tive control for robot grasping. Robotics and Autonomous Systems, 58(9):1105–1116,

work page 2010

[16] [16]

Active learning for teach- ing a robot grounded relational symbols

[Kulick et al., 2013] Johannes Kulick, Marc Toussaint, To- bias Lang, and Manuel Lopes. Active learning for teach- ing a robot grounded relational symbols. In Proceedings of the Twenty-Third international joint conference on Arti- ﬁcial Intelligence, pages 1451–1457. AAAI Press,

work page 2013

[17] [17]

A large-scale hierarchical multi-view rgb-d object dataset

[Lai et al., 2011] Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. A large-scale hierarchical multi-view rgb-d object dataset. In IEEE Int. Conf. on Robotics and Au- tomation, pages 1817–1824,

work page 2011

[18] [18]

Active learning for reward estimation in inverse reinforcement learning

[Lopes et al., 2009] Manuel Lopes, Francisco Melo, and Luis Montesano. Active learning for reward estimation in inverse reinforcement learning. In Machine Learn- ing and Knowledge Discovery in Databases, pages 31–46. Springer,

work page 2009

[19] [19]

Al- gorithms for inverse reinforcement learning

[Ng et al., 2000] Andrew Y Ng, Stuart J Russell, et al. Al- gorithms for inverse reinforcement learning. In Icml, vol- ume 1, page 2,

work page 2000

[20] [20]

An algorithmic perspective on imitation learn- ing

[Osa et al., 2018] Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, Jan Peters, et al. An algorithmic perspective on imitation learn- ing. Foundations and TrendsR⃝ in Robotics, 7(1-2):1–179,

work page 2018

[21] [21]

Ac- tive robot learning for temporal task models

[Racca and Kyrki, 2018] Mattia Racca and Ville Kyrki. Ac- tive robot learning for temporal task models. In Proceed- ings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction, pages 123–131. ACM,

work page 2018

[22] [22]

Modeling humans as observation providers using pomdps

[Rosenthal and Veloso, 2011] Stephanie Rosenthal and Manuela Veloso. Modeling humans as observation providers using pomdps. In RO-MAN, 2011 IEEE, pages 53–58. IEEE,

work page 2011

[23] [23]

Active learning

[Settles, 2012] Burr Settles. Active learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 6(1):1–114,

work page 2012

[24] [24]

Opportunistic active learning for grounding natural language descriptions

[Thomason et al., 2017] Jesse Thomason, Aishwarya Pad- makumar, Jivko Sinapov, Justin Hart, Peter Stone, and Raymond J Mooney. Opportunistic active learning for grounding natural language descriptions. In Conference on Robot Learning, pages 67–76,

work page 2017

[25] [25]

Maximum entropy inverse reinforcement learning

[Ziebart et al., 2008] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008

work page 2008