Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Agata Lapedriza; Asma Ghandeharioun; Craig Ferguson; Judy Hanwen Shen; Natasha Jaques; Noah Jones; Rosalind Picard; Shixiang Gu

arxiv: 1907.00456 · v2 · pith:MCQMVIWLnew · submitted 2019-06-30 · 💻 cs.LG · cs.AI· stat.ML

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Natasha Jaques , Asma Ghandeharioun , Judy Hanwen Shen , Craig Ferguson , Agata Lapedriza , Noah Jones , Shixiang Gu , Rosalind Picard This is my paper

Pith reviewed 2026-05-25 12:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords off-policy reinforcement learningbatch RLdialog generationhuman preferencesKL-controloffline learningdeep RLuncertainty estimation

0 comments

The pith

A new batch RL algorithm learns effective dialog policies from fixed offline human interaction data alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a class of off-policy batch reinforcement learning algorithms that train directly on a fixed collection of human dialog exchanges without any further online interaction. It relies on pre-trained language models to supply a prior distribution and applies KL-control to keep the learned policy from straying too far during optimization. Dropout is used to estimate uncertainty and produce conservative lower bounds on target values, replacing the need for Double Q-Learning. The resulting Way Off-Policy method supports extracting several distinct reward functions after data collection and training successful policies from each one. When the trained systems are placed in live conversation with humans, they produce measurable gains over earlier off-policy batch approaches.

Core claim

The central claim is that a novel family of off-policy batch deep RL algorithms can learn useful policies for open-domain dialog from a static batch of human interaction data. Pre-trained models serve as a strong prior, KL-control penalizes divergence from that prior, and dropout-based uncertainty estimates supply lower bounds on target Q-values. The Way Off-Policy instantiation of the approach permits multiple reward functions to be derived post-hoc from the same collected data, with policies trained successfully from all of them. Live deployment to human users confirms that the resulting systems outperform prior off-policy batch RL methods on the same task.

What carries the argument

The Way Off-Policy algorithm, which uses a pre-trained model as prior together with KL-control and dropout uncertainty estimates to enable offline optimization from a fixed batch of human dialog data.

If this is right

Multiple distinct reward functions can be recovered from one fixed batch of human dialog data after collection.
Policies trained entirely offline can be deployed directly into open-ended human conversations.
The method operates in action spaces of 20,000 dimensions without requiring online exploration.
KL-control combined with a pre-trained prior keeps optimization stable on offline data.
Dropout uncertainty estimates provide a practical substitute for Double Q-Learning in batch settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same offline dataset could support rapid testing of alternative preference models without new data collection.
The approach may extend to other interactive domains where fresh human feedback is costly to obtain.
Post-hoc reward extraction suggests a route for auditing or refining implicit human preferences captured in past logs.
Live results indicate that offline RL could lower the barrier to safe iterative improvement of deployed dialog systems.

Load-bearing premise

Pre-trained models supply a prior strong enough that KL-control during training prevents harmful divergence while still permitting effective learning from purely offline human data.

What would settle it

A live deployment experiment in which the learned dialog agents receive no higher human preference ratings or conversation-quality scores than agents trained with earlier off-policy batch methods on the same data.

Figures

Figures reproduced from arXiv: 1907.00456 by Agata Lapedriza, Asma Ghandeharioun, Craig Ferguson, Judy Hanwen Shen, Natasha Jaques, Noah Jones, Rosalind Picard, Shixiang Gu.

**Figure 1.** Figure 1: Simplified diagram of the variational hierarchical dialog model. In this work, we employ hierarchical seq2seq dialog models [20, 43, 52, 53], which use three recurrent networks to generate the next utterance in a conversation (see [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: KL-divergence of the policy from the prior is lower with KL-control throughout training. Bands show standard deviation. Without KL-regularization, the baseline RL models diverge quickly and continuously from the prior, losing information about realistic sequences – as shown in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Z-scored reward. Red metrics were used in training rewards, green are post-hoc. Traditional [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of top 10 conversation trajectories observed across deployed models, 90% CI [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: (a) 64-most frequent emojis as predicted by [ [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Normalized reward scores obtained by models trained with respect to different rewards. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Interactive evaluation ratings page available at [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Interactive evaluation chat interface. The chatbots were kept in a separate project from the Django project and maintained separately from the server code. Each chatbot extended an abstract class that defined key methods for the Django program to use, and was registered to a globally accessible dictionary via a decorator. The Django project was provided the path to the Chatbots project in its PYTHONPATH, s… view at source ↗

read the original abstract

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation -- a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in off-policy batch RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets live human deployment working for batch RL in dialog via pre-trained priors and KL control, but the abstract leaves the stability claims in a 20k-dim action space hard to verify without more details.

read the letter

The key takeaway is that they managed to train dialog policies from a fixed batch of human interactions using offline RL, anchored by pre-trained models and KL penalties, and then deployed them to talk to real people with claimed improvements. What the paper does well is the live testing part. Running the models in an open-domain setting with humans provides direct feedback on whether the learned behaviors hold up, which is valuable for this domain. Extracting multiple reward functions post-hoc from the same data is also a practical angle that could be useful beyond this specific setup. The soft spots are around the core technical claims. The description of using KL-control to prevent divergence from the prior during training is central, but without seeing the actual equations or how they handle the massive action space, it's difficult to tell if this really enables 'way off-policy' learning or if the pre-trained models are doing most of the heavy lifting. The dropout-based Q bounding is presented as more efficient than double Q, but again the abstract gives no numbers or comparisons to judge that. The concern about the prior not tightly covering the batch data in high dimensions seems like it could be an issue if not carefully validated in the experiments. The work is aimed at the intersection of reinforcement learning and conversational AI, particularly for those interested in offline methods for costly data collection scenarios. It deserves a serious referee because the human deployment experiments add real weight even if the algorithmic contributions need more unpacking to stand on their own. Recommendation: send it out for review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a class of 'Way Off-Policy' batch deep RL algorithms for learning from fixed offline batches of human dialog interaction data without online exploration. It uses pre-trained models as strong priors combined with KL-control to penalize policy divergence during training, plus dropout-based uncertainty estimates as an alternative to Double Q-Learning for lower-bounding target Q-values. The methods are applied to open-domain dialog generation (20,000-dimensional action space), enabling post-hoc extraction of multiple implicit reward functions from the same batch; live human deployment is used to demonstrate significant improvements over prior off-policy batch RL methods.

Significance. If the empirical claims hold, the work would be significant for applying RL to real-world settings with expensive data collection (e.g., human preference learning), by showing that offline batch methods can extract and optimize multiple rewards while generalizing in live open-domain dialog. The live deployment and multi-reward extraction are concrete strengths that go beyond typical offline RL benchmarks.

major comments (2)

[Abstract/Method] Abstract and method description: the central claim that pre-trained priors plus KL-control suffice to prevent harmful divergence and enable effective optimization from purely offline data in a 20,000-dimensional action space is load-bearing, yet no analysis, support-coverage bounds, or mismatch diagnostics are supplied to address the risk that the prior fails to tightly cover the batch support (allowing reward-model exploitation).
[Results/Evaluation] Results and evaluation sections: the reported 'significant improvements' and live-deployment generalization rest on quantitative comparisons, but the provided description supplies no tables, effect sizes, ablation details on the KL term, or verification that the uncertainty lower-bound actually substitutes for Double Q-Learning without introducing bias.

minor comments (2)

[Method] Notation for the KL-control term and the precise form of the dropout-based Q-target could be stated explicitly with an equation reference for reproducibility.
[Abstract] The abstract's description of 'extract multiple different reward functions post-hoc' would benefit from a short clarifying sentence on how the rewards are recovered from the fixed batch without additional labeling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the potential significance of live deployment results and multi-reward extraction. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract/Method] Abstract and method description: the central claim that pre-trained priors plus KL-control suffice to prevent harmful divergence and enable effective optimization from purely offline data in a 20,000-dimensional action space is load-bearing, yet no analysis, support-coverage bounds, or mismatch diagnostics are supplied to address the risk that the prior fails to tightly cover the batch support (allowing reward-model exploitation).

Authors: We agree that formal support-coverage bounds or explicit mismatch diagnostics would strengthen the central claim. Such bounds are difficult to derive tightly in a 20k-dimensional discrete action space, which is why the work relies on the empirical evidence from live human deployment. In revision we will add a dedicated discussion subsection on prior coverage (including statistics on observed KL divergence during training and qualitative examples of out-of-support actions) to make the assumptions more transparent, while noting that full theoretical coverage guarantees remain an open direction. revision: partial
Referee: [Results/Evaluation] Results and evaluation sections: the reported 'significant improvements' and live-deployment generalization rest on quantitative comparisons, but the provided description supplies no tables, effect sizes, ablation details on the KL term, or verification that the uncertainty lower-bound actually substitutes for Double Q-Learning without introducing bias.

Authors: The full manuscript contains quantitative tables and live-deployment metrics, but we accept that additional detail is warranted. We will expand the results section with (i) explicit effect-size reporting, (ii) an ablation table isolating the KL-control coefficient, and (iii) a direct comparison of dropout uncertainty versus Double Q-Learning on the same batch to quantify any bias introduced by the lower-bound. These additions will be placed in the main paper rather than only the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pre-trained priors and live empirical tests

full rationale

The paper introduces a batch off-policy RL method that combines pre-trained language models (external to the new algorithm) with KL penalization and uncertainty-based Q-value bounding. No equations or derivations are presented that reduce the claimed performance gains to quantities defined by the method itself; the central results rest on post-hoc reward extraction from fixed human data batches followed by live human deployment comparisons against baselines. Self-citations are not load-bearing for the uniqueness or correctness of the core claims, and the pre-trained prior is invoked as an independent starting point rather than derived from the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5790 in / 1277 out tokens · 34088 ms · 2026-05-25T12:29:58.660365+00:00 · methodology

discussion (0)

Forward citations

Cited by 15 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multi-Armed Sampling Problem and the End of Exploration
cs.LG 2025-07 conditional novelty 8.0

Multi-armed sampling framework shows near-optimal regret is achievable with minimal exploration, unlike bandits, and unifies both via a continuous temperature family.
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
cs.LG 2020-04 accept novelty 8.0

D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Aligning Flow Map Policies with Optimal Q-Guidance
cs.LG 2026-05 unverdicted novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
cs.AI 2024-06 conditional novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Red Teaming Language Models with Language Models
cs.CL 2022-02 conditional novelty 7.0

One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.
Learning to summarize from human feedback
cs.CL 2020-09 conditional novelty 7.0

Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
Fine-Tuning Language Models from Human Preferences
cs.CL 2019-09 unverdicted novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State
cs.AI 2026-05 unverdicted novelty 6.0

In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic p...
Threshold-Guided Optimization for Visual Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
cs.LG 2020-06 unverdicted novelty 6.0

AWAC combines offline data with online RL via advantage-weighted actor-critic updates to enable faster acquisition of robotic skills such as dexterous manipulation.
Behavior Regularized Offline Reinforcement Learning
cs.LG 2019-11 unverdicted novelty 6.0

Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.
Benchmarking Batch Deep Reinforcement Learning Algorithms
cs.LG 2019-10 unverdicted novelty 6.0

Many batch RL algorithms underperform both online DQN and the behavioral policy on Atari; an adapted discrete-action BCQ outperforms the others tested.
Secrets of RLHF in Large Language Models Part I: PPO
cs.CL 2023-07 unverdicted novelty 5.0

Policy constraints are the critical factor for stable PPO training in RLHF, and the proposed PPO-max variant improves stability for large language model alignment.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
cs.LG 2020-05 unverdicted novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 15 Pith papers · 10 internal anchors

[1]

Maximum a Posteriori Policy Optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Hindsight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017

work page 2017
[3]

Efﬁcient exploration through bayesian deep q-networks

Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efﬁcient exploration through bayesian deep q-networks. In 2018 Information Theory and Applications Workshop (ITA) , pages 1–9. IEEE, 2018. 9

work page 2018
[4]

Crossnorm: Normalization for off-policy td reinforcement learning

Aditya Bhatt, Max Argus, Artemij Amiranashvili, and Thomas Brox. Crossnorm: Normalization for off-policy td reinforcement learning. arXiv preprint arXiv:1902.05605, 2019

work page arXiv 1902
[5]

Cyr, Michelle Pence, Michael Rold, and James Honeycutt

Graham D Bodie, Kellie St. Cyr, Michelle Pence, Michael Rold, and James Honeycutt. Listening competence in initial interactions i: Distinguishing between what listening is and what listeners do. International Journal of Listening, 26(1):1–28, 2012

work page 2012
[6]

active listen- ing

Graham D Bodie, Andrea J Vickery, Kaitlin Cannava, and Susanne M Jones. The role of “active listen- ing” in informal helping conversations: Impact on perceptions of listener helpfulness, sensitivity, and supportiveness and discloser emotional improvement. Western Journal of Communication, 79(2):151–173, 2015

work page 2015
[7]

Deep reinforce- ment learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforce- ment learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017

work page 2017
[8]

Supervised learning of universal sentence representations from natural language inference data

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, 2017

work page 2017
[9]

Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs

Cristian Danescu-Niculescu-Mizil and Lillian Lee. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics , pages 76–87. Association for Computational Linguistics, 2011

work page 2011
[10]

Off-policy actor-critic

Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 179–186. Omnipress, 2012

work page 2012
[11]

More robust doubly robust off-policy evaluation

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1446–1455, 2018

work page 2018
[12]

Policy networks with two-stage training for dialogue systems

Mehdi Fatemi, Layla El Asri, Hannes Schulz, Jing He, and Kaheer Suleman. Policy networks with two-stage training for dialogue systems. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 101–110, 2016

work page 2016
[13]

Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm

Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In 2017 Conference on Empirical Methods in Natural Language ProcessingConference on Empirical Methods in Natural Language Processing. Association for Computational ...

work page 2017
[14]

Taming the noise in reinforcement learning via soft updates

Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pages 202–211. AUAI Press, 2016

work page 2016
[15]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1582–1591, 2018

work page 2018
[16]

Off-policy deep reinforcement learning without explo- ration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without explo- ration. arXiv preprint arXiv:1812.02900, 2018

work page arXiv 2018
[17]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016

work page 2016
[18]

On-line policy optimisation of spoken dialogue systems via live interaction with human subjects

Milica Gaši´c, Filip Jurˇcíˇcek, Blaise Thomson, Kai Yu, and Steve Young. On-line policy optimisation of spoken dialogue systems via live interaction with human subjects. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pages 312–317. IEEE, 2011

work page 2011
[19]

Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

Carles Gelada and Marc G Bellemare. Off-policy deep reinforcement learning by bootstrapping the covariate shift. arXiv preprint arXiv:1901.09455, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[20]

Approximating interactive human evaluation with self-play for open-domain dialog systems

Asma Ghandeharioun, Judy Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, and Rosalind Picard. Approximating interactive human evaluation with self-play for open-domain dialog systems. arXiv preprint arXiv:1906.09308, 2019

work page arXiv 1906
[21]

Reinforcement learning with deep energy-based policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org, 2017. 10

work page 2017
[22]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1856–1865, 2018

work page 2018
[23]

Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[24]

Functions of humor in the conversations of men and women

Jennifer Hay. Functions of humor in the conversations of men and women. Journal of pragmatics , 32(6):709–742, 2000

work page 2000
[25]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[26]

Microsoft deletes ’teen girl’ ai after it became a hitler-loving sex robot within 24 hours

Helena Horton. Microsoft deletes ’teen girl’ ai after it became a hitler-loving sex robot within 24 hours. In Telegraph UK, 2016

work page 2016
[27]

Language style matching predicts relationship initiation and stability

Molly E Ireland, Richard B Slatcher, Paul W Eastwick, Lauren E Scissors, Eli J Finkel, and James W Pennebaker. Language style matching predicts relationship initiation and stability. Psychological science, 22(1):39–44, 2011

work page 2011
[28]

Sequence tutor: Conservative ﬁne-tuning of sequence generation models with kl-control

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, José Miguel Hernández-Lobato, Richard E Turner, and Douglas Eck. Sequence tutor: Conservative ﬁne-tuning of sequence generation models with kl-control. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1645–1654. JMLR. org, 2017

work page 2017
[29]

Doubly robust off-policy value evaluation for reinforcement learning

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661, 2016

work page 2016
[30]

Learning to achieve goals

Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, pages 1094–1099. Citeseer, 1993

work page 1993
[31]

Uncertainty-Aware Reinforcement Learning for Collision Avoidance

Gregory Kahn, Adam Villaﬂor, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

A natural policy gradient

Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems (NIPS), volume 14, pages 1531–1538, 2002

work page 2002
[33]

Optimal control as a graphical model inference problem

Hilbert J Kappen, Vicenç Gómez, and Manfred Opper. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012

work page 2012
[34]

Dialogue Learning With Human-In-The-Loop

Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. Dialogue learning with human-in-the-loop. arXiv preprint arXiv:1611.09823, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

Deep reinforcement learning for dialogue generation

Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, 2016

work page 2016
[36]

Adversarial learning for neural dialogue generation

Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2157–2169, 2017

work page 2017
[37]

Dialogue Generation: From Imitation Learning to Inverse Reinforcement Learning

Ziming Li, Julia Kiseleva, and Maarten de Rijke. Dialogue generation: From imitation learning to inverse reinforcement learning. arXiv preprint arXiv:1812.03509, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[38]

Iterative policy learning in end-to-end trainable task-oriented neural dialog models

Bing Liu and Ian Lane. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 482–489. IEEE, 2017

work page 2017
[39]

Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems

Bing Liu, Gokhan Tür, Dilek Hakkani-Tür, Pararth Shah, and Larry Heck. Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 20...

work page 2018
[40]

Off-Policy Policy Gradient with State Distribution Correction

Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[41]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 11

work page internal anchor Pith review Pith/arXiv arXiv 2013
[42]

Deep exploration via bootstrapped dqn

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026–4034, 2016

work page 2016
[43]

A hierarchical latent structure for variational conversation modeling

Yookoon Park, Jaemin Cho, and Gunhee Kim. A hierarchical latent structure for variational conversation modeling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1792–1801, 2018

work page 2018
[44]

Relative entropy policy search

Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. In AAAI, pages 1607–1612. Atlanta, 2010

work page 2010
[45]

Eligibility traces for off-policy policy evaluation

Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000

work page 2000
[46]

Laughter

Robert R Provine. Laughter. American scientist, 84(1):38–48, 1996

work page 1996
[47]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1:8, 2019

work page 2019
[48]

On stochastic optimal control and reinforcement learning by approximate inference

Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In Robotics: science and systems, 2012

work page 2012
[49]

Neural ﬁtted q iteration–ﬁrst experiences with a data efﬁcient neural reinforcement learning method

Martin Riedmiller. Neural ﬁtted q iteration–ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In European Conference on Machine Learning, pages 317–328. Springer, 2005

work page 2005
[50]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) , pages 1889–1897, 2015

work page 2015
[51]

A Deep Reinforcement Learning Chatbot

Iulian V Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, et al. A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[52]

Building end-to-end dialogue systems using generative hierarchical neural network models

Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016
[53]

A hierarchical latent variable encoder-decoder model for generating dialogues

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017
[54]

Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning

Pararth Shah, Dilek Hakkani-Tur, Bing Liu, and Gokhan Tur. Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), page...

work page 2018
[55]

Sentiment adaptive end-to-end dialog systems

Weiyan Shi and Zhou Yu. Sentiment adaptive end-to-end dialog systems. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1509–1519, 2018

work page 2018
[56]

Happybot: Generating empathetic dialogue responses by improving user experience look-ahead

Jamin Shin, Peng Xu, Andrea Madotto, and Pascale Fung. Happybot: Generating empathetic dialogue responses by improving user experience look-ahead. arXiv preprint arXiv:1906.08487, 2019

work page arXiv 1906
[57]

Where to look: a study of human-robot engagement

Candace L Sidner, Cory D Kidd, Christopher Lee, and Neal Lesh. Where to look: a study of human-robot engagement. In Proceedings of the 9th international conference on Intelligent user interfaces, pages 78–84. ACM, 2004

work page 2004
[58]

Stochastic optimal control

Robert F Stengel. Stochastic optimal control. John Wiley and Sons New York, New York, 1986

work page 1986
[59]

Sample-efﬁcient actor- critic reinforcement learning with supervised data for dialogue management

Pei-Hao Su, Paweł Budzianowski, Stefan Ultes, Milica Gasic, and Steve Young. Sample-efﬁcient actor- critic reinforcement learning with supervised data for dialogue management. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 147–157, 2017

work page 2017
[60]

Data-efﬁcient off-policy policy evaluation for reinforcement learning

Philip Thomas and Emma Brunskill. Data-efﬁcient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016

work page 2016
[61]

Linearly-solvable markov decision problems

Emanuel Todorov. Linearly-solvable markov decision problems. In Advances in neural information processing systems (NIPS), pages 1369–1376, 2007. 12

work page 2007
[62]

Deep reinforcement learning with double q-learning

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016
[63]

Active listening in peer interviews: The inﬂuence of message paraphrasing on perceptions of listening skill

Harry Weger Jr, Gina R Castle, and Melissa C Emmett. Active listening in peer interviews: The inﬂuence of message paraphrasing on perceptions of listening skill. The Intl. Journal of Listening, 24(1):34–49, 2010

work page 2010
[64]

sample-franklin.png

Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. The design and implementation of xiaoice, an empathetic social chatbot. arXiv preprint arXiv:1812.08989, 2018

work page arXiv 2018
[65]

Close Chat and Rate

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 8 Appendix 8.1 Details about implicit metrics 8.1.1 Sentiment-based To compute sentiment on short texts like conversation utterances, we leverage a state-of-the-art sentiment- detect...

work page 2008

[1] [1]

Maximum a Posteriori Policy Optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Hindsight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017

work page 2017

[3] [3]

Efﬁcient exploration through bayesian deep q-networks

Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efﬁcient exploration through bayesian deep q-networks. In 2018 Information Theory and Applications Workshop (ITA) , pages 1–9. IEEE, 2018. 9

work page 2018

[4] [4]

Crossnorm: Normalization for off-policy td reinforcement learning

Aditya Bhatt, Max Argus, Artemij Amiranashvili, and Thomas Brox. Crossnorm: Normalization for off-policy td reinforcement learning. arXiv preprint arXiv:1902.05605, 2019

work page arXiv 1902

[5] [5]

Cyr, Michelle Pence, Michael Rold, and James Honeycutt

Graham D Bodie, Kellie St. Cyr, Michelle Pence, Michael Rold, and James Honeycutt. Listening competence in initial interactions i: Distinguishing between what listening is and what listeners do. International Journal of Listening, 26(1):1–28, 2012

work page 2012

[6] [6]

active listen- ing

Graham D Bodie, Andrea J Vickery, Kaitlin Cannava, and Susanne M Jones. The role of “active listen- ing” in informal helping conversations: Impact on perceptions of listener helpfulness, sensitivity, and supportiveness and discloser emotional improvement. Western Journal of Communication, 79(2):151–173, 2015

work page 2015

[7] [7]

Deep reinforce- ment learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforce- ment learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017

work page 2017

[8] [8]

Supervised learning of universal sentence representations from natural language inference data

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, 2017

work page 2017

[9] [9]

Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs

Cristian Danescu-Niculescu-Mizil and Lillian Lee. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics , pages 76–87. Association for Computational Linguistics, 2011

work page 2011

[10] [10]

Off-policy actor-critic

Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 179–186. Omnipress, 2012

work page 2012

[11] [11]

More robust doubly robust off-policy evaluation

Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1446–1455, 2018

work page 2018

[12] [12]

Policy networks with two-stage training for dialogue systems

Mehdi Fatemi, Layla El Asri, Hannes Schulz, Jing He, and Kaheer Suleman. Policy networks with two-stage training for dialogue systems. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 101–110, 2016

work page 2016

[13] [13]

Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm

Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In 2017 Conference on Empirical Methods in Natural Language ProcessingConference on Empirical Methods in Natural Language Processing. Association for Computational ...

work page 2017

[14] [14]

Taming the noise in reinforcement learning via soft updates

Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pages 202–211. AUAI Press, 2016

work page 2016

[15] [15]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1582–1591, 2018

work page 2018

[16] [16]

Off-policy deep reinforcement learning without explo- ration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without explo- ration. arXiv preprint arXiv:1812.02900, 2018

work page arXiv 2018

[17] [17]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016

work page 2016

[18] [18]

On-line policy optimisation of spoken dialogue systems via live interaction with human subjects

Milica Gaši´c, Filip Jurˇcíˇcek, Blaise Thomson, Kai Yu, and Steve Young. On-line policy optimisation of spoken dialogue systems via live interaction with human subjects. In 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pages 312–317. IEEE, 2011

work page 2011

[19] [19]

Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift

Carles Gelada and Marc G Bellemare. Off-policy deep reinforcement learning by bootstrapping the covariate shift. arXiv preprint arXiv:1901.09455, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[20] [20]

Approximating interactive human evaluation with self-play for open-domain dialog systems

Asma Ghandeharioun, Judy Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, and Rosalind Picard. Approximating interactive human evaluation with self-play for open-domain dialog systems. arXiv preprint arXiv:1906.09308, 2019

work page arXiv 1906

[21] [21]

Reinforcement learning with deep energy-based policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1352–1361. JMLR. org, 2017. 10

work page 2017

[22] [22]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1856–1865, 2018

work page 2018

[23] [23]

Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, and Jason Weston. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[24] [24]

Functions of humor in the conversations of men and women

Jennifer Hay. Functions of humor in the conversations of men and women. Journal of pragmatics , 32(6):709–742, 2000

work page 2000

[25] [25]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[26] [26]

Microsoft deletes ’teen girl’ ai after it became a hitler-loving sex robot within 24 hours

Helena Horton. Microsoft deletes ’teen girl’ ai after it became a hitler-loving sex robot within 24 hours. In Telegraph UK, 2016

work page 2016

[27] [27]

Language style matching predicts relationship initiation and stability

Molly E Ireland, Richard B Slatcher, Paul W Eastwick, Lauren E Scissors, Eli J Finkel, and James W Pennebaker. Language style matching predicts relationship initiation and stability. Psychological science, 22(1):39–44, 2011

work page 2011

[28] [28]

Sequence tutor: Conservative ﬁne-tuning of sequence generation models with kl-control

Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, José Miguel Hernández-Lobato, Richard E Turner, and Douglas Eck. Sequence tutor: Conservative ﬁne-tuning of sequence generation models with kl-control. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1645–1654. JMLR. org, 2017

work page 2017

[29] [29]

Doubly robust off-policy value evaluation for reinforcement learning

Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661, 2016

work page 2016

[30] [30]

Learning to achieve goals

Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, pages 1094–1099. Citeseer, 1993

work page 1993

[31] [31]

Uncertainty-Aware Reinforcement Learning for Collision Avoidance

Gregory Kahn, Adam Villaﬂor, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

A natural policy gradient

Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems (NIPS), volume 14, pages 1531–1538, 2002

work page 2002

[33] [33]

Optimal control as a graphical model inference problem

Hilbert J Kappen, Vicenç Gómez, and Manfred Opper. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012

work page 2012

[34] [34]

Dialogue Learning With Human-In-The-Loop

Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. Dialogue learning with human-in-the-loop. arXiv preprint arXiv:1611.09823, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[35] [35]

Deep reinforcement learning for dialogue generation

Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, 2016

work page 2016

[36] [36]

Adversarial learning for neural dialogue generation

Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2157–2169, 2017

work page 2017

[37] [37]

Dialogue Generation: From Imitation Learning to Inverse Reinforcement Learning

Ziming Li, Julia Kiseleva, and Maarten de Rijke. Dialogue generation: From imitation learning to inverse reinforcement learning. arXiv preprint arXiv:1812.03509, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [38]

Iterative policy learning in end-to-end trainable task-oriented neural dialog models

Bing Liu and Ian Lane. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 482–489. IEEE, 2017

work page 2017

[39] [39]

Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems

Bing Liu, Gokhan Tür, Dilek Hakkani-Tür, Pararth Shah, and Larry Heck. Dialogue learning with human teaching and feedback in end-to-end trainable task-oriented dialogue systems. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 20...

work page 2018

[40] [40]

Off-Policy Policy Gradient with State Distribution Correction

Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with state distribution correction. arXiv preprint arXiv:1904.08473, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[41] [41]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. 11

work page internal anchor Pith review Pith/arXiv arXiv 2013

[42] [42]

Deep exploration via bootstrapped dqn

Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026–4034, 2016

work page 2016

[43] [43]

A hierarchical latent structure for variational conversation modeling

Yookoon Park, Jaemin Cho, and Gunhee Kim. A hierarchical latent structure for variational conversation modeling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1792–1801, 2018

work page 2018

[44] [44]

Relative entropy policy search

Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. In AAAI, pages 1607–1612. Atlanta, 2010

work page 2010

[45] [45]

Eligibility traces for off-policy policy evaluation

Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000

work page 2000

[46] [46]

Laughter

Robert R Provine. Laughter. American scientist, 84(1):38–48, 1996

work page 1996

[47] [47]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1:8, 2019

work page 2019

[48] [48]

On stochastic optimal control and reinforcement learning by approximate inference

Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In Robotics: science and systems, 2012

work page 2012

[49] [49]

Neural ﬁtted q iteration–ﬁrst experiences with a data efﬁcient neural reinforcement learning method

Martin Riedmiller. Neural ﬁtted q iteration–ﬁrst experiences with a data efﬁcient neural reinforcement learning method. In European Conference on Machine Learning, pages 317–328. Springer, 2005

work page 2005

[50] [50]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15) , pages 1889–1897, 2015

work page 2015

[51] [51]

A Deep Reinforcement Learning Chatbot

Iulian V Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, et al. A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[52] [52]

Building end-to-end dialogue systems using generative hierarchical neural network models

Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016

[53] [53]

A hierarchical latent variable encoder-decoder model for generating dialogues

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017

[54] [54]

Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning

Pararth Shah, Dilek Hakkani-Tur, Bing Liu, and Gokhan Tur. Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), page...

work page 2018

[55] [55]

Sentiment adaptive end-to-end dialog systems

Weiyan Shi and Zhou Yu. Sentiment adaptive end-to-end dialog systems. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1509–1519, 2018

work page 2018

[56] [56]

Happybot: Generating empathetic dialogue responses by improving user experience look-ahead

Jamin Shin, Peng Xu, Andrea Madotto, and Pascale Fung. Happybot: Generating empathetic dialogue responses by improving user experience look-ahead. arXiv preprint arXiv:1906.08487, 2019

work page arXiv 1906

[57] [57]

Where to look: a study of human-robot engagement

Candace L Sidner, Cory D Kidd, Christopher Lee, and Neal Lesh. Where to look: a study of human-robot engagement. In Proceedings of the 9th international conference on Intelligent user interfaces, pages 78–84. ACM, 2004

work page 2004

[58] [58]

Stochastic optimal control

Robert F Stengel. Stochastic optimal control. John Wiley and Sons New York, New York, 1986

work page 1986

[59] [59]

Sample-efﬁcient actor- critic reinforcement learning with supervised data for dialogue management

Pei-Hao Su, Paweł Budzianowski, Stefan Ultes, Milica Gasic, and Steve Young. Sample-efﬁcient actor- critic reinforcement learning with supervised data for dialogue management. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 147–157, 2017

work page 2017

[60] [60]

Data-efﬁcient off-policy policy evaluation for reinforcement learning

Philip Thomas and Emma Brunskill. Data-efﬁcient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016

work page 2016

[61] [61]

Linearly-solvable markov decision problems

Emanuel Todorov. Linearly-solvable markov decision problems. In Advances in neural information processing systems (NIPS), pages 1369–1376, 2007. 12

work page 2007

[62] [62]

Deep reinforcement learning with double q-learning

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016

[63] [63]

Active listening in peer interviews: The inﬂuence of message paraphrasing on perceptions of listening skill

Harry Weger Jr, Gina R Castle, and Melissa C Emmett. Active listening in peer interviews: The inﬂuence of message paraphrasing on perceptions of listening skill. The Intl. Journal of Listening, 24(1):34–49, 2010

work page 2010

[64] [64]

sample-franklin.png

Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. The design and implementation of xiaoice, an empathetic social chatbot. arXiv preprint arXiv:1812.08989, 2018

work page arXiv 2018

[65] [65]

Close Chat and Rate

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 8 Appendix 8.1 Details about implicit metrics 8.1.1 Sentiment-based To compute sentiment on short texts like conversation utterances, we leverage a state-of-the-art sentiment- detect...

work page 2008