Prompt reinforcing for long-term planning of large language models

Benjamin Matthias Ruppik; Carel van Niekerk; Chia-Hao Shen; Hsien-Chin Lin; Michael Heck; Milica Ga\v{s}i\'c; Nurul Lubis; Renato Vukovic; Shutong Feng

arxiv: 2510.05921 · v3 · pith:SSMPFZ4Enew · submitted 2025-10-07 · 💻 cs.CL · cs.LG

Prompt reinforcing for long-term planning of large language models

Hsien-Chin Lin , Benjamin Matthias Ruppik , Carel van Niekerk , Chia-Hao Shen , Michael Heck , Nurul Lubis , Renato Vukovic , Shutong Feng

show 1 more author

Milica Ga\v{s}i\'c

This is my paper

Pith reviewed 2026-05-21 20:39 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords prompt optimizationlarge language modelsreinforcement learningmulti-turn taskstext-to-SQLtask-oriented dialogueexperience replaylong-term planning

0 comments

The pith

A reinforcement learning-inspired method lets LLMs optimize their own prompts for better multi-turn performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models tend to falter in extended interactions because they commit to wrong assumptions early and fail to keep user goals in mind. This paper develops a prompt optimization approach modeled on reinforcement learning to fix that by revising the main task instructions as the conversation progresses. The revision relies on feedback generated after each turn and on replaying earlier experiences to refine the prompt. If this works, models could handle ongoing tasks like database queries from text or managing conversations more reliably while keeping the underlying model unchanged. Readers would care about this because most practical uses of language models involve back-and-forth exchanges rather than one-shot questions.

Core claim

By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, the proposed prompt optimisation framework enables large language models to conduct long-term planning in multi-turn tasks such as text-to-SQL and task-oriented dialogue, with the improvements generalizing across different LLM-based agents and allowing the use of diverse LLMs as meta-prompting agents.

What carries the argument

The prompt optimisation framework that uses turn-by-turn feedback and experience replay to rewrite the task instruction prompt of the LLM agent.

Load-bearing premise

Feedback generated turn-by-turn by the LLM or a similar model is accurate and informative enough to drive prompt improvements without compounding errors.

What would settle it

Testing the method on extended interactions where feedback quality is reduced or errors are introduced early on to see if performance gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2510.05921 by Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Hsien-Chin Lin, Michael Heck, Milica Ga\v{s}i\'c, Nurul Lubis, Renato Vukovic, Shutong Feng.

**Figure 2.** Figure 2: Workflow of feedback generation by an LLM. The Monte Carlo–style feedback (left) is [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The summary of our experiment tasks. test whether prompt optimisation generalises across different LLMs. The agent is optimised in the multi-sharded environment and evaluated by functional accuracy, requiring generated SQL queries to exactly match the reference outputs across all databases. Task-oriented Dialogue To evaluate on a more realistic scenario, we conduct experiments on MultiWOZ 2.1 (Budzianowski… view at source ↗

**Figure 4.** Figure 4: The training curves of different optimisation methods. Each setting is trained on 4 seeds [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Overall preference between our method and a standard system (Standard), GPO, and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The training curve of different optimisation methods. Each setting is trained over 4 seeds, [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: The prompt of the basic rewriter. C.1 AN EXAMPLE OF THE SYSTEM PROMPT BEFORE AND AFTER OPTIMISATION BY RPO [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: The prompt of the experience replay rewriter. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: The prompt of the MC-style feedbacker. Here is the user goal:[USER GOALS] For each turn, please evaluate the system's behaviour. Your response should include your reasons, what the user emotion would be when the user sees the system's response, and whether the system is efficiently progressing towards solving the task (In Progress), or if the conversation failed (Fail) or if the conversation is successfull… view at source ↗

**Figure 10.** Figure 10: The prompt of the TD-style feedbacker. The input, including user utterance, system [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: The system prompt of FnCTOD before prompt optimisation. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: The system prompt of FnCTOD after it is optimised by RPO [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

read the original abstract

Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies experience replay to iteratively rewrite LLM task prompts using turn-by-turn feedback for better multi-turn coherence, but the abstract gives no numbers or controls to check if it works.

read the letter

The main contribution is a parameter-free way to update the system prompt over multiple turns by storing feedback and replaying it to rewrite instructions. This targets drift in long-horizon tasks like text-to-SQL and task-oriented dialogue, and the setup claims to work across different base models while using other LLMs for the rewriting step. That combination of replay and prompt-level adaptation is presented as new relative to the cited prompt-optimization and RL-for-LLM papers, and it stays simple enough for practical agent deployments without extra training. The generalization claim across agents is a plus if the full experiments back it up. The soft spots are more noticeable. The abstract states significant improvements yet supplies no quantitative results, baselines, statistical tests, or ablation details, so the gains cannot be verified from what is shown. The central assumption that LLM-generated turn-by-turn feedback is reliable enough to drive the updates looks shaky; noisy or systematically wrong feedback on goal tracking or task details could compound errors through replay and erase any reported benefits. The stress-test point about feedback quality holds up on the available description, and without oracle comparisons or quality checks the results risk being fragile. This is aimed at people building interactive LLM agents who need simple coherence fixes rather than core theory advances. A reader working on applied dialogue systems or agent prompting might extract useful implementation ideas from the full paper if the experiments are properly controlled. I would send it for peer review to get the numbers and setup examined rather than desk-reject it.

Referee Report

2 major / 1 minor

Summary. The paper proposes a reinforcement learning-inspired prompt optimization framework for LLMs to address suboptimal performance in multi-turn interactions. By generating turn-by-turn feedback and applying experience replay to rewrite the task instruction prompt, the method aims to enable long-term planning, correct early assumptions, and better track user goals. It claims significant empirical improvements on multi-turn tasks such as text-to-SQL and task-oriented dialogue, along with generalization across different LLM agents and meta-prompting models.

Significance. If the empirical results are robust and the feedback mechanism is validated, this could represent a useful parameter-free approach to enhancing LLM agents in interactive settings without fine-tuning, extending ideas from dialogue systems and RL to practical multi-turn applications.

major comments (2)

Abstract: the central claim of 'significant improvement' in multi-turn tasks is presented without any quantitative metrics, baselines, statistical tests, or ablation details, preventing verification of the reported gains on text-to-SQL and dialogue.
Method (feedback and replay components): the approach depends on LLM-generated turn-by-turn feedback being accurate enough to guide prompt rewriting; no evaluation of feedback quality or ablation against oracle/human feedback is described, leaving open the possibility that experience replay reinforces systematic errors such as misidentified goal drift.

minor comments (1)

Abstract: the phrase 'parameter-free optimisation methods' should be clarified, as the framework still relies on the parameters of the underlying LLMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and outline the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the central claim of 'significant improvement' in multi-turn tasks is presented without any quantitative metrics, baselines, statistical tests, or ablation details, preventing verification of the reported gains on text-to-SQL and dialogue.

Authors: We agree that the abstract would be more informative with concrete details. In the revised manuscript we will update the abstract to include specific quantitative gains (e.g., absolute and relative improvements on the text-to-SQL and task-oriented dialogue benchmarks), reference the main baselines, and note that statistical significance was assessed via paired t-tests across multiple runs. revision: yes
Referee: Method (feedback and replay components): the approach depends on LLM-generated turn-by-turn feedback being accurate enough to guide prompt rewriting; no evaluation of feedback quality or ablation against oracle/human feedback is described, leaving open the possibility that experience replay reinforces systematic errors such as misidentified goal drift.

Authors: This is a fair and important point. While end-to-end task improvements across LLMs provide indirect evidence that the feedback is useful on average, we did not quantify feedback accuracy or compare against oracle feedback. In the revision we will add a new analysis subsection that (i) reports agreement between LLM-generated feedback and human annotations on a held-out set of turns and (ii) includes an ablation that replaces the learned feedback with oracle feedback to measure the performance gap and discuss error propagation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with direct claims

full rationale

The paper proposes a prompt optimization framework for LLMs that uses turn-by-turn feedback and experience replay to enable long-term planning in multi-turn tasks. No equations, derivations, or first-principles results are present in the abstract or described approach that reduce outputs to inputs by construction. The central claims rest on reported empirical improvements in text-to-SQL and task-oriented dialogue, with the method defined directly rather than through self-referential fits, predictions, or load-bearing self-citations. This is a standard empirical contribution without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that LLM-generated feedback is reliable enough to guide prompt improvement and on the modeling choice that only the initial instruction needs to be rewritten.

axioms (1)

domain assumption LLM-generated turn-by-turn feedback is sufficiently accurate to serve as a learning signal for prompt rewriting
Invoked implicitly when the method uses the feedback to rewrite prompts without external verification.

pith-pipeline@v0.9.0 · 5720 in / 1198 out tokens · 26544 ms · 2026-05-21T20:39:09.600165+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Reinforced Prompt Optimisation (RPO)... By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Inspired by... temporal difference (TD) error... experience replay

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Weize Kong, Spurthi Hombaiah, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky

URLhttps://aclanthology.org/2023.emnlp-main.319. Weize Kong, Spurthi Hombaiah, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. PRewrite: Prompt rewriting with reinforcement learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers)...

work page 2023
[2]

LLMs Get Lost In Multi-Turn Conversation

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-short.54. URL https://aclanthology.org/2024.acl-short.54. Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs Get Lost In Multi-Turn Conversation, 2025. URLhttps://arxiv.org/abs/2505.06120. Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for paramet...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-short.54 2024
[3]

URL https://aclanthology.org/2021

Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243. Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.),Proceedings of the 59th Annual Meeting of the Association for Comput...

work page doi:10.18653/v1/2021.emnlp-main.243 2021
[4]

Louis, G

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/ file/1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf. Xinyu Tang, Xiaolei Wang, Wayne Xin Zhao, Siyuan Lu, Yaliang Li, and Ji-Rong Wen. Un- leashing the potential of large language models as prompt optimizers: analogical analysis with gradient-based model optimizers. InProceedings of the Thir...

work page doi:10.1609/aaai 2023
[5]

URLhttps://aclanthology.org/2025

doi: 10.18653/v1/2025.findings-naacl.211. URLhttps://aclanthology.org/2025. findings-naacl.211/. Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric Xing, and Zhiting Hu. PromptAgent: Strategic Planning with Language Models Enables Expert- level Prompt Optimization. InThe Twelfth International Conference on Learning...

work page doi:10.18653/v1/2025.findings-naacl.211 2025
[6]

Wenjing Yue Wei Zhu and Xiaoling Wang

URLhttps://proceedings.neurips.cc/paper_files/paper/2022/ file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf. Wenjing Yue Wei Zhu and Xiaoling Wang. ShenNong-TCM: A Traditional Chinese Medicine Large Language Model.https://github.com/michael-wzhu/ShenNong-TCM-LLM, 2023. Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and...

work page doi:10.18653/v1/2024.findings-acl.21 2022
[7]

Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/ file/f6b22ac37beb5da61efd4882082c9ecd-Paper-Conference.pdf. Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization. InICLR 2024 Workshop on Large Language M...

work page doi:10.18653/v1/2023.emnlp-demo.9 2023
[8]

Avoid greetings and redundant repetition of the user request

Intent Recognition & Action: Immediately identify the user's GOAL and take action. Avoid greetings and redundant repetition of the user request. Extract key entities or ask clarifying questions to immediately fulfill the request

work page
[9]

Prioritize explicit user input

Dynamic Slot Updating & Goal Tracking: After each turn, completely update all relevant slots (day, time, people, location, price range, constraints, etc.) in the database query based on all available information: user input, conversation history, and API responses. Prioritize explicit user input. Track user goals throughout the conversation and make sure ...

work page
[10]

If a direct match isn't found, proactively offer alternatives that best align with user requirements (nearby locations, different dates/times, related options, fuzzy matching)

Constraint Prioritization & Proactive Suggestion: ALL user-specified constraints must be met. If a direct match isn't found, proactively offer alternatives that best align with user requirements (nearby locations, different dates/times, related options, fuzzy matching). Before concluding unavailability, suggest relaxing constraints (one at a time) and pro...

work page
[11]

Avoid repetitive questions by remembering previous answers

Context & Conversational Flow: Maintain context across turns using conversation history. Avoid repetitive questions by remembering previous answers. Update search parameters based on new information. Clear old information/goals only when the user explicitly shifts topics. Repeat unfulfilled goals only when presenting subtask results if the goals are perti...

work page
[12]

Avoid hardcoded or default values

Accurate & Efficient API Calls: Validate API call parameters against current, complete, and accurate user preferences exactly. Avoid hardcoded or default values. Do not continue API calls if the answer has already been found and presented or if the API provides the requested information. Validate input data type compliance and reasonable limits (dates, ti...

work page
[13]

Do not hallucinate bookings

Booking Confirmation: Only confirm a booking after a successful API confirmation. Do not hallucinate bookings

work page
[14]

Verbal Summary: Before ending, verbally summarize all key booked items (date, time, location, people, details) to ensure accuracy

work page
[15]

Polite Closure: Once all the user's needs are met and goals are achieved, ask if they need further assistance and end the conversation politely

work page
[16]

Figure 12: The system prompt of FnCTOD after it is optimised by RPO TD+replay for 8 epochs

Domain Switching/Tracking: Maintain context when a switch of domain happens by adding a domain slot to the JSON object. Figure 12: The system prompt of FnCTOD after it is optimised by RPO TD+replay for 8 epochs. RPOTD+replay is built with Gemini-2.0-Flash. The format is generated by the rewriter in markdown format. For illustration, the instructions of go...

work page

[1] [1]

Weize Kong, Spurthi Hombaiah, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky

URLhttps://aclanthology.org/2023.emnlp-main.319. Weize Kong, Spurthi Hombaiah, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. PRewrite: Prompt rewriting with reinforcement learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers)...

work page 2023

[2] [2]

LLMs Get Lost In Multi-Turn Conversation

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-short.54. URL https://aclanthology.org/2024.acl-short.54. Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs Get Lost In Multi-Turn Conversation, 2025. URLhttps://arxiv.org/abs/2505.06120. Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for paramet...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-short.54 2024

[3] [3]

URL https://aclanthology.org/2021

Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243. Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.),Proceedings of the 59th Annual Meeting of the Association for Comput...

work page doi:10.18653/v1/2021.emnlp-main.243 2021

[4] [4]

Louis, G

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/ file/1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf. Xinyu Tang, Xiaolei Wang, Wayne Xin Zhao, Siyuan Lu, Yaliang Li, and Ji-Rong Wen. Un- leashing the potential of large language models as prompt optimizers: analogical analysis with gradient-based model optimizers. InProceedings of the Thir...

work page doi:10.1609/aaai 2023

[5] [5]

URLhttps://aclanthology.org/2025

doi: 10.18653/v1/2025.findings-naacl.211. URLhttps://aclanthology.org/2025. findings-naacl.211/. Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric Xing, and Zhiting Hu. PromptAgent: Strategic Planning with Language Models Enables Expert- level Prompt Optimization. InThe Twelfth International Conference on Learning...

work page doi:10.18653/v1/2025.findings-naacl.211 2025

[6] [6]

Wenjing Yue Wei Zhu and Xiaoling Wang

URLhttps://proceedings.neurips.cc/paper_files/paper/2022/ file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf. Wenjing Yue Wei Zhu and Xiaoling Wang. ShenNong-TCM: A Traditional Chinese Medicine Large Language Model.https://github.com/michael-wzhu/ShenNong-TCM-LLM, 2023. Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and...

work page doi:10.18653/v1/2024.findings-acl.21 2022

[7] [7]

Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/ file/f6b22ac37beb5da61efd4882082c9ecd-Paper-Conference.pdf. Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization. InICLR 2024 Workshop on Large Language M...

work page doi:10.18653/v1/2023.emnlp-demo.9 2023

[8] [8]

Avoid greetings and redundant repetition of the user request

Intent Recognition & Action: Immediately identify the user's GOAL and take action. Avoid greetings and redundant repetition of the user request. Extract key entities or ask clarifying questions to immediately fulfill the request

work page

[9] [9]

Prioritize explicit user input

Dynamic Slot Updating & Goal Tracking: After each turn, completely update all relevant slots (day, time, people, location, price range, constraints, etc.) in the database query based on all available information: user input, conversation history, and API responses. Prioritize explicit user input. Track user goals throughout the conversation and make sure ...

work page

[10] [10]

If a direct match isn't found, proactively offer alternatives that best align with user requirements (nearby locations, different dates/times, related options, fuzzy matching)

Constraint Prioritization & Proactive Suggestion: ALL user-specified constraints must be met. If a direct match isn't found, proactively offer alternatives that best align with user requirements (nearby locations, different dates/times, related options, fuzzy matching). Before concluding unavailability, suggest relaxing constraints (one at a time) and pro...

work page

[11] [11]

Avoid repetitive questions by remembering previous answers

Context & Conversational Flow: Maintain context across turns using conversation history. Avoid repetitive questions by remembering previous answers. Update search parameters based on new information. Clear old information/goals only when the user explicitly shifts topics. Repeat unfulfilled goals only when presenting subtask results if the goals are perti...

work page

[12] [12]

Avoid hardcoded or default values

Accurate & Efficient API Calls: Validate API call parameters against current, complete, and accurate user preferences exactly. Avoid hardcoded or default values. Do not continue API calls if the answer has already been found and presented or if the API provides the requested information. Validate input data type compliance and reasonable limits (dates, ti...

work page

[13] [13]

Do not hallucinate bookings

Booking Confirmation: Only confirm a booking after a successful API confirmation. Do not hallucinate bookings

work page

[14] [14]

Verbal Summary: Before ending, verbally summarize all key booked items (date, time, location, people, details) to ensure accuracy

work page

[15] [15]

Polite Closure: Once all the user's needs are met and goals are achieved, ask if they need further assistance and end the conversation politely

work page

[16] [16]

Figure 12: The system prompt of FnCTOD after it is optimised by RPO TD+replay for 8 epochs

Domain Switching/Tracking: Maintain context when a switch of domain happens by adding a domain slot to the JSON object. Figure 12: The system prompt of FnCTOD after it is optimised by RPO TD+replay for 8 epochs. RPOTD+replay is built with Gemini-2.0-Flash. The format is generated by the rewriter in markdown format. For illustration, the instructions of go...

work page