ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

Austin Zhang; Karthik Konaraddi; Kartik Natarajan; Mahesh Sankaranarayanan; Niraj Nawanit; Rakshit Parashar; Rishita Mote; Wei Niu; Yanjun Lin; Zimo Xiao

arxiv: 2605.19077 · v1 · pith:W3VVE3RMnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

Yanjun Lin , Zimo Xiao , Kartik Natarajan , Mahesh Sankaranarayanan , Niraj Nawanit , Rakshit Parashar , Austin Zhang , Karthik Konaraddi

show 2 more authors

Rishita Mote Wei Niu

This is my paper

Pith reviewed 2026-05-20 10:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords dialogue state trackingzero-shot learningneuro-symbolic architectureReAct loopsymbolic validationMultiWOZtask-oriented dialogueself-correction

0 comments

The pith

ReacTOD uses a bounded ReAct loop with symbolic validation to achieve new zero-shot state-of-the-art dialogue state tracking on MultiWOZ.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReacTOD as a neuro-symbolic architecture that reformulates natural language understanding for dialogue state tracking into discrete tool calls inside a self-correcting ReAct loop. Deterministic symbolic validation enforces action compliance, schema conformance, and coreference consistency on every state update, intercepting and fixing most errors that arise from hallucination or format mistakes. This produces structured execution traces and allows iterative self-correction, which improves accuracy by up to 9.3 percentage points over single-pass inference. On MultiWOZ 2.1 the method reaches new zero-shot records while remaining practical for moderately sized models, and it generalizes to the Schema-Guided Dialogue benchmark under end-to-end conditions.

Core claim

ReacTOD reformulates NLU as discrete tool calls within a self-correcting bounded ReAct loop governed by deterministic validation that enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1 percent self-correction rate on intercepted errors and new zero-shot joint goal accuracy of 52.71 percent on MultiWOZ 2.1 with a 20B model and 47.34 percent with an 8B model.

What carries the argument

Bounded ReAct loop with deterministic symbolic validator that performs action compliance, schema conformance, and coreference consistency checks on every dialogue state update.

If this is right

Iterative self-correction in the bounded loop improves joint goal accuracy by up to 9.3 percentage points over single-pass inference.
The symbolic validator achieves a 93.1 percent self-correction rate on intercepted errors while producing structured execution traces.
Incremental state prediction and on-demand history retrieval keep prompts compact and improve instruction adherence in parameter-constrained models.
The architecture generalizes across benchmarks, reaching 80.68 percent JGA on Schema-Guided Dialogue with predicted domains and no task-specific training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The combination of neural reasoning steps with hard symbolic checks could reduce reliance on very large models for production task-oriented dialogue systems.
Similar bounded loops with deterministic validators might improve reliability in other agentic settings that require structured outputs, such as API calling or database query formulation.
The approach invites tests on longer conversations or additional domains to check whether the self-correction rate and latency overhead remain favorable.

Load-bearing premise

The symbolic validator intercepts and correctly classifies the majority of errors without introducing new inconsistencies or excessive latency.

What would settle it

Running the same models on MultiWOZ 2.1 with the symbolic validator disabled and observing whether joint goal accuracy falls back near the single-pass baseline levels would test whether the bounded loop and validation drive the reported gains.

Figures

Figures reproduced from arXiv: 2605.19077 by Austin Zhang, Karthik Konaraddi, Kartik Natarajan, Mahesh Sankaranarayanan, Niraj Nawanit, Rakshit Parashar, Rishita Mote, Wei Niu, Yanjun Lin, Zimo Xiao.

**Figure 2.** Figure 2: Tool definitions and data flow in ReacTOD. The validator checks all tool calls before execution; [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Task-oriented dialogue systems -- handling transactions, reservations, and service requests -- require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation. A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models. On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09% -- demonstrating cross-benchmark generalization without task-specific training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReacTOD pairs a bounded ReAct loop with a deterministic symbolic validator to lift zero-shot DST accuracy, but the reported gains rest heavily on unexamined validator behavior.

read the letter

ReacTOD's main contribution is wrapping an LLM in a bounded ReAct loop that calls tools for state updates and then runs every change through a fixed symbolic validator for schema, action, and coreference rules. This produces the headline numbers: 52.71% joint goal accuracy on MultiWOZ 2.1 zero-shot with a 20B model, 14 points above the prior best, plus solid transfer to SGD and usable results from an 8B Qwen3 variant. The incremental state tracking and on-demand history retrieval are sensible engineering choices that keep prompts short and seem to help smaller models follow instructions better. The 9.3-point lift over single-pass inference and the 93.1% self-correction rate on caught errors are the concrete empirical wins here. The paper does a reasonable job framing the problem around hallucination and format drift in practical dialogue systems and showing that the neuro-symbolic mix can mitigate them without task-specific training. The soft spot is exactly where the stress test flags it. The correction rate only counts errors the validator actually intercepts, so we have no visibility into false negatives on MultiWOZ edge cases such as implicit cross-domain references or partial slot updates. If the rule set silently accepts a schema-compliant but wrong state, or if it forces an incorrect fix, the net accuracy improvement could shrink. The abstract gives no false-positive rate, no latency breakdown for the loop, and no post-correction error analysis, which leaves the central claim harder to trust at face value. This paper is aimed at dialogue researchers and practitioners who need better zero-shot reliability from mid-sized models. It has enough benchmark results and a clear enough architecture to merit a serious referee, even though additional ablations on the validator would make the evidence stronger. I would send it out for review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes ReacTOD, a bounded neuro-symbolic architecture for zero-shot dialogue state tracking. It reformulates NLU as discrete tool calls inside a self-correcting ReAct loop whose iterations are governed by a deterministic symbolic validator that enforces action compliance, schema conformance, and coreference consistency. Incremental state prediction and on-demand history retrieval are used to keep prompts compact. The manuscript reports up to 9.3 pp gains over single-pass inference, a new zero-shot SOTA of 52.71% joint goal accuracy on MultiWOZ 2.1 with gpt-oss-20B, 47.34% with Qwen3-8B, and competitive end-to-end results on SGD.

Significance. If the empirical gains are robust, the work illustrates a practical way to combine moderately sized LLMs with symbolic constraints inside a bounded loop to reduce format and hallucination errors in structured prediction tasks. The reported cross-benchmark generalization without task-specific training and the performance achieved by 8B-scale models are useful for latency-sensitive deployments. The production of structured execution traces is a secondary strength that could aid debugging.

major comments (3)

[Abstract] Abstract: The 93.1% self-correction rate is stated only for intercepted errors. Without a reported false-positive rate for the validator or a post-correction error breakdown (particularly on MultiWOZ phenomena such as implicit cross-domain coreference or partial slot updates), it remains unclear whether the validator silently accepts schema-compliant but factually incorrect states, which would inflate the net accuracy improvement.
[Section 4 (Experiments)] Section 4 (Experiments): The +14 pp improvement over the previous zero-shot best and the 9.3 pp gain over single-pass are presented as headline results, yet the text provides no description of the exact baseline systems, prompt templates, decoding parameters, or number of evaluation runs. This information is load-bearing for assessing whether the central claim of a new SOTA is reproducible and free of post-hoc setup choices.
[Section 3 (Architecture)] Section 3 (Architecture): The bounded ReAct loop termination criteria and the precise interface between the LLM tool calls and the deterministic validator are described at a high level. A more formal specification (e.g., pseudocode or state-transition rules) is needed to evaluate potential introduction of new inconsistencies or excessive latency on edge cases.

minor comments (2)

[Abstract] Abstract: The first mention of JGA should be spelled out as 'joint goal accuracy' for readers outside the immediate subfield.
[Throughout] Throughout: Figure captions and table headers should explicitly state whether results are zero-shot and whether domains are predicted or oracle-provided.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen clarity, reproducibility, and formalization.

read point-by-point responses

Referee: [Abstract] Abstract: The 93.1% self-correction rate is stated only for intercepted errors. Without a reported false-positive rate for the validator or a post-correction error breakdown (particularly on MultiWOZ phenomena such as implicit cross-domain coreference or partial slot updates), it remains unclear whether the validator silently accepts schema-compliant but factually incorrect states, which would inflate the net accuracy improvement.

Authors: We agree that additional metrics on validator behavior would improve the presentation. The reported 93.1% specifically quantifies successful corrections among errors that the validator intercepted. In the revised manuscript we will add the validator's false-positive rate (incorrectly flagged valid states) together with a post-correction error breakdown that explicitly discusses implicit cross-domain coreference and partial slot updates on MultiWOZ. This analysis will clarify the extent to which schema-compliant yet factually incorrect states remain after the loop. revision: yes
Referee: [Section 4 (Experiments)] Section 4 (Experiments): The +14 pp improvement over the previous zero-shot best and the 9.3 pp gain over single-pass are presented as headline results, yet the text provides no description of the exact baseline systems, prompt templates, decoding parameters, or number of evaluation runs. This information is load-bearing for assessing whether the central claim of a new SOTA is reproducible and free of post-hoc setup choices.

Authors: We acknowledge that the current description is insufficient for full reproducibility. In the revised Section 4 we will supply (i) precise specifications of all baseline systems and their prompting setups, (ii) the complete prompt templates used for single-pass and ReacTOD inference, (iii) decoding hyperparameters (temperature, top-p, max tokens, etc.), and (iv) the number of evaluation runs together with mean and standard deviation across three independent runs with different seeds. These additions will allow independent verification of the reported gains. revision: yes
Referee: [Section 3 (Architecture)] Section 3 (Architecture): The bounded ReAct loop termination criteria and the precise interface between the LLM tool calls and the deterministic validator are described at a high level. A more formal specification (e.g., pseudocode or state-transition rules) is needed to evaluate potential introduction of new inconsistencies or excessive latency on edge cases.

Authors: We agree that a higher-level description leaves room for ambiguity. We will extend Section 3 with pseudocode that formally defines the bounded ReAct loop, the termination conditions (maximum iterations, validator-driven convergence, and early-exit rules), and the exact hand-off protocol between LLM tool-call output and the deterministic validator. The added specification will also include state-transition rules to facilitate analysis of potential inconsistencies or latency on edge cases. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results from proposed architecture

full rationale

The paper describes a neuro-symbolic ReacTOD system using a bounded ReAct loop and deterministic symbolic validator for zero-shot dialogue state tracking. All reported gains, including the 9.3pp improvement over single-pass inference, the 93.1% self-correction rate, and SOTA numbers such as 52.71% JGA on MultiWOZ 2.1, are presented as outcomes of experimental evaluation on standard benchmarks rather than quantities derived from equations, fitted parameters, or self-referential definitions. No load-bearing self-citations, uniqueness theorems, ansatzes, or renamings of known results appear in the derivation chain; the architecture is motivated by practical requirements for predictability and then measured directly against baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review limited to abstract; the central claim rests on the unverified effectiveness of the symbolic validator and the bounded loop's ability to converge without excessive steps.

axioms (1)

domain assumption A deterministic symbolic validator can reliably enforce action compliance, schema conformance, and coreference consistency on every dialogue state update.
Invoked when the abstract states the validator achieves 93.1% self-correction on intercepted errors.

invented entities (1)

Bounded ReAct loop governed by deterministic validation no independent evidence
purpose: To enable iterative self-correction of NLU errors in zero-shot DST
Core architectural contribution introduced in the abstract.

pith-pipeline@v0.9.0 · 5868 in / 1398 out tokens · 46345 ms · 2026-05-20T10:30:16.670298+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

From Schema to State: Zero-Shot Scheme-Only Dialogue State Tracking via Diverse Synthetic Dialogue and Step-by-Step Distillation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=. 2025 , doi=

work page 2025
[9]

2022 , eprint=

A Fast Attention Network for Joint Intent Detection and Slot Filling on Edge Devices , author=. 2022 , eprint=

work page 2022
[10]

Vicomtech Foundation , year=

Fine-Tuning Medium-Scale LLMs for Joint Intent Classification and Slot Filling: A Data-Efficient and Cost-Effective Solution for SMEs , author=. Vicomtech Foundation , year=

work page
[11]

Electronics , volume=

Prompt-Based End-to-End Cross-Domain Dialogue State Tracking , author=. Electronics , volume=. 2024 , publisher=

work page 2024
[12]

Survey of Hallucination in Natural Language Generation

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. ACM Comput. Surv. , month = mar, articleno =. 2023 , issue_date =. doi:10.1145/3571730 , abstract =

work page doi:10.1145/3571730 2023
[13]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , title =. Proceedings of the 37th Internatio...

work page 2023
[14]

2019 , eprint=

BERT for Joint Intent Classification and Slot Filling , author=. 2019 , eprint=

work page 2019
[15]

Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems

Wu, Chien-Sheng and Madotto, Andrea and Hosseini-Asl, Ehsan and Xiong, Caiming and Socher, Richard and Fung, Pascale. Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1078

work page doi:10.18653/v1/p19-1078 2019
[16]

2022 , eprint=

A Simple Language Model for Task-Oriented Dialogue , author=. 2022 , eprint=

work page 2022
[17]

Soloist: Building Task Bots at Scale with Transfer Learning and Machine Teaching

Peng, Baolin and Li, Chunyuan and Li, Jinchao and Shayandeh, Shahin and Liden, Lars and Gao, Jianfeng. Soloist: Building Task Bots at Scale with Transfer Learning and Machine Teaching. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00399

work page doi:10.1162/tacl_a_00399 2021
[18]

Proceedings of the 11th International Conference on Learning Representations , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. Proceedings of the 11th International Conference on Learning Representations , year=

work page
[19]

2022 , eprint=

Description-Driven Task-Oriented Dialog Modeling , author=. 2022 , eprint=

work page 2022
[20]

2024 , eprint=

Inference is All You Need: Self Example Retriever for Cross-domain Dialogue State Tracking with ChatGPT , author=. 2024 , eprint=

work page 2024
[21]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year=

Large Language Models as Zero-shot Dialogue State Tracker through Function Calling , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year=

work page
[22]

Exploring R e A ct Prompting for Task-Oriented Dialogue: Insights and Shortcomings

Elizabeth, Michelle and Veyret, Morgan and Couceiro, Miguel and Dusek, Ondrej and Rojas Barahona, Lina M. Exploring R e A ct Prompting for Task-Oriented Dialogue: Insights and Shortcomings. Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology. 2025

work page 2025
[23]

Proceedings of the twelfth language resources and evaluation conference , pages=

MultiWOZ 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines , author=. Proceedings of the twelfth language resources and evaluation conference , pages=

work page
[24]

Proceedings of the 2nd workshop on natural language processing for conversational AI , pages=

MultiWOZ 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines , author=. Proceedings of the 2nd workshop on natural language processing for conversational AI , pages=. 2020 , doi=

work page 2020
[25]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=. 2018 , doi=

work page 2018
[26]

Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue , pages=

Multiwoz 2.4: A multi-domain task-oriented dialogue dataset with essential annotation corrections to improve state tracking evaluation , author=. Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue , pages=. 2022 , doi=

work page 2022
[27]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[28]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025
[29]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025
[30]

Evaluating small language models for news summarization: Implications and factors influencing performance , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=. 2025 , doi=

work page 2025
[31]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2020 , doi=

work page 2020
[32]

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies , year=

A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies , year=

work page 2025
[33]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Towards LLM-driven Dialogue State Tracking , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=. 2023 , doi=

work page 2023

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[2] [2]

Publications Manual , year = "1983", publisher =

work page 1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[5] [5]

Dan Gusfield , title =. 1997

work page 1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[8] [8]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

From Schema to State: Zero-Shot Scheme-Only Dialogue State Tracking via Diverse Synthetic Dialogue and Step-by-Step Distillation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=. 2025 , doi=

work page 2025

[9] [9]

2022 , eprint=

A Fast Attention Network for Joint Intent Detection and Slot Filling on Edge Devices , author=. 2022 , eprint=

work page 2022

[10] [10]

Vicomtech Foundation , year=

Fine-Tuning Medium-Scale LLMs for Joint Intent Classification and Slot Filling: A Data-Efficient and Cost-Effective Solution for SMEs , author=. Vicomtech Foundation , year=

work page

[11] [11]

Electronics , volume=

Prompt-Based End-to-End Cross-Domain Dialogue State Tracking , author=. Electronics , volume=. 2024 , publisher=

work page 2024

[12] [12]

Survey of Hallucination in Natural Language Generation

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Ye Jin and Madotto, Andrea and Fung, Pascale , title =. ACM Comput. Surv. , month = mar, articleno =. 2023 , issue_date =. doi:10.1145/3571730 , abstract =

work page doi:10.1145/3571730 2023

[13] [13]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , title =. Proceedings of the 37th Internatio...

work page 2023

[14] [14]

2019 , eprint=

BERT for Joint Intent Classification and Slot Filling , author=. 2019 , eprint=

work page 2019

[15] [15]

Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems

Wu, Chien-Sheng and Madotto, Andrea and Hosseini-Asl, Ehsan and Xiong, Caiming and Socher, Richard and Fung, Pascale. Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1078

work page doi:10.18653/v1/p19-1078 2019

[16] [16]

2022 , eprint=

A Simple Language Model for Task-Oriented Dialogue , author=. 2022 , eprint=

work page 2022

[17] [17]

Soloist: Building Task Bots at Scale with Transfer Learning and Machine Teaching

Peng, Baolin and Li, Chunyuan and Li, Jinchao and Shayandeh, Shahin and Liden, Lars and Gao, Jianfeng. Soloist: Building Task Bots at Scale with Transfer Learning and Machine Teaching. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00399

work page doi:10.1162/tacl_a_00399 2021

[18] [18]

Proceedings of the 11th International Conference on Learning Representations , year=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. Proceedings of the 11th International Conference on Learning Representations , year=

work page

[19] [19]

2022 , eprint=

Description-Driven Task-Oriented Dialog Modeling , author=. 2022 , eprint=

work page 2022

[20] [20]

2024 , eprint=

Inference is All You Need: Self Example Retriever for Cross-domain Dialogue State Tracking with ChatGPT , author=. 2024 , eprint=

work page 2024

[21] [21]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year=

Large Language Models as Zero-shot Dialogue State Tracker through Function Calling , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , year=

work page

[22] [22]

Exploring R e A ct Prompting for Task-Oriented Dialogue: Insights and Shortcomings

Elizabeth, Michelle and Veyret, Morgan and Couceiro, Miguel and Dusek, Ondrej and Rojas Barahona, Lina M. Exploring R e A ct Prompting for Task-Oriented Dialogue: Insights and Shortcomings. Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology. 2025

work page 2025

[23] [23]

Proceedings of the twelfth language resources and evaluation conference , pages=

MultiWOZ 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines , author=. Proceedings of the twelfth language resources and evaluation conference , pages=

work page

[24] [24]

Proceedings of the 2nd workshop on natural language processing for conversational AI , pages=

MultiWOZ 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines , author=. Proceedings of the 2nd workshop on natural language processing for conversational AI , pages=. 2020 , doi=

work page 2020

[25] [25]

Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=. 2018 , doi=

work page 2018

[26] [26]

Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue , pages=

Multiwoz 2.4: A multi-domain task-oriented dialogue dataset with essential annotation corrections to improve state tracking evaluation , author=. Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue , pages=. 2022 , doi=

work page 2022

[27] [27]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[28] [28]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025

[29] [29]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025

[30] [30]

Evaluating small language models for news summarization: Implications and factors influencing performance , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=. 2025 , doi=

work page 2025

[31] [31]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Towards Scalable Multi-Domain Conversational Agents: The Schema-Guided Dialogue Dataset , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2020 , doi=

work page 2020

[32] [32]

Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies , year=

A Zero-Shot Open-Vocabulary Pipeline for Dialogue Understanding , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies , year=

work page 2025

[33] [33]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Towards LLM-driven Dialogue State Tracking , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=. 2023 , doi=

work page 2023