pith. machine review for the scientific record. sign in

arxiv: 2605.04449 · v1 · submitted 2026-05-06 · 💻 cs.CL · cs.AI

Recognition: unknown

GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking

Adithya Suresh, Iman Abbasnejad, Tomal Deb, Ziqi Zhu

Pith reviewed 2026-05-08 17:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords dialogue state trackingmixture of expertsgraph neural networksReAct agentsMultiWOZjoint goal accuracydialogue systemslanguage models
0
0 comments X

The pith

GEM routes between a graph expert and a language model plus ReAct agents to hit 65.19 percent joint goal accuracy on dialogue state tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dialogue state tracking requires pulling precise structured information such as user goals and slot values from multi-turn, multi-domain conversations, yet large language models alone often produce inconsistent or incomplete results. The paper introduces GEM, a system that maintains a graph neural network expert to represent dialogue structure and turn dependencies, pairs it with a finetuned T5-Small expert for sequence generation, and adds ReAct agents for step-by-step reasoning on difficult value outputs. These components are coordinated by a dynamic router that activates only the needed expert. The resulting model records 65.19 percent joint goal accuracy on MultiWOZ 2.2, well above the strongest end-to-end language-model baseline at 38.43 percent and above earlier systems such as TOATOD. A reader would care because the result demonstrates a concrete way to combine structured representations with selective model activation and agentic reasoning for tasks that demand factual precision rather than fluent generation.

Core claim

The authors show that a graph-enhanced mixture-of-experts architecture dynamically routes each dialogue turn to either a Graph Neural Network expert that encodes structure and dependencies or a finetuned T5-Small encoder-decoder for sequence modeling, while invoking ReAct agents for complex value generation; this combination produces 65.19 percent Joint Goal Accuracy on MultiWOZ 2.2, surpassing end-to-end LLM approaches at 38.43 percent and prior state-of-the-art methods including TOATOD at 63.79 percent, D3ST at 58.70 percent, and Diable at 56.48 percent.

What carries the argument

The GEM router that selects between the Graph Neural Network expert for dialogue structure and the T5-Small expert for sequence modeling, with ReAct agents performing structured reasoning when value generation is complex.

Load-bearing premise

The reported accuracy gains come from the graph-enhanced mixture-of-experts routing and ReAct integration rather than from differences in data preprocessing, hyperparameter tuning, or baseline re-implementations.

What would settle it

An ablation that removes the graph neural network component or the ReAct agents, keeps every other training and evaluation detail identical, and checks whether joint goal accuracy falls below 63.79 percent.

Figures

Figures reproduced from arXiv: 2605.04449 by Adithya Suresh, Iman Abbasnejad, Tomal Deb, Ziqi Zhu.

Figure 1
Figure 1. Figure 1: Our proposed architecture for dialogue state track view at source ↗
read the original abstract

Dialogue State Tracking (DST) requires precise extraction of structured information from multi-domain conversations, a task where Large Language Models (LLMs) struggle despite their impressive general capabilities. We present GEM (Graph-Enhanced Mixture-of-Experts), a novel framework that combines language models and graph-structured dialogue understanding with ReAct agent-based reasoning for superior DST performance. Our approach dynamically routes between specialized experts: a Graph Neural Network that captures dialogue structure and turn-level dependencies, and a finetuned T5-Small encoder-decoder for sequence modeling, coordinated by an intelligent router. For complex value generation tasks, we integrate ReAct agents that perform structured reasoning over dialogue context. On MultiWOZ 2.2, GEM achieves 65.19% Joint Goal Accuracy, substantially outperforming end-to-end LLM approaches (best: 38.43%) and surpassing state-of-the-art (SOTA) methods including TOATOD (63.79%), D3ST (58.70%), and Diable (56.48%). Our graph-enhanced mixture-of-experts architecture with ReAct integration demonstrates that combining structured dialogue representation with dynamic expert routing and agent-based reasoning provides a powerful paradigm for dialogue state tracking, achieving superior accuracy while maintaining computational efficiency through selective expert activation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces GEM, a Graph-Enhanced Mixture-of-Experts framework for dialogue state tracking that routes between a GNN expert capturing dialogue structure and a fine-tuned T5-Small model, augmented by ReAct agents for complex value generation. It reports 65.19% Joint Goal Accuracy on MultiWOZ 2.2, outperforming end-to-end LLM baselines (best 38.43%) and prior SOTA methods such as TOATOD (63.79%), D3ST (58.70%), and Diable (56.48%), attributing gains to the combination of graph-structured representations, dynamic expert routing, and agent-based reasoning.

Significance. If the reported gains are shown to arise specifically from the proposed components rather than implementation differences, the work would demonstrate a viable hybrid paradigm for DST that leverages structured graph modeling alongside selective LLM activation, offering both accuracy and efficiency advantages over pure end-to-end approaches.

major comments (2)
  1. [Experimental evaluation] The experimental evaluation provides no protocol details, data splits, preprocessing steps, hyperparameter settings, or reproduction information for the cited baselines (TOATOD, D3ST, Diable, and LLM approaches). Without these, the 65.19% JGA figure and the claimed outperformance cannot be verified or attributed to the graph MoE + ReAct design.
  2. [Ablation and analysis] No ablation studies or component analyses are presented to isolate the contributions of the GNN expert, the router, the T5-Small expert, or the ReAct agents. The central claim that performance improvements stem from the graph-enhanced MoE routing and ReAct integration therefore lacks supporting evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough and constructive review of our manuscript. We appreciate the feedback highlighting areas where additional details and analyses would strengthen the work. We address each major comment below and will incorporate the necessary revisions to improve reproducibility and substantiate our claims.

read point-by-point responses
  1. Referee: [Experimental evaluation] The experimental evaluation provides no protocol details, data splits, preprocessing steps, hyperparameter settings, or reproduction information for the cited baselines (TOATOD, D3ST, Diable, and LLM approaches). Without these, the 65.19% JGA figure and the claimed outperformance cannot be verified or attributed to the graph MoE + ReAct design.

    Authors: We agree that the current manuscript lacks sufficient experimental protocol details, which is essential for reproducibility and proper attribution of results. In the revised version, we will add a dedicated Experimental Setup section that explicitly details the MultiWOZ 2.2 data splits, preprocessing steps, hyperparameter configurations for GEM and all baselines (including TOATOD, D3ST, Diable, and the LLM approaches), training procedures, evaluation metrics, and any other reproduction information. We will also release code and checkpoints to allow independent verification of the 65.19% JGA. revision: yes

  2. Referee: [Ablation and analysis] No ablation studies or component analyses are presented to isolate the contributions of the GNN expert, the router, the T5-Small expert, or the ReAct agents. The central claim that performance improvements stem from the graph-enhanced MoE routing and ReAct integration therefore lacks supporting evidence.

    Authors: We acknowledge that the absence of ablation studies leaves the contributions of individual components unsubstantiated. We will conduct new ablation experiments and include a dedicated Ablation and Analysis section in the revision. These will evaluate variants such as removing the GNN expert, disabling dynamic routing, using only the T5-Small expert, and ablating the ReAct agents, reporting their impact on Joint Goal Accuracy to directly support the claims regarding the graph-enhanced MoE and ReAct components. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with benchmark results only

full rationale

The paper introduces GEM as an architectural combination of GNN experts, MoE routing, T5-Small, and ReAct agents for DST, then reports empirical Joint Goal Accuracy of 65.19% on MultiWOZ 2.2 against listed baselines. No equations, derivations, first-principles predictions, or ansatzes appear in the provided text. Performance claims rest on reported numbers rather than any reduction to fitted inputs or self-citation chains. The central result is therefore self-contained experimental evidence with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical machine-learning paper. It introduces no mathematical axioms, free parameters beyond standard neural-network training, or invented physical entities; it relies on existing models (T5, GNNs, ReAct) and the MultiWOZ benchmark.

pith-pipeline@v0.9.0 · 5534 in / 1268 out tokens · 42040 ms · 2026-05-08T17:31:01.150576+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 32 canonical work pages · 5 internal anchors

  1. [1]

    Language Models are Few-Shot Learners

    Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , volume=

  2. [2]

    KevinGBecker,KathleenCBarnes,TiffaniJBright, and S Alex Wang

    A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity , author=. arXiv preprint arXiv:2302.04023 , year=

  3. [3]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  4. [4]

    arXiv preprint arXiv:2212.10403 , year=

    Towards reasoning in large language models: A survey , author=. arXiv preprint arXiv:2212.10403 , year=

  5. [5]

    The Journal of Supercomputing , volume=

    Improving zero-shot chain-of-thought reasoning across languages with rectification and self-optimization prompting , author=. The Journal of Supercomputing , volume=. 2025 , publisher=

  6. [6]

    Decomposed prompting: A modular approach for solving complex tasks,

    Decomposed prompting: A modular approach for solving complex tasks , author=. arXiv preprint arXiv:2210.02406 , year=

  7. [7]

    arXiv preprint arXiv:2305.02556 , year=

    Faithful question answering with monte-carlo planning , author=. arXiv preprint arXiv:2305.02556 , year=

  8. [8]

    arXiv preprint arXiv:2308.13259 , year=

    Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering , author=. arXiv preprint arXiv:2308.13259 , year=

  9. [9]

    Bioinformatics , volume=

    Are genomic language models all you need? Exploring genomic language models on protein downstream tasks , author=. Bioinformatics , volume=. 2024 , publisher=

  10. [10]

    IEEE Transactions on Knowledge and Data Engineering , volume=

    Unifying large language models and knowledge graphs: A roadmap , author=. IEEE Transactions on Knowledge and Data Engineering , volume=. 2024 , publisher=

  11. [11]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

    Query graph generation for answering multi-hop complex questions from knowledge bases , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

  12. [12]

    arXiv preprint arXiv:2109.08678 , year=

    RNG-KBQA: Generation augmented iterative ranking for knowledge base question answering , author=. arXiv preprint arXiv:2109.08678 , year=

  13. [13]

    arXiv preprint arXiv:2212.00959 , year=

    Unikgqa: Unified retrieval and reasoning for solving multi-hop question answering over knowledge graph , author=. arXiv preprint arXiv:2212.00959 , year=

  14. [14]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  15. [15]

    arXiv preprint arXiv:2405.20139 , year=

    Gnn-rag: Graph neural retrieval for large language model reasoning , author=. arXiv preprint arXiv:2405.20139 , year=

  16. [16]

    arXiv preprint arXiv:2406.10393 , year=

    EWEK-QA: Enhanced web and efficient knowledge graph retrieval for citation-based question answering systems , author=. arXiv preprint arXiv:2406.10393 , year=

  17. [17]

    OpenMoE: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739, 2024

    Openmoe: An early effort on open mixture-of-experts language models , author=. arXiv preprint arXiv:2402.01739 , year=

  18. [18]

    Mixtral of Experts

    Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

  19. [19]

    Graph Attention Networks

    Graph attention networks , author=. arXiv preprint arXiv:1710.10903 , year=

  20. [20]

    Proceedings

    A new model for learning in graph domains , author=. Proceedings. 2005 IEEE international joint conference on neural networks, 2005. , volume=. 2005 , organization=

  21. [21]

    How attentive are graph attention networks?arXiv preprint arXiv:2105.14491,

    How attentive are graph attention networks? , author=. arXiv preprint arXiv:2105.14491 , year=

  22. [22]

    Neural computation , volume=

    Adaptive mixtures of local experts , author=. Neural computation , volume=. 1991 , publisher=

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    Hierarchical mixture of classification experts uncovers interactions between brain regions , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    arXiv preprint arXiv:2412.14219 , year=

    A survey on inference optimization techniques for mixture of experts models , author=. arXiv preprint arXiv:2412.14219 , year=

  25. [25]

    Preprints , year=

    The evolution of mixture of experts: A survey from basics to breakthroughs , author=. Preprints , year=

  26. [26]

    arXiv preprint arXiv:1905.08743 , year=

    Transferable multi-domain state generator for task-oriented dialogue systems , author=. arXiv preprint arXiv:1905.08743 , year=

  27. [27]

    arXiv preprint arXiv:2005.02877 , year=

    Trippy: A triple copy strategy for value independent neural dialog state tracking , author=. arXiv preprint arXiv:2005.02877 , year=

  28. [28]

    arXiv preprint arXiv:2201.08904 , year=

    Description-driven task-oriented dialog modeling , author=. arXiv preprint arXiv:2201.08904 , year=

  29. [29]

    arXiv preprint arXiv:2305.02468 , year=

    Task-optimized adapters for an end-to-end task-oriented dialogue system , author=. arXiv preprint arXiv:2305.02468 , year=

  30. [30]

    Proceedings of the 9th International Workshop on Spoken Dialogue Systems Technology , pages=

    Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking , author=. Proceedings of the 9th International Workshop on Spoken Dialogue Systems Technology , pages=

  31. [31]

    2021 , eprint=

    A Sequence-to-Sequence Approach to Dialogue State Tracking , author=. 2021 , eprint=

  32. [32]

    LUNA : Learning Slot-Turn Alignment for Dialogue State Tracking

    Wang, Yifan and Zhao, Jing and Bao, Junwei and Duan, Chaoqun and Wu, Youzheng and He, Xiaodong. LUNA : Learning Slot-Turn Alignment for Dialogue State Tracking. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.242

  33. [33]

    2022 , eprint=

    SPACE-3: Unified Dialog Model Pre-training for Task-Oriented Dialog Understanding and Generation , author=. 2022 , eprint=

  34. [34]

    Span-Selective Linear Attention Transformers for Effective and Robust Schema-Guided Dialogue State Tracking

    Bebensee, Bj. Span-Selective Linear Attention Transformers for Effective and Robust Schema-Guided Dialogue State Tracking. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.6

  35. [35]

    Diable: Efficient Dialogue State Tracking as Operations on Tables , url=

    Lesci, Pietro and Fujinuma, Yoshinari and Hardalov, Momchil and Shang, Chao and Marquez, Lluis , year=. Diable: Efficient Dialogue State Tracking as Operations on Tables , url=. doi:10.18653/v1/2023.findings-acl.615 , booktitle=

  36. [36]

    Are Large Language Models All You Need for Task-Oriented Dialogue?

    Hude c ek, Vojt e ch and Dusek, Ondrej. Are Large Language Models All You Need for Task-Oriented Dialogue?. Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2023. doi:10.18653/v1/2023.sigdial-1.21

  37. [37]

    arXiv preprint arXiv:2306.01386 , year=

    ChatGPT for zero-shot dialogue state tracking: A solution or an opportunity? , author=. arXiv preprint arXiv:2306.01386 , year=

  38. [38]

    arXiv preprint arXiv:2203.08568 , year=

    In-context learning for few-shot dialogue state tracking , author=. arXiv preprint arXiv:2203.08568 , year=

  39. [39]

    arXiv preprint arXiv:2409.06243 , year=

    Inference is All You Need: Self Example Retriever for Cross-domain Dialogue State Tracking with ChatGPT , author=. arXiv preprint arXiv:2409.06243 , year=

  40. [40]

    arXiv preprint arXiv:2310.08885 , year=

    InstructTODS: Large language models for end-to-end task-oriented dialogue systems , author=. arXiv preprint arXiv:2310.08885 , year=

  41. [41]

    arXiv preprint arXiv:2402.10466 , year=

    Large language models as zero-shot dialogue state tracker through function calling , author=. arXiv preprint arXiv:2402.10466 , year=

  42. [42]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    GNN-RAG: Graph Neural Retrieval for Efficient Large Language Model Reasoning on Knowledge Graphs , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  43. [43]

    arXiv preprint arXiv:2502.01113 (2024), https://arxiv.org/abs/2411.15041

    GFM-RAG: graph foundation model for retrieval augmented generation , author=. arXiv preprint arXiv:2502.01113 , year=

  44. [44]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Schema-guided multi-domain dialogue state tracking with graph attention neural networks , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  45. [45]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , author=. arXiv preprint arXiv:1701.06538 , year=

  46. [46]

    International Conference on Learning Representations (ICLR) , year=

    React: Synergizing reasoning and acting in language models , author=. International Conference on Learning Representations (ICLR) , year=

  47. [47]

    Journal of machine learning research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

  48. [48]

    Advances in Neural Information Processing Systems , volume=

    Knowgpt: Knowledge graph based prompting for large language models , author=. Advances in Neural Information Processing Systems , volume=

  49. [49]

    Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling

    Multiwoz--a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling , author=. arXiv preprint arXiv:1810.00278 , year=

  50. [50]

    2023 , note=

    Embedding Models on Amazon Bedrock , author=. 2023 , note=

  51. [51]

    2023 , note=

    Sonnet 3.7 on Amazon Bedrock , author=. 2023 , note=

  52. [52]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  53. [53]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

  54. [54]

    Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages=

    C-pack: Packed resources for general chinese embeddings , author=. Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval , pages=

  55. [55]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019