arxiv: 2305.19118 · v4 · submitted 2023-05-30 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Rui Wang, Shuming Shi, Tian Liang, Wenxiang Jiao, Xing Wang, Yan Wang, Yujiu Yang, Zhaopeng Tu, Zhiwei He

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsmulti-agent debatedivergent thinkingdegeneration of thoughtself-reflectionreasoning taskscommonsense translationarithmetic reasoning

0 comments

The pith

Large language models overcome stuck reasoning by having multiple agents argue tit-for-tat under a judge instead of reflecting alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-reflection causes LLMs to lock into an initial answer once they gain confidence, even when that answer is wrong, because later reflection fails to produce genuinely new ideas. The paper introduces a Multi-Agent Debate framework in which separate LLM agents present opposing arguments in a back-and-forth exchange while a judge LLM oversees the process and selects a final solution. Experiments on commonsense machine translation and counter-intuitive arithmetic reasoning show that this debate setup produces better results than reflection-based methods. The authors also find that debate performance depends on stopping at an adaptive point and keeping the level of disagreement moderate rather than extreme.

Core claim

The Multi-Agent Debate framework encourages divergent thinking in LLMs by placing multiple agents in a tit-for-tat argumentative state, with a judge managing the exchange to reach a final solution, thereby addressing the Degeneration-of-Thought problem that limits self-reflection on tasks requiring deep contemplation.

What carries the argument

The Multi-Agent Debate process in which LLM agents generate opposing arguments in a tit-for-tat dynamic and a separate judge LLM synthesizes them into a final answer.

If this is right

MAD improves performance over self-reflection on commonsense machine translation and counter-intuitive arithmetic reasoning.
Effective MAD requires an adaptive stopping point for the debate and only a modest level of tit-for-tat intensity.
Using different LLMs for agents versus judge can produce biased synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same debate structure could be tested on other tasks that reward considering multiple perspectives, such as planning or creative writing.
If the judge bias problem is confirmed, replacing the judge with a human or a rule-based aggregator becomes a direct next step.
Scaling the number of agents beyond the small groups tested here might increase the chance of surfacing overlooked alternatives.

Load-bearing premise

The judge LLM can evaluate and combine the agents' arguments fairly without itself becoming stuck in an initial view.

What would settle it

Apply the same MAD setup to the reported datasets and obtain accuracy no higher than self-reflection baselines, or observe the judge consistently favoring one agent's first position regardless of counter-arguments.

read the original abstract

Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of "tit for tat" state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at https://github.com/Skytliang/Multi-Agents-Debate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper identifies the Degeneration-of-Thought (DoT) problem in self-reflection methods for LLMs on complex reasoning tasks. It proposes a Multi-Agent Debate (MAD) framework in which multiple agents engage in tit-for-tat arguments managed by a judge LLM to produce a final solution. Experiments on commonsense machine translation and counter-intuitive arithmetic reasoning datasets are reported to demonstrate effectiveness, with additional analyses on adaptive debate length and tit-for-tat intensity.

Significance. If the results hold under tighter controls, the MAD framework provides a concrete procedural approach to mitigating DoT and encouraging divergent thinking in LLMs. The open-sourced code and empirical evaluation on two challenging tasks constitute a useful contribution to the study of LLM reasoning strategies.

major comments (3)

[Abstract and Experiments] The central claim requires that the judge LLM synthesizes the debate without inheriting DoT bias. The manuscript notes unfairness when different LLMs are used for agents but does not report controls that hold the judge model fixed while varying agent diversity or initial stance strength (Abstract; Experiments section).
[Experiments] Statistical significance, exact baseline implementations, prompt sensitivity, and judge-bias controls are not detailed, leaving the reported gains on the two datasets difficult to interpret or reproduce (Experiments section).
[Analyses] The claim that an adaptive break and modest tit-for-tat level are required for good performance lacks quantitative thresholds or effect-size tables showing how performance degrades outside those regimes (Analyses section).

minor comments (1)

All prompts and exact debate templates should be included in an appendix to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We believe the suggested revisions will significantly strengthen the paper by providing more rigorous controls and quantitative analyses. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract and Experiments] The central claim requires that the judge LLM synthesizes the debate without inheriting DoT bias. The manuscript notes unfairness when different LLMs are used for agents but does not report controls that hold the judge model fixed while varying agent diversity or initial stance strength (Abstract; Experiments section).

Authors: We thank the referee for highlighting this important aspect. While we observed that using different LLMs for agents can lead to unfair judgments, we agree that explicit controls holding the judge fixed are necessary to isolate the effect of agent diversity. In the revised manuscript, we will include additional experiments where the judge model is fixed (e.g., using GPT-4 as judge) and systematically vary the agent models and the strength of initial stances. This will provide clearer evidence that the judge synthesizes without inheriting DoT bias. revision: yes
Referee: [Experiments] Statistical significance, exact baseline implementations, prompt sensitivity, and judge-bias controls are not detailed, leaving the reported gains on the two datasets difficult to interpret or reproduce (Experiments section).

Authors: We acknowledge the need for more rigorous reporting. In the revision, we will provide: (1) statistical significance tests (e.g., p-values from paired t-tests or bootstrap) for the performance gains; (2) exact prompt templates and baseline implementations with links to code; (3) analysis of prompt sensitivity by varying key prompt elements; and (4) additional judge-bias controls as mentioned above. These details will be added to the Experiments section to enhance reproducibility. revision: yes
Referee: [Analyses] The claim that an adaptive break and modest tit-for-tat level are required for good performance lacks quantitative thresholds or effect-size tables showing how performance degrades outside those regimes (Analyses section).

Authors: We agree that quantitative support would strengthen this claim. We will add effect-size tables and plots in the Analyses section showing performance as a function of debate length (number of rounds) and tit-for-tat intensity levels. This will include thresholds where performance degrades, such as when debate continues too long or tit-for-tat is too aggressive, leading to degeneration. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines the MAD framework as a procedural multi-agent interaction with a judge, without any equations, fitted parameters, or mathematical derivations. Central claims rest on empirical results from two external datasets (commonsense machine translation and counter-intuitive arithmetic reasoning) compared to baselines, with no reduction of outputs to inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing manner for the core argument; the DoT observation and MAD proposal are presented as independent contributions evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that structured disagreement among LLM instances produces net gains in reasoning quality; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Multiple LLM agents in tit-for-tat debate can generate novel thoughts that a single agent cannot produce via self-reflection.
Invoked to justify why the framework overcomes DoT; appears in the motivation and method description.
domain assumption An LLM judge can reliably select the best solution from the debate transcript.
Required for the final output step; noted as potentially problematic when different LLMs are used.

pith-pipeline@v0.9.0 · 5585 in / 1265 out tokens · 43183 ms · 2026-05-13T23:56:00.948856+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
cs.CV 2026-05 unverdicted novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.
Learning to Interrupt in Language-based Multi-agent Communication
cs.CL 2026-04 unverdicted novelty 7.0

HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, sche...
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
cs.MA 2026-05 unverdicted novelty 6.0

Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for ...
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact ...
Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation
cs.MA 2026-04 unverdicted novelty 6.0

Architectural heterogeneity across 7-9B models reduces first-choice concentration in policy simulations (70.9% to 46.1% and 46.0% to 22.9%), while coherence validation shows a scenario-dependent tradeoff.
Preregistered Belief Revision Contracts
cs.AI 2026-04 unverdicted novelty 6.0

PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage
cs.AI 2026-04 unverdicted novelty 6.0

PolySwarm aggregates predictions from 50 LLM personas for Polymarket trading using Bayesian combination and divergence metrics, outperforming single models in calibration while adding latency arbitrage via CEX price models.
Large Language Models Cannot Self-Correct Reasoning Yet
cs.CL 2023-10 unverdicted novelty 6.0

LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.
Robust Multi-Agent LLMs under Byzantine Faults
cs.MA 2026-05 unverdicted novelty 5.0

SAC is a decentralized iterative filter-and-refine protocol that achieves (F+1)-robustness in LLM multi-agent systems, suppressing Byzantine influence and improving performance on reasoning benchmarks where prior meth...
When Independent Sampling Outperforms Agentic Reasoning
cs.LG 2026-05 unverdicted novelty 5.0

On Codeforces problems, independent k-shot sampling achieves better accuracy-cost and accuracy-query tradeoffs than agentic reasoning, even with prompt caching.
12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation
cs.AI 2026-05 unverdicted novelty 5.0

Twelve LLM agents in a 12 Angry Men jury setup almost always end in hung juries due to anchoring, with Llama-4-Scout showing more vote changes than GPT-4o, suggesting RLHF alignment intensity limits deliberative flexibility.
Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems
cs.MA 2026-04 unverdicted novelty 5.0

MMP defines a seven-field CMB schema, role-based SVAF evaluation, content-hash lineage, and remix storage to enable traceable cross-session collaboration among autonomous LLM agents.
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
cs.HC 2026-04 unverdicted novelty 5.0

AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models
cs.LG 2026-03 unverdicted novelty 5.0

Language models display model-specific escalation thresholds in uncertain decisions that are not explained by scale or architecture, and supervised fine-tuning on explicit uncertainty reasoning produces robust, genera...
Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures
cs.AI 2026-04 unverdicted novelty 4.0

A survey comparing classical multi-agent systems with large foundation model-enabled multi-agent systems, showing how the latter enables semantic-level collaboration and greater adaptability.
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
cs.LG 2026-04 unverdicted novelty 4.0

HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
A Case-Driven Multi-Agent Framework for E-Commerce Search Relevance
cs.IR 2026-05 unverdicted novelty 3.0

A case-driven multi-agent system automates the full pipeline of bad-case detection, annotation, and resolution for e-commerce search relevance using Annotator, Optimizer, and User agents plus supporting components.

Reference graph

Works this paper leans on

286 extracted references · 286 canonical work pages · cited by 21 Pith papers · 7 internal anchors

[1]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Answering Questions by Meta-Reasoning over Multiple Chains of Thought , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[3]

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity , author=. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[7]

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=

Solving General Arithmetic Word Problems , author=. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2015
[12]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

work page
[13]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

work page
[15]

Advances in Neural Information Processing Systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[18]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[19]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[20]

Advances in Neural Information Processing Systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[21]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Transactions of the Association for Computational Linguistics , volume=

Exploring human-like translation strategy with large language models , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

work page 2024
[25]

International Conference on Machine Learning , pages=

The unreasonable effectiveness of few-shot learning for machine translation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[26]

Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction , author=. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[27]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Critic: Large language models can self-correct with tool-interactive critiquing , author=. arXiv preprint arXiv:2305.11738 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Advances in Neural Information Processing Systems , volume=

Eliciting thinking hierarchy without a prior , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Question Answering as Programming for Solving Time-Sensitive Questions , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[31]

The 61st Annual Meeting Of The Association For Computational Linguistics , year=

Solving Math Word Problems via Cooperative Reasoning induced Language Models , author=. The 61st Annual Meeting Of The Association For Computational Linguistics , year=

work page
[32]

Philosophical Explorations , volume=

Does reflection lead to wise choices? , author=. Philosophical Explorations , volume=. 2011 , publisher=

work page 2011
[33]

, author=

Metacognition and Reflection by Interdisciplinary Experts: Insights from Cognitive Science and Philosophy. , author=. Issues in Interdisciplinary Studies , volume=. 2017 , publisher=

work page 2017
[34]

On the reliability of watermarks for large language models,

On the reliability of watermarks for large language models , author=. arXiv preprint arXiv:2306.04634 , year=

work page arXiv
[35]

arXiv preprint arXiv:2308.10848 , year=

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents , author=. arXiv preprint arXiv:2308.10848 , year=

work page arXiv
[36]

Advances in Neural Information Processing Systems , volume=

Camel: Communicative agents for" mind" exploration of large language model society , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chateval: Towards better llm-based evaluators through multi-agent debate , author=. arXiv preprint arXiv:2308.07201 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

ChatDev: Communicative Agents for Software Development

Communicative agents for software development , author=. arXiv preprint arXiv:2307.07924 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Thinking, fast and slow , author=

work page
[40]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Towards Making the Most of ChatGPT for Machine Translation , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[41]

Transactions of the Association for Computational Linguistics , volume=

Lost in the middle: How language models use long contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

work page 2024
[42]

Proceedings of the Eighth Conference on Machine Translation , pages=

Findings of the WMT 2023 Shared Task on Machine Translation with Terminologies , author=. Proceedings of the Eighth Conference on Machine Translation , pages=

work page 2023
[43]

Proceedings of the Sixth Conference on Machine Translation , pages=

Findings of the WMT shared task on machine translation using terminologies , author=. Proceedings of the Sixth Conference on Machine Translation , pages=

work page
[44]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Meta-cotgan: A meta cooperative training paradigm for improving adversarial text generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[45]

Transactions on Machine Learning Research , year=

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on Machine Learning Research , year=

work page
[46]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

work page 2023
[47]

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Learning to solve arithmetic word problems with verb categorization , author=. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2014
[48]

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Confer...

work page 2023
[49]

Lisa Bortolotti. 2011. Does reflection lead to wise choices? Philosophical Explorations, 14(3):297--313

work page 2011
[50]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[51]

Kahneman Daniel. 2017. Thinking, fast and slow. Farrar, Straus and Giroux

work page 2017
[52]

Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. 2023. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246

work page arXiv 2023
[53]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. 2023. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142

work page arXiv 2023
[55]

Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720

work page arXiv 2022
[56]

Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin Johnson, and Orhan Firat. 2023. The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning, pages 10867--10878. PMLR

work page 2023
[57]

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. Critic: Large language models can self-correct with tool-interactive critiquing

work page 2023
[58]

Jie He, Tao Wang, Deyi Xiong, and Qun Liu. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.327 The box is in the pen: Evaluating commonsense reasoning in neural machine translation . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3662--3672, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.findings-emnlp.327 2020
[59]

Zhiwei He, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Yujiu Yang, Rui Wang, Zhaopeng Tu, Shuming Shi, and Xing Wang. 2024. Exploring human-like translation strategy with large language models. Transactions of the Association for Computational Linguistics, 12:229--246

work page 2024
[60]

Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210

work page arXiv 2023
[61]

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523--533

work page 2014
[62]

Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745

work page arXiv 2023
[63]

Machiel Keestra. 2017. Metacognition and reflection by interdisciplinary experts: Insights from cognitive science and philosophy. Issues in Interdisciplinary Studies, 35:121--169

work page 2017
[64]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199--22213

work page 2022
[65]

Yuqing Kong, Yunqi Li, Yubo Zhang, Zhihuan Huang, and Jinzhao Wu. 2022. Eliciting thinking hierarchy without a prior. Advances in Neural Information Processing Systems, 35:13329--13341

work page 2022
[66]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157--173

work page 2024
[67]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36

work page 2024
[68]

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1--22

work page 2023
[69]

Jonathan Pilault, Xavier Garcia, Arthur Bra z inskas, and Orhan Firat. 2023. Interactive-chain-prompting: Ambiguity resolution for crosslingual conditional generation with interaction. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computati...

work page 2023
[70]

Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743--1752

work page 2015
[71]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36

work page 2024
[72]

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research

work page 2023
[73]

Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. 2023. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003--13051

work page 2023
[74]

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926

work page arXiv 2023
[75]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2022
[76]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

work page 2022
[77]

Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao, and Michael Lyu. 2023. Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv preprint arXiv:2303.13648

work page arXiv 2023
[78]

Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. 2023. Diving into the inter-consistency of large language models: An insightful analysis through debate. arXiv preprint arXiv:2305.11595

work page arXiv 2023
[79]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36

work page 2024
[80]

Haiyan Yin, Dingcheng Li, Xu Li, and Ping Li. 2020. Meta-cotgan: A meta cooperative training paradigm for improving adversarial text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9466--9473

work page 2020
[81]

Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. 2023. Answering questions by meta-reasoning over multiple chains of thought. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5942--5966

work page 2023
[82]

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493

work page arXiv 2022
[83]

Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797

work page arXiv 2023
[84]

Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Jiaxing Zhang, Yujiu Yang, et al. 2023 a . Solving math word problems via cooperative reasoning induced language models. In The 61st Annual Meeting Of The Association For Computational Linguistics

work page 2023
[85]

Xinyu Zhu, Cheng Yang, Bei Chen, Siheng Li, Jian-Guang Lou, and Yujiu Yang. 2023 b . Question answering as programming for solving time-sensitive questions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12775--12790

work page 2023
[86]

Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, et al. 2023 c . Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144

work page arXiv 2023
[87]

Proceedings of the Ninth Workshop on Noisy and User-generated Text (W-NUT 2024). 2024

work page 2024
[88]

Correcting Challenging F innish Learner Texts With Claude, GPT -3.5 and GPT -4 Large Language Models

Creutz, Mathias. Correcting Challenging F innish Learner Texts With Claude, GPT -3.5 and GPT -4 Large Language Models. 2024

work page 2024
[89]

Context-aware Adversarial Attack on Named Entity Recognition

Chen, Shuguang and Neves, Leonardo and Solorio, Thamar. Context-aware Adversarial Attack on Named Entity Recognition. 2024

work page 2024
[90]

Effects of different types of noise in user-generated reviews on human and machine translations including C hat GPT

Popovic, Maja and Lapshinova-Koltunski, Ekaterina and Koponen, Maarit. Effects of different types of noise in user-generated reviews on human and machine translations including C hat GPT. 2024

work page 2024
[91]

Stanceosaurus 2.0 - Classifying Stance Towards R ussian and S panish Misinformation

Lavrouk, Anton and Ligon, Ian and Zheng, Jonathan and Naous, Tarek and Xu, Wei and Ritter, Alan. Stanceosaurus 2.0 - Classifying Stance Towards R ussian and S panish Misinformation. 2024

work page 2024
[92]

and Shibli, G

Elahi, Kazi and Rahman, Tasnuva and Shahriar, Shakil and Sarker, Samir and Shawon, Md. and Shibli, G. M. A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy B angla Texts. 2024

work page 2024
[93]

Label Supervised Contrastive Learning for Imbalanced Text Classification in E uclidean and Hyperbolic Embedding Spaces

Khalid, Baber and Dai, Shuyang and Taghavi, Tara and Lee, Sungjin. Label Supervised Contrastive Learning for Imbalanced Text Classification in E uclidean and Hyperbolic Embedding Spaces. 2024

work page 2024
[94]

M aint N orm: A corpus and benchmark model for lexical normalisation and masking of industrial maintenance short text

Bikaun, Tyler and Hodkiewicz, Melinda and Liu, Wei. M aint N orm: A corpus and benchmark model for lexical normalisation and masking of industrial maintenance short text. 2024

work page 2024

Showing first 80 references.