Recognition: 2 theorem links
· Lean TheoremEncouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Pith reviewed 2026-05-13 23:56 UTC · model grok-4.3
The pith
Large language models overcome stuck reasoning by having multiple agents argue tit-for-tat under a judge instead of reflecting alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Multi-Agent Debate framework encourages divergent thinking in LLMs by placing multiple agents in a tit-for-tat argumentative state, with a judge managing the exchange to reach a final solution, thereby addressing the Degeneration-of-Thought problem that limits self-reflection on tasks requiring deep contemplation.
What carries the argument
The Multi-Agent Debate process in which LLM agents generate opposing arguments in a tit-for-tat dynamic and a separate judge LLM synthesizes them into a final answer.
If this is right
- MAD improves performance over self-reflection on commonsense machine translation and counter-intuitive arithmetic reasoning.
- Effective MAD requires an adaptive stopping point for the debate and only a modest level of tit-for-tat intensity.
- Using different LLMs for agents versus judge can produce biased synthesis.
Where Pith is reading between the lines
- The same debate structure could be tested on other tasks that reward considering multiple perspectives, such as planning or creative writing.
- If the judge bias problem is confirmed, replacing the judge with a human or a rule-based aggregator becomes a direct next step.
- Scaling the number of agents beyond the small groups tested here might increase the chance of surfacing overlooked alternatives.
Load-bearing premise
The judge LLM can evaluate and combine the agents' arguments fairly without itself becoming stuck in an initial view.
What would settle it
Apply the same MAD setup to the reported datasets and obtain accuracy no higher than self-reflection baselines, or observe the judge consistently favoring one agent's first position regardless of counter-arguments.
read the original abstract
Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of "tit for tat" state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at https://github.com/Skytliang/Multi-Agents-Debate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies the Degeneration-of-Thought (DoT) problem in self-reflection methods for LLMs on complex reasoning tasks. It proposes a Multi-Agent Debate (MAD) framework in which multiple agents engage in tit-for-tat arguments managed by a judge LLM to produce a final solution. Experiments on commonsense machine translation and counter-intuitive arithmetic reasoning datasets are reported to demonstrate effectiveness, with additional analyses on adaptive debate length and tit-for-tat intensity.
Significance. If the results hold under tighter controls, the MAD framework provides a concrete procedural approach to mitigating DoT and encouraging divergent thinking in LLMs. The open-sourced code and empirical evaluation on two challenging tasks constitute a useful contribution to the study of LLM reasoning strategies.
major comments (3)
- [Abstract and Experiments] The central claim requires that the judge LLM synthesizes the debate without inheriting DoT bias. The manuscript notes unfairness when different LLMs are used for agents but does not report controls that hold the judge model fixed while varying agent diversity or initial stance strength (Abstract; Experiments section).
- [Experiments] Statistical significance, exact baseline implementations, prompt sensitivity, and judge-bias controls are not detailed, leaving the reported gains on the two datasets difficult to interpret or reproduce (Experiments section).
- [Analyses] The claim that an adaptive break and modest tit-for-tat level are required for good performance lacks quantitative thresholds or effect-size tables showing how performance degrades outside those regimes (Analyses section).
minor comments (1)
- All prompts and exact debate templates should be included in an appendix to support reproducibility.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback on our manuscript. We believe the suggested revisions will significantly strengthen the paper by providing more rigorous controls and quantitative analyses. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Abstract and Experiments] The central claim requires that the judge LLM synthesizes the debate without inheriting DoT bias. The manuscript notes unfairness when different LLMs are used for agents but does not report controls that hold the judge model fixed while varying agent diversity or initial stance strength (Abstract; Experiments section).
Authors: We thank the referee for highlighting this important aspect. While we observed that using different LLMs for agents can lead to unfair judgments, we agree that explicit controls holding the judge fixed are necessary to isolate the effect of agent diversity. In the revised manuscript, we will include additional experiments where the judge model is fixed (e.g., using GPT-4 as judge) and systematically vary the agent models and the strength of initial stances. This will provide clearer evidence that the judge synthesizes without inheriting DoT bias. revision: yes
-
Referee: [Experiments] Statistical significance, exact baseline implementations, prompt sensitivity, and judge-bias controls are not detailed, leaving the reported gains on the two datasets difficult to interpret or reproduce (Experiments section).
Authors: We acknowledge the need for more rigorous reporting. In the revision, we will provide: (1) statistical significance tests (e.g., p-values from paired t-tests or bootstrap) for the performance gains; (2) exact prompt templates and baseline implementations with links to code; (3) analysis of prompt sensitivity by varying key prompt elements; and (4) additional judge-bias controls as mentioned above. These details will be added to the Experiments section to enhance reproducibility. revision: yes
-
Referee: [Analyses] The claim that an adaptive break and modest tit-for-tat level are required for good performance lacks quantitative thresholds or effect-size tables showing how performance degrades outside those regimes (Analyses section).
Authors: We agree that quantitative support would strengthen this claim. We will add effect-size tables and plots in the Analyses section showing performance as a function of debate length (number of rounds) and tit-for-tat intensity levels. This will include thresholds where performance degrades, such as when debate continues too long or tit-for-tat is too aggressive, leading to degeneration. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper defines the MAD framework as a procedural multi-agent interaction with a judge, without any equations, fitted parameters, or mathematical derivations. Central claims rest on empirical results from two external datasets (commonsense machine translation and counter-intuitive arithmetic reasoning) compared to baselines, with no reduction of outputs to inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing manner for the core argument; the DoT observation and MAD proposal are presented as independent contributions evaluated externally.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Multiple LLM agents in tit-for-tat debate can generate novel thoughts that a single agent cannot produce via self-reflection.
- domain assumption An LLM judge can reliably select the best solution from the debate transcript.
Forward citations
Cited by 21 Pith papers
-
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
-
When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning
Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.
-
Learning to Interrupt in Language-based Multi-agent Communication
HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, sche...
-
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for ...
-
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact ...
-
Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation
Architectural heterogeneity across 7-9B models reduces first-choice concentration in policy simulations (70.9% to 46.1% and 46.0% to 22.9%), while coherence validation shows a scenario-dependent tradeoff.
-
Preregistered Belief Revision Contracts
PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
-
PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage
PolySwarm aggregates predictions from 50 LLM personas for Polymarket trading using Bayesian combination and divergence metrics, outperforming single models in calibration while adding latency arbitrage via CEX price models.
-
Large Language Models Cannot Self-Correct Reasoning Yet
LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.
-
Robust Multi-Agent LLMs under Byzantine Faults
SAC is a decentralized iterative filter-and-refine protocol that achieves (F+1)-robustness in LLM multi-agent systems, suppressing Byzantine influence and improving performance on reasoning benchmarks where prior meth...
-
When Independent Sampling Outperforms Agentic Reasoning
On Codeforces problems, independent k-shot sampling achieves better accuracy-cost and accuracy-query tradeoffs than agentic reasoning, even with prompt caching.
-
12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation
Twelve LLM agents in a 12 Angry Men jury setup almost always end in hung juries due to anchoring, with Llama-4-Scout showing more vote changes than GPT-4o, suggesting RLHF alignment intensity limits deliberative flexibility.
-
Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems
MMP defines a seven-field CMB schema, role-based SVAF evaluation, content-hash lineage, and remix storage to enable traceable cross-session collaboration among autonomous LLM agents.
-
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models
Language models display model-specific escalation thresholds in uncertain decisions that are not explained by scale or architecture, and supervised fine-tuning on explicit uncertainty reasoning produces robust, genera...
-
Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures
A survey comparing classical multi-agent systems with large foundation model-enabled multi-agent systems, showing how the latter enables semantic-level collaboration and greater adaptability.
-
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
A Case-Driven Multi-Agent Framework for E-Commerce Search Relevance
A case-driven multi-agent system automates the full pipeline of bad-case detection, annotation, and resolution for e-commerce search relevance using Annotator, Optimizer, and User agents plus supporting components.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
Answering Questions by Meta-Reasoning over Multiple Chains of Thought , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2023
-
[3]
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity , author=. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[7]
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=
Solving General Arithmetic Word Problems , author=. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2015
-
[12]
Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=
Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=
-
[13]
Advances in neural information processing systems , volume=
Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=
-
[15]
Advances in Neural Information Processing Systems , volume=
Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[18]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[19]
Advances in Neural Information Processing Systems , volume=
Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[20]
Advances in Neural Information Processing Systems , volume=
Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[21]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Transactions of the Association for Computational Linguistics , volume=
Exploring human-like translation strategy with large language models , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=
work page 2024
-
[25]
International Conference on Machine Learning , pages=
The unreasonable effectiveness of few-shot learning for machine translation , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[26]
Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction , author=. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[27]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Critic: Large language models can self-correct with tool-interactive critiquing , author=. arXiv preprint arXiv:2305.11738 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Advances in Neural Information Processing Systems , volume=
Eliciting thinking hierarchy without a prior , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
Question Answering as Programming for Solving Time-Sensitive Questions , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2023
-
[31]
The 61st Annual Meeting Of The Association For Computational Linguistics , year=
Solving Math Word Problems via Cooperative Reasoning induced Language Models , author=. The 61st Annual Meeting Of The Association For Computational Linguistics , year=
-
[32]
Philosophical Explorations , volume=
Does reflection lead to wise choices? , author=. Philosophical Explorations , volume=. 2011 , publisher=
work page 2011
- [33]
-
[34]
On the reliability of watermarks for large language models,
On the reliability of watermarks for large language models , author=. arXiv preprint arXiv:2306.04634 , year=
-
[35]
arXiv preprint arXiv:2308.10848 , year=
Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents , author=. arXiv preprint arXiv:2308.10848 , year=
-
[36]
Advances in Neural Information Processing Systems , volume=
Camel: Communicative agents for" mind" exploration of large language model society , author=. Advances in Neural Information Processing Systems , volume=
-
[37]
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Chateval: Towards better llm-based evaluators through multi-agent debate , author=. arXiv preprint arXiv:2308.07201 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
ChatDev: Communicative Agents for Software Development
Communicative agents for software development , author=. arXiv preprint arXiv:2307.07924 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Thinking, fast and slow , author=
-
[40]
Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
Towards Making the Most of ChatGPT for Machine Translation , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=
work page 2023
-
[41]
Transactions of the Association for Computational Linguistics , volume=
Lost in the middle: How language models use long contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=
work page 2024
-
[42]
Proceedings of the Eighth Conference on Machine Translation , pages=
Findings of the WMT 2023 Shared Task on Machine Translation with Terminologies , author=. Proceedings of the Eighth Conference on Machine Translation , pages=
work page 2023
-
[43]
Proceedings of the Sixth Conference on Machine Translation , pages=
Findings of the WMT shared task on machine translation using terminologies , author=. Proceedings of the Sixth Conference on Machine Translation , pages=
-
[44]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Meta-cotgan: A meta cooperative training paradigm for improving adversarial text generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[45]
Transactions on Machine Learning Research , year=
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on Machine Learning Research , year=
-
[46]
Findings of the Association for Computational Linguistics: ACL 2023 , pages=
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=
work page 2023
-
[47]
Learning to solve arithmetic word problems with verb categorization , author=. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=
work page 2014
-
[48]
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Confer...
work page 2023
-
[49]
Lisa Bortolotti. 2011. Does reflection lead to wise choices? Philosophical Explorations, 14(3):297--313
work page 2011
-
[50]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[51]
Kahneman Daniel. 2017. Thinking, fast and slow. Farrar, Straus and Giroux
work page 2017
- [52]
-
[53]
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [54]
- [55]
-
[56]
Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin Johnson, and Orhan Firat. 2023. The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning, pages 10867--10878. PMLR
work page 2023
-
[57]
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. Critic: Large language models can self-correct with tool-interactive critiquing
work page 2023
-
[58]
Jie He, Tao Wang, Deyi Xiong, and Qun Liu. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.327 The box is in the pen: Evaluating commonsense reasoning in neural machine translation . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3662--3672, Online. Association for Computational Linguistics
-
[59]
Zhiwei He, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Yujiu Yang, Rui Wang, Zhaopeng Tu, Shuming Shi, and Xing Wang. 2024. Exploring human-like translation strategy with large language models. Transactions of the Association for Computational Linguistics, 12:229--246
work page 2024
- [60]
-
[61]
Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523--533
work page 2014
- [62]
-
[63]
Machiel Keestra. 2017. Metacognition and reflection by interdisciplinary experts: Insights from cognitive science and philosophy. Issues in Interdisciplinary Studies, 35:121--169
work page 2017
-
[64]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199--22213
work page 2022
-
[65]
Yuqing Kong, Yunqi Li, Yubo Zhang, Zhihuan Huang, and Jinzhao Wu. 2022. Eliciting thinking hierarchy without a prior. Advances in Neural Information Processing Systems, 35:13329--13341
work page 2022
-
[66]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157--173
work page 2024
-
[67]
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36
work page 2024
-
[68]
Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1--22
work page 2023
-
[69]
Jonathan Pilault, Xavier Garcia, Arthur Bra z inskas, and Orhan Firat. 2023. Interactive-chain-prompting: Ambiguity resolution for crosslingual conditional generation with interaction. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computati...
work page 2023
-
[70]
Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743--1752
work page 2015
-
[71]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36
work page 2024
-
[72]
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research
work page 2023
-
[73]
Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. 2023. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003--13051
work page 2023
- [74]
-
[75]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[76]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837
work page 2022
- [77]
- [78]
-
[79]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36
work page 2024
-
[80]
Haiyan Yin, Dingcheng Li, Xu Li, and Ping Li. 2020. Meta-cotgan: A meta cooperative training paradigm for improving adversarial text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9466--9473
work page 2020
-
[81]
Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. 2023. Answering questions by meta-reasoning over multiple chains of thought. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5942--5966
work page 2023
- [82]
- [83]
-
[84]
Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Jiaxing Zhang, Yujiu Yang, et al. 2023 a . Solving math word problems via cooperative reasoning induced language models. In The 61st Annual Meeting Of The Association For Computational Linguistics
work page 2023
-
[85]
Xinyu Zhu, Cheng Yang, Bei Chen, Siheng Li, Jian-Guang Lou, and Yujiu Yang. 2023 b . Question answering as programming for solving time-sensitive questions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12775--12790
work page 2023
-
[86]
Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, et al. 2023 c . Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144
-
[87]
Proceedings of the Ninth Workshop on Noisy and User-generated Text (W-NUT 2024). 2024
work page 2024
-
[88]
Correcting Challenging F innish Learner Texts With Claude, GPT -3.5 and GPT -4 Large Language Models
Creutz, Mathias. Correcting Challenging F innish Learner Texts With Claude, GPT -3.5 and GPT -4 Large Language Models. 2024
work page 2024
-
[89]
Context-aware Adversarial Attack on Named Entity Recognition
Chen, Shuguang and Neves, Leonardo and Solorio, Thamar. Context-aware Adversarial Attack on Named Entity Recognition. 2024
work page 2024
-
[90]
Popovic, Maja and Lapshinova-Koltunski, Ekaterina and Koponen, Maarit. Effects of different types of noise in user-generated reviews on human and machine translations including C hat GPT. 2024
work page 2024
-
[91]
Stanceosaurus 2.0 - Classifying Stance Towards R ussian and S panish Misinformation
Lavrouk, Anton and Ligon, Ian and Zheng, Jonathan and Naous, Tarek and Xu, Wei and Ritter, Alan. Stanceosaurus 2.0 - Classifying Stance Towards R ussian and S panish Misinformation. 2024
work page 2024
-
[92]
Elahi, Kazi and Rahman, Tasnuva and Shahriar, Shakil and Sarker, Samir and Shawon, Md. and Shibli, G. M. A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy B angla Texts. 2024
work page 2024
-
[93]
Khalid, Baber and Dai, Shuyang and Taghavi, Tara and Lee, Sungjin. Label Supervised Contrastive Learning for Imbalanced Text Classification in E uclidean and Hyperbolic Embedding Spaces. 2024
work page 2024
-
[94]
Bikaun, Tyler and Hodkiewicz, Melinda and Liu, Wei. M aint N orm: A corpus and benchmark model for lexical normalisation and masking of industrial maintenance short text. 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.