Recognition: 2 theorem links
· Lean TheoremChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Pith reviewed 2026-05-13 12:58 UTC · model grok-4.3
The pith
A multi-agent team of LLMs debates to evaluate generated text with human-like reliability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ChatEval constructs a multi-agent referee team that allows a group of LLMs to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation tasks, transcending mere textual scoring to offer a human-mimicking evaluation process for reliable assessments.
What carries the argument
The multi-agent referee team that lets distinct LLMs exchange views and reach consensus on response quality.
If this is right
- Evaluations on open-ended questions become more reliable without added human annotators.
- Standard NLG tasks receive assessments that capture subtleties single models often miss.
- The framework scales to intricate tasks by combining multiple models' strengths.
- Labor and time costs for large-scale text evaluation drop while consistency rises.
Where Pith is reading between the lines
- The debate logs could serve as richer training signals for improving the evaluated models themselves.
- Similar multi-agent structures might transfer to other LLM workflows such as planning or verification.
- Optimal team composition and discussion length remain open parameters that future runs could tune.
Load-bearing premise
Performance gains come from genuine collaboration among the agents rather than from simply making more calls to one model or using better single-prompt instructions.
What would settle it
A controlled test in which one LLM receives the full transcript of the multi-agent discussion and produces evaluation scores that match or exceed the team's accuracy against human judgments.
read the original abstract
Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ChatEval, a multi-agent debate framework in which multiple LLMs assume distinct referee roles, discuss, and collectively evaluate the quality of responses generated by other models on open-ended questions and standard NLG tasks. It claims that this collaborative process yields more reliable, human-mimicking assessments than conventional single-agent LLM prompting, with supporting experiments and publicly released code.
Significance. If the reported gains can be shown to arise specifically from agent interaction and role differentiation rather than from increased total inference budget, the work would offer a practical, scalable method for automated evaluation that reduces dependence on human annotators while preserving reliability. The public code release is a clear strength for reproducibility and follow-up research.
major comments (2)
- [Section 4] Experimental setup (Section 4): No control condition equates total LLM calls or token budget between ChatEval and single-agent baselines. A single-agent variant that issues the same number of sequential or repeated calls (with concatenated history) is required to isolate whether improvements stem from multi-agent debate structure rather than simply aggregating more model outputs.
- [Section 4] Results and analysis: The abstract asserts that ChatEval provides superior human-mimicking evaluation, yet the description supplies no concrete quantitative metrics (e.g., Pearson/Spearman correlation with human judgments, win rates, or statistical significance tests) or explicit single-agent baselines with matched compute. This leaves the central superiority claim only moderately supported.
minor comments (2)
- [Section 3] The roles and interaction protocol of the referee team are described at a high level; a concise diagram or pseudocode of one debate round would improve clarity of the multi-agent mechanism.
- [Section 2] Related-work discussion could more explicitly contrast ChatEval with prior multi-agent LLM frameworks (e.g., those using debate for reasoning) to highlight the novelty of the evaluation-specific application.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our experimental analysis. We address each major comment below.
read point-by-point responses
-
Referee: [Section 4] Experimental setup (Section 4): No control condition equates total LLM calls or token budget between ChatEval and single-agent baselines. A single-agent variant that issues the same number of sequential or repeated calls (with concatenated history) is required to isolate whether improvements stem from multi-agent debate structure rather than simply aggregating more model outputs.
Authors: We acknowledge the importance of controlling for the total computational budget to ensure that the observed improvements are due to the multi-agent debate mechanism rather than increased inference calls. In the revised manuscript, we introduce a new single-agent baseline that performs an equivalent number of sequential LLM calls with concatenated history. Our updated experiments demonstrate that ChatEval maintains superior performance even under this matched-budget condition, thereby strengthening the evidence for the benefits of multi-agent interaction. revision: yes
-
Referee: [Section 4] Results and analysis: The abstract asserts that ChatEval provides superior human-mimicking evaluation, yet the description supplies no concrete quantitative metrics (e.g., Pearson/Spearman correlation with human judgments, win rates, or statistical significance tests) or explicit single-agent baselines with matched compute. This leaves the central superiority claim only moderately supported.
Authors: We appreciate this feedback on the presentation of results. While the full manuscript in Section 4 does include quantitative comparisons and human correlation metrics, we have revised the abstract to explicitly state key quantitative findings, including Pearson and Spearman correlations with human judgments, win rates against baselines, and statistical significance. Additionally, as noted in response to the first comment, we now include matched-compute single-agent baselines. These changes provide stronger support for the superiority claim. revision: yes
Circularity Check
No circularity in empirical multi-agent evaluation framework
full rationale
The paper presents ChatEval as an empirical construction: a multi-agent debate setup for LLM-based evaluation on open-ended and NLG tasks. No equations, derivations, or fitted parameters appear in the provided text. Claims rest on experimental comparisons and public code rather than any self-referential reduction where a 'prediction' equals an input by construction. Self-citations are absent from the abstract and setup; the method does not invoke uniqueness theorems or smuggle ansatzes. This is a standard empirical proposal whose validity can be checked externally via replication, yielding a score of 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can effectively debate and reach consensus on text quality
invented entities (1)
-
ChatEval referee team
no independent evidence
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 32 Pith papers
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect ...
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...
-
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
-
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
-
AI-Gram: When Visual Agents Interact in a Social Network
Autonomous visual AI agents spontaneously form image reply chains, maintain stable individual styles, and produce richer style-diverse conversations than single agents can achieve alone.
-
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for ...
-
LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents
LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
-
When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews
Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
-
Pact: A Choreographic Language for Agentic Ecosystems
Pact is a choreographic language extended with game-theoretic operations that maps every protocol to a formal game for reasoning about agent decisions and solving for decision policies.
-
MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria
MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...
-
TeamFusion: Supporting Open-ended Teamwork with Multi-Agent Systems
TeamFusion uses per-member proxy agents and iterative structured discussions to generate more representative and consensual team deliverables than direct aggregation in open-ended tasks.
-
PARM: Pipeline-Adapted Reward Model
PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
-
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
-
Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate
HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.
-
Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems
LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
-
A Survey on Large Language Model based Autonomous Agents
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
-
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
Agentic AI needs social theory as a structural prior, formalized via the MASS dynamical system framework with four priors: strategic heterogeneity, networked-constrained dependence, co-evolution, and distributional in...
-
TRUST: A Framework for Decentralized AI Service v.0.1
TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while ...
-
Emergent Social Intelligence Risks in Generative Multi-Agent Systems
Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.
-
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
-
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.
-
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
Agentic AI requires social theory as a structural prior in the proposed MASS framework to model emergent outcomes from agent interactions and influence.
-
BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection
BLUEmed combines hybrid RAG with structured multi-agent debate and a safety filter to detect terminology substitution errors in clinical notes, reaching 69.13% accuracy under few-shot prompting and outperforming singl...
-
Beyond Retrieval: Modeling Confidence Decay and Deterministic Agentic Platforms in Generative Engine Optimization
Deterministic multi-agent intent routing can reduce hallucinations in generative engines to near zero by limiting LLMs to intent routers and handing off tasks to specialized agents.
-
EMS: Multi-Agent Voting via Efficient Majority-then-Stopping
EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Reference graph
Works this paper leans on
-
[1]
Benchmarking foundation models with language-model-as-an-examiner
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, et al. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181,
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[3]
Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk
Chris Callison-Burch. Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 conference on empirical methods in natural language processing, pp. 286–295,
work page 2009
-
[5]
Cheng-Han Chiang and Hung-yi Lee
URL https://arxiv.org/abs/2006.14799. Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evalu- ations? arXiv preprint arXiv:2305.01937,
-
[6]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
11 Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023),
work page 2023
-
[7]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Findings of the Association for Computa- tional Linguistics: ACL-IJCNLP 2021
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire.arXiv preprint arXiv:2302.04166,
-
[10]
Human-like summarization evaluation with chatgpt
Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. Human-like summarization evaluation with chatgpt. arXiv preprint arXiv:2304.02554,
-
[11]
The perils of using mechanical turk to evaluate open-ended text generation
Marzena Karpinska, Nader Akoury, and Mohit Iyyer. The perils of using mechanical turk to evaluate open-ended text generation. arXiv preprint arXiv:2109.06835,
-
[12]
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for” mind” exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023a. Ruosen Li, Teerth Patel, and Xinya Du. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arX...
work page internal anchor Pith review arXiv
-
[13]
Training socially aligned language models in simulated human society
Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush V osoughi. Training socially aligned language models in simulated human society.arXiv preprint arXiv:2305.16960, 2023a. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment...
-
[14]
Roco: Dialectic multi-robot collaboration with large language models
Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models. arXiv preprint arXiv:2307.04738,
-
[15]
Usr: An unsupervised and reference free evaluation metric for dialog generation
Shikib Mehri and Maxine Eskenazi. Usr: An unsupervised and reference free evaluation metric for dialog generation. arXiv preprint arXiv:2005.00456,
-
[16]
Why we need new evaluation metrics for NLG
Jekaterina Novikova, Ond ˇrej Du ˇsek, Amanda Cercas Curry, and Verena Rieser. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Meth- ods in Natural Language Processing , pp. 2241–2252, Copenhagen, Denmark, September
work page 2017
-
[17]
Why We Need New Evaluation Metrics for NLG
Association for Computational Linguistics. doi: 10.18653/v1/D17-1238. URL https:// aclanthology.org/D17-1238. 12 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Inform...
-
[18]
Generative Agents: Interactive Simulacra of Human Behavior
Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
ChatDev: Communicative Agents for Software Development
Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, An- toine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207,
work page internal anchor Pith review arXiv
-
[21]
Bleurt: Learning robust metrics for text gener- ation
Thibault Sellam, Dipanjan Das, and Ankur P Parikh. Bleurt: Learning robust metrics for text gener- ation. arXiv preprint arXiv:2004.04696,
-
[22]
Chenhui Shen, Liying Cheng, Yang You, and Lidong Bing. Are large language models good evalu- ators for abstractive summarization? arXiv preprint arXiv:2305.13091,
-
[23]
Is chatgpt a good nlg evaluator? a preliminary study
Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048, 2023a. Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv pre...
-
[24]
Jason Weston and Sainbayar Sukhbaatar
Ning Wu, Ming Gong, Linjun Shou, Shining Liang, and Daxin Jiang. Large language models are diverse role-players for summarization evaluation. arXiv preprint arXiv:2303.15078,
-
[25]
BERTScore: Evaluating Text Generation with BERT
13 Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluat- ing text generation with bert. arXiv preprint arXiv:1904.09675,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[26]
Mover- score: Text generation evaluating with contextualized embeddings and earth mover distance
Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. Mover- score: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622,
-
[27]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Towards a unified multi-dimensional evaluator for text generation.arXiv preprint arXiv:2210.07197,
Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. Towards a unified multi-dimensional evaluator for text generation.arXiv preprint arXiv:2210.07197,
-
[29]
(2023) and design several different role descriptions as follows
A PROMPT TEMPLATE AND DIVERSE ROLE PROMPT The overall prompt template is shown in Table 6, we draw inspiration from Wu et al. (2023) and design several different role descriptions as follows. General Public You are now General Public, one of the referees in this task. You are interested in the story and looking for updates on the investigation. Please thi...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.