A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
hub
Self-Preference Bias in LLM-as-a-Judge
27 Pith papers cite this work. Polarity classification is still indexing.
abstract
Automated evaluation leveraging large language models (LLMs), commonly referred to as LLM evaluators or LLM-as-a-judge, has been widely used in measuring the performance of dialogue systems. However, the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs. Despite the importance of this issue, there is a lack of established methods to measure the self-preference bias quantitatively, and its underlying causes are poorly understood. In this paper, we introduce a novel quantitative metric to measure the self-preference bias. Our experimental results demonstrate that GPT-4 exhibits a significant degree of self-preference bias. To explore the causes, we hypothesize that LLMs may favor outputs that are more familiar to them, as indicated by lower perplexity. We analyze the relationship between LLM evaluations and the perplexities of outputs. Our findings reveal that LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Defines GEA validity criterion and reports first measurement of r=0.698 recovery with positive bias in LLM two-stage adaptive assessment, stronger for verifiable skills.
MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under perturbations.
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
Personalized LLM judges conditioned on an individual evaluator's scoring history align more closely with that evaluator than aggregate judges trained on mixed histories.
Creates LogicDS with 122 logical vulnerabilities and LogicEval framework to evaluate repair techniques, finding failures mainly from prompt sensitivity, lost code context, and poor patch localization.
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.
CaseFacts benchmark of 6,294 claims shows LLMs struggle to verify colloquial legal statements against Supreme Court precedents, with unrestricted web search degrading performance due to noisy precedents.
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
Stage-Audit raises source-frontier precision from 0.356 to 0.505 and F1 from 0.334 to 0.451 on a 51-instance cross-domain set by enforcing disjoint write rights and row-level source gates.
Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the StateGuard defense.
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
StratMem-Bench reveals that state-of-the-art LLMs distinguish required from irrelevant memories effectively but struggle to integrate supportive memories in character conversations.
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
Sentipolis equips LLM agents with continuous PAD emotional states, dual-speed dynamics, and memory coupling to improve emotional continuity and grounded behavior in social simulations.
Eight LLMs exhibited massive self-preference that followed assigned identities rather than true ones, appearing in both simple word tasks and consequential evaluations of job candidates and AI technologies.
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
Three LLMs exhibit distinct consistency profiles in repeated exercise prescription generation, with GPT-4.1 producing unique but semantically stable outputs while Gemini 2.5 Flash achieves high similarity through text duplication.
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.
Repeated generations of exercise prescriptions by an LLM showed high semantic consistency but notable variability in quantitative details such as exercise intensity.
citing papers explorer
-
TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization
A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
-
Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
Defines GEA validity criterion and reports first measurement of r=0.698 recovery with positive bias in LLM two-stage adaptive assessment, stronger for verifiable skills.
-
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
MCJudgeBench evaluates LLM judges at the constraint level with gold labels and inconsistency metrics, showing that overall performance does not ensure reliable detection of partial or no cases or stability under perturbations.
-
Green Shielding: A User-Centric Approach Towards Trustworthy AI
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
-
Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
Personalized LLM judges conditioned on an individual evaluator's scoring history align more closely with that evaluator than aggregate judges trained on mixed histories.
-
LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software
Creates LogicDS with 122 logical vulnerabilities and LogicEval framework to evaluate repair techniques, finding failures mainly from prompt sensitivity, lost code context, and poor patch localization.
-
How Independent are Large Language Models? A Statistical Framework for Auditing Behavioral Entanglement and Reweighting Verifier Ensembles
A new auditing framework reveals widespread behavioral entanglement among LLMs and shows that reweighting ensembles based on measured independence improves verification accuracy by up to 4.5%.
-
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.
-
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.
-
CaseFacts: A Benchmark for Legal Fact-Checking and Precedent Retrieval
CaseFacts benchmark of 6,294 claims shows LLMs struggle to verify colloquial legal statements against Supreme Court precedents, with unrestricted web search degrading performance due to noisy precedents.
-
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
-
Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables
Stage-Audit raises source-frontier precision from 0.356 to 0.505 and F1 from 0.334 to 0.451 on a 51-instance cross-domain set by enforcing disjoint write rights and row-level source gates.
-
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
-
When Routine Chats Turn Toxic: Unintended Long-Term State Poisoning in Personalized Agents
Routine user chats can unintentionally poison the long-term state of personalized LLM agents, causing authorization drift, tool escalation, and unchecked autonomy, as measured by a new benchmark and reduced by the StateGuard defense.
-
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
-
StratMem-Bench: Evaluating Strategic Memory Use in Virtual Character Conversation Beyond Factual Recall
StratMem-Bench reveals that state-of-the-art LLMs distinguish required from irrelevant memories effectively but struggle to integrate supportive memories in character conversations.
-
Learning to Control Summaries with Score Ranking
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
-
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
-
Sentipolis: Emotion-Aware Agents for Social Simulations
Sentipolis equips LLM agents with continuous PAD emotional states, dual-speed dynamics, and memory coupling to improve emotional continuity and grounded behavior in social simulations.
-
Extreme Self-Preference in Language Models
Eight LLMs exhibited massive self-preference that followed assigned identities rather than true ones, appearing in both simple word tasks and consequential evaluations of job candidates and AI technologies.
-
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
-
Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
Three LLMs exhibit distinct consistency profiles in repeated exercise prescription generation, with GPT-4.1 producing unique but semantically stable outputs while Gemini 2.5 Flash achieves high similarity through text duplication.
-
Towards Self-Improving Error Diagnosis in Multi-Agent Systems
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.
-
Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model
Repeated generations of exercise prescriptions by an LLM showed high semantic consistency but notable variability in quantitative details such as exercise intensity.
-
Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning
Fine-tuning LLMs on multi-source synthetic data mitigates distribution collapse and self-preference bias while increasing output quality relative to single-source or human-only fine-tuning.
- Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
- Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks