pith. machine review for the scientific record. sign in

arxiv: 2306.05685 · v4 · submitted 2023-06-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM evaluationhuman preferencesMT-BenchChatbot ArenaGPT-4 judgechatbot evaluationbias mitigationpairwise comparison
0
0 comments X

The pith

Strong LLM judges like GPT-4 agree with human preferences over 80 percent of the time on chat model evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores the use of large language models as judges to assess chat assistants on open-ended, multi-turn questions where fixed benchmarks fall short. It documents biases in LLM judgments such as favoring the first position or longer answers and tests methods to reduce them. Two new evaluation resources are created: MT-Bench, a set of multi-turn questions, and Chatbot Arena, a platform that collects crowdsourced pairwise votes. The central result is that capable models such as GPT-4 reach agreement rates above 80 percent with both controlled and crowdsourced human votes, matching the consistency level observed between humans themselves. This positions LLM judges as a practical, scalable substitute for repeated human labeling.

Core claim

Strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain.

What carries the argument

LLM-as-a-judge, an approach in which a capable model such as GPT-4 scores or compares chat responses on multi-turn or head-to-head questions, paired with bias-mitigation steps like position swapping.

If this is right

  • Strong LLMs can serve as scalable proxies for human evaluation of chat models without requiring new rounds of human votes for each model variant.
  • MT-Bench and Chatbot Arena supply open-ended conversation data that complements narrower traditional benchmarks.
  • Simple mitigation techniques such as swapping answer order and prompting for step-by-step reasoning improve the reliability of LLM judgments.
  • The resulting preference data and benchmarks can be reused to evaluate future model versions efficiently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the agreement holds on previously unseen model families, the cost of iterative chat-model development could drop sharply by replacing most human labeling.
  • The method might extend to other open-ended tasks such as code review or creative writing, though new bias checks would be needed.
  • Continuous deployment of LLM judges could enable real-time quality monitoring of live chat systems.

Load-bearing premise

The human votes collected through Chatbot Arena and MT-Bench represent reliable, unbiased ground truth that generalizes to other models, questions, and evaluation conditions.

What would settle it

Collecting fresh human judgments on the same model pairs and finding that GPT-4 agreement falls substantially below the reported 80 percent level would directly test the central claim.

read the original abstract

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that strong LLMs such as GPT-4 can function effectively as judges for open-ended LLM chat assistant evaluation. It introduces MT-Bench (multi-turn questions with expert human votes) and Chatbot Arena (crowdsourced pairwise battles), reports that GPT-4 achieves over 80% agreement with human preferences on both—matching the human-human agreement rate on the same data—while cataloging position, verbosity, and self-enhancement biases and proposing mitigations. The work also shows complementarity with traditional benchmarks via LLaMA/Vicuna variant evaluations and publicly releases the MT-Bench questions, 3K expert votes, and 30K conversations.

Significance. If the empirical agreement results hold, the work supplies a scalable, explainable proxy for human preference data that is otherwise expensive to collect at scale. The public release of the full vote and conversation datasets is a clear strength, directly enabling verification and extension by others. The approach usefully complements existing automatic benchmarks and supplies concrete observations on LLM-judge failure modes.

minor comments (3)
  1. [Abstract] Abstract: the phrase 'over 80% agreement, the same level of agreement between humans' should be replaced by the exact reported percentages (with the human-human baseline value) so readers can immediately assess the numerical equivalence.
  2. [Section 3] Section 3 (MT-Bench and Chatbot Arena definitions): the precise formula or aggregation rule used to compute pairwise agreement (including tie handling and multi-turn scoring) is not stated explicitly; adding a short equation or pseudocode would remove ambiguity without altering the central claim.
  3. [Section 4] Section 4 (results tables): confidence intervals or standard errors on the agreement rates are absent; including them would strengthen the comparison to the human-human baseline without changing the reported point estimates.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of our work, including the recognition of MT-Bench and Chatbot Arena as scalable proxies for human preferences, the public dataset release, and the complementarity with traditional benchmarks. We appreciate the recommendation for minor revision and will address any editorial or minor points in the updated manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical study: it collects independent human preference data (3K expert votes on MT-Bench and 30K crowdsourced conversations on Chatbot Arena), measures agreement rates of LLM judges against those data, and reports that GPT-4 reaches >80% agreement matching human-human agreement on the same data. No equations, fitted parameters, or first-principles derivations appear; the agreement metric is a direct count of matches between LLM outputs and the released human votes. Bias analysis is observational, mitigations are proposed separately, and no self-citation chain or uniqueness theorem is invoked to justify the central result. The benchmarks and data releases make the claim externally verifiable without reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations. The primary unverified cost is the assumption that collected human votes serve as reliable ground truth for model quality.

axioms (1)
  • domain assumption Crowdsourced human votes on chatbot comparisons accurately reflect true user preferences for response quality
    This serves as the ground truth against which LLM judge agreement is measured.

pith-pipeline@v0.9.0 · 5573 in / 1263 out tokens · 73136 ms · 2026-05-10T18:48:24.408029+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  2. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    cs.CL 2023-08 unverdicted novelty 8.0

    LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

  3. Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

    cond-mat.stat-mech 2026-05 unverdicted novelty 7.0

    LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.

  4. SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

    cs.LG 2026-05 unverdicted novelty 7.0

    SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

  5. ProactBench: Beyond What The User Asked For

    cs.LG 2026-05 unverdicted novelty 7.0

    ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

  6. Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

    cs.HC 2026-05 accept novelty 7.0

    LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.

  7. GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection

    cs.LG 2026-05 unverdicted novelty 7.0

    GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.

  8. Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

    cs.LG 2026-05 unverdicted novelty 7.0

    Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

  9. EditPropBench: Measuring Factual Edit Propagation in Scientific Manuscripts

    cs.CL 2026-05 unverdicted novelty 7.0

    EditPropBench evaluates LLM editors on propagating factual edits to dependent claims in synthetic scientific manuscripts, showing that even the strongest systems miss roughly 30% of required updates on hard cases.

  10. The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

    cs.LG 2026-05 unverdicted novelty 7.0

    An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.

  11. Subliminal Steering: Stronger Encoding of Hidden Signals

    cs.CL 2026-04 unverdicted novelty 7.0

    Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.

  12. Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing

    cs.CR 2026-04 unverdicted novelty 7.0

    Agentic Witnessing enables privacy-preserving auditing of semantic properties in private data by running an LLM auditor in a TEE that answers binary queries and produces cryptographic transcripts of its reasoning.

  13. AsmRAG: LLM-Driven Malware Detection by Retrieving Functionally Similar Assembly Code

    cs.CR 2026-04 unverdicted novelty 7.0

    AsmRAG detects malware at 96% F1 and attributes families at 95% F1 by retrieving functionally similar assembly code via LLM embeddings and density-weighted anchor selection, remaining robust to metamorphic obfuscation.

  14. HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

    cs.AI 2026-04 unverdicted novelty 7.0

    HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.

  15. More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

    cs.AI 2026-04 unverdicted novelty 7.0

    Position bias scales positively with reasoning trajectory length in CoT models, shown by partial correlations and truncation interventions across multiple benchmarks and model scales.

  16. Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.

  17. Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

    cs.AI 2026-04 conditional novelty 7.0

    AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...

  18. The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents

    cs.AI 2026-04 unverdicted novelty 7.0

    A parallel Cognitive Companion architecture reduces repetition in LLM agents by 52-62% on loop-prone tasks using LLM monitoring with 11% overhead or zero-overhead probes on hidden states, with benefits depending on task type.

  19. CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

    cs.CL 2026-04 unverdicted novelty 7.0

    CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform t...

  20. Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

    cs.CV 2026-04 unverdicted novelty 7.0

    Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

  21. IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling

    cs.AI 2026-04 unverdicted novelty 7.0

    IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.

  22. An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

    cs.AI 2026-04 unverdicted novelty 7.0

    An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...

  23. FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

    cs.CL 2026-04 unverdicted novelty 7.0

    FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

  24. LatentAudit: Real-Time White-Box Faithfulness Monitoring for Retrieval-Augmented Generation with Verifiable Deployment

    cs.AI 2026-04 unverdicted novelty 7.0

    LatentAudit monitors RAG faithfulness in real time via Mahalanobis distance on residual-stream activations, reaching 0.942 AUROC on PubMedQA with 0.77 ms overhead and supporting Groth16 verification.

  25. From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

    cs.IR 2026-03 unverdicted novelty 7.0

    Docling with hierarchical splitting reaches 94.1% RAG accuracy on domain documents, beating naive PDF loading but trailing manual Markdown curation at 97.1%.

  26. Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

    cs.CL 2026-03 unverdicted novelty 7.0

    A two-stage synthetic data generation method creates the CommonSyn dataset, allowing LLMs fine-tuned on it to produce more diverse and higher-quality commonsense responses than vanilla or human-data-trained models.

  27. KTO: Model Alignment as Prospect Theoretic Optimization

    cs.LG 2024-02 conditional novelty 7.0

    KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

  28. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    cs.CL 2023-10 conditional novelty 7.0

    Fine-tuning aligned LLMs compromises safety guardrails even with minimal adversarial examples or benign data, creating new risks not covered by existing inference-time protections.

  29. WizardLM: Empowering large pre-trained language models to follow complex instructions

    cs.CL 2023-04 conditional novelty 7.0

    WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.

  30. Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

    cs.AI 2026-05 unverdicted novelty 6.0

    MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.

  31. Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A dual hierarchical RL framework lets agents learn when and how to ask probing questions in U.S. Supreme Court arguments, outperforming baselines on a court dataset.

  32. SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

    cs.SE 2026-05 unverdicted novelty 6.0

    SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

  33. Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

    cs.AI 2026-05 unverdicted novelty 6.0

    Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.

  34. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  35. BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    BioTool dataset enables fine-tuning a 4B-parameter LLM to outperform GPT-5.1 in biomedical tool calling while improving downstream answer quality per human experts.

  36. RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

    cs.CL 2026-05 unverdicted novelty 6.0

    RLearner-LLM's Hybrid-DPO fuses DeBERTa NLI and LLM verifier scores to deliver up to 6x higher NLI entailment than standard SFT while preserving answer coverage across academic domains.

  37. RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

    cs.CL 2026-05 unverdicted novelty 6.0

    RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.

  38. DocSync: Agentic Documentation Maintenance via Critic-Guided Reflexion

    cs.SE 2026-05 unverdicted novelty 6.0

    DocSync fuses AST-aware retrieval with an iterative critic loop to update documentation, outperforming CodeT5-base on semantic alignment and automated judge scores in a proxy code-to-text task.

  39. Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

    cs.AI 2026-05 unverdicted novelty 6.0

    The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...

  40. VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models

    cs.CR 2026-05 conditional novelty 6.0

    Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.

  41. Iterative Finetuning is Mostly Idempotent

    cs.AI 2026-05 unverdicted novelty 6.0

    Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.

  42. Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis

    cs.CR 2026-05 unverdicted novelty 6.0

    Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...

  43. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  44. Simple Self-Conditioning Adaptation for Masked Diffusion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    SCMDM adapts trained masked diffusion models to condition denoising steps on their own prior clean predictions, cutting generative perplexity nearly in half on open-web text while improving discretized image, molecule...

  45. Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation

    cs.AI 2026-04 unverdicted novelty 6.0

    SiPeR improves recommendation accuracy and response quality in situated conversations by estimating scene transitions and performing Bayesian inverse inference with multimodal LLMs.

  46. ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

    cs.AI 2026-04 unverdicted novelty 6.0

    ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...

  47. CreativeGame:Toward Mechanic-Aware Creative Game Generation

    cs.AI 2026-04 unverdicted novelty 6.0

    CreativeGame enables iterative HTML5 game generation via mechanic-guided planning, lineage memory, runtime validation, and programmatic rewards to produce inspectable version-to-version mechanic evolution.

  48. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...

  49. LLMs Corrupt Your Documents When You Delegate

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

  50. Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

    cs.LG 2026-04 unverdicted novelty 6.0

    A calibrated three-model LLM jury scores medical diagnoses and clinical reasoning on real hospital cases with higher agreement to primary expert panels and fewer severe errors than human re-scoring panels.

  51. Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval

    cs.AI 2026-04 unverdicted novelty 6.0

    A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.

  52. TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection

    cs.AI 2026-04 unverdicted novelty 6.0

    TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts wh...

  53. In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach

    cs.AI 2026-04 unverdicted novelty 6.0

    A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.

  54. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

    cs.AI 2026-04 unverdicted novelty 6.0

    AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 pe...

  55. Can Large Language Models Reinvent Foundational Algorithms?

    cs.AI 2026-04 unverdicted novelty 6.0

    LLMs reinvention success reaches 50% with no hints and 90% with moderate hints on 10 algorithms, but complex cases like Strassen's require test-time RL and still fail without sufficient guidance.

  56. Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming

    cs.RO 2026-04 unverdicted novelty 6.0

    DAERT generates diverse adversarial instructions via a uniform policy in RL to drop VLA task success rates from 93.33% to 5.85% on benchmarks with models like π0 and OpenVLA.

  57. Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

    cs.AI 2026-04 unverdicted novelty 6.0

    Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.

  58. Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    Weak supervision signals can be distilled into LLM hidden states so that simple probes on internal activations detect hallucinations at inference without external tools.

  59. DQA: Diagnostic Question Answering for IT Support

    cs.CL 2026-04 unverdicted novelty 6.0

    DQA maintains persistent diagnostic state and aggregates retrievals at the root-cause level to reach 78.7% success on 150 enterprise IT scenarios versus 41.3% for standard multi-turn RAG while cutting average turns fr...

  60. Simulating the Evolution of Alignment and Values in Machine Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 100 Pith papers · 17 internal anchors

  1. [1]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  3. [3]

    Position bias in multiple-choice questions

    Niels J Blunch. Position bias in multiple-choice questions. Journal of Marketing Research , 21(2):216–220, 1984

  4. [4]

    Evaluations of self and others: Self-enhancement biases in social judgments

    Jonathon D Brown. Evaluations of self and others: Self-enhancement biases in social judgments. Social cognition, 4(4):353–376, 1986

  5. [5]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

  6. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    Chiang and H.-Y

    Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023

  8. [8]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

  9. [9]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  10. [10]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 10

  11. [11]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

  12. [12]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023

  13. [13]

    Lmflow: An extensible toolkit for finetuning and inference of large foundation models

    Shizhe Diao, Rui Pan, Hanze Dong, Ka Shun Shum, Jipeng Zhang, Wei Xiong, and Tong Zhang. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420, 2023

  14. [14]

    Alpacafarm: A simulation framework for methods that learn from human feedback

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023

  15. [15]

    Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation

    Jiazhan Feng, Qingfeng Sun, Can Xu, Pu Zhao, Yaming Yang, Chongyang Tao, Dongyan Zhao, and Qingwei Lin. Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. arXiv preprint arXiv:2211.05719, 2022

  16. [16]

    Koala: A dialogue model for academic research

    Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A dialogue model for academic research. Blog post, April 2023

  17. [17]

    Chatgpt outperforms crowd-workers for text-annotation tasks

    Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056, 2023

  18. [18]

    arXiv preprint arXiv:2305.15717 , year =

    Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023

  19. [19]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  20. [20]

    Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech

    Fan Huang, Haewoon Kwak, and Jisun An. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. arXiv preprint arXiv:2302.07736, 2023

  21. [21]

    Dynabench: Rethinking benchmarking in nlp

    Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, et al. Dynabench: Rethinking benchmarking in nlp. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 4110...

  22. [22]

    Look at the first sentence: Position bias in question answering

    Miyoung Ko, Jinhyuk Lee, Hyunjae Kim, Gangwoo Kim, and Jaewoo Kang. Look at the first sentence: Position bias in question answering. arXiv preprint arXiv:2004.14602, 2020

  23. [23]

    o pf, Yannic Kilcher, Dimitri von R \

    Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, et al. Ope- nassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023

  24. [24]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

  25. [25]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  26. [26]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021

  27. [27]

    The flan collection: Designing data and methods for effective instruction tuning

    Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023

  28. [28]

    Cross-task general- ization via natural language crowdsourcing instructions

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task general- ization via natural language crowdsourcing instructions. In ACL, 2022

  29. [29]

    Evals is a framework for evaluating llms and llm systems, and an open-source registry of benchmarks

    OpenAI. Evals is a framework for evaluating llms and llm systems, and an open-source registry of benchmarks. https://github.com/openai/evals

  30. [30]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023. 11

  31. [31]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022

  32. [32]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  33. [33]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023

  34. [34]

    Center-of-inattention: Position biases in decision-making

    Priya Raghubir and Ana Valenzuela. Center-of-inattention: Position biases in decision-making. Organizational Behavior and Human Decision Processes , 99(1):66–80, 2006

  35. [35]

    Coqa: A conversational question answering challenge

    Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019

  36. [36]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021

  37. [37]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022

  38. [38]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  39. [39]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  40. [40]

    Large language models are not fair evaluators

    Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023

  41. [41]

    Position bias estimation for unbiased learning to rank in personal search

    Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Donald Metzler, and Marc Najork. Position bias estimation for unbiased learning to rank in personal search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining , pages 610–618, 2018

  42. [42]

    Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization, 2023

    Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, and Yue Zhang. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization, 2023

  43. [43]

    arXiv preprint arXiv:2306.04751 , year=

    Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A Smith, Iz Beltagy, et al. How far can camels go? exploring the state of instruction tuning on open resources.arXiv preprint arXiv:2306.04751, 2023

  44. [44]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instruc- tions, 2022

  45. [45]

    Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks

    Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP, 2022

  46. [46]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

  47. [47]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022. 12

  48. [48]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023

  49. [49]

    SkyPilot: An intercloud broker for sky computing

    Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, Gautam Mittal, Scott Shenker, and Ion Stoica. SkyPilot: An intercloud broker for sky computing. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) , pages 437–455, Boston, MA, April 2023. USENIX Association

  50. [50]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

  51. [51]

    Agieval: A human-centric benchmark for evaluating foundation models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023

  52. [52]

    Lima: Less is more for alignment.arXiv preprint arXiv:2305.11206,

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.arXiv preprint arXiv:2305.11206, 2023. 13 A Prompt templates We list the prompt templates for LLM judges. Please refer to our github repository 3 for full details. [System]Please act as an impartial judge...