A Survey on LLM-as-a-Judge
Pith reviewed 2026-05-23 17:30 UTC · model grok-4.3
The pith
LLMs can provide scalable evaluations for complex tasks when strategies address consistency and bias issues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper states that reliable LLM-as-a-Judge systems are achievable by combining strategies for consistency improvement, bias mitigation, and scenario adaptation, together with new evaluation methodologies and a novel benchmark that measures judge reliability.
What carries the argument
The LLM-as-a-Judge approach, carried by targeted reliability strategies and a novel benchmark that quantifies consistency and bias.
If this is right
- LLMs become practical substitutes for expert human evaluators in high-volume or subjective domains.
- Standardized reliability checks can be applied before deploying any LLM judge.
- Applications in real decision systems become viable once bias levels fall below acceptable thresholds.
- Research can shift from basic feasibility to refining the identified strategies for specific tasks.
Where Pith is reading between the lines
- Adoption of the benchmark could create a common test set that all future LLM-judge papers must report against.
- The survey's emphasis on bias mitigation suggests similar techniques might transfer to other LLM uses such as content moderation.
- If the benchmark covers only certain task types, extensions to multi-modal or long-context judging would be natural next steps.
- Real-world teams could run the benchmark on their chosen LLM before integrating it into production evaluation pipelines.
Load-bearing premise
The surveyed papers represent the full range of work on the topic and the new benchmark measures true reliability without its own selection biases.
What would settle it
An independent test showing that LLM judges still produce inconsistent or biased results on the proposed benchmark even after applying all the surveyed consistency and bias-mitigation strategies.
Figures
read the original abstract
Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This survey examines LLM-as-a-Judge systems for evaluating complex tasks. It claims that reliable systems can be built via strategies for improving consistency, mitigating biases, and adapting to diverse scenarios; proposes evaluation methodologies supported by a novel benchmark; and discusses applications, challenges, and future directions. The central question addressed is how to construct reliable LLM-as-a-Judge systems.
Significance. If the surveyed works form a representative sample and the novel benchmark supplies a generalizable, unbiased measure of judge reliability, the synthesis of strategies plus the benchmark could provide a useful reference for standardizing LLM-based evaluations. The work explicitly compiles external literature without derivations or fitted parameters.
major comments (2)
- [Abstract] Abstract and introduction: the headline claim that the survey is 'comprehensive' and that the 'novel benchmark' is 'designed for this purpose' is load-bearing for the central thesis, yet no search protocol, inclusion/exclusion criteria, or coverage statistics are supplied; without these the representativeness of the synthesized strategies cannot be assessed.
- Benchmark section (wherever the novel benchmark is introduced): the abstract states the benchmark supports 'methodologies for evaluating the reliability of LLM-as-a-Judge systems,' but provides no validation details, error analysis, task-coverage justification, or comparison against existing benchmarks; this directly affects whether the proposed methodologies can be treated as generalizable.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and introduction: the headline claim that the survey is 'comprehensive' and that the 'novel benchmark' is 'designed for this purpose' is load-bearing for the central thesis, yet no search protocol, inclusion/exclusion criteria, or coverage statistics are supplied; without these the representativeness of the synthesized strategies cannot be assessed.
Authors: We agree that a transparent literature search protocol is necessary to substantiate the claim of comprehensiveness. In the revised manuscript we will add a dedicated subsection (likely in Section 2 or the introduction) that specifies the search strategy, databases queried, keywords and time range, explicit inclusion/exclusion criteria, and basic coverage statistics (e.g., number of papers screened versus retained). This addition will allow readers to evaluate the representativeness of the synthesized reliability strategies. revision: yes
-
Referee: [—] Benchmark section (wherever the novel benchmark is introduced): the abstract states the benchmark supports 'methodologies for evaluating the reliability of LLM-as-a-Judge systems,' but provides no validation details, error analysis, task-coverage justification, or comparison against existing benchmarks; this directly affects whether the proposed methodologies can be treated as generalizable.
Authors: We acknowledge that the current presentation of the novel benchmark lacks the supporting analyses required to establish its generalizability. In the revision we will expand the benchmark section to include: (i) validation procedures and results, (ii) error analysis across tasks, (iii) explicit justification for task selection and coverage, and (iv) side-by-side comparisons with prior benchmarks. These additions will directly support the claim that the benchmark enables generalizable evaluation methodologies. revision: yes
Circularity Check
No circularity: survey compiles external literature with independent benchmark proposal
full rationale
This is a survey paper whose core contribution is synthesis of external works plus proposal of evaluation methodologies and a novel benchmark. No derivations, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. The abstract and structure reference external literature and a new benchmark without any self-definitional loops, fitted-input predictions, or load-bearing self-citations that collapse the claims. The paper is self-contained against external benchmarks as a literature review, warranting score 0.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
FollowTable: A Benchmark for Instruction-Following Table Retrieval
FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond ...
-
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
-
GIANTS: Generative Insight Anticipation from Scientific Literature
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
-
MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation
MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.
-
GS-QA: A Benchmark for Geospatial Question Answering
GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.
-
Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)
Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, a...
-
Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench
ConsumerSimBench evaluates 13 LLMs on reconstructing crowd reactions from 1,553 Chinese social-media topics using 23,122 auditable yes-no criteria, finding maximum coverage of 47.8% by Gemini-3.1-Pro.
-
Recall Isn't Enough: Bounding Commitments in Personalized Language Systems
CBEA with LCV bounds evidence sets and validates commitments before response generation, achieving zero failures in scoped tests at 0.49-0.60 availability versus near-zero for baselines.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
-
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
-
Task-Aware Calibration: Provably Optimal Decoding in LLMs
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
-
Membership Inference Attacks for Retrieval Based In-Context Learning for Document Question Answering
Black-box membership inference attacks on retrieval-based in-context learning for document QA succeed via query prefixes, with a novel weighted-averaging method outperforming priors even under paraphrasing.
-
BIM Information Extraction Through LLM-based Adaptive Exploration
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
-
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
-
PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators
Depression patient simulators produce overly long, low-variability responses that resolve emotions too quickly along a uniform trajectory, with framework choice outweighing model scale.
-
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
-
Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA
MuDABench provides 332 analytical QA instances over large semi-structured document collections, showing standard RAG performs poorly while a multi-agent workflow with planning, extraction, and code generation improves...
-
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic pe...
-
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.
-
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
-
LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs
A controlled LLM pipeline generates synthetic French OSCE transcripts with varying skill levels and evaluates them, with mid-size models achieving ~90% accuracy matching GPT-4o on the synthetic data.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models
PR-CAD unifies text-to-CAD generation and editing via progressive refinement with LLMs, a new interaction dataset, and RL-enhanced reasoning to achieve better controllability and faithfulness.
-
When Negation Is a Geometry Problem in Vision-Language Models
A direction associated with negation exists in CLIP embedding space and can be steered at test time via representation engineering to produce negation-aware outputs without fine-tuning.
-
Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.
-
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
-
When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
-
VIDEOP2R: Video Understanding from Perception to Reasoning
VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
-
When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models
Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step inter...
-
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
FinAuditing is a taxonomy-structured multi-document benchmark with 1,102 instances averaging over 33k tokens from XBRL filings, defining three tasks to evaluate LLMs on financial auditing capabilities.
-
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models
Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.
-
Bayesian Social Deduction with Graph-Informed Language Models
Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
-
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
-
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
Consensus Entropy measures inter-VLM output agreement to verify OCR reliability and enable self-improving ensembles, yielding 42.1% F1 gains over single-model judging.
-
Towards Context-Invariant Safety Alignment for Large Language Models
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
-
Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models
Evaluation of 6233 MedGPTs finds 25-30% with low factual accuracy, 33.6-54.3% violating operational thresholds, and 57% of action-enabled models lacking privacy disclosures.
-
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...
-
Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
A paraphrase-robust clustering pipeline plus XGBoost classifier identifies refactoring-worthy step subsequences in large BDD test corpora with out-of-fold F1 0.891, outperforming rule baselines and LLM judges.
-
Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems
HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.
-
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
-
Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs
A 30-token prompt requesting a neutral comparison table cuts sponsored recommendations in LLMs from roughly 50% to near zero.
-
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement
PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
-
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
-
Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel
False-positive bug reports in the Linux kernel consume effort comparable to real bugs and can be filtered by LLMs using retrieval-augmented generation at 88% F1.
-
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
-
Shadow-Loom: Causal Reasoning over Graphical World Models of Narratives
Shadow-Loom builds graphical world models from stories to enable code-based causal reasoning and structural scoring of narrative effects such as mystery, irony, suspense, and surprise.
-
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation
VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
-
MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria
MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...
-
A Survey on LLM-based Conversational User Simulation
A survey that introduces a taxonomy for LLM-based conversational user simulation, analyzes core techniques and evaluation methods, and identifies open challenges in the field.
-
Exploring Audio Hallucination in Egocentric Video Understanding
AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
-
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.
-
Evian: Towards Explainable Visual Instruction-tuning Data Auditing
EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.
-
Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models
A generative reward model supplies separate semantic and turn-taking scores for spoken dialogues to enable more reliable reinforcement learning.
-
Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs
GLOW integrates a pre-trained GNN for candidate prediction with an LLM for joint symbolic-semantic reasoning over incomplete KGs, reporting up to 53.3% gains on standard benchmarks and a new GLOW-BENCH dataset.
-
Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation
MISE proves that hindsight self-evaluation rewards equal minimizing mutual information plus KL divergence to a proxy policy, and experiments show 7B LLMs reaching GPT-4o-level results on validation tasks.
-
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
-
TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection
TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts wh...
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
Reference graph
Works this paper leans on
- [1]
-
[2]
Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu
-
[3]
L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088 (2023)
-
[4]
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. 2024. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. ArXiv preprint abs/2410.09024 (2024). https://arxiv.org/abs/2410.09024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [5]
-
[6]
Golnoosh Babaei and Paolo Giudici. 2024. GPT classifications, with application to credit lending. Machine Learning with Applications 16 (2024), 100534
work page 2024
- [7]
-
[8]
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. 2024. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks. arXiv preprint arXiv:2412.15204 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. 2023. Benchmarking Foundation Models with Language-Model-as-an-Examiner. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2...
work page 2023
- [10]
-
[11]
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Confere...
-
[12]
Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. 2024. Guiding LLMs the right way: fast, non-invasive constrained generation. InProceedings of the 41st International Conference on Machine Learning (ICML’24,Vol.235). JMLR.org, Vienna, Austria, 3658–3673
work page 2024
-
[13]
Nathan Brake and Thomas Schaaf. 2024. Comparing Two Model Designs for Clinical Note Generation: Is an LLM a Useful Evaluator of Consistency? Findings of the ACL (2024)
work page 2024
- [14]
-
[15]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page 2020
-
[16]
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chat- Eval: Towards Better LLM-based Evaluators through Multi-Agent Debate. InThe Twelfth International Conference on Learning Representations
work page 2023
-
[17]
David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny. 2023. CLAIR: Evaluating Image Captions with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 13638–13646. doi:1...
-
[18]
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. In Forty-first International Conference on Machine Learning. https://openreview.net/forum?id=dbFEFHAD79
work page 2024
-
[19]
Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, et al. 2024. Data-juicer: A one-stop data processing system for large language models. In Companion of the 2024 International Conference on Management of Data. 120–134
work page 2024
- [20]
-
[21]
Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. 2024. Automated evaluation of large vision-language models on self-driving corner cases. ArXiv preprint abs/2404.10595 (2024). https://arxiv.org/abs/2404.10595
-
[22]
Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, et al. 2023. Evaluating hallucinations in chinese large language models. ArXiv preprint abs/2310.03368 (2023). https://arxiv.org/abs/2310.03368
-
[23]
Inyoung Cheong, King Xia, KJ Kevin Feng, Quan Ze Chen, and Amy X Zhang. 2024. (A) I Am Not a Lawyer, But...: Engaging Legal Experts towards Responsible LLM Policies for Legal Advice. InThe 2024 ACM Conference on Fairness, Accountability, and Transparency. 2454–2469
work page 2024
-
[24]
Antonia Creswell, Murray Shanahan, and Irina Higgins. 2023. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=3Pf3Wg6o-A4
work page 2023
- [25]
- [26]
-
[27]
Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, and Jun Xu. 2024. Neural Retrievers are Biased Towards LLM-Generated Content. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Barcelona, Spain) (KDD ’24). Association for Computing Machinery, New York, NY, USA, 526–537. doi:1...
-
[28]
MRSB DATA. 2024. Multimodal artificial intelligence foundation models: Unleashing the power of remote sensing big data in earth observation. Innovation 2, 1 (2024), 100055
work page 2024
-
[29]
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. 2023. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. arXiv preprint arXiv:2304.06767 (2023). https://arxiv.org/abs/2304.06767
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen
Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. 2024. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. doi:10.48550/arXiv.2411.15100 arXiv:2411.15100 [cs]
- [31]
-
[32]
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Associ...
-
[33]
Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS...
work page 2023
-
[34]
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2024. GPTScore: Evaluate as You Desire. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Lingui...
work page 2024
-
[35]
Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2024. Bias and fairness in large language models: A survey.Computational Linguistics (2024), 1–79. , Vol. 1, No. 1, Article . Publication date: October 2025. J. Gu, X. Jiang, Z. Shi, J. Guo, et al
work page 2024
-
[36]
Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2024. Bias and fairness in large language models: A survey.Computational Linguistics 50, 3 (2024), 1097–1179
work page 2024
- [37]
-
[38]
Leo Gao, John Schulman, and Jacob Hilton. 2023. Scaling laws for reward model overoptimization. In International Conference on Machine Learning. PMLR, 10835–10866
work page 2023
- [39]
-
[40]
Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, and Idan Szpektor. 2023. TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapor...
-
[41]
Google. 2023. Gemini: a family of highly capable multimodal models. ArXiv preprint abs/2312.11805 (2023). https: //arxiv.org/abs/2312.11805
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John J. Nay, Jonathan H. Choi, K...
work page 2023
-
[43]
Yufei Guo, Muzhe Guo, Juntao Su, Zhou Yang, Mengqiu Zhu, Hongfei Li, Mengyang Qiu, and Shuo Shuo Liu. 2024. Bias in large language models: Origin, evaluation, and mitigation. arXiv preprint arXiv:2411.10915 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. 2023. Reasoning with Language Model is Planning with World Model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 8154–8173. doi:1...
-
[45]
Hangfeng He, Hongming Zhang, and Dan Roth. 2024. SocREval: Large Language Models with the Socratic Method for Reference-free Reasoning Evaluation. In Findings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 2736–2764. https://a...
work page 2024
-
[46]
Shijun He, Fan Yang, Jian-ping Zuo, and Ze-min Lin. 2023. ChatGPT for scientific paper writing—promises and perils. The Innovation 4, 6 (2023)
work page 2023
-
[47]
Pedram Hosseini, Jessica M. Sin, Bing Ren, Bryceton G. Thomas, Elnaz Nouri, Ali Farahanchi, and Saeed Hassanpour
-
[48]
A Benchmark for Long-Form Medical Question Answering. In Proceedings of EMNLP
-
[49]
Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, and Xiaojun Wan. 2024. Are LLM-based Evaluators Confusing NLG Quality Criteria?. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9530–9570. https://aclanthology.org/2024.acl-long.516
work page 2024
-
[50]
Hui Huang, Yancheng He, Hongli Zhou, Rui Zhang, Wei Liu, Weixun Wang, Wenbo Su, Bo Zheng, and Jiaheng Liu
-
[51]
arXiv preprint arXiv:2505.14268 (2025)
Think-j: Learning to think for generative llm-as-a-judge. arXiv preprint arXiv:2505.14268 (2025)
- [52]
-
[53]
Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards Reasoning in Large Language Models: A Survey. InFindings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 1049–1065. doi:10.18653/v1/2023.findings-acl.67
-
[54]
Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael Lyu. 2023. On the humanity of conversational ai: Evaluating the psychological portrayal of llms. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=H3UayAQWoE
work page 2023
-
[55]
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline defenses for adversarial attacks against aligned language models. ArXiv preprint abs/2309.00614 (2023). https://arxiv.org/abs/2309.00614 , Vol. 1, No. 1, Article . Publication dat...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Minbyul Jeong, Jiwoong Sohn, Mujeen Sung, and Jaewoo Kang. 2024. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 40, Supplement_1 (2024), i119– i129
work page 2024
-
[57]
Bowen Jiang, Yangxinyu Xie, Xiaomeng Wang, Weijie J Su, Camillo Jose Taylor, and Tanwi Mallick. 2024. Multi-modal and multi-agent systems meet rationality: A survey. In ICML 2024 Workshop on LLMs and Cognition
work page 2024
-
[58]
Theodore T. Jiang, Li Fang, and Kai Wang. 2023. Deciphering “the language of nature”: A transformer-based language model for deleterious mutations in proteins. The Innovation 4, 5 (2023), 100487. doi:10.1016/j.xinn.2023.100487
-
[59]
Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, and Huan Sun. 2024. A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Kevin Duh, Helena Gomez, and Ste...
work page 2024
- [60]
-
[61]
Immanuel Kant. 1781. Critique of Pure Reason (a/b ed.). Macmillan, London. Akademie-Ausgabe, Vol. 3, A132/B171
-
[62]
Immanuel Kant. 1790. Critique of Judgment. Hackett Publishing Company, Indianapolis. Akademie-Ausgabe, Vol. 5, 5:179
-
[63]
Akira Kawabata and Saku Sugawara. 2024. Rationale-Aware Answer Verification by Pairwise Self-Evaluation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 16178–16196
work page 2024
-
[64]
Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, et al. 2024. CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pap...
work page 2024
-
[65]
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113–19122
work page 2023
-
[66]
Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. 2023. Prometheus: Inducing Fine-grained Evaluation Capability in Language Models. ArXiv preprint abs/2310.08491 (2023). https://arxiv.org/abs/2310.08491
-
[67]
Pang Wei Koh, Jialin Zhang, Jane Lee, and Percy Liang. 2024. MedHELM: Holistic Evaluation of Language Models for Medical Applications. Technical Report. Stanford Human-Centered Artificial Intelligence
work page 2024
-
[68]
Mahi Kolla, Siddharth Salunkhe, Eshwar Chandrasekharan, and Koustuv Saha. 2024. LLM-Mod: Can Large Language Models Assist Content Moderation?. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–8
work page 2024
- [69]
- [70]
- [71]
- [72]
- [73]
-
[74]
Preethi Lahoti, Nicholas Blumm, Xiao Ma, Raghavendra Kotikalapudi, Sahitya Potluri, Qijun Tan, Hansa Srinivasan, Ben Packer, Ahmad Beirami, Alex Beutel, and Jilin Chen. 2023. Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting. In Proceedings of the 2023 Conference on Empirical Methods in Nat...
-
[75]
Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang, Dahua Lin, Kai Chen, and Xian-Ling Mao. [n. d.]. CriticEval: Evaluating Large-scale Language Model as Critic. In The Thirty-eighth Annual Conference on Neural Information Processing Systems
-
[76]
Dongryeol Lee, Yerin Hwang, Yongil Kim, Joonsuk Park, and Kyomin Jung. 2024. Are LLM-judges robust to ex- pressions of uncertainty? investigating the effect of epistemic markers on LLM-based evaluation. arXiv preprint , Vol. 1, No. 1, Article . Publication date: October 2025. J. Gu, X. Jiang, Z. Shi, J. Guo, et al. arXiv:2410.20774 (2024)
-
[77]
Yebin Lee, Imseong Park, and Myungjoo Kang. 2024. FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 3732–3746. https://aclanthology.org/2024.acl-long.205
work page 2024
-
[78]
Alice Li and Luanne Sinnamon. 2023. Examining query sentiment bias effects on search results in large language models. In The Symposium on Future Directions in Information Access (FDIA) co-located with the 2023 European Summer School on Information Retrieval (ESSIR)
work page 2023
- [79]
-
[80]
Dawei Li, Shu Yang, Zhen Tan, Jae Young Baik, Sunkwon Yun, Joseph Lee, Aaron Chacko, Bojian Hou, Duy Duong- Tran, Ying Ding, et al. 2024. DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer’s Disease Questions with Scientific Literature. ArXiv preprint abs/2405.04819 (2024). https://arxiv.org/abs/2405.04819
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.