A Survey on LLM-as-a-Judge

Chengjin Xu; Hexiang Tan; Honghao Liu; Jian Guo; Jiawei Gu; Kun Zhang; Lionel Ni; Saizhuo Wang; Shengjie Ma; Wei Li

arxiv: 2411.15594 · v6 · submitted 2024-11-23 · 💻 cs.CL · cs.AI

A Survey on LLM-as-a-Judge

Jiawei Gu , Xuhui Jiang , Zhichao Shi , Hexiang Tan , Xuehao Zhai , Chengjin Xu , Wei Li , Yinghan Shen

show 8 more authors

Shengjie Ma Honghao Liu Saizhuo Wang Kun Zhang Yuanzhuo Wang Wen Gao Lionel Ni Jian Guo

This is my paper

Pith reviewed 2026-05-23 17:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM-as-a-Judgeevaluation reliabilitybias mitigationconsistency strategiesautomated assessmentbenchmark for judgesLLM evaluation survey

0 comments

The pith

LLMs can provide scalable evaluations for complex tasks when strategies address consistency and bias issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey examines how large language models can act as judges to evaluate subjective or large-scale tasks that traditionally require human experts. It establishes that LLMs offer cost-effective and consistent assessments across data types but only when specific techniques are applied to reduce variability and bias. The authors outline methods to improve reliability and introduce a new benchmark to test those methods. A reader would care because reliable automated evaluation could transform decision-making in fields where human judgment is expensive or inconsistent.

Core claim

The paper states that reliable LLM-as-a-Judge systems are achievable by combining strategies for consistency improvement, bias mitigation, and scenario adaptation, together with new evaluation methodologies and a novel benchmark that measures judge reliability.

What carries the argument

The LLM-as-a-Judge approach, carried by targeted reliability strategies and a novel benchmark that quantifies consistency and bias.

If this is right

LLMs become practical substitutes for expert human evaluators in high-volume or subjective domains.
Standardized reliability checks can be applied before deploying any LLM judge.
Applications in real decision systems become viable once bias levels fall below acceptable thresholds.
Research can shift from basic feasibility to refining the identified strategies for specific tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of the benchmark could create a common test set that all future LLM-judge papers must report against.
The survey's emphasis on bias mitigation suggests similar techniques might transfer to other LLM uses such as content moderation.
If the benchmark covers only certain task types, extensions to multi-modal or long-context judging would be natural next steps.
Real-world teams could run the benchmark on their chosen LLM before integrating it into production evaluation pipelines.

Load-bearing premise

The surveyed papers represent the full range of work on the topic and the new benchmark measures true reliability without its own selection biases.

What would settle it

An independent test showing that LLM judges still produce inconsistent or biased results on the proposed benchmark even after applying all the surveyed consistency and bias-mitigation strategies.

Figures

Figures reproduced from arXiv: 2411.15594 by Chengjin Xu, Hexiang Tan, Honghao Liu, Jian Guo, Jiawei Gu, Kun Zhang, Lionel Ni, Saizhuo Wang, Shengjie Ma, Wei Li, Wen Gao, Xuehao Zhai, Xuhui Jiang, Yinghan Shen, Yuanzhuo Wang, Zhichao Shi.

**Figure 2.** Figure 2: LLM-as-a-Judge evaluation pipelines. • 𝑓R: A series of constraints and validation methods applied systematically to the basic LLMas-a-Judge framework to enhance evaluation reliability. These include methods to mitigate biases, control variability, and confirm robustness against adversarial inputs. 2.1 In-Context Learning To apply LLM-as-a-Judge, evaluation tasks are typically specified using In-Context Le… view at source ↗

**Figure 3.** Figure 3: The illustrations of method generating scores in ICL Evaluation Prompt Templates from Gao et al. [38] Likert Scale Scoring: Evaluate the quality of summaries written for a news article. Rate each summary on four dimensions: {Dimension_1}, {Dimension_2}, {Dimension_3}, and {Dimension_4}. You should rate on a scale from 1 (worst) to 5 (best). Article: {Article} Summary: {Summary} Pairwise Comparison: Given a… view at source ↗

**Figure 4.** Figure 4: The illustrations of method Solving Yes/No questions and Conducting pairwise comparisons in ICL 2.1.3 Conducting pairwise comparisons. Pairwise comparison refers to comparing two options and selecting which one is superior or more aligned with a specific standard, showed in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Four typical scenarios using LLM-as-a-Judge evaluation pipeline. How to use LLM-as-a-Judge? Post-processing [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: The illustrations of the scenario LLM-as-a-Judge for Models. The example of "win-tie-lose" is from Li et al. [79] , Vol. 1, No. 1, Article . Publication date: October 2025 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: LLM-as-a-Judge appears in two common forms in the agent. The left diagram is Agent-as-a-Juge, [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Flowchart of Quick Practice Next is prompt design, detailed in Section 2.1, where both wording and formats matter. The most efficient and generally effective approach involves specifying scoring dimensions, emphasizing relative comparisons for improved assessments, and creating effective examples to guide the LLM. Careful prompt engineering is essential to mitigate issues like output variability and inter-… view at source ↗

**Figure 9.** Figure 9: Structure of how to improve and evaluate LLM-as-a-Judge. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Simplified Evaluation Pipeline of Two Decomposition Paradigms. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Two paradigms of the construction process of meta evaluation datasets for training. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Three Dimensions of Evaluation. 4 EVALUATION OF LLM-AS-A-JUDGE Following the discussions on the application and enhancement of LLM-as-a-Judge, we now address the critical question of its evaluation. While the basic evaluation pipeline provides a conceptual foundation, it does not inherently guarantee the reliability of the system. To formally capture this essential property, we recall the enhanced formal … view at source ↗

**Figure 13.** Figure 13: LLM-as-a-Judge Meta-evaluation Pipeline and Tools [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: The development process and future prospects of LLM-as-a-Judge. [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗

**Figure 15.** Figure 15: Illustration of using dual-LLM iterative feedback loop for alpha generation in finance. Figure adapted [PITH_FULL_IMAGE:figures/full_fig_p039_15.png] view at source ↗

**Figure 16.** Figure 16: The relationship of LLM-as-a-Judge and Reasoning/Thinking. [PITH_FULL_IMAGE:figures/full_fig_p045_16.png] view at source ↗

**Figure 17.** Figure 17: The development process and future prospects of LLM-as-a-Judge. [PITH_FULL_IMAGE:figures/full_fig_p046_17.png] view at source ↗

read the original abstract

Accurate and consistent evaluation is crucial for decision-making across numerous fields, yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large Language Models (LLMs) have achieved remarkable success across diverse domains, leading to the emergence of "LLM-as-a-Judge," where LLMs are employed as evaluators for complex tasks. With their ability to process diverse data types and provide scalable, cost-effective, and consistent assessments, LLMs present a compelling alternative to traditional expert-driven evaluations. However, ensuring the reliability of LLM-as-a-Judge systems remains a significant challenge that requires careful design and standardization. This paper provides a comprehensive survey of LLM-as-a-Judge, addressing the core question: How can reliable LLM-as-a-Judge systems be built? We explore strategies to enhance reliability, including improving consistency, mitigating biases, and adapting to diverse assessment scenarios. Additionally, we propose methodologies for evaluating the reliability of LLM-as-a-Judge systems, supported by a novel benchmark designed for this purpose. To advance the development and real-world deployment of LLM-as-a-Judge systems, we also discussed practical applications, challenges, and future directions. This survey serves as a foundational reference for researchers and practitioners in this rapidly evolving field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. This survey examines LLM-as-a-Judge systems for evaluating complex tasks. It claims that reliable systems can be built via strategies for improving consistency, mitigating biases, and adapting to diverse scenarios; proposes evaluation methodologies supported by a novel benchmark; and discusses applications, challenges, and future directions. The central question addressed is how to construct reliable LLM-as-a-Judge systems.

Significance. If the surveyed works form a representative sample and the novel benchmark supplies a generalizable, unbiased measure of judge reliability, the synthesis of strategies plus the benchmark could provide a useful reference for standardizing LLM-based evaluations. The work explicitly compiles external literature without derivations or fitted parameters.

major comments (2)

[Abstract] Abstract and introduction: the headline claim that the survey is 'comprehensive' and that the 'novel benchmark' is 'designed for this purpose' is load-bearing for the central thesis, yet no search protocol, inclusion/exclusion criteria, or coverage statistics are supplied; without these the representativeness of the synthesized strategies cannot be assessed.
Benchmark section (wherever the novel benchmark is introduced): the abstract states the benchmark supports 'methodologies for evaluating the reliability of LLM-as-a-Judge systems,' but provides no validation details, error analysis, task-coverage justification, or comparison against existing benchmarks; this directly affects whether the proposed methodologies can be treated as generalizable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and introduction: the headline claim that the survey is 'comprehensive' and that the 'novel benchmark' is 'designed for this purpose' is load-bearing for the central thesis, yet no search protocol, inclusion/exclusion criteria, or coverage statistics are supplied; without these the representativeness of the synthesized strategies cannot be assessed.

Authors: We agree that a transparent literature search protocol is necessary to substantiate the claim of comprehensiveness. In the revised manuscript we will add a dedicated subsection (likely in Section 2 or the introduction) that specifies the search strategy, databases queried, keywords and time range, explicit inclusion/exclusion criteria, and basic coverage statistics (e.g., number of papers screened versus retained). This addition will allow readers to evaluate the representativeness of the synthesized reliability strategies. revision: yes
Referee: [—] Benchmark section (wherever the novel benchmark is introduced): the abstract states the benchmark supports 'methodologies for evaluating the reliability of LLM-as-a-Judge systems,' but provides no validation details, error analysis, task-coverage justification, or comparison against existing benchmarks; this directly affects whether the proposed methodologies can be treated as generalizable.

Authors: We acknowledge that the current presentation of the novel benchmark lacks the supporting analyses required to establish its generalizability. In the revision we will expand the benchmark section to include: (i) validation procedures and results, (ii) error analysis across tasks, (iii) explicit justification for task selection and coverage, and (iv) side-by-side comparisons with prior benchmarks. These additions will directly support the claim that the benchmark enables generalizable evaluation methodologies. revision: yes

Circularity Check

0 steps flagged

No circularity: survey compiles external literature with independent benchmark proposal

full rationale

This is a survey paper whose core contribution is synthesis of external works plus proposal of evaluation methodologies and a novel benchmark. No derivations, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. The abstract and structure reference external literature and a new benchmark without any self-definitional loops, fitted-input predictions, or load-bearing self-citations that collapse the claims. The paper is self-contained against external benchmarks as a literature review, warranting score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Survey paper with no mathematical derivations or empirical claims beyond the benchmark proposal mentioned in the abstract; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5797 in / 941 out tokens · 24724 ms · 2026-05-23T17:30:57.915501+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FollowTable: A Benchmark for Instruction-Following Table Retrieval
cs.IR 2026-05 unverdicted novelty 8.0

FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond ...
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
cs.AI 2026-04 accept novelty 8.0

MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
GIANTS: Generative Insight Anticipation from Scientific Literature
cs.CL 2026-04 unverdicted novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation
cs.CL 2025-07 accept novelty 8.0

MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.
GS-QA: A Benchmark for Geospatial Question Answering
cs.DB 2026-05 unverdicted novelty 7.0

GS-QA is a new benchmark of 2,800 QA pairs on 28 templates using OSM and Wikipedia data to evaluate LLMs on spatial predicates, multi-source reasoning, and diverse answer types including distances and counts.
Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)
cs.CR 2026-05 accept novelty 7.0

Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, a...
Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench
cs.CL 2026-05 unverdicted novelty 7.0

ConsumerSimBench evaluates 13 LLMs on reconstructing crowd reactions from 1,553 Chinese social-media topics using 23,122 auditable yes-no criteria, finding maximum coverage of 47.8% by Gemini-3.1-Pro.
Recall Isn't Enough: Bounding Commitments in Personalized Language Systems
cs.AI 2026-05 unverdicted novelty 7.0

CBEA with LCV bounds evidence sets and validates commitments before response generation, achieving zero failures in scoped tests at 0.49-0.60 availability versus near-zero for baselines.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics
cond-mat.stat-mech 2026-05 unverdicted novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 accept novelty 7.0

StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
cs.CY 2026-05 unverdicted novelty 7.0

StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
Task-Aware Calibration: Provably Optimal Decoding in LLMs
cs.LG 2026-05 unverdicted novelty 7.0

Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
Membership Inference Attacks for Retrieval Based In-Context Learning for Document Question Answering
cs.CR 2026-05 unverdicted novelty 7.0

Black-box membership inference attacks on retrieval-based in-context learning for document QA succeed via query prefixes, with a novel weighted-averaging method outperforming priors even under paraphrasing.
BIM Information Extraction Through LLM-based Adaptive Exploration
cs.CL 2026-05 unverdicted novelty 7.0

LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators
cs.CL 2026-04 unverdicted novelty 7.0

Depression patient simulators produce overly long, low-variability responses that resolve emotions too quickly along a uniform trajectory, with framework choice outweighing model scale.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA
cs.CL 2026-04 conditional novelty 7.0

MuDABench provides 332 analytical QA instances over large semi-structured document collections, showing standard RAG performs poorly while a multi-agent workflow with planning, extraction, and code generation improves...
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
cs.CL 2026-04 unverdicted novelty 7.0

LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic pe...
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
cs.CL 2026-04 unverdicted novelty 7.0

MM-JudgeBias benchmark shows that many MLLM judges neglect modalities and produce unstable evaluations under small input changes, based on tests of 26 models with over 1,800 samples.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
cs.AI 2026-04 conditional novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs
cs.CL 2026-04 unverdicted novelty 7.0

A controlled LLM pipeline generates synthetic French OSCE transcripts with varying skill levels and evaluates them, with mid-size models achieving ~90% accuracy matching GPT-4o on the synthetic data.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models
cs.CL 2026-03 unverdicted novelty 7.0

PR-CAD unifies text-to-CAD generation and editing via progressive refinement with LLMs, a new interaction dataset, and RL-enhanced reasoning to achieve better controllability and faithfulness.
When Negation Is a Geometry Problem in Vision-Language Models
cs.CV 2026-03 conditional novelty 7.0

A direction associated with negation exists in CLIP embedding space and can be steered at test time via representation engineering to produce negation-aware outputs without fine-tuning.
Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
cs.CL 2026-03 conditional novelty 7.0

Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
cs.LG 2026-03 unverdicted novelty 7.0

A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling
cs.SE 2026-01 unverdicted novelty 7.0

A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
VIDEOP2R: Video Understanding from Perception to Reasoning
cs.CV 2025-11 conditional novelty 7.0

VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models
cs.AI 2025-10 unverdicted novelty 7.0

Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step inter...
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
cs.CL 2025-10 unverdicted novelty 7.0

FinAuditing is a taxonomy-structured multi-document benchmark with 1,102 instances averaging over 33k tokens from XBRL filings, defining three tasks to evaluate LLMs on financial auditing capabilities.
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models
cs.CL 2025-07 conditional novelty 7.0

Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.
Bayesian Social Deduction with Graph-Informed Language Models
cs.AI 2025-06 unverdicted novelty 7.0

Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
cs.LG 2025-04 accept novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
cs.CV 2025-04 conditional novelty 7.0

Consensus Entropy measures inter-VLM output agreement to verify OCR reliability and enable self-improving ensembles, yielding 42.1% F1 gains over single-model judging.
Towards Context-Invariant Safety Alignment for Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Evaluation of 6233 MedGPTs finds 25-30% with low factual accuracy, 33.6-54.3% violating operational thresholds, and 57% of action-enabled models lacking privacy disclosures.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
cs.CL 2026-05 unverdicted novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance...
Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
cs.SE 2026-05 accept novelty 6.0

A paraphrase-robust clustering pipeline plus XGBoost classifier identifies refactoring-worthy step subsequences in large BDD test corpora with out-of-fold F1 0.891, outperforming rule baselines and LLM judges.
Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems
cs.AI 2026-05 unverdicted novelty 6.0

HEAR uses a stratified hypergraph ontology to orchestrate evidence-driven multi-hop reasoning over heterogeneous business systems, reaching 94.7% accuracy on supply-chain root-cause tasks with open-weight models.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
cs.SE 2026-05 unverdicted novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs
cs.CV 2026-05 accept novelty 6.0

A 30-token prompt requesting a neutral comparison table cuts sponsored recommendations in LLMs from roughly 50% to near zero.
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement
cs.AI 2026-05 unverdicted novelty 6.0

PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
cs.LG 2026-05 unverdicted novelty 6.0

LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel
cs.SE 2026-05 conditional novelty 6.0

False-positive bug reports in the Linux kernel consume effort comparable to real bugs and can be filtered by LLMs using retrieval-augmented generation at 88% F1.
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
cs.CV 2026-05 unverdicted novelty 6.0

DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
Shadow-Loom: Causal Reasoning over Graphical World Models of Narratives
cs.AI 2026-05 unverdicted novelty 6.0

Shadow-Loom builds graphical world models from stories to enable code-based causal reasoning and structural scoring of narrative effects such as mystery, irony, suspense, and surprise.
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation
cs.CL 2026-05 unverdicted novelty 6.0

VIDA provides 2,500 visually-dependent ambiguous MT instances and LLM-judge metrics; chain-of-thought SFT improves disambiguation accuracy over standard SFT, especially out-of-distribution.
MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria
cs.HC 2026-04 unverdicted novelty 6.0

MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...
A Survey on LLM-based Conversational User Simulation
cs.CL 2026-04 unverdicted novelty 6.0

A survey that introduces a taxonomy for LLM-based conversational user simulation, analyzes core techniques and evaluation methods, and identifies open challenges in the field.
Exploring Audio Hallucination in Egocentric Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models
cs.MM 2026-04 unverdicted novelty 6.0

OceanPile is a new multimodal corpus with unified data collection, instruction tuning set, and benchmark to train foundation models for ocean science.
Evian: Towards Explainable Visual Instruction-tuning Data Auditing
cs.CV 2026-04 unverdicted novelty 6.0

EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.
Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models
cs.AI 2026-04 unverdicted novelty 6.0

A generative reward model supplies separate semantic and turn-taking scores for spoken dialogues to enable more reliable reinforcement learning.
Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs
cs.CL 2026-04 unverdicted novelty 6.0

GLOW integrates a pre-trained GNN for candidate prediction with an LLM for joint symbolic-semantic reasoning over incomplete KGs, reporting up to 53.3% gains on standard benchmarks and a new GLOW-BENCH dataset.
Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation
cs.CL 2026-04 unverdicted novelty 6.0

MISE proves that hindsight self-evaluation rewards equal minimizing mutual information plus KL divergence to a proxy policy, and experiments show 7B LLMs reaching GPT-4o-level results on validation tasks.
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval
cs.AI 2026-04 unverdicted novelty 6.0

A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection
cs.AI 2026-04 unverdicted novelty 6.0

TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts wh...
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

Reference graph

Works this paper leans on

231 extracted references · 231 canonical work pages · cited by 114 Pith papers · 24 internal anchors

[1]

Ayush Agrawal, Mirac Suzgun, Lester Mackey, and Adam Tauman Kalai. 2023. Do Language Models Know When They’re Hallucinating References?arXiv preprint arXiv:2305.18248 (2023)

work page arXiv 2023
[2]

Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu

work page
[3]

2307.11088 , archivePrefix=

L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088 (2023)

work page arXiv 2023
[4]

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. 2024. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. ArXiv preprint abs/2410.09024 (2024). https://arxiv.org/abs/2410.09024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D Chang, and Prithviraj Ammanabrolu. 2024. Critique-out- loud reward models. arXiv preprint arXiv:2408.11791 (2024)

work page arXiv 2024
[6]

Golnoosh Babaei and Paolo Giudici. 2024. GPT classifications, with application to credit lending. Machine Learning with Applications 16 (2024), 100534

work page 2024
[7]

Sher Badshah and Hassan Sajjad. 2024. Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text. ArXiv preprint abs/2408.09235 (2024). https://arxiv.org/abs/2408.09235

work page arXiv 2024
[8]

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. 2024. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks. arXiv preprint arXiv:2412.15204 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. 2023. Benchmarking Foundation Models with Language-Model-as-an-Examiner. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2...

work page 2023
[10]

Sergio Bermejo. 2024. Enhancing Annotated Bibliography Generation with LLM Ensembles. arXiv preprint arXiv:2412.20864 (2024)

work page arXiv 2024
[11]

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Confere...

work page doi:10.1609/aaai 2024
[12]

Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. 2024. Guiding LLMs the right way: fast, non-invasive constrained generation. InProceedings of the 41st International Conference on Machine Learning (ICML’24,Vol.235). JMLR.org, Vienna, Austria, 3658–3673

work page 2024
[13]

Nathan Brake and Thomas Schaaf. 2024. Comparing Two Model Designs for Clinical Note Generation: Is an LLM a Useful Evaluator of Consistency? Findings of the ACL (2024)

work page 2024
[14]

Meni Brief, Oded Ovadia, Gil Shenderovitz, Noga Ben Yoash, Rachel Lemberg, and Eitam Sheetrit. 2024. Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance–A Case Study in Finance.ArXiv preprint abs/2410.01109 (2024). https://arxiv.org/abs/2410.01109

work page arXiv 2024
[15]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020
[16]

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chat- Eval: Towards Better LLM-based Evaluators through Multi-Agent Debate. InThe Twelfth International Conference on Learning Representations

work page 2023
[17]

David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny. 2023. CLAIR: Evaluating Image Captions with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 13638–13646. doi:1...

work page doi:10.18653/v1/2023.emnlp-main.841 2023
[18]

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. In Forty-first International Conference on Machine Learning. https://openreview.net/forum?id=dbFEFHAD79

work page 2024
[19]

Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, et al. 2024. Data-juicer: A one-stop data processing system for large language models. In Companion of the 2024 International Conference on Management of Data. 120–134

work page 2024
[20]

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024. Humans or llms as the judge? a study on judgement biases. ArXiv preprint abs/2402.10669 (2024). https://arxiv.org/abs/2402.10669

work page arXiv 2024
[21]

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. 2024. Automated evaluation of large vision-language models on self-driving corner cases. ArXiv preprint abs/2404.10595 (2024). https://arxiv.org/abs/2404.10595

work page arXiv 2024
[22]

Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, et al. 2023. Evaluating hallucinations in chinese large language models. ArXiv preprint abs/2310.03368 (2023). https://arxiv.org/abs/2310.03368

work page arXiv 2023
[23]

Inyoung Cheong, King Xia, KJ Kevin Feng, Quan Ze Chen, and Amy X Zhang. 2024. (A) I Am Not a Lawyer, But...: Engaging Legal Experts towards Responsible LLM Policies for Legal Advice. InThe 2024 ACM Conference on Fairness, Accountability, and Transparency. 2454–2469

work page 2024
[24]

Antonia Creswell, Murray Shanahan, and Irina Higgins. 2023. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=3Pf3Wg6o-A4

work page 2023
[25]

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. 2023. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. ArXiv preprint abs/2311.03287 (2023). https://arxiv.org/abs/2311.03287

work page arXiv 2023
[26]

Sunhao Dai, Chen Xu, Shicheng Xu, Liang Pang, Zhenhua Dong, and Jun Xu. 2024. Unifying Bias and Unfairness in Information Retrieval: A Survey of Challenges and Opportunities with Large Language Models. arXiv preprint arXiv:2404.11457 (2024)

work page arXiv 2024
[27]

Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, and Jun Xu. 2024. Neural Retrievers are Biased Towards LLM-Generated Content. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Barcelona, Spain) (KDD ’24). Association for Computing Machinery, New York, NY, USA, 526–537. doi:1...

work page doi:10.1145/3637528.3671882 2024
[28]

MRSB DATA. 2024. Multimodal artificial intelligence foundation models: Unleashing the power of remote sensing big data in earth observation. Innovation 2, 1 (2024), 100055

work page 2024
[29]

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. 2023. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. arXiv preprint arXiv:2304.06767 (2023). https://arxiv.org/abs/2304.06767

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. 2024. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. doi:10.48550/arXiv.2411.15100 arXiv:2411.15100 [cs]

work page doi:10.48550/arxiv.2411.15100 2024
[31]

Yijiang River Dong, Tiancheng Hu, and Nigel Collier. 2024. Can LLM be a Personalized Judge? arXiv preprint arXiv:2406.11657 (2024)

work page arXiv 2024
[32]

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Associ...

work page doi:10.18653/v1/2022.acl-long.26 2022
[33]

Hashimoto

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS...

work page 2023
[34]

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2024. GPTScore: Evaluate as You Desire. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Lingui...

work page 2024
[35]

Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2024. Bias and fairness in large language models: A survey.Computational Linguistics (2024), 1–79. , Vol. 1, No. 1, Article . Publication date: October 2025. J. Gu, X. Jiang, Z. Shi, J. Guo, et al

work page 2024
[36]

Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2024. Bias and fairness in large language models: A survey.Computational Linguistics 50, 3 (2024), 1097–1179

work page 2024
[37]

Chang Gao, Haiyun Jiang, Deng Cai, Shuming Shi, and Wai Lam. 2023. Strategyllm: Large language models as strategy generators, executors, optimizers, and evaluators for problem solving. arXiv preprint arXiv:2311.08803 (2023)

work page arXiv 2023
[38]

Leo Gao, John Schulman, and Jacob Hilton. 2023. Scaling laws for reward model overoptimization. In International Conference on Machine Learning. PMLR, 10835–10866

work page 2023
[39]

Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. 2023. Human-like summarization evaluation with chatgpt. ArXiv preprint abs/2304.02554 (2023). https://arxiv.org/abs/2304.02554

work page arXiv 2023
[40]

Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, and Idan Szpektor. 2023. TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapor...

work page doi:10.18653/v1/2023.emnlp-main.127 2023
[41]

Google. 2023. Gemini: a family of highly capable multimodal models. ArXiv preprint abs/2312.11805 (2023). https: //arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John J. Nay, Jonathan H. Choi, K...

work page 2023
[43]

Yufei Guo, Muzhe Guo, Juntao Su, Zhou Yang, Mengqiu Zhu, Hongfei Li, Mengyang Qiu, and Shuo Shuo Liu. 2024. Bias in large language models: Origin, evaluation, and mitigation. arXiv preprint arXiv:2411.10915 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. 2023. Reasoning with Language Model is Planning with World Model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 8154–8173. doi:1...

work page doi:10.18653/v1/2023.emnlp-main.507 2023
[45]

Hangfeng He, Hongming Zhang, and Dan Roth. 2024. SocREval: Large Language Models with the Socratic Method for Reference-free Reasoning Evaluation. In Findings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 2736–2764. https://a...

work page 2024
[46]

Shijun He, Fan Yang, Jian-ping Zuo, and Ze-min Lin. 2023. ChatGPT for scientific paper writing—promises and perils. The Innovation 4, 6 (2023)

work page 2023
[47]

Sin, Bing Ren, Bryceton G

Pedram Hosseini, Jessica M. Sin, Bing Ren, Bryceton G. Thomas, Elnaz Nouri, Ali Farahanchi, and Saeed Hassanpour

work page
[48]

In Proceedings of EMNLP

A Benchmark for Long-Form Medical Question Answering. In Proceedings of EMNLP

work page
[49]

Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, and Xiaojun Wan. 2024. Are LLM-based Evaluators Confusing NLG Quality Criteria?. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9530–9570. https://aclanthology.org/2024.acl-long.516

work page 2024
[50]

Hui Huang, Yancheng He, Hongli Zhou, Rui Zhang, Wei Liu, Weixun Wang, Wenbo Su, Bo Zheng, and Jiaheng Liu

work page
[51]

arXiv preprint arXiv:2505.14268 (2025)

Think-j: Learning to think for generative llm-as-a-judge. arXiv preprint arXiv:2505.14268 (2025)

work page arXiv 2025
[52]

Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and Tiejun Zhao. 2024. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers. ArXiv preprint abs/2403.02839 (2024). https: //arxiv.org/abs/2403.02839

work page arXiv 2024
[53]

Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards Reasoning in Large Language Models: A Survey. InFindings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 1049–1065. doi:10.18653/v1/2023.findings-acl.67

work page doi:10.18653/v1/2023.findings-acl.67 2023
[54]

Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael Lyu. 2023. On the humanity of conversational ai: Evaluating the psychological portrayal of llms. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=H3UayAQWoE

work page 2023
[55]

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline defenses for adversarial attacks against aligned language models. ArXiv preprint abs/2309.00614 (2023). https://arxiv.org/abs/2309.00614 , Vol. 1, No. 1, Article . Publication dat...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Minbyul Jeong, Jiwoong Sohn, Mujeen Sung, and Jaewoo Kang. 2024. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 40, Supplement_1 (2024), i119– i129

work page 2024
[57]

Bowen Jiang, Yangxinyu Xie, Xiaomeng Wang, Weijie J Su, Camillo Jose Taylor, and Tanwi Mallick. 2024. Multi-modal and multi-agent systems meet rationality: A survey. In ICML 2024 Workshop on LLMs and Cognition

work page 2024
[58]

the language of nature

Theodore T. Jiang, Li Fang, and Kai Wang. 2023. Deciphering “the language of nature”: A transformer-based language model for deleterious mutations in proteins. The Innovation 4, 5 (2023), 100487. doi:10.1016/j.xinn.2023.100487

work page doi:10.1016/j.xinn.2023.100487 2023
[59]

Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, and Huan Sun. 2024. A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Kevin Duh, Helena Gomez, and Ste...

work page 2024
[60]

Jaehun Jung, Faeze Brahman, and Yejin Choi. 2024. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement. arXiv preprint arXiv:2407.18370 (2024)

work page arXiv 2024
[61]

Immanuel Kant. 1781. Critique of Pure Reason (a/b ed.). Macmillan, London. Akademie-Ausgabe, Vol. 3, A132/B171

work page
[62]

Immanuel Kant. 1790. Critique of Judgment. Hackett Publishing Company, Indianapolis. Akademie-Ausgabe, Vol. 5, 5:179

work page
[63]

Akira Kawabata and Saku Sugawara. 2024. Rationale-Aware Answer Verification by Pairwise Self-Evaluation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 16178–16196

work page 2024
[64]

Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, et al. 2024. CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pap...

work page 2024
[65]

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113–19122

work page 2023
[66]

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. 2023. Prometheus: Inducing Fine-grained Evaluation Capability in Language Models. ArXiv preprint abs/2310.08491 (2023). https://arxiv.org/abs/2310.08491

work page arXiv 2023
[67]

Pang Wei Koh, Jialin Zhang, Jane Lee, and Percy Liang. 2024. MedHELM: Holistic Evaluation of Language Models for Medical Applications. Technical Report. Stanford Human-Centered Artificial Intelligence

work page 2024
[68]

Mahi Kolla, Siddharth Salunkhe, Eshwar Chandrasekharan, and Koustuv Saha. 2024. LLM-Mod: Can Large Language Models Assist Content Moderation?. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–8

work page 2024
[69]

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. 2023. Benchmarking cognitive biases in large language models as evaluators. ArXiv preprint abs/2309.17012 (2023). https://arxiv.org/abs/ 2309.17012

work page arXiv 2023
[70]

Fajri Koto. 2024. Cracking the Code: Multi-domain LLM Evaluation on Real-World Professional Exams in Indonesia. ArXiv preprint abs/2409.08564 (2024). https://arxiv.org/abs/2409.08564

work page arXiv 2024
[71]

Jack Krolik, Herprit Mahal, Feroz Ahmad, Gaurav Trivedi, and Bahador Saket. 2024. Towards Leveraging Large Language Models for Automated Medical Question–Answer Evaluation. arXiv preprint arXiv:2403.01892 (2024)

work page arXiv 2024
[72]

Abhishek Kumar, Sonia Haiduc, Partha Pratim Das, and Partha Pratim Chakrabarti. 2024. LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization. ArXiv preprint abs/2409.00630 (2024). https://arxiv.org/ abs/2409.00630

work page arXiv 2024
[73]

Abhishek Kumar, Sarfaroz Yunusov, and Ali Emami. 2024. Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models. ArXiv preprint abs/2405.14555 (2024). https://arxiv.org/abs/2405.14555

work page arXiv 2024
[74]

Preethi Lahoti, Nicholas Blumm, Xiao Ma, Raghavendra Kotikalapudi, Sahitya Potluri, Qijun Tan, Hansa Srinivasan, Ben Packer, Ahmad Beirami, Alex Beutel, and Jilin Chen. 2023. Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting. In Proceedings of the 2023 Conference on Empirical Methods in Nat...

work page doi:10.18653/v1/2023.emnlp-main.643 2023
[75]

Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang, Dahua Lin, Kai Chen, and Xian-Ling Mao. [n. d.]. CriticEval: Evaluating Large-scale Language Model as Critic. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page
[76]

Dongryeol Lee, Yerin Hwang, Yongil Kim, Joonsuk Park, and Kyomin Jung. 2024. Are LLM-judges robust to ex- pressions of uncertainty? investigating the effect of epistemic markers on LLM-based evaluation. arXiv preprint , Vol. 1, No. 1, Article . Publication date: October 2025. J. Gu, X. Jiang, Z. Shi, J. Guo, et al. arXiv:2410.20774 (2024)

work page arXiv 2024
[77]

Yebin Lee, Imseong Park, and Myungjoo Kang. 2024. FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 3732–3746. https://aclanthology.org/2024.acl-long.205

work page 2024
[78]

Alice Li and Luanne Sinnamon. 2023. Examining query sentiment bias effects on search results in large language models. In The Symposium on Future Directions in Information Access (FDIA) co-located with the 2023 European Summer School on Information Retrieval (ESSIR)

work page 2023
[79]

Dawei Li, Zhen Tan, Peijia Qian, Yifan Li, Kumar Satvik Chaudhary, Lijie Hu, and Jiayi Shen. 2024. SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents. ArXiv preprint abs/2411.03284 (2024). https: //arxiv.org/abs/2411.03284

work page arXiv 2024
[80]

Dawei Li, Shu Yang, Zhen Tan, Jae Young Baik, Sunkwon Yun, Joseph Lee, Aaron Chacko, Bojian Hou, Duy Duong- Tran, Ying Ding, et al. 2024. DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer’s Disease Questions with Scientific Literature. ArXiv preprint abs/2405.04819 (2024). https://arxiv.org/abs/2405.04819

work page arXiv 2024

Showing first 80 references.

[1] [1]

Ayush Agrawal, Mirac Suzgun, Lester Mackey, and Adam Tauman Kalai. 2023. Do Language Models Know When They’re Hallucinating References?arXiv preprint arXiv:2305.18248 (2023)

work page arXiv 2023

[2] [2]

Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu

work page

[3] [3]

2307.11088 , archivePrefix=

L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088 (2023)

work page arXiv 2023

[4] [4]

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. 2024. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. ArXiv preprint abs/2410.09024 (2024). https://arxiv.org/abs/2410.09024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D Chang, and Prithviraj Ammanabrolu. 2024. Critique-out- loud reward models. arXiv preprint arXiv:2408.11791 (2024)

work page arXiv 2024

[6] [6]

Golnoosh Babaei and Paolo Giudici. 2024. GPT classifications, with application to credit lending. Machine Learning with Applications 16 (2024), 100534

work page 2024

[7] [7]

Sher Badshah and Hassan Sajjad. 2024. Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text. ArXiv preprint abs/2408.09235 (2024). https://arxiv.org/abs/2408.09235

work page arXiv 2024

[8] [8]

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. 2024. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks. arXiv preprint arXiv:2412.15204 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. 2023. Benchmarking Foundation Models with Language-Model-as-an-Examiner. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2...

work page 2023

[10] [10]

Sergio Bermejo. 2024. Enhancing Annotated Bibliography Generation with LLM Ensembles. arXiv preprint arXiv:2412.20864 (2024)

work page arXiv 2024

[11] [11]

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Confere...

work page doi:10.1609/aaai 2024

[12] [12]

Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. 2024. Guiding LLMs the right way: fast, non-invasive constrained generation. InProceedings of the 41st International Conference on Machine Learning (ICML’24,Vol.235). JMLR.org, Vienna, Austria, 3658–3673

work page 2024

[13] [13]

Nathan Brake and Thomas Schaaf. 2024. Comparing Two Model Designs for Clinical Note Generation: Is an LLM a Useful Evaluator of Consistency? Findings of the ACL (2024)

work page 2024

[14] [14]

Meni Brief, Oded Ovadia, Gil Shenderovitz, Noga Ben Yoash, Rachel Lemberg, and Eitam Sheetrit. 2024. Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance–A Case Study in Finance.ArXiv preprint abs/2410.01109 (2024). https://arxiv.org/abs/2410.01109

work page arXiv 2024

[15] [15]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020

[16] [16]

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chat- Eval: Towards Better LLM-based Evaluators through Multi-Agent Debate. InThe Twelfth International Conference on Learning Representations

work page 2023

[17] [17]

David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, and John Canny. 2023. CLAIR: Evaluating Image Captions with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 13638–13646. doi:1...

work page doi:10.18653/v1/2023.emnlp-main.841 2023

[18] [18]

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. In Forty-first International Conference on Machine Learning. https://openreview.net/forum?id=dbFEFHAD79

work page 2024

[19] [19]

Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, et al. 2024. Data-juicer: A one-stop data processing system for large language models. In Companion of the 2024 International Conference on Management of Data. 120–134

work page 2024

[20] [20]

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024. Humans or llms as the judge? a study on judgement biases. ArXiv preprint abs/2402.10669 (2024). https://arxiv.org/abs/2402.10669

work page arXiv 2024

[21] [21]

Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. 2024. Automated evaluation of large vision-language models on self-driving corner cases. ArXiv preprint abs/2404.10595 (2024). https://arxiv.org/abs/2404.10595

work page arXiv 2024

[22] [22]

Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, et al. 2023. Evaluating hallucinations in chinese large language models. ArXiv preprint abs/2310.03368 (2023). https://arxiv.org/abs/2310.03368

work page arXiv 2023

[23] [23]

Inyoung Cheong, King Xia, KJ Kevin Feng, Quan Ze Chen, and Amy X Zhang. 2024. (A) I Am Not a Lawyer, But...: Engaging Legal Experts towards Responsible LLM Policies for Legal Advice. InThe 2024 ACM Conference on Fairness, Accountability, and Transparency. 2454–2469

work page 2024

[24] [24]

Antonia Creswell, Murray Shanahan, and Irina Higgins. 2023. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=3Pf3Wg6o-A4

work page 2023

[25] [25]

Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. 2023. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. ArXiv preprint abs/2311.03287 (2023). https://arxiv.org/abs/2311.03287

work page arXiv 2023

[26] [26]

Sunhao Dai, Chen Xu, Shicheng Xu, Liang Pang, Zhenhua Dong, and Jun Xu. 2024. Unifying Bias and Unfairness in Information Retrieval: A Survey of Challenges and Opportunities with Large Language Models. arXiv preprint arXiv:2404.11457 (2024)

work page arXiv 2024

[27] [27]

Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, and Jun Xu. 2024. Neural Retrievers are Biased Towards LLM-Generated Content. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Barcelona, Spain) (KDD ’24). Association for Computing Machinery, New York, NY, USA, 526–537. doi:1...

work page doi:10.1145/3637528.3671882 2024

[28] [28]

MRSB DATA. 2024. Multimodal artificial intelligence foundation models: Unleashing the power of remote sensing big data in earth observation. Innovation 2, 1 (2024), 100055

work page 2024

[29] [29]

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. 2023. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment. arXiv preprint arXiv:2304.06767 (2023). https://arxiv.org/abs/2304.06767

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. 2024. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. doi:10.48550/arXiv.2411.15100 arXiv:2411.15100 [cs]

work page doi:10.48550/arxiv.2411.15100 2024

[31] [31]

Yijiang River Dong, Tiancheng Hu, and Nigel Collier. 2024. Can LLM be a Personalized Judge? arXiv preprint arXiv:2406.11657 (2024)

work page arXiv 2024

[32] [32]

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Associ...

work page doi:10.18653/v1/2022.acl-long.26 2022

[33] [33]

Hashimoto

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS...

work page 2023

[34] [34]

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2024. GPTScore: Evaluate as You Desire. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Lingui...

work page 2024

[35] [35]

Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2024. Bias and fairness in large language models: A survey.Computational Linguistics (2024), 1–79. , Vol. 1, No. 1, Article . Publication date: October 2025. J. Gu, X. Jiang, Z. Shi, J. Guo, et al

work page 2024

[36] [36]

Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. 2024. Bias and fairness in large language models: A survey.Computational Linguistics 50, 3 (2024), 1097–1179

work page 2024

[37] [37]

Chang Gao, Haiyun Jiang, Deng Cai, Shuming Shi, and Wai Lam. 2023. Strategyllm: Large language models as strategy generators, executors, optimizers, and evaluators for problem solving. arXiv preprint arXiv:2311.08803 (2023)

work page arXiv 2023

[38] [38]

Leo Gao, John Schulman, and Jacob Hilton. 2023. Scaling laws for reward model overoptimization. In International Conference on Machine Learning. PMLR, 10835–10866

work page 2023

[39] [39]

Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, and Xiaojun Wan. 2023. Human-like summarization evaluation with chatgpt. ArXiv preprint abs/2304.02554 (2023). https://arxiv.org/abs/2304.02554

work page arXiv 2023

[40] [40]

Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, and Idan Szpektor. 2023. TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapor...

work page doi:10.18653/v1/2023.emnlp-main.127 2023

[41] [41]

Google. 2023. Gemini: a family of highly capable multimodal models. ArXiv preprint abs/2312.11805 (2023). https: //arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya K, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John J. Nay, Jonathan H. Choi, K...

work page 2023

[43] [43]

Yufei Guo, Muzhe Guo, Juntao Su, Zhou Yang, Mengqiu Zhu, Hongfei Li, Mengyang Qiu, and Shuo Shuo Liu. 2024. Bias in large language models: Origin, evaluation, and mitigation. arXiv preprint arXiv:2411.10915 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. 2023. Reasoning with Language Model is Planning with World Model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 8154–8173. doi:1...

work page doi:10.18653/v1/2023.emnlp-main.507 2023

[45] [45]

Hangfeng He, Hongming Zhang, and Dan Roth. 2024. SocREval: Large Language Models with the Socratic Method for Reference-free Reasoning Evaluation. In Findings of the Association for Computational Linguistics: NAACL 2024, Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 2736–2764. https://a...

work page 2024

[46] [46]

Shijun He, Fan Yang, Jian-ping Zuo, and Ze-min Lin. 2023. ChatGPT for scientific paper writing—promises and perils. The Innovation 4, 6 (2023)

work page 2023

[47] [47]

Sin, Bing Ren, Bryceton G

Pedram Hosseini, Jessica M. Sin, Bing Ren, Bryceton G. Thomas, Elnaz Nouri, Ali Farahanchi, and Saeed Hassanpour

work page

[48] [48]

In Proceedings of EMNLP

A Benchmark for Long-Form Medical Question Answering. In Proceedings of EMNLP

work page

[49] [49]

Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, and Xiaojun Wan. 2024. Are LLM-based Evaluators Confusing NLG Quality Criteria?. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9530–9570. https://aclanthology.org/2024.acl-long.516

work page 2024

[50] [50]

Hui Huang, Yancheng He, Hongli Zhou, Rui Zhang, Wei Liu, Weixun Wang, Wenbo Su, Bo Zheng, and Jiaheng Liu

work page

[51] [51]

arXiv preprint arXiv:2505.14268 (2025)

Think-j: Learning to think for generative llm-as-a-judge. arXiv preprint arXiv:2505.14268 (2025)

work page arXiv 2025

[52] [52]

Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and Tiejun Zhao. 2024. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers. ArXiv preprint abs/2403.02839 (2024). https: //arxiv.org/abs/2403.02839

work page arXiv 2024

[53] [53]

Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards Reasoning in Large Language Models: A Survey. InFindings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 1049–1065. doi:10.18653/v1/2023.findings-acl.67

work page doi:10.18653/v1/2023.findings-acl.67 2023

[54] [54]

Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, and Michael Lyu. 2023. On the humanity of conversational ai: Evaluating the psychological portrayal of llms. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=H3UayAQWoE

work page 2023

[55] [55]

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline defenses for adversarial attacks against aligned language models. ArXiv preprint abs/2309.00614 (2023). https://arxiv.org/abs/2309.00614 , Vol. 1, No. 1, Article . Publication dat...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [56]

Minbyul Jeong, Jiwoong Sohn, Mujeen Sung, and Jaewoo Kang. 2024. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics 40, Supplement_1 (2024), i119– i129

work page 2024

[57] [57]

Bowen Jiang, Yangxinyu Xie, Xiaomeng Wang, Weijie J Su, Camillo Jose Taylor, and Tanwi Mallick. 2024. Multi-modal and multi-agent systems meet rationality: A survey. In ICML 2024 Workshop on LLMs and Cognition

work page 2024

[58] [58]

the language of nature

Theodore T. Jiang, Li Fang, and Kai Wang. 2023. Deciphering “the language of nature”: A transformer-based language model for deleterious mutations in proteins. The Innovation 4, 5 (2023), 100487. doi:10.1016/j.xinn.2023.100487

work page doi:10.1016/j.xinn.2023.100487 2023

[59] [59]

Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, and Huan Sun. 2024. A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Kevin Duh, Helena Gomez, and Ste...

work page 2024

[60] [60]

Jaehun Jung, Faeze Brahman, and Yejin Choi. 2024. Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement. arXiv preprint arXiv:2407.18370 (2024)

work page arXiv 2024

[61] [61]

Immanuel Kant. 1781. Critique of Pure Reason (a/b ed.). Macmillan, London. Akademie-Ausgabe, Vol. 3, A132/B171

work page

[62] [62]

Immanuel Kant. 1790. Critique of Judgment. Hackett Publishing Company, Indianapolis. Akademie-Ausgabe, Vol. 5, 5:179

work page

[63] [63]

Akira Kawabata and Saku Sugawara. 2024. Rationale-Aware Answer Verification by Pairwise Self-Evaluation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 16178–16196

work page 2024

[64] [64]

Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, et al. 2024. CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pap...

work page 2024

[65] [65]

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113–19122

work page 2023

[66] [66]

Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, et al. 2023. Prometheus: Inducing Fine-grained Evaluation Capability in Language Models. ArXiv preprint abs/2310.08491 (2023). https://arxiv.org/abs/2310.08491

work page arXiv 2023

[67] [67]

Pang Wei Koh, Jialin Zhang, Jane Lee, and Percy Liang. 2024. MedHELM: Holistic Evaluation of Language Models for Medical Applications. Technical Report. Stanford Human-Centered Artificial Intelligence

work page 2024

[68] [68]

Mahi Kolla, Siddharth Salunkhe, Eshwar Chandrasekharan, and Koustuv Saha. 2024. LLM-Mod: Can Large Language Models Assist Content Moderation?. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–8

work page 2024

[69] [69]

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. 2023. Benchmarking cognitive biases in large language models as evaluators. ArXiv preprint abs/2309.17012 (2023). https://arxiv.org/abs/ 2309.17012

work page arXiv 2023

[70] [70]

Fajri Koto. 2024. Cracking the Code: Multi-domain LLM Evaluation on Real-World Professional Exams in Indonesia. ArXiv preprint abs/2409.08564 (2024). https://arxiv.org/abs/2409.08564

work page arXiv 2024

[71] [71]

Jack Krolik, Herprit Mahal, Feroz Ahmad, Gaurav Trivedi, and Bahador Saket. 2024. Towards Leveraging Large Language Models for Automated Medical Question–Answer Evaluation. arXiv preprint arXiv:2403.01892 (2024)

work page arXiv 2024

[72] [72]

Abhishek Kumar, Sonia Haiduc, Partha Pratim Das, and Partha Pratim Chakrabarti. 2024. LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization. ArXiv preprint abs/2409.00630 (2024). https://arxiv.org/ abs/2409.00630

work page arXiv 2024

[73] [73]

Abhishek Kumar, Sarfaroz Yunusov, and Ali Emami. 2024. Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models. ArXiv preprint abs/2405.14555 (2024). https://arxiv.org/abs/2405.14555

work page arXiv 2024

[74] [74]

Preethi Lahoti, Nicholas Blumm, Xiao Ma, Raghavendra Kotikalapudi, Sahitya Potluri, Qijun Tan, Hansa Srinivasan, Ben Packer, Ahmad Beirami, Alex Beutel, and Jilin Chen. 2023. Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting. In Proceedings of the 2023 Conference on Empirical Methods in Nat...

work page doi:10.18653/v1/2023.emnlp-main.643 2023

[75] [75]

Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang, Dahua Lin, Kai Chen, and Xian-Ling Mao. [n. d.]. CriticEval: Evaluating Large-scale Language Model as Critic. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page

[76] [76]

Dongryeol Lee, Yerin Hwang, Yongil Kim, Joonsuk Park, and Kyomin Jung. 2024. Are LLM-judges robust to ex- pressions of uncertainty? investigating the effect of epistemic markers on LLM-based evaluation. arXiv preprint , Vol. 1, No. 1, Article . Publication date: October 2025. J. Gu, X. Jiang, Z. Shi, J. Guo, et al. arXiv:2410.20774 (2024)

work page arXiv 2024

[77] [77]

Yebin Lee, Imseong Park, and Myungjoo Kang. 2024. FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 3732–3746. https://aclanthology.org/2024.acl-long.205

work page 2024

[78] [78]

Alice Li and Luanne Sinnamon. 2023. Examining query sentiment bias effects on search results in large language models. In The Symposium on Future Directions in Information Access (FDIA) co-located with the 2023 European Summer School on Information Retrieval (ESSIR)

work page 2023

[79] [79]

Dawei Li, Zhen Tan, Peijia Qian, Yifan Li, Kumar Satvik Chaudhary, Lijie Hu, and Jiayi Shen. 2024. SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents. ArXiv preprint abs/2411.03284 (2024). https: //arxiv.org/abs/2411.03284

work page arXiv 2024

[80] [80]

Dawei Li, Shu Yang, Zhen Tan, Jae Young Baik, Sunkwon Yun, Joseph Lee, Aaron Chacko, Bojian Hou, Duy Duong- Tran, Ying Ding, et al. 2024. DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer’s Disease Questions with Scientific Literature. ArXiv preprint abs/2405.04819 (2024). https://arxiv.org/abs/2405.04819

work page arXiv 2024