Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.
hub Mixed citations
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Mixed citation behavior. Most common role is background (57%).
abstract
The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references. However, these LLM-based evaluators still have lower human correspondence than medium-size neural evaluators. In this work, we present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs. We experiment with two generation tasks, text summarization and dialogue generation. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin. We also propose preliminary analysis on the behavior of LLM-based evaluators, and highlight the potential issue of LLM-based evaluators having a bias towards the LLM-generated texts. The code is at https://github.com/nlpyang/geval
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LLMs are applied in a generative pipeline for extracting, normalizing, and interpreting eligibility criteria from securities prospectuses, achieving up to 91% precision in document-level decisions with a conservative bias.
The paper defines Cherry-pick Override (CCO) as unauthorized directional commitment by LLM judges under mixed evidence and quantifies its prevalence (>84% on AVeriTeC conflicting subset) while testing intervention ladders and a two-channel reference probe.
MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.
LLM agents with personality anchoring from characters show dyadic agreeableness composition monotonically predicts shared goal achievement across 1010 simulated conversations, with homogeneous-agreeable pairs at 62% success versus 6% for homogeneous-disagreeable pairs.
NextMotionQA benchmark reveals VLMs have critical gaps in fine-grained human motion understanding and align with experts on coarse judgment (κ=0.70) but not fine-grained (κ=0.10).
CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.
A per-token feature from temperature-induced changes in LLM token distributions predicts within-prompt creativity rank at Spearman rho 0.918 vs LLM judges and 0.870 vs humans, outperforming perplexity, entropy, top-1 margin, and compression baselines.
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.
LogDx-CI benchmark shows hybrid grep+tail reducers achieve top diagnosis quality at low cost, agent loops shrink quality variance across reducers, and cross-family LLM summarizers outperform same-family pairs.
A 527-item GDPR-aligned privacy preference item bank was developed by extracting 669 statements from 99 GDPR articles and validating them through multi-round expert consensus and semantic clustering.
CBEA with LCV bounds evidence sets and validates commitments before response generation, achieving zero failures in scoped tests at 0.49-0.60 availability versus near-zero for baselines.
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
BiasedTales-ML provides a parallel multilingual corpus of LLM-generated children's stories that reveals substantial cross-lingual differences in narrative attributes not captured by English-centric analyses.
CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
Empirical study of open-source AI agents shows testing effort concentrates on deterministic tools and workflows (over 70%) while the FM-based plan body gets under 5% and prompts appear in only 1% of tests.
Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
Empirical analysis across three datasets identifies three open problems in constitutional preference reconstruction and shows that principle refinement raises inter-executor agreement from 73% to 78%.
Proposes the Metanym Game as a self-contained LLM benchmark using peer ratings and SVD to extract generator and judge competence, with skills dissociating and correlation to GPQA Diamond at r=0.92.
citing papers explorer
-
LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank
LLMs are applied in a generative pipeline for extracting, normalizing, and interpreting eligibility criteria from securities prospectuses, achieving up to 91% precision in document-level decisions with a conservative bias.
-
CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks
CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.
-
Before and After Temperature: A Distributional View of Creative LLM Generation
A per-token feature from temperature-induced changes in LLM token distributions predicts within-prompt creativity rank at Spearman rho 0.918 vs LLM judges and 0.870 vs humans, outperforming perplexity, entropy, top-1 margin, and compression baselines.
-
Low-Resource Safety Failures Are Action Failures, Not Representation Failures
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
-
Green Shielding: A User-Centric Approach Towards Trustworthy AI
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
-
BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories
BiasedTales-ML provides a parallel multilingual corpus of LLM-generated children's stories that reveals substantial cross-lingual differences in narrative attributes not captured by English-centric analyses.
-
CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform them and generalize to new domains.
-
PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.
-
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.
-
The Metanym Game: A Self-Contained, Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence
Proposes the Metanym Game as a self-contained LLM benchmark using peer ratings and SVD to extract generator and judge competence, with skills dissociating and correlation to GPQA Diamond at r=0.92.
-
M\"OVE: A Holistic LLM Benchmark for the German Public Sector
MÖVE presents a new German-language benchmark evaluating 39 LLMs on performance and governance criteria using ten public-administration datasets.
-
Substrate Asymmetry in User-Side Memory: A Diagnostic Framework
User memory in LLMs factors into three orthogonal axes where parametric adapters and retrieval show opposite strengths, with causal evidence from attention interventions and an alignment tax on RLHF models.
-
Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents
Empirical study of a production multi-turn ordering agent finds LLM-as-judge recall below 25% for human-confirmed defects, missing cross-turn state issues due to limited rubric and routing.
-
Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue
RefMem-Bench benchmarks reflective memory in dialogue with 26K instances across eight dimensions, and REMIND improves model accuracy via hierarchical evidence retrieval, grounding, and abstraction.
-
When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
Extending textual gradients to multi-objective LLM judge optimization shows a 59% drop in gradient task-focus and a 0.085 drop in Spearman rho, due to gradient dilution at optimization time and instruction interference at inference time.
-
Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.
-
Towards Context-Invariant Safety Alignment for Large Language Models
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
-
EmbGen: Teaching with Reassembled Corpora
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.
-
A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation
MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
-
DWTSumm: Discrete Wavelet Transform for Document Summarization
DWT decomposes sentence- or word-level embeddings into multi-resolution components that preserve semantics for direct or LLM-guided summarization, yielding up to 97% fidelity and gains in BERTScore and semantic metrics over GPT-4o baselines on clinical and legal benchmarks.
-
Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data
An LLM-based topic modeling method with a custom evaluation framework improves topic interpretability, specificity, and polarity consistency over prior approaches when linking corporate review text to external outcomes such as employee morale.
-
Learning to Control Summaries with Score Ranking
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
-
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
-
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
-
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.
-
Reasoning with Language Model is Planning with World Model
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
-
MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision
MemSlides introduces a three-part memory hierarchy (user profile, working, tool) with scoped local revision for multi-turn personalized slide generation.
-
Cross Paraphrastic Invariance Learning for Hallucination Detection
CPIL is a contrastive two-stage method that enforces paraphrase invariance on limited labeled data to outperform baselines in hallucination detection across 11 tasks.
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
-
MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering
MultiFinRAG is a multimodal RAG framework that improves accuracy on financial QA tasks involving text, tables, and images by 19 percentage points over ChatGPT-4o while running on commodity hardware.
-
Instruction-Following Evaluation for Large Language Models
IFEval is a new benchmark of 25 verifiable instruction types and ~500 prompts for objective, reproducible evaluation of LLMs' instruction-following abilities.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Refining and Reusing Annotation Guidelines for LLM Annotation
An iterative moderation framework refines and reuses annotation guidelines to improve LLM annotation accuracy on biomedical NER tasks across GPT, Gemini, and DeepSeek models.
-
A Community-Based Approach for Stance Distribution and Argument Organization
Unsupervised graph community detection organizes arguments to reveal stance distributions in debates.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
-
Causal Connections: Leveraging Multilingual Fine-Tuning for Financial QA@FinCausal 2026
Fine-tuned multilingual LLMs achieve top shared-task scores on financial causality extraction in English and Spanish.