hub Mixed citations

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

· 2024 · cs.LG · arXiv 2404.04475

Mixed citation behavior. Most common role is background (33%).

80 Pith papers citing it

Background 33% of classified citations

open full Pith review browse 80 citing papers arXiv PDF

abstract

LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for instruction-tuned LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?" To achieve this, we first fit a generalized linear model to predict the biased auto-annotator's preferences based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, but we also find that it increases the Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 dataset 3 baseline 1 method 1 other 1

citation-polarity summary

background 3 use dataset 3 baseline 1 unclear 1 use method 1

representative citing papers

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

CV-Arena is a new 12K-pair benchmark for instruction-guided real-image editing with 16 task types, CogRetriever curation, and Active Elo mixed human-AI evaluation that finds gaps in 21 models and presents CV-Agent.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

cs.LG · 2026-04-14 · unverdicted · novelty 7.0

Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motivates a new regularizer that improves real LLM jailbreak robustness-utility tradeoff

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

cs.CL · 2026-04-08 · conditional · novelty 7.0

SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.

TiCo: Time-Controllable Spoken Dialogue Model

cs.CL · 2026-03-23 · unverdicted · novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

cs.CL · 2026-03-02 · unverdicted · novelty 7.0

CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.

Improving Sampling for Masked Diffusion Models via Information Gain

cs.CL · 2026-02-20 · unverdicted · novelty 7.0

Info-Gain Sampler improves MDM decoding by using bidirectional information gain to reduce cumulative uncertainty, outperforming greedy samplers on reasoning accuracy and creative writing tasks.

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

cs.CL · 2026-02-10 · unverdicted · novelty 7.0

Top-W applies Wasserstein-regularized truncation on token-embedding geometry to create a closed-form optimal crop for LLM sampling that outperforms prior methods by up to 33.7% on GSM8K, GPQA, AlpacaEval, and MT-Bench.

VIDEOP2R: Video Understanding from Perception to Reasoning

cs.CV · 2025-11-14 · conditional · novelty 7.0

VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.

Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

cs.LG · 2025-05-19 · conditional · novelty 7.0

A well-tuned kNN router matches or exceeds state-of-the-art learned routers on new standardized benchmarks spanning instruction, QA, reasoning, and the first multi-modal visual routing dataset, due to locality of model performance in embedding space.

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

cs.CL · 2024-06-12 · unverdicted · novelty 7.0

Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Cognitive World Models for Process-Level Social Influence Evaluation

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

CogWM is a new LLM user model for evaluating social influence by predicting and tracking cognitive state evolution in dialogues, trained on 150k samples and shown to differentiate AI agents effectively.

What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

Proposes SCSuff metric for evaluating LLM explanation sufficiency via model-generated alternative inputs, showing explanations are typically insufficient and predictable from hidden states.

Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

Curating concise data for VLMs induces brevity, delivering 35x lower Cost-of-Pass at near-identical accuracy and higher matched-length accuracy than uncurated baselines.

FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative Essays

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

FOXGLOVE dataset of 2340 comments shows LLMs and instructors align on feedback goals and positions but diverge on sentence selection, with LLMs using more complex language and fewer questions and higher quality ratings driven by comment length.

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

SEAL revives saturated benchmarks via adaptive LLM meta-judging in elimination matches, matching full pairwise accuracy with roughly half the calls across code, math, QA, and agent tasks.

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Persona prompting trades expertise depth for reduced clarity in LLM answers and works best on advisory questions in medicine and psychology.

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

RUBRIC-ARROW is an alternating rubric generator and judge framework that uses probability-based scoring and pairwise preferences to improve pointwise reward modeling accuracy for LLM post-training in non-verifiable domains.

AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

AdaDPO uses self-adaptive stop-gradient coefficients to balance preferred and dispreferred gradients in DPO, achieving higher AlpacaEval 2 win rates than standard DPO on Llama-3-8B-Instruct.

citing papers explorer

Showing 30 of 80 citing papers.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 211 · internal anchor
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration cs.CL · 2025-05-16 · conditional · none · ref 27 · internal anchor
XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.
The Differences Between Direct Alignment Algorithms are a Blur cs.LG · 2025-02-03 · unverdicted · none · ref 14 · internal anchor
A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 58 · internal anchor
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence cs.SE · 2024-06-17 · unverdicted · none · ref 6 · internal anchor
An open-source MoE code model matches GPT-4 Turbo on coding and math benchmarks while expanding to 338 languages and 128K context length.
Mixture-of-Agents Enhances Large Language Model Capabilities cs.CL · 2024-06-07 · unverdicted · none · ref 7 · internal anchor
A layered Mixture-of-Agents system combining multiple LLMs achieves state-of-the-art results on AlpacaEval 2.0 (65.1%), MT-Bench, and FLASK, outperforming GPT-4 Omni.
Can MLLMs Critique Like Humans? Evaluating Open-Ended Aesthetic Reasoning in Multimodal Large Language Models cs.CL · 2026-06-29 · unverdicted · none · ref 4 · internal anchor
MLLMs generate verbose, comprehensive, and repetitive aesthetic critiques unlike selective human ones, and reference-based metrics fail to detect this because they capture model house style instead of image-specific content.
CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts cs.LG · 2026-05-30 · unverdicted · none · ref 30 · internal anchor
CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.
A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models cs.MM · 2026-05-29 · unverdicted · none · ref 9 · internal anchor
Pilot evaluation of language-specific versus multilingual LoRA adapters on Qwen2.5-VL-3B for curator-guided BLV art descriptions in three languages.
TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation cs.AI · 2026-05-28 · unverdicted · none · ref 4 · internal anchor
TRACE is a new metric for assessing LLM CoT reasoning structure via Toulmin and Flavell frameworks, showing r=0.74 correlation with accuracy on 26.3K samples and utility as an RL reward.
A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test cs.AI · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
Proposes a minimum measurement standard for LLM-as-a-judge in multi-hop RAG that fixes budgets and requires cluster-aware inference, showing it alters which baseline comparisons remain significant.
CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations cs.CL · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
CroCo applies English-reward-ranked self-generations for contrastive preference tuning that improves two LLMs on structured and open-ended tasks across 14 languages without language-specific annotations.
LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control cs.CL · 2026-05-20 · unverdicted · none · ref 8 · internal anchor
LoCar is a localization-aware evaluation framework for in-vehicle assistants that identifies unstable Korean honorific control and weaker performance on strategic metrics like clarification and proactivity in current LLMs.
Re-Triggering Safeguards within LLMs for Jailbreak Detection cs.CR · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
Data-dependent Exploration for Online Reinforcement Learning from Human Feedback cs.LG · 2026-05-06 · unverdicted · none · ref 13 · 2 links · internal anchor
DEPO constructs uncertainty bonuses from historical data for exploration in online RLHF and provides a data-dependent regret bound that adapts to task hardness.
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts cs.CR · 2026-05-04 · accept · none · ref 34 · internal anchor
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding cs.SE · 2026-04-30 · unverdicted · none · ref 27 · internal anchor
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts cs.CL · 2026-04-07 · unverdicted · none · ref 4 · internal anchor
MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choice affects scores.
What Is Preference Optimization Doing, and Why? cs.LG · 2025-11-30 · unverdicted · none · ref 44 · internal anchor
Gradient analysis and ablations show DPO and PPO have different target directions and component roles in preference optimization for LLMs.
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA cs.LG · 2025-10-27 · unverdicted · none · ref 24 · internal anchor
GIFT matches the optimal policy of GRPO using an endogenous prompt-dependent KL coefficient derived via z-score standardization of implicit rewards.
Proximal Supervised Fine-Tuning cs.LG · 2025-08-25 · unverdicted · none · ref 6 · internal anchor
PSFT modifies supervised fine-tuning by incorporating trust-region ideas from RL to constrain policy changes, yielding better out-of-domain generalization in math and human-value tasks without entropy collapse.
Enhancing Speech Large Language Models through Reinforced Behavior Alignment cs.CL · 2025-08-25 · unverdicted · none · ref 21 · internal anchor
Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.
InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion cs.CL · 2025-05-20 · unverdicted · none · ref 4 · internal anchor
InfiGFusion introduces graph-on-logits distillation with an O(n log n) Gromov-Wasserstein approximation to fuse LLMs by modeling token co-activations, reporting gains over baselines on 11 benchmarks.
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong cs.CL · 2025-01-16 · unverdicted · none · ref 5 · internal anchor
Reasoning before answering MCQs increases LLM confidence more for incorrect answers and degrades calibration on a 57-subject benchmark across seven models.
WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback cs.CL · 2024-08-28 · unverdicted · none · ref 12 · internal anchor
WildFeedback extracts preference pairs from in-situ user feedback in LLM conversations to fine-tune models for better alignment with real user preferences.
Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation cs.SE · 2026-06-27 · unverdicted · none · ref 11 · internal anchor
Empirical study on five LLMs finds pretrained-to-aligned paths yield bigger gains over baseline than finetuned-to-aligned paths, though absolute accuracy remains lower for pretrained starts.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 56 · internal anchor
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines cs.AI · 2026-04-25 · unreviewed · ref 3 · internal anchor
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures cs.AI · 2026-04-09 · unreviewed · ref 12 · internal anchor
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unreviewed · ref 79 · internal anchor

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer