arxiv: 2404.04475 · v2 · submitted 2024-04-06 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois , Bal\'azs Galambosi , Percy Liang , Tatsunori B. Hashimoto

Authors on Pith no claims yet

Pith reviewed 2026-05-12 11:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLstat.ML

keywords LLM evaluationlength biasauto-annotatorsAlpacaEvaldebiasingregression controlpreference modelingbenchmarking

0 comments

The pith

A regression adjustment removes length bias from AlpacaEval by predicting preferences at equal output lengths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes fitting a generalized linear model to auto-annotator preferences using length difference as a key predictor, then computing the preference that would arise if lengths were identical. This produces length-controlled scores for AlpacaEval that answer the counterfactual of equal-length responses. The adjustment makes the metric robust to simple verbosity manipulations that previously inflated scores for longer outputs. It also raises the Spearman rank correlation with human preferences on the LMSYS Chatbot Arena from 0.94 to 0.98. A reader would care because cheap, automated benchmarks are central to LLM development, yet known biases like length preference have undermined trust in their rankings.

Core claim

We introduce length-controlled AlpacaEval, which fits a generalized linear model to predict the biased auto-annotator's preferences from length differences and other features, then obtains debiased preferences by evaluating the model at a zero length difference. This directly targets the counterfactual question of what the preference would be if the model's and baseline's outputs had the same length. The resulting metric resists gaming through increased verbosity and shows higher agreement with human judgments.

What carries the argument

a generalized linear model that predicts auto-annotator preferences from length difference and other features, then evaluates the model at zero length difference to yield counterfactual unbiased scores

Load-bearing premise

The fitted generalized linear model accurately captures the causal effect of length on the auto-annotator's preference, so the zero-difference prediction represents the true unbiased counterfactual.

What would settle it

A test set of models that vary only in output length but are otherwise equivalent in quality, where the length-controlled rankings still fail to match independent human judgments.

read the original abstract

LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for instruction-tuned LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?" To achieve this, we first fit a generalized linear model to predict the biased auto-annotator's preferences based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, but we also find that it increases the Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Length-Controlled AlpacaEval, a regression-based debiasing method for the AlpacaEval benchmark. It fits a generalized linear model (GLM) to predict the auto-annotator's pairwise preferences from length difference and additional features, then obtains adjusted scores by evaluating the fitted GLM at zero length difference. The authors claim this controls for length bias, increases robustness to verbosity manipulations, and raises the Spearman correlation with LMSYS Chatbot Arena human preferences from 0.94 to 0.98.

Significance. If the GLM adjustment validly isolates length effects without residual bias or new artifacts, the contribution is practically significant: AlpacaEval is a widely adopted, low-cost evaluator, and length bias is a documented confounder in LLM auto-annotation. The reported correlation gain and robustness checks under controlled verbosity changes are concrete empirical strengths that could encourage adoption of similar regression adjustments in other auto-evaluators.

major comments (3)

[Abstract and §3] Abstract and §3 (GLM procedure): The length-controlled score is defined by plugging length difference = 0 into the fitted GLM. This counterfactual interpretation requires that all confounders jointly affecting length and preference are included as covariates and that the chosen link function and interactions correctly specify the conditional expectation; the manuscript provides neither a sensitivity analysis for omitted variables nor an independent identification strategy (e.g., instrumental variables or randomized length perturbations).
[Results] Results (correlation and robustness): The jump from 0.94 to 0.98 Spearman correlation with Arena is presented as evidence of improved validity, yet no confidence intervals, bootstrap standard errors, or formal test of the difference are reported. Without these, it is impossible to determine whether the improvement is statistically reliable or driven by the particular set of models evaluated.
[§4] §4 (verbosity manipulation experiments): The robustness checks demonstrate that length-controlled scores are less sensitive to artificial lengthening of responses. However, the experiments do not test whether the GLM adjustment introduces bias under other manipulations (e.g., changes in response quality that correlate with length) or whether the same GLM coefficients generalize across different base models and annotators.

minor comments (2)

[Abstract and Methods] The abstract and methods should explicitly list all covariates included in the GLM beyond length difference and state the link function and any interaction terms used.
[§3] Notation for the length-controlled preference score should be introduced with an equation that clearly distinguishes the observed (biased) preference from the counterfactual prediction at zero length difference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to clarify assumptions, add statistical rigor, and discuss limitations.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (GLM procedure): The length-controlled score is defined by plugging length difference = 0 into the fitted GLM. This counterfactual interpretation requires that all confounders jointly affecting length and preference are included as covariates and that the chosen link function and interactions correctly specify the conditional expectation; the manuscript provides neither a sensitivity analysis for omitted variables nor an independent identification strategy (e.g., instrumental variables or randomized length perturbations).

Authors: We agree that the GLM adjustment relies on the assumption that the included covariates (length difference and other features) sufficiently capture confounding for the purpose of debiasing. The method is presented as a practical regression-based correction rather than a fully identified causal model. In the revised manuscript, we have expanded Section 3 with an explicit discussion of the modeling assumptions, the risk of omitted-variable bias, and the limitations of the counterfactual interpretation. We also added a sensitivity analysis by refitting the GLM with alternative covariate sets and link functions, showing that the length-controlled rankings remain largely stable. revision: partial
Referee: [Results] Results (correlation and robustness): The jump from 0.94 to 0.98 Spearman correlation with Arena is presented as evidence of improved validity, yet no confidence intervals, bootstrap standard errors, or formal test of the difference are reported. Without these, it is impossible to determine whether the improvement is statistically reliable or driven by the particular set of models evaluated.

Authors: We appreciate this point. The revised Results section now reports bootstrap confidence intervals for both Spearman correlations and includes a paired bootstrap test of the difference. The improvement from 0.94 to 0.98 is statistically significant (p < 0.01), and the intervals do not overlap, indicating that the gain is not an artifact of the specific model set. revision: yes
Referee: [§4] §4 (verbosity manipulation experiments): The robustness checks demonstrate that length-controlled scores are less sensitive to artificial lengthening of responses. However, the experiments do not test whether the GLM adjustment introduces bias under other manipulations (e.g., changes in response quality that correlate with length) or whether the same GLM coefficients generalize across different base models and annotators.

Authors: We acknowledge that the primary robustness experiments target verbosity. The revised Section 4 now includes additional discussion of potential bias under quality-correlated manipulations and reports GLM coefficients fitted separately on different annotators and model families to illustrate stability. Full generalization across all possible manipulations remains an open question for future work, which we now note explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity; standard regression adjustment for counterfactual

full rationale

The paper's core derivation fits a GLM on observed (biased) auto-annotator preferences using length difference and other features as predictors, then evaluates the fitted model at length difference = 0 to obtain the length-controlled score. This is an explicit counterfactual computation via regression adjustment and does not reduce to self-definition, reuse of the target metric, or any fitted input being renamed as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text; the method is self-contained against external benchmarks such as the reported Spearman correlation with Chatbot Arena. The assumptions (no unmeasured confounding, correct functional form) are standard for the technique and do not create circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the GLM correctly modeling the relationship between length difference and preference; no new entities are postulated.

free parameters (1)

GLM coefficients for length difference and other features
Fitted to the biased auto-annotator preferences to enable the zero-length prediction.

axioms (1)

domain assumption The relationship between length difference and preference can be captured by a generalized linear model
Invoked when fitting the model and using it for counterfactual prediction.

pith-pipeline@v0.9.0 · 5566 in / 1141 out tokens · 77048 ms · 2026-05-12T11:07:34.332342+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, but we also find that it increases the Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark
cs.CL 2024-06 unverdicted novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
cs.CL 2026-04 unverdicted novelty 7.0

Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
cs.LG 2026-04 unverdicted novelty 7.0

Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motiv...
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
cs.CL 2026-04 conditional novelty 7.0

SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
TiCo: Time-Controllable Spoken Dialogue Model
cs.CL 2026-03 unverdicted novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
cs.CL 2026-05 unverdicted novelty 6.0

Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.
Leveraging RAG for Training-Free Alignment of LLMs
cs.LG 2026-05 unverdicted novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
G-Zero: Self-Play for Open-Ended Generation from Zero Data
cs.LG 2026-05 unverdicted novelty 6.0

G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
Bias and Uncertainty in LLM-as-a-Judge Estimation
cs.LG 2026-05 unverdicted novelty 6.0

Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimate...
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
cs.CL 2026-05 unverdicted novelty 6.0

SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
cs.AI 2026-05 unverdicted novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Data-dependent Exploration for Online Reinforcement Learning from Human Feedback
cs.LG 2026-05 unverdicted novelty 6.0

DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
cs.CR 2026-05 unverdicted novelty 6.0

LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
cs.CR 2026-04 unverdicted novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria
cs.HC 2026-04 unverdicted novelty 6.0

MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...
Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards
cs.AI 2026-04 unverdicted novelty 6.0

Analysis of the LMArena dataset reveals heavy topic skew and varying model rankings, leading to an interactive visualization tool for users to define custom evaluation priorities on LLM leaderboards.
Hybrid Policy Distillation for LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner
cs.LG 2026-04 unverdicted novelty 6.0

A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
cs.AI 2026-04 unverdicted novelty 6.0

AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 pe...
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
cs.LG 2026-04 unverdicted novelty 6.0

Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
Re-Triggering Safeguards within LLMs for Jailbreak Detection
cs.CR 2026-05 unverdicted novelty 5.0

Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
cs.CR 2026-05 accept novelty 5.0

The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
cs.SE 2026-04 unverdicted novelty 5.0

LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
cs.AI 2026-04 unverdicted novelty 5.0

Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
cs.CL 2026-04 unverdicted novelty 5.0

MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choi...
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
cs.CL 2026-03

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 31 Pith papers · 1 internal anchor

[1]

Aarons , journal =

D. Aarons , journal =. Puns and Tacit Linguistic Knowledge , year =

work page
[2]

Abadi and A

M. Abadi and A. Agarwal and P. Barham and E. Brevdo and Z. Chen and C. Citro and G. S. Corrado and A. Davis and J. Dean and M. Devin and S. Ghemawat and I. J. Goodfellow and A. Harp and G. Irving and M. Isard and Y. Jia and R. Józefowicz and L. Kaiser and M. Kudlur and J. Levenberg and D. Mané and R. Monga and S. Moore and D. G. Murray and C. Olah and M. ...

work page
[3]

Abadi and A

M. Abadi and A. Chu and I. Goodfellow and H. B. McMahan and I. Mironov and K. Talwar and L. Zhang , booktitle =. Deep learning with differential privacy , year =

work page
[4]

Abadi and P

M. Abadi and P. Barham and J. Chen and Z. Chen and A. Davis and J. Dean and M. Devin and S. Ghemawat and G. Irving and M. Isard and others , booktitle =. TensorFlow: A system for large-scale machine learning , year =

work page
[5]

Abadi and A

M. Abadi and A. Chu and I. Goodfellow and H. B. McMahan and I. Mironov and K. Talwar and L. Zhang , journal =. Deep Learning with Differential Privacy , year =

work page
[6]

Abbe and C

E. Abbe and C. Sandon , journal =. Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms , year =

work page
[7]

Abbe and C

E. Abbe and C. Sandon , journal =. Detection in the stochastic block model with multiple clusters: proof of the achievability conjectures, acyclic

work page
[8]

Abbeel and A

P. Abbeel and A. Ng , booktitle =. Apprenticeship learning via inverse reinforcement learning , year =

work page
[9]

Abbeel and M

P. Abbeel and M. Quigley and A. Y. Ng , booktitle =. Using inaccurate models in reinforcement learning , year =

work page
[10]

The Journal of the American Academy of Psychiatry and the Law , author =

Throwing the baby out with the bath water: is it time for clinical judgment to supplement actuarial risk assessment? , volume =. The Journal of the American Academy of Psychiatry and the Law , author =. 2011 , pages =

work page 2011
[11]

A. B. Abel , institution =. Classical measurement error with several regressors , year =

work page
[12]

Abid and V

A. Abid and V. K. Bagaria and M. J. Zhang and J. Zou , journal =. Contrastive principal component analysis , year =

work page
[13]

Abid and M

A. Abid and M. J. Zhang and V. K. Bagaria and J. Zou , journal =. Exploring patterns enriched in a dataset with contrastive principal component analysis , volume =

work page
[14]

Abiteboul , booktitle =

S. Abiteboul , booktitle =. Querying semi-structured data , year =

work page
[15]

D. A. Abolafia and M. Norouzi and Q. V. Le , journal =. Neural Program Synthesis with Priority Queue Training , year =

work page
[16]

Abujabal and R

A. Abujabal and R. S. Roy and M. Yahya and G. Weikum , journal =. ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters , year =

work page
[17]

Y. S. Abu-Mostafa , journal =. Learning from hints in neural networks , volume =

work page
[18]

Achlioptas and F

D. Achlioptas and F. McSherry , booktitle =. On spectral learning of mixtures of distributions , year =

work page
[19]

Machine Learning for Healthcare Conference , year =

Hidden risks of machine learning applied to healthcare: unintended feedback loops between models and future data causing model degradation , author =. Machine Learning for Healthcare Conference , year =

work page
[20]

2022 , booktitle =

Error Amplification When Updating Deployed Machine Learning Models , author =. 2022 , booktitle =

work page 2022
[21]

Adamczak and P

R. Adamczak and P. Wolff , journal =. Concentration inequalities for non-

work page
[22]

Adel and B

H. Adel and B. Roth and H. Sch\". Human Language Technology and North American Association for Computational Linguistics (HLT/NAACL) , title =

work page
[23]

J. K. Adelman-McCarthy and M. A. Ag. The sixth data release of the. The Astrophysical Journal Supplement Series , number =

work page
[24]

Adler and J

M. Adler and J. Berant and I. Dagan , booktitle =. Entailment-based Text Exploration with Application to the Health-care Domain , year =

work page
[25]

Adler and C

P. Adler and C. Falk and S. A. Friedler and G. Rybeck and C. Scheidegger and B. Smith and S. Venkatasubramanian , journal =. Auditing Black-box Models for Indirect Influence , year =

work page
[26]

Adomavicius and J

G. Adomavicius and J. Bockstedt and S. Curley and J. Zhang , journal =. De-Biasing User Preference Ratings in Recommender Systems , volume =

work page
[27]

P. W. Adriaans , institution =. Learning Shallow Context-Free Languages under Simple Distributions , year =

work page
[28]

Afakih and H

A. Afakih and H. Wolkowicz , institution =. On the embeddability of weighted graphs in Euclidean spaces , year =

work page
[29]

Afantenos and N

S. Afantenos and N. Asher and F. Benamara and A. Cadilhac and C. Dégremont and P. Denis and M. Guhe and S. Keizer and A. Lascarides and O. Lemon and P. Muller and S. Paul and V. Rieser and L. Vieu , booktitle =. Developing a corpus of strategic conversation in The Settlers of Catan , year =

work page
[30]

Afantenos and N

S. Afantenos and N. Asher and F. Benamara and A. Cadilhac and C. Dégremont and P. Denis and M. Guhe and S. Keizer and A. Lascarides and O. Lemon and others , booktitle =. Modelling Strategic Conversation: Model, Annotation Design and Corpus , year =

work page
[31]

Afsari , booktitle =

B. Afsari , booktitle =. Simple

work page
[32]

Afsari , journal =

B. Afsari , journal =. Sensitivity analysis for the problem of matrix joint diagonalization , volume =

work page
[33]

Agarwal and H

A. Agarwal and H. Daume , booktitle =. Exponential Family Hybrid Learning , year =

work page
[34]

Agarwal and M

A. Agarwal and M. J. Wainwright and P. Bartlett and P. Ravikumar , journal =. Information-theoretic lower bounds on the oracle complexity of convex optimization , volume =

work page
[35]

Agarwal , booktitle =

A. Agarwal , booktitle =. Selective sampling algorithms for cost-sensitive multiclass prediction , year =

work page
[36]

Agarwal and A

N. Agarwal and A. S. Bandeira and K. Koiliaris and A. Kolla , journal =. Multisection in the stochastic block model using semidefinite programming , year =

work page
[37]

Agarwal and B

N. Agarwal and B. Bullins and E. Hazan , journal =. Second order stochastic optimization in linear time , year =

work page
[38]

Agarwal and A

A. Agarwal and A. Beygelzimer and M. Dudik and J. Langford and H. Wallach , booktitle =. A Reductions Approach to Fair Classification , year =

work page
[39]

Agarwal and B

N. Agarwal and B. Bullins and X. Chen and E. Hazan and K. Singh and C. Zhang and Y. Zhang , journal =. The case for full-matrix adaptive regularization , year =

work page
[40]

Agarwal and B

N. Agarwal and B. Bullins and X. Chen and E. Hazan and K. Singh and C. Zhang and Y. Zhang , booktitle =. Efficient full-matrix adaptive regularization , year =

work page
[41]

Agarwal and C

R. Agarwal and C. Liang and D. Schuurmans and M. Norouzi , journal =. Learning to Generalize from Sparse and Underspecified Rewards , year =

work page
[42]

arXiv preprint arXiv:1805.08125 , author =

A. arXiv preprint arXiv:1805.08125 , author =

work page arXiv
[43]

O. E. Agazzi and S. Kuo and E. Levin and R. Pieraccini , booktitle =. Connected and degraded text recognition using planar hidden

work page
[44]

Agichtein and L

E. Agichtein and L. Gravano , booktitle =. Snowball: Extracting relations from large plain-text collections , year =

work page
[45]

Agirre and C

E. Agirre and C. Banea and C. Cardie and D. M. Cer and M. T. Diab and A. Gonzalez-Agirre and W. Guo and R. Mihalcea and G. Rigau and J. Wiebe , booktitle =. Sem

work page
[46]

Agrawal and N

S. Agrawal and N. Goyal , booktitle =. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , year =

work page
[47]

, year =

Analyzing the performance of multilayer neural networks for object recognition , author =. , year =

work page
[48]

Agrawal and D

A. Agrawal and D. Batra and D. Parikh , booktitle =. Analyzing the behavior of visual question answering models , year =

work page
[49]

V ., Arriaga, R

Using Large Language Models to Simulate Multiple Humans , author =. arXiv preprint arXiv:2208.10264 , year =

work page arXiv
[50]

A. A. Ahmadi and A. Majumdar , journal =

work page
[51]

von Ahn and L

L. von Ahn and L. A. Dabbish , booktitle =. Labeling images with a computer game , year =

work page
[52]

L. V. Ahn and R. Liu and M. Blum , booktitle =. Peekaboom: a game for locating objects in images , year =

work page
[53]

L. V. Ahn and L. Dabbish , journal =. Designing games with a purpose , volume =

work page
[54]

A. V. Aho and J. D. Ullman , title =. 1972 , volume =

work page 1972
[55]

Methods for time series analysis of

T. Methods for time series analysis of. Bioinformatics , number =

work page
[56]

Akaike , journal =

H. Akaike , journal =. A new look at the statistical model identification , volume =

work page
[57]

Akgun and M

B. Akgun and M. Cakmak and K. Jiang and A. Thomaz , journal =. Keyframe-based learning from demonstration , volume =

work page
[58]

Akyürek and T

E. Akyürek and T. Bolukbasi and F. Liu and B. Xiong and I. Tenney and J. Andreas and K. Guu , keywords =. Tracing Knowledge in Language Models Back to the Training Data , publisher =. 2022 , copyright =

work page 2022
[59]

M. A. Alcorn and Q. Li and Z. Gong and C. Wang and L. Mai and W.-S. Ku and A. Nguyen , title =. , year =

work page
[60]

Aldous , journal =

D. Aldous , journal =. Exchangeability and related topics , volume =

work page
[61]

Alexandrescu and K

A. Alexandrescu and K. Kirchhoff , booktitle =. Graph-based learning for statistical machine translation , year =

work page
[62]

Alfonseca and K

E. Alfonseca and K. Filippova and J. Delort and G. Garrido , booktitle =. Pattern learning for relation extraction with a hierarchical topic model , year =

work page
[63]

S. M. Ali and S. D. Silvey , journal =. A General Class of Coefficients of Divergence of One Distribution from Another , volume =

work page
[64]

Ali and Y

H. Ali and Y. Chali and S. A. Hasan , booktitle =. Automation of question generation from sentences , year =

work page
[65]

Allamanis and D

M. Allamanis and D. Tarlow and A. Gordon and Y. Wei , booktitle =. Bimodal modelling of source code and natural language , year =

work page
[66]

Allamanis and M

M. Allamanis and M. Brockschmidt and M. Khademi , booktitle =. Learning to Represent Programs with Graphs , year =

work page
[67]

Allemand and K

K. Allemand and K. Fukuda and T. M. Liebling and E. Steiner , journal =. A polynomial case of unconstrained zero-one quadratic optimization , volume =

work page
[68]

J. F. Allen and C. R. Perrault , journal =. Analyzing intention in utterances , volume =

work page
[69]

J. F. Allen and D. K. Byron and M. Dzikovska and G. Ferguson and L. Galescu and A. Stent , journal =. Toward conversational human-computer interaction , volume =

work page
[70]

Allen and N

J. Allen and N. Chambers and G. Ferguson and L. Galescu and H. Jung and M. Swift and W. Taysom , booktitle =

work page
[71]

Allen and H

J. Allen and H. Kautz and R. Pelavin and J. Tenenberg , publisher =. Reasoning about plans , year =

work page
[72]

E. S. Allman and C. Matias and J. A. Rhodes , journal =. Identifiability of parameters in latent structure models with many observed variables , volume =

work page
[73]

E. S. Allman and S. Petrovi and J. A. Rhodes and S. Sullivant , journal =. Identifiability of 2-tree mixtures for group-based models , volume =

work page
[74]

Alon and A

N. Alon and A. Naor , journal =. Approximating the cut-norm via

work page
[75]

Alon and R

N. Alon and R. Bassily and S. Moran , journal =. Limits of private learning with access to public data , year =

work page
[76]

Alshawi and P

H. Alshawi and P. Chang and M. Ringgaard , booktitle =. Deterministic Statistical Mapping of Sentences to Underspecified Semantics , year =

work page
[77]

Alterovitz and S

R. Alterovitz and S. Patil and A. Derbakova , booktitle =. Rapidly-exploring roadmaps: Weighing exploration vs. refinement in optimal motion planning , year =

work page
[78]

J. J. Altham , journal =. Rawls' Difference Principle , volume =

work page
[79]

Alzantot and Y

M. Alzantot and Y. Sharma and A. Elgohary and B. Ho and M. Srivastava and K. Chang , booktitle =. Generating Natural Language Adversarial Examples , year =

work page
[80]

Amershi and M

S. Amershi and M. Chickering and S. M. Drucker and B. Lee and P. Simard and J. Suh , booktitle =. Modeltracker: Redesigning performance analysis tools for machine learning , year =

work page

Showing first 80 references.