pith. machine review for the scientific record. sign in

arxiv: 2404.04475 · v2 · submitted 2024-04-06 · 💻 cs.LG · cs.AI· cs.CL· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Authors on Pith no claims yet

Pith reviewed 2026-05-12 11:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLstat.ML
keywords LLM evaluationlength biasauto-annotatorsAlpacaEvaldebiasingregression controlpreference modelingbenchmarking
0
0 comments X

The pith

A regression adjustment removes length bias from AlpacaEval by predicting preferences at equal output lengths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes fitting a generalized linear model to auto-annotator preferences using length difference as a key predictor, then computing the preference that would arise if lengths were identical. This produces length-controlled scores for AlpacaEval that answer the counterfactual of equal-length responses. The adjustment makes the metric robust to simple verbosity manipulations that previously inflated scores for longer outputs. It also raises the Spearman rank correlation with human preferences on the LMSYS Chatbot Arena from 0.94 to 0.98. A reader would care because cheap, automated benchmarks are central to LLM development, yet known biases like length preference have undermined trust in their rankings.

Core claim

We introduce length-controlled AlpacaEval, which fits a generalized linear model to predict the biased auto-annotator's preferences from length differences and other features, then obtains debiased preferences by evaluating the model at a zero length difference. This directly targets the counterfactual question of what the preference would be if the model's and baseline's outputs had the same length. The resulting metric resists gaming through increased verbosity and shows higher agreement with human judgments.

What carries the argument

a generalized linear model that predicts auto-annotator preferences from length difference and other features, then evaluates the model at zero length difference to yield counterfactual unbiased scores

Load-bearing premise

The fitted generalized linear model accurately captures the causal effect of length on the auto-annotator's preference, so the zero-difference prediction represents the true unbiased counterfactual.

What would settle it

A test set of models that vary only in output length but are otherwise equivalent in quality, where the length-controlled rankings still fail to match independent human judgments.

read the original abstract

LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for instruction-tuned LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?" To achieve this, we first fit a generalized linear model to predict the biased auto-annotator's preferences based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, but we also find that it increases the Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Length-Controlled AlpacaEval, a regression-based debiasing method for the AlpacaEval benchmark. It fits a generalized linear model (GLM) to predict the auto-annotator's pairwise preferences from length difference and additional features, then obtains adjusted scores by evaluating the fitted GLM at zero length difference. The authors claim this controls for length bias, increases robustness to verbosity manipulations, and raises the Spearman correlation with LMSYS Chatbot Arena human preferences from 0.94 to 0.98.

Significance. If the GLM adjustment validly isolates length effects without residual bias or new artifacts, the contribution is practically significant: AlpacaEval is a widely adopted, low-cost evaluator, and length bias is a documented confounder in LLM auto-annotation. The reported correlation gain and robustness checks under controlled verbosity changes are concrete empirical strengths that could encourage adoption of similar regression adjustments in other auto-evaluators.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (GLM procedure): The length-controlled score is defined by plugging length difference = 0 into the fitted GLM. This counterfactual interpretation requires that all confounders jointly affecting length and preference are included as covariates and that the chosen link function and interactions correctly specify the conditional expectation; the manuscript provides neither a sensitivity analysis for omitted variables nor an independent identification strategy (e.g., instrumental variables or randomized length perturbations).
  2. [Results] Results (correlation and robustness): The jump from 0.94 to 0.98 Spearman correlation with Arena is presented as evidence of improved validity, yet no confidence intervals, bootstrap standard errors, or formal test of the difference are reported. Without these, it is impossible to determine whether the improvement is statistically reliable or driven by the particular set of models evaluated.
  3. [§4] §4 (verbosity manipulation experiments): The robustness checks demonstrate that length-controlled scores are less sensitive to artificial lengthening of responses. However, the experiments do not test whether the GLM adjustment introduces bias under other manipulations (e.g., changes in response quality that correlate with length) or whether the same GLM coefficients generalize across different base models and annotators.
minor comments (2)
  1. [Abstract and Methods] The abstract and methods should explicitly list all covariates included in the GLM beyond length difference and state the link function and any interaction terms used.
  2. [§3] Notation for the length-controlled preference score should be introduced with an equation that clearly distinguishes the observed (biased) preference from the counterfactual prediction at zero length difference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to clarify assumptions, add statistical rigor, and discuss limitations.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (GLM procedure): The length-controlled score is defined by plugging length difference = 0 into the fitted GLM. This counterfactual interpretation requires that all confounders jointly affecting length and preference are included as covariates and that the chosen link function and interactions correctly specify the conditional expectation; the manuscript provides neither a sensitivity analysis for omitted variables nor an independent identification strategy (e.g., instrumental variables or randomized length perturbations).

    Authors: We agree that the GLM adjustment relies on the assumption that the included covariates (length difference and other features) sufficiently capture confounding for the purpose of debiasing. The method is presented as a practical regression-based correction rather than a fully identified causal model. In the revised manuscript, we have expanded Section 3 with an explicit discussion of the modeling assumptions, the risk of omitted-variable bias, and the limitations of the counterfactual interpretation. We also added a sensitivity analysis by refitting the GLM with alternative covariate sets and link functions, showing that the length-controlled rankings remain largely stable. revision: partial

  2. Referee: [Results] Results (correlation and robustness): The jump from 0.94 to 0.98 Spearman correlation with Arena is presented as evidence of improved validity, yet no confidence intervals, bootstrap standard errors, or formal test of the difference are reported. Without these, it is impossible to determine whether the improvement is statistically reliable or driven by the particular set of models evaluated.

    Authors: We appreciate this point. The revised Results section now reports bootstrap confidence intervals for both Spearman correlations and includes a paired bootstrap test of the difference. The improvement from 0.94 to 0.98 is statistically significant (p < 0.01), and the intervals do not overlap, indicating that the gain is not an artifact of the specific model set. revision: yes

  3. Referee: [§4] §4 (verbosity manipulation experiments): The robustness checks demonstrate that length-controlled scores are less sensitive to artificial lengthening of responses. However, the experiments do not test whether the GLM adjustment introduces bias under other manipulations (e.g., changes in response quality that correlate with length) or whether the same GLM coefficients generalize across different base models and annotators.

    Authors: We acknowledge that the primary robustness experiments target verbosity. The revised Section 4 now includes additional discussion of potential bias under quality-correlated manipulations and reports GLM coefficients fitted separately on different annotators and model families to illustrate stability. Full generalization across all possible manipulations remains an open question for future work, which we now note explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity; standard regression adjustment for counterfactual

full rationale

The paper's core derivation fits a GLM on observed (biased) auto-annotator preferences using length difference and other features as predictors, then evaluates the fitted model at length difference = 0 to obtain the length-controlled score. This is an explicit counterfactual computation via regression adjustment and does not reduce to self-definition, reuse of the target metric, or any fitted input being renamed as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text; the method is self-contained against external benchmarks such as the reported Spearman correlation with Chatbot Arena. The assumptions (no unmeasured confounding, correct functional form) are standard for the technique and do not create circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the GLM correctly modeling the relationship between length difference and preference; no new entities are postulated.

free parameters (1)
  • GLM coefficients for length difference and other features
    Fitted to the biased auto-annotator preferences to enable the zero-length prediction.
axioms (1)
  • domain assumption The relationship between length difference and preference can be captured by a generalized linear model
    Invoked when fitting the model and using it for counterfactual prediction.

pith-pipeline@v0.9.0 · 5566 in / 1141 out tokens · 77048 ms · 2026-05-12T11:07:34.332342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    cs.CL 2024-06 unverdicted novelty 8.0

    LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

  2. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  3. Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

    cs.CL 2026-04 unverdicted novelty 7.0

    Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.

  4. Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motiv...

  5. Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

    cs.CL 2026-04 conditional novelty 7.0

    SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.

  6. TiCo: Time-Controllable Spoken Dialogue Model

    cs.CL 2026-03 unverdicted novelty 7.0

    TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

  7. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  8. Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

    cs.CL 2026-05 unverdicted novelty 6.0

    Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.

  9. Leveraging RAG for Training-Free Alignment of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...

  10. G-Zero: Self-Play for Open-Ended Generation from Zero Data

    cs.LG 2026-05 unverdicted novelty 6.0

    G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.

  11. Bias and Uncertainty in LLM-as-a-Judge Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimate...

  12. Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

    cs.CL 2026-05 unverdicted novelty 6.0

    SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.

  13. Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

    cs.AI 2026-05 unverdicted novelty 6.0

    LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.

  14. Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

    cs.LG 2026-05 unverdicted novelty 6.0

    DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.

  15. LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training

    cs.CR 2026-05 unverdicted novelty 6.0

    LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.

  16. TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

    cs.CR 2026-04 unverdicted novelty 6.0

    TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

  17. MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria

    cs.HC 2026-04 unverdicted novelty 6.0

    MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...

  18. Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards

    cs.AI 2026-04 unverdicted novelty 6.0

    Analysis of the LMArena dataset reveals heavy topic skew and varying model rankings, leading to an interactive visualization tool for users to define custom evaluation priorities on LLM leaderboards.

  19. Hybrid Policy Distillation for LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...

  20. S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.

  21. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  22. Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner

    cs.LG 2026-04 unverdicted novelty 6.0

    A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.

  23. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

    cs.AI 2026-04 unverdicted novelty 6.0

    AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 pe...

  24. Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.

  25. Re-Triggering Safeguards within LLMs for Jailbreak Detection

    cs.CR 2026-05 unverdicted novelty 5.0

    Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.

  26. A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

    cs.CR 2026-05 accept novelty 5.0

    The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.

  27. LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

    cs.SE 2026-04 unverdicted novelty 5.0

    LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.

  28. Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines

    cs.AI 2026-04 unverdicted novelty 5.0

    Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.

  29. MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

    cs.CL 2026-04 unverdicted novelty 5.0

    MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choi...

  30. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

  31. CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

    cs.CL 2026-03

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 31 Pith papers · 1 internal anchor

  1. [1]

    Aarons , journal =

    D. Aarons , journal =. Puns and Tacit Linguistic Knowledge , year =

  2. [2]

    Abadi and A

    M. Abadi and A. Agarwal and P. Barham and E. Brevdo and Z. Chen and C. Citro and G. S. Corrado and A. Davis and J. Dean and M. Devin and S. Ghemawat and I. J. Goodfellow and A. Harp and G. Irving and M. Isard and Y. Jia and R. Józefowicz and L. Kaiser and M. Kudlur and J. Levenberg and D. Mané and R. Monga and S. Moore and D. G. Murray and C. Olah and M. ...

  3. [3]

    Abadi and A

    M. Abadi and A. Chu and I. Goodfellow and H. B. McMahan and I. Mironov and K. Talwar and L. Zhang , booktitle =. Deep learning with differential privacy , year =

  4. [4]

    Abadi and P

    M. Abadi and P. Barham and J. Chen and Z. Chen and A. Davis and J. Dean and M. Devin and S. Ghemawat and G. Irving and M. Isard and others , booktitle =. TensorFlow: A system for large-scale machine learning , year =

  5. [5]

    Abadi and A

    M. Abadi and A. Chu and I. Goodfellow and H. B. McMahan and I. Mironov and K. Talwar and L. Zhang , journal =. Deep Learning with Differential Privacy , year =

  6. [6]

    Abbe and C

    E. Abbe and C. Sandon , journal =. Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms , year =

  7. [7]

    Abbe and C

    E. Abbe and C. Sandon , journal =. Detection in the stochastic block model with multiple clusters: proof of the achievability conjectures, acyclic

  8. [8]

    Abbeel and A

    P. Abbeel and A. Ng , booktitle =. Apprenticeship learning via inverse reinforcement learning , year =

  9. [9]

    Abbeel and M

    P. Abbeel and M. Quigley and A. Y. Ng , booktitle =. Using inaccurate models in reinforcement learning , year =

  10. [10]

    The Journal of the American Academy of Psychiatry and the Law , author =

    Throwing the baby out with the bath water: is it time for clinical judgment to supplement actuarial risk assessment? , volume =. The Journal of the American Academy of Psychiatry and the Law , author =. 2011 , pages =

  11. [11]

    A. B. Abel , institution =. Classical measurement error with several regressors , year =

  12. [12]

    Abid and V

    A. Abid and V. K. Bagaria and M. J. Zhang and J. Zou , journal =. Contrastive principal component analysis , year =

  13. [13]

    Abid and M

    A. Abid and M. J. Zhang and V. K. Bagaria and J. Zou , journal =. Exploring patterns enriched in a dataset with contrastive principal component analysis , volume =

  14. [14]

    Abiteboul , booktitle =

    S. Abiteboul , booktitle =. Querying semi-structured data , year =

  15. [15]

    D. A. Abolafia and M. Norouzi and Q. V. Le , journal =. Neural Program Synthesis with Priority Queue Training , year =

  16. [16]

    Abujabal and R

    A. Abujabal and R. S. Roy and M. Yahya and G. Weikum , journal =. ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters , year =

  17. [17]

    Y. S. Abu-Mostafa , journal =. Learning from hints in neural networks , volume =

  18. [18]

    Achlioptas and F

    D. Achlioptas and F. McSherry , booktitle =. On spectral learning of mixtures of distributions , year =

  19. [19]

    Machine Learning for Healthcare Conference , year =

    Hidden risks of machine learning applied to healthcare: unintended feedback loops between models and future data causing model degradation , author =. Machine Learning for Healthcare Conference , year =

  20. [20]

    2022 , booktitle =

    Error Amplification When Updating Deployed Machine Learning Models , author =. 2022 , booktitle =

  21. [21]

    Adamczak and P

    R. Adamczak and P. Wolff , journal =. Concentration inequalities for non-

  22. [22]

    Adel and B

    H. Adel and B. Roth and H. Sch\". Human Language Technology and North American Association for Computational Linguistics (HLT/NAACL) , title =

  23. [23]

    J. K. Adelman-McCarthy and M. A. Ag. The sixth data release of the. The Astrophysical Journal Supplement Series , number =

  24. [24]

    Adler and J

    M. Adler and J. Berant and I. Dagan , booktitle =. Entailment-based Text Exploration with Application to the Health-care Domain , year =

  25. [25]

    Adler and C

    P. Adler and C. Falk and S. A. Friedler and G. Rybeck and C. Scheidegger and B. Smith and S. Venkatasubramanian , journal =. Auditing Black-box Models for Indirect Influence , year =

  26. [26]

    Adomavicius and J

    G. Adomavicius and J. Bockstedt and S. Curley and J. Zhang , journal =. De-Biasing User Preference Ratings in Recommender Systems , volume =

  27. [27]

    P. W. Adriaans , institution =. Learning Shallow Context-Free Languages under Simple Distributions , year =

  28. [28]

    Afakih and H

    A. Afakih and H. Wolkowicz , institution =. On the embeddability of weighted graphs in Euclidean spaces , year =

  29. [29]

    Afantenos and N

    S. Afantenos and N. Asher and F. Benamara and A. Cadilhac and C. Dégremont and P. Denis and M. Guhe and S. Keizer and A. Lascarides and O. Lemon and P. Muller and S. Paul and V. Rieser and L. Vieu , booktitle =. Developing a corpus of strategic conversation in The Settlers of Catan , year =

  30. [30]

    Afantenos and N

    S. Afantenos and N. Asher and F. Benamara and A. Cadilhac and C. Dégremont and P. Denis and M. Guhe and S. Keizer and A. Lascarides and O. Lemon and others , booktitle =. Modelling Strategic Conversation: Model, Annotation Design and Corpus , year =

  31. [31]

    Afsari , booktitle =

    B. Afsari , booktitle =. Simple

  32. [32]

    Afsari , journal =

    B. Afsari , journal =. Sensitivity analysis for the problem of matrix joint diagonalization , volume =

  33. [33]

    Agarwal and H

    A. Agarwal and H. Daume , booktitle =. Exponential Family Hybrid Learning , year =

  34. [34]

    Agarwal and M

    A. Agarwal and M. J. Wainwright and P. Bartlett and P. Ravikumar , journal =. Information-theoretic lower bounds on the oracle complexity of convex optimization , volume =

  35. [35]

    Agarwal , booktitle =

    A. Agarwal , booktitle =. Selective sampling algorithms for cost-sensitive multiclass prediction , year =

  36. [36]

    Agarwal and A

    N. Agarwal and A. S. Bandeira and K. Koiliaris and A. Kolla , journal =. Multisection in the stochastic block model using semidefinite programming , year =

  37. [37]

    Agarwal and B

    N. Agarwal and B. Bullins and E. Hazan , journal =. Second order stochastic optimization in linear time , year =

  38. [38]

    Agarwal and A

    A. Agarwal and A. Beygelzimer and M. Dudik and J. Langford and H. Wallach , booktitle =. A Reductions Approach to Fair Classification , year =

  39. [39]

    Agarwal and B

    N. Agarwal and B. Bullins and X. Chen and E. Hazan and K. Singh and C. Zhang and Y. Zhang , journal =. The case for full-matrix adaptive regularization , year =

  40. [40]

    Agarwal and B

    N. Agarwal and B. Bullins and X. Chen and E. Hazan and K. Singh and C. Zhang and Y. Zhang , booktitle =. Efficient full-matrix adaptive regularization , year =

  41. [41]

    Agarwal and C

    R. Agarwal and C. Liang and D. Schuurmans and M. Norouzi , journal =. Learning to Generalize from Sparse and Underspecified Rewards , year =

  42. [42]

    arXiv preprint arXiv:1805.08125 , author =

    A. arXiv preprint arXiv:1805.08125 , author =

  43. [43]

    O. E. Agazzi and S. Kuo and E. Levin and R. Pieraccini , booktitle =. Connected and degraded text recognition using planar hidden

  44. [44]

    Agichtein and L

    E. Agichtein and L. Gravano , booktitle =. Snowball: Extracting relations from large plain-text collections , year =

  45. [45]

    Agirre and C

    E. Agirre and C. Banea and C. Cardie and D. M. Cer and M. T. Diab and A. Gonzalez-Agirre and W. Guo and R. Mihalcea and G. Rigau and J. Wiebe , booktitle =. Sem

  46. [46]

    Agrawal and N

    S. Agrawal and N. Goyal , booktitle =. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , year =

  47. [47]

    , year =

    Analyzing the performance of multilayer neural networks for object recognition , author =. , year =

  48. [48]

    Agrawal and D

    A. Agrawal and D. Batra and D. Parikh , booktitle =. Analyzing the behavior of visual question answering models , year =

  49. [49]

    V ., Arriaga, R

    Using Large Language Models to Simulate Multiple Humans , author =. arXiv preprint arXiv:2208.10264 , year =

  50. [50]

    A. A. Ahmadi and A. Majumdar , journal =

  51. [51]

    von Ahn and L

    L. von Ahn and L. A. Dabbish , booktitle =. Labeling images with a computer game , year =

  52. [52]

    L. V. Ahn and R. Liu and M. Blum , booktitle =. Peekaboom: a game for locating objects in images , year =

  53. [53]

    L. V. Ahn and L. Dabbish , journal =. Designing games with a purpose , volume =

  54. [54]

    A. V. Aho and J. D. Ullman , title =. 1972 , volume =

  55. [55]

    Methods for time series analysis of

    T. Methods for time series analysis of. Bioinformatics , number =

  56. [56]

    Akaike , journal =

    H. Akaike , journal =. A new look at the statistical model identification , volume =

  57. [57]

    Akgun and M

    B. Akgun and M. Cakmak and K. Jiang and A. Thomaz , journal =. Keyframe-based learning from demonstration , volume =

  58. [58]

    Akyürek and T

    E. Akyürek and T. Bolukbasi and F. Liu and B. Xiong and I. Tenney and J. Andreas and K. Guu , keywords =. Tracing Knowledge in Language Models Back to the Training Data , publisher =. 2022 , copyright =

  59. [59]

    M. A. Alcorn and Q. Li and Z. Gong and C. Wang and L. Mai and W.-S. Ku and A. Nguyen , title =. , year =

  60. [60]

    Aldous , journal =

    D. Aldous , journal =. Exchangeability and related topics , volume =

  61. [61]

    Alexandrescu and K

    A. Alexandrescu and K. Kirchhoff , booktitle =. Graph-based learning for statistical machine translation , year =

  62. [62]

    Alfonseca and K

    E. Alfonseca and K. Filippova and J. Delort and G. Garrido , booktitle =. Pattern learning for relation extraction with a hierarchical topic model , year =

  63. [63]

    S. M. Ali and S. D. Silvey , journal =. A General Class of Coefficients of Divergence of One Distribution from Another , volume =

  64. [64]

    Ali and Y

    H. Ali and Y. Chali and S. A. Hasan , booktitle =. Automation of question generation from sentences , year =

  65. [65]

    Allamanis and D

    M. Allamanis and D. Tarlow and A. Gordon and Y. Wei , booktitle =. Bimodal modelling of source code and natural language , year =

  66. [66]

    Allamanis and M

    M. Allamanis and M. Brockschmidt and M. Khademi , booktitle =. Learning to Represent Programs with Graphs , year =

  67. [67]

    Allemand and K

    K. Allemand and K. Fukuda and T. M. Liebling and E. Steiner , journal =. A polynomial case of unconstrained zero-one quadratic optimization , volume =

  68. [68]

    J. F. Allen and C. R. Perrault , journal =. Analyzing intention in utterances , volume =

  69. [69]

    J. F. Allen and D. K. Byron and M. Dzikovska and G. Ferguson and L. Galescu and A. Stent , journal =. Toward conversational human-computer interaction , volume =

  70. [70]

    Allen and N

    J. Allen and N. Chambers and G. Ferguson and L. Galescu and H. Jung and M. Swift and W. Taysom , booktitle =

  71. [71]

    Allen and H

    J. Allen and H. Kautz and R. Pelavin and J. Tenenberg , publisher =. Reasoning about plans , year =

  72. [72]

    E. S. Allman and C. Matias and J. A. Rhodes , journal =. Identifiability of parameters in latent structure models with many observed variables , volume =

  73. [73]

    E. S. Allman and S. Petrovi and J. A. Rhodes and S. Sullivant , journal =. Identifiability of 2-tree mixtures for group-based models , volume =

  74. [74]

    Alon and A

    N. Alon and A. Naor , journal =. Approximating the cut-norm via

  75. [75]

    Alon and R

    N. Alon and R. Bassily and S. Moran , journal =. Limits of private learning with access to public data , year =

  76. [76]

    Alshawi and P

    H. Alshawi and P. Chang and M. Ringgaard , booktitle =. Deterministic Statistical Mapping of Sentences to Underspecified Semantics , year =

  77. [77]

    Alterovitz and S

    R. Alterovitz and S. Patil and A. Derbakova , booktitle =. Rapidly-exploring roadmaps: Weighing exploration vs. refinement in optimal motion planning , year =

  78. [78]

    J. J. Altham , journal =. Rawls' Difference Principle , volume =

  79. [79]

    Alzantot and Y

    M. Alzantot and Y. Sharma and A. Elgohary and B. Ho and M. Srivastava and K. Chang , booktitle =. Generating Natural Language Adversarial Examples , year =

  80. [80]

    Amershi and M

    S. Amershi and M. Chickering and S. M. Drucker and B. Lee and P. Simard and J. Suh , booktitle =. Modeltracker: Redesigning performance analysis tools for machine learning , year =

Showing first 80 references.