Recognition: 2 theorem links
· Lean TheoremLength-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Pith reviewed 2026-05-12 11:07 UTC · model grok-4.3
The pith
A regression adjustment removes length bias from AlpacaEval by predicting preferences at equal output lengths.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce length-controlled AlpacaEval, which fits a generalized linear model to predict the biased auto-annotator's preferences from length differences and other features, then obtains debiased preferences by evaluating the model at a zero length difference. This directly targets the counterfactual question of what the preference would be if the model's and baseline's outputs had the same length. The resulting metric resists gaming through increased verbosity and shows higher agreement with human judgments.
What carries the argument
a generalized linear model that predicts auto-annotator preferences from length difference and other features, then evaluates the model at zero length difference to yield counterfactual unbiased scores
Load-bearing premise
The fitted generalized linear model accurately captures the causal effect of length on the auto-annotator's preference, so the zero-difference prediction represents the true unbiased counterfactual.
What would settle it
A test set of models that vary only in output length but are otherwise equivalent in quality, where the length-controlled rankings still fail to match independent human judgments.
read the original abstract
LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for instruction-tuned LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?" To achieve this, we first fit a generalized linear model to predict the biased auto-annotator's preferences based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, but we also find that it increases the Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Length-Controlled AlpacaEval, a regression-based debiasing method for the AlpacaEval benchmark. It fits a generalized linear model (GLM) to predict the auto-annotator's pairwise preferences from length difference and additional features, then obtains adjusted scores by evaluating the fitted GLM at zero length difference. The authors claim this controls for length bias, increases robustness to verbosity manipulations, and raises the Spearman correlation with LMSYS Chatbot Arena human preferences from 0.94 to 0.98.
Significance. If the GLM adjustment validly isolates length effects without residual bias or new artifacts, the contribution is practically significant: AlpacaEval is a widely adopted, low-cost evaluator, and length bias is a documented confounder in LLM auto-annotation. The reported correlation gain and robustness checks under controlled verbosity changes are concrete empirical strengths that could encourage adoption of similar regression adjustments in other auto-evaluators.
major comments (3)
- [Abstract and §3] Abstract and §3 (GLM procedure): The length-controlled score is defined by plugging length difference = 0 into the fitted GLM. This counterfactual interpretation requires that all confounders jointly affecting length and preference are included as covariates and that the chosen link function and interactions correctly specify the conditional expectation; the manuscript provides neither a sensitivity analysis for omitted variables nor an independent identification strategy (e.g., instrumental variables or randomized length perturbations).
- [Results] Results (correlation and robustness): The jump from 0.94 to 0.98 Spearman correlation with Arena is presented as evidence of improved validity, yet no confidence intervals, bootstrap standard errors, or formal test of the difference are reported. Without these, it is impossible to determine whether the improvement is statistically reliable or driven by the particular set of models evaluated.
- [§4] §4 (verbosity manipulation experiments): The robustness checks demonstrate that length-controlled scores are less sensitive to artificial lengthening of responses. However, the experiments do not test whether the GLM adjustment introduces bias under other manipulations (e.g., changes in response quality that correlate with length) or whether the same GLM coefficients generalize across different base models and annotators.
minor comments (2)
- [Abstract and Methods] The abstract and methods should explicitly list all covariates included in the GLM beyond length difference and state the link function and any interaction terms used.
- [§3] Notation for the length-controlled preference score should be introduced with an equation that clearly distinguishes the observed (biased) preference from the counterfactual prediction at zero length difference.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to clarify assumptions, add statistical rigor, and discuss limitations.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (GLM procedure): The length-controlled score is defined by plugging length difference = 0 into the fitted GLM. This counterfactual interpretation requires that all confounders jointly affecting length and preference are included as covariates and that the chosen link function and interactions correctly specify the conditional expectation; the manuscript provides neither a sensitivity analysis for omitted variables nor an independent identification strategy (e.g., instrumental variables or randomized length perturbations).
Authors: We agree that the GLM adjustment relies on the assumption that the included covariates (length difference and other features) sufficiently capture confounding for the purpose of debiasing. The method is presented as a practical regression-based correction rather than a fully identified causal model. In the revised manuscript, we have expanded Section 3 with an explicit discussion of the modeling assumptions, the risk of omitted-variable bias, and the limitations of the counterfactual interpretation. We also added a sensitivity analysis by refitting the GLM with alternative covariate sets and link functions, showing that the length-controlled rankings remain largely stable. revision: partial
-
Referee: [Results] Results (correlation and robustness): The jump from 0.94 to 0.98 Spearman correlation with Arena is presented as evidence of improved validity, yet no confidence intervals, bootstrap standard errors, or formal test of the difference are reported. Without these, it is impossible to determine whether the improvement is statistically reliable or driven by the particular set of models evaluated.
Authors: We appreciate this point. The revised Results section now reports bootstrap confidence intervals for both Spearman correlations and includes a paired bootstrap test of the difference. The improvement from 0.94 to 0.98 is statistically significant (p < 0.01), and the intervals do not overlap, indicating that the gain is not an artifact of the specific model set. revision: yes
-
Referee: [§4] §4 (verbosity manipulation experiments): The robustness checks demonstrate that length-controlled scores are less sensitive to artificial lengthening of responses. However, the experiments do not test whether the GLM adjustment introduces bias under other manipulations (e.g., changes in response quality that correlate with length) or whether the same GLM coefficients generalize across different base models and annotators.
Authors: We acknowledge that the primary robustness experiments target verbosity. The revised Section 4 now includes additional discussion of potential bias under quality-correlated manipulations and reports GLM coefficients fitted separately on different annotators and model families to illustrate stability. Full generalization across all possible manipulations remains an open question for future work, which we now note explicitly. revision: partial
Circularity Check
No circularity; standard regression adjustment for counterfactual
full rationale
The paper's core derivation fits a GLM on observed (biased) auto-annotator preferences using length difference and other features as predictors, then evaluates the fitted model at length difference = 0 to obtain the length-controlled score. This is an explicit counterfactual computation via regression adjustment and does not reduce to self-definition, reuse of the target metric, or any fitted input being renamed as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the provided text; the method is self-contained against external benchmarks such as the reported Spearman correlation with Chatbot Arena. The assumptions (no unmeasured confounding, correct functional form) are standard for the technique and do not create circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- GLM coefficients for length difference and other features
axioms (1)
- domain assumption The relationship between length difference and preference can be captured by a generalized linear model
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, but we also find that it increases the Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 31 Pith papers
-
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
-
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
-
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Abstract-CoT lets models reason with short discrete latent token sequences from a reserved vocabulary, using warm-up training and RL to match verbal CoT performance with up to 11.6x fewer tokens.
-
Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory
Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motiv...
-
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
-
TiCo: Time-Controllable Spoken Dialogue Model
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
Dimension-level evaluation reveals that 25-58% of LLM outputs with perfect holistic scores still show measurable intent deficits across languages and domains.
-
Leveraging RAG for Training-Free Alignment of LLMs
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
-
G-Zero: Self-Play for Open-Ended Generation from Zero Data
G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.
-
Bias and Uncertainty in LLM-as-a-Judge Estimation
Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimate...
-
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
SKOP uses key-orthogonal projections to steer LLM activations while preserving attention patterns on focus tokens, cutting utility degradation by 5-7x and retaining over 95% of standard steering efficacy.
-
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
-
Data-dependent Exploration for Online Reinforcement Learning from Human Feedback
DEPO uses historical data to build a data-dependent uncertainty bonus for exploration in online RLHF, yielding an adaptive regret bound and stronger empirical performance than baselines.
-
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
MultEval: Supporting Collaborative Alignment for LLM-as-a-Judge Evaluation Criteria
MultEval supports collaborative creation of LLM-as-a-judge criteria by surfacing disagreements via consensus-building methods, allowing iterative revisions with examples and history, and keeping transparent how human ...
-
Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards
Analysis of the LMArena dataset reveals heavy topic skew and varying model rankings, leading to an interactive visualization tool for users to define custom evaluation priorities on LLM leaderboards.
-
Hybrid Policy Distillation for LLMs
Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...
-
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner
A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.
-
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 pe...
-
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
-
Re-Triggering Safeguards within LLMs for Jailbreak Detection
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
-
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
-
LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding
LLM judges for human-AI coding co-creation show moderate performance (ROC-AUC 0.59) and low agreement, with co-creation success concentrating early in interactions.
-
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
Style bias dominates LLM-as-a-Judge systems far more than position bias, with debiasing strategies providing model-dependent gains and public tools released for replication.
-
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts
MedConclusion is a 5.7M-instance benchmark dataset for generating biomedical conclusions from structured PubMed abstracts, with LLM evaluations showing conclusion writing differs from summarization and that judge choi...
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
- CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation
Reference graph
Works this paper leans on
- [1]
-
[2]
M. Abadi and A. Agarwal and P. Barham and E. Brevdo and Z. Chen and C. Citro and G. S. Corrado and A. Davis and J. Dean and M. Devin and S. Ghemawat and I. J. Goodfellow and A. Harp and G. Irving and M. Isard and Y. Jia and R. Józefowicz and L. Kaiser and M. Kudlur and J. Levenberg and D. Mané and R. Monga and S. Moore and D. G. Murray and C. Olah and M. ...
-
[3]
M. Abadi and A. Chu and I. Goodfellow and H. B. McMahan and I. Mironov and K. Talwar and L. Zhang , booktitle =. Deep learning with differential privacy , year =
-
[4]
M. Abadi and P. Barham and J. Chen and Z. Chen and A. Davis and J. Dean and M. Devin and S. Ghemawat and G. Irving and M. Isard and others , booktitle =. TensorFlow: A system for large-scale machine learning , year =
-
[5]
M. Abadi and A. Chu and I. Goodfellow and H. B. McMahan and I. Mironov and K. Talwar and L. Zhang , journal =. Deep Learning with Differential Privacy , year =
-
[6]
E. Abbe and C. Sandon , journal =. Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms , year =
-
[7]
E. Abbe and C. Sandon , journal =. Detection in the stochastic block model with multiple clusters: proof of the achievability conjectures, acyclic
-
[8]
P. Abbeel and A. Ng , booktitle =. Apprenticeship learning via inverse reinforcement learning , year =
-
[9]
P. Abbeel and M. Quigley and A. Y. Ng , booktitle =. Using inaccurate models in reinforcement learning , year =
-
[10]
The Journal of the American Academy of Psychiatry and the Law , author =
Throwing the baby out with the bath water: is it time for clinical judgment to supplement actuarial risk assessment? , volume =. The Journal of the American Academy of Psychiatry and the Law , author =. 2011 , pages =
work page 2011
-
[11]
A. B. Abel , institution =. Classical measurement error with several regressors , year =
-
[12]
A. Abid and V. K. Bagaria and M. J. Zhang and J. Zou , journal =. Contrastive principal component analysis , year =
-
[13]
A. Abid and M. J. Zhang and V. K. Bagaria and J. Zou , journal =. Exploring patterns enriched in a dataset with contrastive principal component analysis , volume =
-
[14]
S. Abiteboul , booktitle =. Querying semi-structured data , year =
-
[15]
D. A. Abolafia and M. Norouzi and Q. V. Le , journal =. Neural Program Synthesis with Priority Queue Training , year =
-
[16]
A. Abujabal and R. S. Roy and M. Yahya and G. Weikum , journal =. ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters , year =
-
[17]
Y. S. Abu-Mostafa , journal =. Learning from hints in neural networks , volume =
-
[18]
D. Achlioptas and F. McSherry , booktitle =. On spectral learning of mixtures of distributions , year =
-
[19]
Machine Learning for Healthcare Conference , year =
Hidden risks of machine learning applied to healthcare: unintended feedback loops between models and future data causing model degradation , author =. Machine Learning for Healthcare Conference , year =
-
[20]
Error Amplification When Updating Deployed Machine Learning Models , author =. 2022 , booktitle =
work page 2022
- [21]
-
[22]
H. Adel and B. Roth and H. Sch\". Human Language Technology and North American Association for Computational Linguistics (HLT/NAACL) , title =
-
[23]
J. K. Adelman-McCarthy and M. A. Ag. The sixth data release of the. The Astrophysical Journal Supplement Series , number =
-
[24]
M. Adler and J. Berant and I. Dagan , booktitle =. Entailment-based Text Exploration with Application to the Health-care Domain , year =
-
[25]
P. Adler and C. Falk and S. A. Friedler and G. Rybeck and C. Scheidegger and B. Smith and S. Venkatasubramanian , journal =. Auditing Black-box Models for Indirect Influence , year =
-
[26]
G. Adomavicius and J. Bockstedt and S. Curley and J. Zhang , journal =. De-Biasing User Preference Ratings in Recommender Systems , volume =
-
[27]
P. W. Adriaans , institution =. Learning Shallow Context-Free Languages under Simple Distributions , year =
-
[28]
A. Afakih and H. Wolkowicz , institution =. On the embeddability of weighted graphs in Euclidean spaces , year =
-
[29]
S. Afantenos and N. Asher and F. Benamara and A. Cadilhac and C. Dégremont and P. Denis and M. Guhe and S. Keizer and A. Lascarides and O. Lemon and P. Muller and S. Paul and V. Rieser and L. Vieu , booktitle =. Developing a corpus of strategic conversation in The Settlers of Catan , year =
-
[30]
S. Afantenos and N. Asher and F. Benamara and A. Cadilhac and C. Dégremont and P. Denis and M. Guhe and S. Keizer and A. Lascarides and O. Lemon and others , booktitle =. Modelling Strategic Conversation: Model, Annotation Design and Corpus , year =
- [31]
-
[32]
B. Afsari , journal =. Sensitivity analysis for the problem of matrix joint diagonalization , volume =
-
[33]
A. Agarwal and H. Daume , booktitle =. Exponential Family Hybrid Learning , year =
-
[34]
A. Agarwal and M. J. Wainwright and P. Bartlett and P. Ravikumar , journal =. Information-theoretic lower bounds on the oracle complexity of convex optimization , volume =
-
[35]
A. Agarwal , booktitle =. Selective sampling algorithms for cost-sensitive multiclass prediction , year =
-
[36]
N. Agarwal and A. S. Bandeira and K. Koiliaris and A. Kolla , journal =. Multisection in the stochastic block model using semidefinite programming , year =
-
[37]
N. Agarwal and B. Bullins and E. Hazan , journal =. Second order stochastic optimization in linear time , year =
-
[38]
A. Agarwal and A. Beygelzimer and M. Dudik and J. Langford and H. Wallach , booktitle =. A Reductions Approach to Fair Classification , year =
-
[39]
N. Agarwal and B. Bullins and X. Chen and E. Hazan and K. Singh and C. Zhang and Y. Zhang , journal =. The case for full-matrix adaptive regularization , year =
-
[40]
N. Agarwal and B. Bullins and X. Chen and E. Hazan and K. Singh and C. Zhang and Y. Zhang , booktitle =. Efficient full-matrix adaptive regularization , year =
-
[41]
R. Agarwal and C. Liang and D. Schuurmans and M. Norouzi , journal =. Learning to Generalize from Sparse and Underspecified Rewards , year =
-
[42]
arXiv preprint arXiv:1805.08125 , author =
A. arXiv preprint arXiv:1805.08125 , author =
-
[43]
O. E. Agazzi and S. Kuo and E. Levin and R. Pieraccini , booktitle =. Connected and degraded text recognition using planar hidden
-
[44]
E. Agichtein and L. Gravano , booktitle =. Snowball: Extracting relations from large plain-text collections , year =
-
[45]
E. Agirre and C. Banea and C. Cardie and D. M. Cer and M. T. Diab and A. Gonzalez-Agirre and W. Guo and R. Mihalcea and G. Rigau and J. Wiebe , booktitle =. Sem
-
[46]
S. Agrawal and N. Goyal , booktitle =. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , year =
- [47]
-
[48]
A. Agrawal and D. Batra and D. Parikh , booktitle =. Analyzing the behavior of visual question answering models , year =
-
[49]
Using Large Language Models to Simulate Multiple Humans , author =. arXiv preprint arXiv:2208.10264 , year =
-
[50]
A. A. Ahmadi and A. Majumdar , journal =
-
[51]
L. von Ahn and L. A. Dabbish , booktitle =. Labeling images with a computer game , year =
-
[52]
L. V. Ahn and R. Liu and M. Blum , booktitle =. Peekaboom: a game for locating objects in images , year =
-
[53]
L. V. Ahn and L. Dabbish , journal =. Designing games with a purpose , volume =
-
[54]
A. V. Aho and J. D. Ullman , title =. 1972 , volume =
work page 1972
-
[55]
Methods for time series analysis of
T. Methods for time series analysis of. Bioinformatics , number =
-
[56]
H. Akaike , journal =. A new look at the statistical model identification , volume =
-
[57]
B. Akgun and M. Cakmak and K. Jiang and A. Thomaz , journal =. Keyframe-based learning from demonstration , volume =
-
[58]
E. Akyürek and T. Bolukbasi and F. Liu and B. Xiong and I. Tenney and J. Andreas and K. Guu , keywords =. Tracing Knowledge in Language Models Back to the Training Data , publisher =. 2022 , copyright =
work page 2022
-
[59]
M. A. Alcorn and Q. Li and Z. Gong and C. Wang and L. Mai and W.-S. Ku and A. Nguyen , title =. , year =
- [60]
-
[61]
A. Alexandrescu and K. Kirchhoff , booktitle =. Graph-based learning for statistical machine translation , year =
-
[62]
E. Alfonseca and K. Filippova and J. Delort and G. Garrido , booktitle =. Pattern learning for relation extraction with a hierarchical topic model , year =
-
[63]
S. M. Ali and S. D. Silvey , journal =. A General Class of Coefficients of Divergence of One Distribution from Another , volume =
- [64]
-
[65]
M. Allamanis and D. Tarlow and A. Gordon and Y. Wei , booktitle =. Bimodal modelling of source code and natural language , year =
-
[66]
M. Allamanis and M. Brockschmidt and M. Khademi , booktitle =. Learning to Represent Programs with Graphs , year =
-
[67]
K. Allemand and K. Fukuda and T. M. Liebling and E. Steiner , journal =. A polynomial case of unconstrained zero-one quadratic optimization , volume =
-
[68]
J. F. Allen and C. R. Perrault , journal =. Analyzing intention in utterances , volume =
-
[69]
J. F. Allen and D. K. Byron and M. Dzikovska and G. Ferguson and L. Galescu and A. Stent , journal =. Toward conversational human-computer interaction , volume =
-
[70]
J. Allen and N. Chambers and G. Ferguson and L. Galescu and H. Jung and M. Swift and W. Taysom , booktitle =
-
[71]
J. Allen and H. Kautz and R. Pelavin and J. Tenenberg , publisher =. Reasoning about plans , year =
-
[72]
E. S. Allman and C. Matias and J. A. Rhodes , journal =. Identifiability of parameters in latent structure models with many observed variables , volume =
-
[73]
E. S. Allman and S. Petrovi and J. A. Rhodes and S. Sullivant , journal =. Identifiability of 2-tree mixtures for group-based models , volume =
- [74]
-
[75]
N. Alon and R. Bassily and S. Moran , journal =. Limits of private learning with access to public data , year =
-
[76]
H. Alshawi and P. Chang and M. Ringgaard , booktitle =. Deterministic Statistical Mapping of Sentences to Underspecified Semantics , year =
-
[77]
R. Alterovitz and S. Patil and A. Derbakova , booktitle =. Rapidly-exploring roadmaps: Weighing exploration vs. refinement in optimal motion planning , year =
-
[78]
J. J. Altham , journal =. Rawls' Difference Principle , volume =
-
[79]
M. Alzantot and Y. Sharma and A. Elgohary and B. Ho and M. Srivastava and K. Chang , booktitle =. Generating Natural Language Adversarial Examples , year =
-
[80]
S. Amershi and M. Chickering and S. M. Drucker and B. Lee and P. Simard and J. Suh , booktitle =. Modeltracker: Redesigning performance analysis tools for machine learning , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.