Matching primary charge explains 99.2% of the NDCG@10 gap between BM25 and best systems on LeCaRDv2 because benchmark relevance is defined by charge-encoding elements.
hub
Thomas McCoy, Ellie Pavlick, and Tal Linzen
19 Pith papers cite this work, alongside 245 external citations. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.
A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.
Mamba-2 models fail to learn reversible state retrieval in the UNDO Flip-Flop task, defaulting to a toggle heuristic and achieving only 41% accuracy under adversarial conditions.
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Grain calibration decomposes theoretical constructs into clause-level components, tests each with extractive evidence, and combines results through explicit theory-derived rules to validate LLM coding beyond agreement with human annotators.
Introduces claim-conditioned re-scoring (SIFT) and warranted supports proportion (WSP) metric, reporting accuracy recovery up to 27.6 points and WSP calibration at AUC 0.92 on FEVER, SciFact and other benchmarks.
Lexical anonymization via Caliper causes consistent accuracy drops of 7-30 percentage points across LLMs on causal benchmarks, indicating reliance on lexical anchors rather than structural causal reasoning.
An auditing framework for brain-to-audio retrieval isolates structural, stimulus-locked, and contextual performance sources via controls and a new Group Context Bias intervention, showing reduced performance under strict settings and measurable contextual gains.
LLMs outperform fine-tuned RoBERTa on low-prevalence inferentially complex circumstances in NVDRS data, with a hybrid prompt-selection framework based on a new Complexity Score generalizing across multiple frontier models.
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
Language models employ a highly localized shared mechanism for filler-gap dependencies but no unified mechanism for NPI licensing, and activation patching generalizes better than supervised alignment search.
Introduces a modality-switching mechanism for LLMs on spatial reasoning tasks using a trustworthiness and complexity based metric, showing up to 42% performance improvement.
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.
Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
citing papers explorer
-
Charge as a Construct-Validity Factor in Chinese Legal Case Retrieval: A Cross-Benchmark Audit
Matching primary charge explains 99.2% of the NDCG@10 gap between BM25 and best systems on LeCaRDv2 because benchmark relevance is defined by charge-encoding elements.
-
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.
-
On the Emergence of Syntax by Means of Local Interaction
A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.
-
The UNDO Flip-Flop: A Controlled Probe for Reversible Semantic State Management in State Space Model
Mamba-2 models fail to learn reversible state retrieval in the UNDO Flip-Flop task, defaulting to a toggle heuristic and achieving only 41% accuracy under adversarial conditions.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructs
Grain calibration decomposes theoretical constructs into clause-level components, tests each with extractive evidence, and combines results through explicit theory-derived rules to validate LLM coding beyond agreement with human annotators.
-
The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking
Introduces claim-conditioned re-scoring (SIFT) and warranted supports proportion (WSP) metric, reporting accuracy recovery up to 27.6 points and WSP calibration at AUC 0.92 on FEVER, SciFact and other benchmarks.
-
Caliper: Probing Lexical Anchors versus Causal Structure in LLMs
Lexical anonymization via Caliper causes consistent accuracy drops of 7-30 percentage points across LLMs on causal benchmarks, indicating reliance on lexical anchors rather than structural causal reasoning.
-
What Are We Actually Decoding? Source Attribution for Non-Invasive Brain-to-Language Retrieval
An auditing framework for brain-to-audio retrieval isolates structural, stimulus-locked, and contextual performance sources via controls and a new Group Context Bias intervention, showing reduced performance under strict settings and measurable contextual gains.
-
Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity
LLMs outperform fine-tuned RoBERTa on low-prevalence inferentially complex circumstances in NVDRS data, with a hybrid prompt-selection framework based on a new Complexity Score generalizing across multiple frontier models.
-
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
-
Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
-
Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models
Language models employ a highly localized shared mechanism for filler-gap dependencies but no unified mechanism for NPI licensing, and activation patching generalizes better than supervised alignment search.
-
Spatial Reasoning via Modality Switching Between Language and Symbolic Representation
Introduces a modality-switching mechanism for LLMs on spatial reasoning tasks using a trustworthiness and complexity based metric, showing up to 42% performance improvement.
-
Rigorous Interpretation Is a Form of Evaluation
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.
-
Do Activation Verbalization Methods Convey Privileged Information?
Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
- Lessons from the Trenches on Reproducible Evaluation of Language Models