NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.
hub
Thomas McCoy and Shunyu Yao and Dan Friedman and Matthew Hardy and Thomas L
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
dataset 1polarities
use dataset 1representative citing papers
Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
SWE-agent introduces a custom agent-computer interface that lets LM agents solve software engineering tasks, reaching 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix, exceeding prior non-interactive approaches.
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
Parameter-based knowledge editing in LLMs induces reasoning collapse via dimensional collapse and is consistently outperformed by a retrieval baseline across varied edit counts, knowledge complexity, and evaluation metrics.
AI models misalign with humans on concept boundaries when probed with implausible category members, such as classifying words as vehicles or vegetables as fruit.
The LENS framework applied to 192 real-world settings shows moderate natural prompt distribution shifts cause 73% average performance loss in deployed LLMs, especially across user groups and regions.
AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constraints but contradicted non-salient ones.
LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.
SuperVoxelGPT creates shape-adaptive, deterministically ordered supervoxel tokens via saliency-guided CVT, cutting sequence length to 12.8% of uniform voxels while claiming SOTA quality and 10x speedup on Trellis-500K.
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.
Position paper warns that model collapse in self-consuming multilingual LLM training loops risks flattening linguistic diversity and cultural nuance.
citing papers explorer
-
Gradient-Based Program Synthesis with Neurally Interpreted Languages
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.
-
Brain-LLM Alignment Tracks Training Data, Not Typology
Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.
-
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
-
Is She Even Relevant? When BERT Ignores Explicit Gender Cues
A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
-
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
-
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
SWE-agent introduces a custom agent-computer interface that lets LM agents solve software engineering tasks, reaching 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix, exceeding prior non-interactive approaches.
-
Consistency Training while Mitigating Obfuscation via Rate Matching
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
-
Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence
Parameter-based knowledge editing in LLMs induces reasoning collapse via dimensional collapse and is consistently outperformed by a retrieval baseline across varied edit counts, knowledge complexity, and evaluation metrics.
-
Investigating Concept Alignment Using Implausible Category Members
AI models misalign with humans on concept boundaries when probed with implausible category members, such as classifying words as vehicles or vegetables as fruit.
-
Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance
The LENS framework applied to 192 real-world settings shows moderate natural prompt distribution shifts cause 73% average performance loss in deployed LLMs, especially across user groups and regions.
-
When Verification Fails: How Compositionally Infeasible Claims Escape Rejection
AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constraints but contradicted non-salient ones.
-
How Do Language Models Compose Functions?
LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.
-
SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation
SuperVoxelGPT creates shape-adaptive, deterministically ordered supervoxel tokens via saliency-guided CVT, cutting sequence length to 12.8% of uniform voxels while claiming SOTA quality and 10x speedup on Trellis-500K.
-
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
-
Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.
-
Losing our Tail, Again: (Un)Natural Selection & Multilingual LLMs
Position paper warns that model collapse in self-consuming multilingual LLM training loops risks flattening linguistic diversity and cultural nuance.