Thomas McCoy and Shunyu Yao and Dan Friedman and Matthew Hardy and Thomas L

· 2023 · arXiv 2309.13638

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

Gradient-Based Program Synthesis with Neurally Interpreted Languages

cs.LG · 2026-04-20 · unverdicted · novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

Is She Even Relevant? When BERT Ignores Explicit Gender Cues

cs.CL · 2026-05-08 · conditional · novelty 7.0

A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

cs.SE · 2024-05-06 · unverdicted · novelty 7.0

SWE-agent introduces a custom agent-computer interface that lets LM agents solve software engineering tasks, reaching 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix, exceeding prior non-interactive approaches.

Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

The LENS framework applied to 192 real-world settings shows moderate natural prompt distribution shifts cause 73% average performance loss in deployed LLMs, especially across user groups and regions.

When Verification Fails: How Compositionally Infeasible Claims Escape Rejection

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constraints but contradicted non-salient ones.

How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.

citing papers explorer

Showing 7 of 7 citing papers.

Gradient-Based Program Synthesis with Neurally Interpreted Languages cs.LG · 2026-04-20 · unverdicted · none · ref 61
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prior methods on combinatorial generalization tasks.
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition cs.CL · 2026-05-12 · unverdicted · none · ref 75
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
Is She Even Relevant? When BERT Ignores Explicit Gender Cues cs.CL · 2026-05-08 · conditional · none · ref 56
A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering cs.SE · 2024-05-06 · unverdicted · none · ref 32
SWE-agent introduces a custom agent-computer interface that lets LM agents solve software engineering tasks, reaching 12.5% pass@1 on SWE-bench and 87.7% on HumanEvalFix, exceeding prior non-interactive approaches.
Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance cs.CL · 2026-04-19 · unverdicted · none · ref 88
The LENS framework applied to 192 real-world settings shows moderate natural prompt distribution shifts cause 73% average performance loss in deployed LLMs, especially across user groups and regions.
When Verification Fails: How Compositionally Infeasible Claims Escape Rejection cs.CL · 2026-04-13 · unverdicted · none · ref 21
AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constraints but contradicted non-salient ones.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem cs.AI · 2026-05-07 · unverdicted · none · ref 45
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.

Thomas McCoy and Shunyu Yao and Dan Friedman and Matthew Hardy and Thomas L

fields

years

verdicts

representative citing papers

citing papers explorer