arxiv: 2603.27343 · v2 · submitted 2026-03-28 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking

Dengzhe Hou , Lingyu Jiang , Deng Li , Zirui Li , Fangzhou Lin , Kazunori D Yamada

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:57 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM evaluationworking memorycumulative state trackingdepth parameterizationarithmetic accumulationmodel probing

0 comments

The pith

LLMs exhibit performance degradation in cumulative state tracking as the depth of sequential operations increases within a single query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WMF-AM to test how language models handle maintaining and updating intermediate results across multiple steps without external aids. It parameterizes the depth K to measure increasing cumulative load on working memory. Experiments across 28 models from 12 families show that difficulty rises with larger K in arithmetic tasks. A parallel set of non-arithmetic tasks with permissions and schedules demonstrates that the effect is not limited to numbers. Ablations confirm that the degradation stems from cumulative load rather than specific skills like arithmetic or entity tracking.

Core claim

WMF-AM isolates within-pass cumulative load by parameterizing depth K in tasks that require maintaining running states across sequential operations. Performance on arithmetic accumulation and matched non-arithmetic variants declines as K grows, with construct-isolation ablations verifying that this reflects working memory limits on cumulative tracking rather than other factors.

What carries the argument

Depth-parameterized cumulative state tracking, which varies the number of sequential operations K to isolate the load of maintaining intermediate results in a single pass.

If this is right

Models across sizes and families show similar patterns of degradation with increasing K.
The probe generalizes beyond arithmetic to domains like permissions and inventories.
Three ablations isolate cumulative load as the primary driver of difficulty.
The method provides a recalibratable diagnostic as models advance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectural changes may be needed to improve internal state maintenance beyond further scaling.
The probe could compare training methods for better performance on sequential reasoning.
Similar depth-dependent effects may appear in other untested sequential domains.

Load-bearing premise

The arithmetic and non-arithmetic variants together with the ablations isolate cumulative state tracking without leaving domain-specific confounds that affect the results.

What would settle it

Observing no performance degradation with increasing K in the main tasks or in the ablated versions would indicate that cumulative load is not the isolated driver.

Figures

Figures reproduced from arXiv: 2603.27343 by Deng Li, Dengzhe Hou, Fangzhou Lin, Kazunori D Yamada, Lingyu Jiang, Zirui Li.

**Figure 1.** Figure 1: WMF-AM framework. (a) Cognitive analogy: a model maintains and updates a running state across K sequential operations and reports only the final answer, without scratchpad. (b) Probe design: the input prompt specifies an initial state and K cumulative updates; the LLM must track hidden internal state as K increases. (c) Radar profiles for 10 representative models at K=3/5/7 with agent battery and yoked con… view at source ↗

**Figure 2.** Figure 2: WMF-AM vs. Agent Battery Score (N=28, 12 families). WMF-AM predicts downstream agent performance (τ=0.595, p<0.001). Blue circles = standard models; red squares = LRM (reasoning) models; orange diamonds = LRM-distill models; black edge = API models. All 28 models are labeled. Note: DeepSeek-R1 (671B, “R1-full”) achieves perfect WMF-AM (1.000) but low agent score (0.50), discussed in Section 5. Claude and o… view at source ↗

**Figure 3.** Figure 3: K-sweep analysis (N=28). (a) K-degradation curves: accuracy vs. depth K for all 28 models (gray), with five representative models highlighted (Claude-Sonnet-4, o3-mini, DeepSeek-V3, GPT-4o, DeepSeek-R1). Standard models show sigmoid-cliff collapse; DeepSeek-R1 shows nonmonotonic recovery. (b) Kcrit vs. Agent Battery Score (τ=0.171, p=0.23, n.s.): collapse threshold does not predict agent performance. Fade… view at source ↗

**Figure 4.** Figure 4: Model evaluation profiles across WMF-AM dimensions (N=10 representative models). Left: Construct controls (K=1, non-arithmetic, CoT, supported agent, K=50). Right: Depth resilience (K=10 through K=50). Solid = API; dashed = open-weight. 10 models span the full capability range from Qwen2.5:3B to Claude-Sonnet-4. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Existing large language models (LLMs) evaluations use fixed-difficulty benchmarks that cannot adapt as models improve, and rarely isolate specific cognitive processes. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a probe of cumulative state tracking, the ability to maintain and update intermediate results across K sequential operations within a single query, without a scratchpad. Unlike multi-step agent benchmarks that stress task orchestration, WMF-AM isolates within-pass cumulative load by parameterizing depth K. The core probe uses arithmetic accumulation on 28 models from 12 families (0.5B to frontier); a matched non-arithmetic extension (permissions, schedules, inventories) confirms the design generalizes beyond arithmetic. Three construct-isolation ablations confirm that cumulative load, not arithmetic skill or entity tracking, drives difficulty. We release WMF-AM as a lightweight, recalibratable diagnostic for characterizing where models degrade under cumulative load. Code and data can be accessed at https://github.com/dengzhe-hou/WMF-AM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WMF-AM gives a practical depth-parameterized probe for cumulative state tracking that tests cleanly across many models, but prompt-length effects as K grows still need tighter controls.

read the letter

The paper's core move is to parameterize working-memory load by the number of sequential updates K inside one prompt, then measure where models lose the running total. They apply this to arithmetic accumulation across 28 models from 12 families, add matched non-arithmetic versions using permissions, schedules, and inventories, and run three ablations that remove arithmetic skill and entity-tracking explanations. The code and data are released, which makes the benchmark immediately usable and recalibratable as models improve. That combination of scale, generalization check, and construct isolation is what is actually new here and what the work does well. The results appear to show consistent degradation with depth, and the non-arithmetic extension helps rule out domain-specific artifacts. The ablations are a clear step beyond fixed-difficulty benchmarks. The main soft spot is the stress-test point on length and position. As K rises the prompt necessarily lengthens and the number of tokens to track grows, so performance drops could still reflect general context degradation rather than cumulative load alone. The three ablations address arithmetic and entity confounds but do not appear to include fixed-length or position-matched controls that would close this gap. If those controls are in the full methods they are not highlighted enough to remove the concern. This paper is for researchers who evaluate or train LLMs and want a lightweight, tunable diagnostic focused on within-pass accumulation. It is solid enough on the empirical side and the idea is clear enough that it deserves a serious referee, with the length-confound issue as the main item for revision.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces WMF-AM, a benchmark that parameterizes depth K to probe LLMs' cumulative state tracking ability in sequential operations within a single query without scratchpad. It evaluates 28 models from 12 families (0.5B to frontier) on arithmetic accumulation tasks, includes a matched non-arithmetic extension (permissions, schedules, inventories), and reports three construct-isolation ablations intended to show that performance degradation is driven by cumulative load rather than arithmetic skill or entity tracking. The benchmark is positioned as a lightweight, recalibratable diagnostic and is released with code and data.

Significance. If the isolation of cumulative state tracking holds after addressing confounds, WMF-AM would supply a scalable, depth-parameterized diagnostic that complements fixed-difficulty benchmarks and multi-step agent evaluations. Its coverage across model scales and families, plus the non-arithmetic generalization, would make it a practical tool for tracking where models degrade under increasing within-pass load, with potential to inform architecture and training choices for better state maintenance.

major comments (2)

[Abstract (ablations description)] The claim that the three construct-isolation ablations confirm cumulative load (rather than arithmetic skill or entity tracking) as the driver of difficulty is load-bearing for the central contribution, yet the design inherently increases prompt length and token count with K. No explicit controls for general long-context degradation or sequence-position effects (e.g., length-matched non-accumulation baselines or fixed-position variants) are described, leaving open the possibility that observed degradation reflects context-length sensitivity instead of state-tracking load specifically.
[Abstract] The abstract states that tests were run on 28 models with three ablations and a non-arithmetic extension, but supplies no quantitative results, error bars, per-model metrics, or exclusion criteria. Without these data (presumably in the results section or tables), it is impossible to assess effect sizes, statistical reliability, or whether the ablations actually succeeded in isolating the target construct.

minor comments (1)

[Abstract] The abstract could more precisely define 'within-pass cumulative load' and explicitly contrast WMF-AM with chain-of-thought or scratchpad methods to avoid reader confusion about the no-scratchpad constraint.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript introducing WMF-AM. We address each major comment point by point below. We have revised the manuscript to incorporate additional controls and quantitative details where the comments identify areas for strengthening the presentation of results and isolation of the target construct.

read point-by-point responses

Referee: [Abstract (ablations description)] The claim that the three construct-isolation ablations confirm cumulative load (rather than arithmetic skill or entity tracking) as the driver of difficulty is load-bearing for the central contribution, yet the design inherently increases prompt length and token count with K. No explicit controls for general long-context degradation or sequence-position effects (e.g., length-matched non-accumulation baselines or fixed-position variants) are described, leaving open the possibility that observed degradation reflects context-length sensitivity instead of state-tracking load specifically.

Authors: We acknowledge that increasing K necessarily lengthens the prompt, which could introduce a potential confound with general long-context sensitivity. The original ablations were designed to hold arithmetic operations and entity counts constant while varying only the cumulative update requirement, and the matched non-arithmetic extension (permissions, schedules, inventories) provides a control for arithmetic skill. However, to more directly address length and position effects, we will add length-matched non-accumulation baselines (e.g., repeated non-cumulative operations of equivalent token length) and fixed-position variants in the revised manuscript. These additions will allow explicit comparison of degradation under matched lengths but differing state-tracking demands. revision: yes
Referee: [Abstract] The abstract states that tests were run on 28 models with three ablations and a non-arithmetic extension, but supplies no quantitative results, error bars, per-model metrics, or exclusion criteria. Without these data (presumably in the results section or tables), it is impossible to assess effect sizes, statistical reliability, or whether the ablations actually succeeded in isolating the target construct.

Authors: The full manuscript reports all requested details in the Experiments, Results, and Ablations sections: per-model accuracy tables across the 28 models, degradation curves with standard error bars from multiple runs, specific ablation outcomes (e.g., performance when arithmetic skill is controlled), and exclusion criteria for model sizes and task variants. To make these findings immediately visible, we will revise the abstract to include key quantitative highlights such as average accuracy drop per increment in K and the proportion of variance explained by the cumulative-load ablations. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction

full rationale

The paper introduces WMF-AM as an empirical diagnostic probe that parameterizes task depth K and measures LLM performance degradation on cumulative state tracking via direct model evaluations on arithmetic and non-arithmetic variants across 28 models. No equations, derivations, fitted parameters, or predictions appear in the work; results derive from raw output comparisons and three ablations rather than any self-referential reduction or self-citation chain. The central claims rest on observable performance patterns under controlled prompt variations, with no load-bearing step that collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an empirical evaluation method rather than new theoretical entities or fitted constants; it rests on standard assumptions about LLM internal state and task design.

axioms (1)

domain assumption Cumulative state tracking can be isolated from arithmetic skill and entity tracking through matched task variants and targeted ablations.
This assumption underpins the claim that difficulty is driven by cumulative load.

pith-pipeline@v0.9.0 · 5490 in / 1341 out tokens · 57003 ms · 2026-05-14T21:57:35.584898+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

WMF-AM isolates within-pass cumulative load by parameterizing depth K... three construct-isolation ablations confirm that cumulative load, not arithmetic skill or entity tracking, drives difficulty.
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

K-sweep... sigmoid-cliff collapse... Kcrit spans 1.3 to 55.3

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 13 internal anchors

[1]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, et al. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[2]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Bran Labash, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

work page 2023
[3]

WebShop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[4]

AgentProcessBench: Diagnosing step-level process quality in tool-using agents.arXiv preprint arXiv:2603.14465, 2026

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, et al. AgentProcessBench: Diagnosing step-level process quality in tool-using agents.arXiv preprint arXiv:2603.14465, 2026

work page arXiv 2026
[5]

M3-bench: Process-aware evaluation of llm agents social behaviors in mixed-motive games.arXiv preprint arXiv:2601.08462, 2026

Sixiong Xie, Zhuofan Shi, Haiyang Shen, Gang Huang, Yun Ma, and Xiang Jing. M3-bench: Process-aware evaluation of llm agents social behaviors in mixed-motive games.arXiv preprint arXiv:2601.08462, 2026

work page arXiv 2026
[6]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Measur- ing what matters: Construct validity in LLM benchmarks

Andrew M Bean, Marijn Jansen, Nicolas Baumard, Sarah Mathew, and Alberto Acerbi. Measur- ing what matters: Construct validity in LLM benchmarks. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2025

work page 2025
[9]

Cronbach and Paul E

Lee J. Cronbach and Paul E. Meehl. Construct validity in psychological tests.Psychological Bulletin, 52(4):281–302, 1955

work page 1955
[10]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, 2020

work page 2020
[11]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, 2020

work page 2020
[12]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

LongBench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024
[14]

Baddeley and Graham J

Alan D. Baddeley and Graham J. Hitch. Working memory.Psychology of Learning and Motivation, 8:47–89, 1974

work page 1974
[15]

The magical number 4 in short-term memory: A reconsideration of mental storage capacity.Behavioral and Brain Sciences, 24(1):87–114, 2001

Nelson Cowan. The magical number 4 in short-term memory: A reconsideration of mental storage capacity.Behavioral and Brain Sciences, 24(1):87–114, 2001

work page 2001
[16]

George A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information.Psychological Review, 63(2):81–97, 1956

work page 1956
[17]

Working memory identifies reasoning limits in language models

Chunhui Zhang, Yiren Jian, Zhongyu Ouyang, and Soroush V osoughi. Working memory identifies reasoning limits in language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16896–16922, 2024

work page 2024
[18]

Exploring working memory capacity in LLMs: From stressors to human-inspired strategies

Eunjin Hong, Sumin Cho, and Juae Kim. Exploring working memory capacity in LLMs: From stressors to human-inspired strategies. InProceedings of the 14th International Joint Conference on Natural Language Processing (IJCNLP-AACL), 2025

work page 2025
[19]

Language models do not have human-like working memory.arXiv preprint arXiv:2505.10571, 2025

Jen-tse Huang, Kaiser Sun, Wenxuan Wang, and Mark Dredze. Language models do not have human-like working memory.arXiv preprint arXiv:2505.10571, 2025

work page arXiv 2025
[20]

Minerva: A programmable memory test benchmark for language models

Menglin Xia, Victor Ruehle, Saravan Rajmohan, and Reza Shokri. Minerva: A programmable memory test benchmark for language models. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[21]

The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv preprint arXiv:2506.06941,

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv preprint arXiv:2506.06941, 2025

work page arXiv 2025
[22]

Engle, Stephen W

Randall W. Engle, Stephen W. Tuholski, James E. Laughlin, and Andrew R. A. Conway. Working memory, short-term memory, and general fluid intelligence: A latent-variable approach. Journal of Experimental Psychology: General, 128(3):309–331, 1999

work page 1999
[23]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.arXiv preprint arXiv:2305.04388, 2023

work page arXiv 2023
[24]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamile Lukosiute, Karina Nguyen, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rauber, Sam McCandlish, Catherine Olsson, Tom Henighan, Tristan Hume, Zac Hatfield-Dodds, Jared Kaplan, Jack Clark, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Ashwath Vaithinathan Aravindan and Mayank Kejriwal. Fragile thoughts: How large language models handle chain-of-thought perturbations.arXiv preprint arXiv:2603.03332, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022

work page 2022
[27]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[28]

Robust an- swers, fragile logic: Probing the decoupling hypothesis in LLM reasoning.arXiv preprint arXiv:2505.17406, 2025

Enyi Jiang, Changming Xu, Nischay Singh, Tian Qiu, and Gagandeep Singh. Robust an- swers, fragile logic: Probing the decoupling hypothesis in LLM reasoning.arXiv preprint arXiv:2505.17406, 2025. 11

work page arXiv 2025
[29]

Holistic evaluation of language models.Transactions on Machine Learning Research, 2023

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narang, et al. Holistic evaluation of language models.Transactions on Machine Learning Research, 2023

work page 2023
[30]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023

work page 2023
[31]

Validity

Samuel Messick. Validity. In Robert L. Linn, editor,Educational Measurement, pages 13–103. American Council on Education / Macmillan, 3rd edition, 1989

work page 1989
[32]

Jacobs and Hanna Wallach

Abigail Z. Jacobs and Hanna Wallach. Measurement and fairness. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (F AccT), pages 375–385, 2021

work page 2021
[33]

Entity tracking in language models

Najoung Kim and Sebastian Schuster. Entity tracking in language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023

work page 2023
[34]

Exploring state tracking capabilities of large language models.arXiv preprint arXiv:2511.10457, 2025

Kiamehr Rezaee, Jose Camacho-Collados, and Mohammad Taher Pilehvar. Exploring state tracking capabilities of large language models.arXiv preprint arXiv:2511.10457, 2025

work page arXiv 2025
[35]

Li, Zifan Carl Guo, and Jacob Andreas

Belinda Z. Li, Zifan Carl Guo, and Jacob Andreas. (how) do language models track state?arXiv preprint arXiv:2503.02854, 2025

work page arXiv 2025
[36]

Strong memory, weak control: An empirical study of executive functioning in llms.arXiv preprint arXiv:2504.02789, 2025

Karin de Langis, Jong Inn Park, Bin Hu, Khanh Chi Le, Andreas Schramm, Michael C Mensink, Andrew Elfenbein, and Dongyeop Kang. Strong memory, weak control: An empirical study of executive functioning in llms.arXiv preprint arXiv:2504.02789, 2025

work page arXiv 2025
[37]

A neuropsychologically grounded evaluation of LLM cognitive abilities.arXiv preprint arXiv:2603.02540, 2026

Faiz Ghifari Haznitrama, Faeyza Rishad Ardi, and Alice Oh. A neuropsychologically grounded evaluation of LLM cognitive abilities.arXiv preprint arXiv:2603.02540, 2026

work page arXiv 2026
[38]

Unable to forget: Proactive interference reveals working memory limits in LLMs beyond context length.arXiv preprint arXiv:2506.08184, 2025

Chupei Wang and Jiaqiu Vince Sun. Unable to forget: Proactive interference reveals working memory limits in LLMs beyond context length.arXiv preprint arXiv:2506.08184, 2025

work page arXiv 2025
[39]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

work page 2022
[40]

Intelligence degradation in long-context llms: Critical threshold determination via natural length distribution analysis.arXiv preprint arXiv:2601.15300, 2026

Weiwei Wang, Jiyong Min, and Weijie Zou. Intelligence degradation in long-context llms: Critical threshold determination via natural length distribution analysis.arXiv preprint arXiv:2601.15300, 2026

work page arXiv 2026
[41]

Easy2hard- bench: Standardized difficulty labels for profiling llm performance and generalization

Zhiyuan Yuan, Jiawei Zhang, Changlong Li, Zhongyi Xu, Fei Liu, and Ning Chen. Easy2hard- bench: Standardized difficulty labels for profiling llm performance and generalization. In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024

work page 2024
[42]

Cognitive load limits in large language models: Benchmarking multi-hop reasoning.arXiv preprint arXiv:2509.19517, 2025

Sai Teja Reddy Adapala. Cognitive load limits in large language models: Benchmarking multi-hop reasoning.arXiv preprint arXiv:2509.19517, 2025

work page arXiv 2025
[43]

The episodic buffer: a new component of working memory?Trends in Cognitive Sciences, 4(11):417–423, 2000

Alan Baddeley. The episodic buffer: a new component of working memory?Trends in Cognitive Sciences, 4(11):417–423, 2000

work page 2000
[44]

Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2):257–285, 1988

John Sweller. Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2):257–285, 1988

work page 1988
[45]

Andrew R. A. Conway, Michael J. Kane, Michael F. Bunting, D. Zachary Hambrick, Oliver Wilhelm, and Randall W. Engle. Working memory span tasks: A methodological review and user’s guide.Psychonomic Bulletin & Review, 12(5):769–786, 2005

work page 2005
[46]

Klaus Oberauer, Stephan Lewandowsky, Edward Awh, Gordon D. A. Brown, Andrew Conway, Nelson Cowan, Christopher Donkin, Simon Farrell, Graham J. Hitch, Mark J. Hurlstone, Wei Ji Ma, Candice C. Morey, Derek Evan Nee, Judith Schweppe, Evie Vergauwe, and Geoff Ward. Benchmarks for models of short-term and working memory.Psychological Bulletin, 144(9): 885–958,...

work page 2018
[47]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Biber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[48]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

GPT-4 Technical Report

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[54]

Robert H. Somers. A new asymmetric measure of association for ordinal variables.American Sociological Review, 27(6):799–811, 1962

work page 1962
[55]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[56]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, 2022

work page 2022
[58]

Can llms predict their own failures? self-awareness via internal circuits.arXiv preprint arXiv:2512.20578, 2025

Amirhosein Ghasemabadi and Di Niu. Can llms predict their own failures? self-awareness via internal circuits.arXiv preprint arXiv:2512.20578, 2025

work page arXiv 2025
[59]

Hoffman, et al

Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, et al. Un- derspecification presents challenges for credibility in modern machine learning.Journal of Machine Learning Research, 23(226):1–61, 2022

work page 2022
[60]

Wichmann

Robert Geirhos, J¨orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020
[61]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[62]

GAIA: a benchmark for General AI Assistants

Gr´egoire Mialon, Cl´ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? In International Conference on Learning Representations (ICLR), 2024

work page 2024
[64]

gains 3” immediately followed by “loses 3

Donald T. Campbell and Donald W. Fiske. Convergent and discriminant validation by the multitrait-multimethod matrix.Psychological Bulletin, 56(2):81–105, 1959. 13 A K-Sweep: Discriminability vs. Step Count Table 4:K-sweep discriminability ( N=28). τ=τ(probe,Agent) at each depth. Discriminability peaks atK=3–7and declines at higher depths as most models ap...

work page 1959