Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance

Faezeh Ghaderi; Mahdi Naser-Moghadasi

arxiv: 2605.15436 · v1 · pith:VS7FSX64new · submitted 2026-05-14 · 💻 cs.CL · cs.LG

Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance

Mahdi Naser-Moghadasi , Faezeh Ghaderi This is my paper

Pith reviewed 2026-05-19 14:54 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords neural activation patternslarge language modelsattention entropysparsity patternscognitive tasksencoder decoder architecturesmathematical reasoning

0 comments

The pith

Mathematical reasoning produces the highest attention entropy across language model architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes neural activation patterns in six different large language model architectures performing twelve cognitive task categories. It uses measurements of final activation values, attention entropy, and sparsity patterns to compare how encoder and decoder models handle these tasks. Across 144 task-model combinations, mathematical reasoning stands out for producing the highest attention entropy in every architecture tested. Decoder models consistently show higher sparsity in their patterns than encoder models do. A sympathetic reader would care because these differences could help in choosing or designing models for specific kinds of reasoning work.

Core claim

Our analysis of 144 task-model combinations demonstrates that mathematical reasoning consistently produces the highest attention entropy across all architectures, while decoder models exhibit significantly higher sparsity patterns compared to encoder models. The findings provide critical insights into the computational characteristics of modern language models and their task-specific neural behaviors.

What carries the argument

The measurements of final activation values, attention entropy, and sparsity patterns, which reveal how different architectures distribute and focus their computations on cognitive tasks.

If this is right

Model selection for big data applications can be informed by matching architecture type to the entropy and sparsity needs of the target cognitive tasks.
Tasks like mathematical reasoning may benefit from architectures that support higher attention entropy.
Decoder models could be more suitable for applications where sparse activation patterns are advantageous for efficiency.
These patterns offer a new lens for understanding and optimizing the internal behaviors of language models beyond just output accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the patterns hold across more models, they could help predict computational costs for new tasks without running full evaluations.
Similar measurements might be applied to study how fine-tuning affects these activation characteristics.
The differences between encoders and decoders could inspire new hybrid model designs optimized for specific sparsity levels.

Load-bearing premise

The twelve cognitive task categories and the chosen measurement definitions (final activation values, attention entropy, sparsity) are assumed to capture meaningful and comparable computational differences without substantial confounding from task formulation or model-specific tokenization effects.

What would settle it

Finding that mathematical reasoning does not produce the highest attention entropy when the same measurements are applied to a different set of language models or task formulations.

read the original abstract

This paper presents a comprehensive analysis of neural activation patterns across six distinct large language model (LLM) architectures, examining their performance on twelve cognitive task categories. Through systematic measurement of final activation values, attention entropy, and sparsity patterns, we reveal fundamental differences in how encoder and decoder architectures process diverse cognitive tasks. Our analysis of 144 task-model combinations demonstrates that mathematical reasoning consistently produces the highest attention entropy across all architectures, while decoder models exhibit significantly higher sparsity patterns compared to encoder models. The findings provide critical insights into the computational characteristics of modern language models and their task-specific neural behaviors, with implications for model selection and optimization in big data applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper compares activation entropy and sparsity across six LLM architectures on twelve tasks but the reported architecture differences look vulnerable to task wording and tokenization confounds.

read the letter

The main point is that this work runs activation, entropy, and sparsity measurements on 144 task-model pairs and reports that mathematical reasoning produces the highest attention entropy while decoder models show higher sparsity than encoders. That scale is larger than many prior single-model interpretability studies, and the measurements follow standard practices, so the paper supplies a broader comparative dataset than some earlier papers in the same vein.

Referee Report

2 major / 2 minor

Summary. The paper presents a comprehensive empirical analysis of neural activation patterns across six LLM architectures on twelve cognitive task categories. Through measurements of final activation values, attention entropy, and sparsity patterns across 144 task-model combinations, it claims that mathematical reasoning tasks consistently yield the highest attention entropy in all architectures while decoder models show significantly higher sparsity than encoder models, offering insights into architecture-specific computational behaviors with implications for model selection.

Significance. If the central findings prove robust after addressing potential confounds, the scale of the 144 combinations provides broad empirical coverage of activation patterns that could inform practical decisions in model architecture selection for cognitive and big-data tasks. The work is purely observational with no parameter-free derivations or machine-checked proofs, so its primary value lies in the descriptive dataset rather than theoretical advance.

major comments (2)

[Abstract] Abstract: The claim that 'mathematical reasoning consistently produces the highest attention entropy across all architectures' and that 'decoder models exhibit significantly higher sparsity patterns' is presented without reference to statistical tests, effect sizes, p-values, or corrections for multiple comparisons across the 144 combinations, leaving open whether the differences exceed what would be expected from unaccounted variance.
[Methods] Methods (task construction and measurement definitions): The twelve cognitive task categories and the definitions of final activation values, attention entropy, and sparsity are not shown to control for prompt length, tokenization differences across models, or task formulation biases; without such controls or baseline comparisons, the observed patterns could arise from systematic confounds rather than genuine architecture-specific processing differences, directly undermining the central claim.

minor comments (2)

[Abstract] The abstract and results sections would benefit from explicit statements of the exact six architectures studied and how the 144 combinations were formed (e.g., whether multiple runs or seeds were used).
[Results] Figure captions and axis labels for any entropy or sparsity plots should include units and the precise formula used for each metric to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. Where appropriate, we indicate revisions that will be incorporated into the next version of the manuscript to address concerns about statistical reporting and potential methodological confounds.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'mathematical reasoning consistently produces the highest attention entropy across all architectures' and that 'decoder models exhibit significantly higher sparsity patterns' is presented without reference to statistical tests, effect sizes, p-values, or corrections for multiple comparisons across the 144 combinations, leaving open whether the differences exceed what would be expected from unaccounted variance.

Authors: We agree that the abstract would benefit from explicit statistical support for the reported patterns. Although the observed differences in attention entropy and sparsity were consistent across the full set of 144 task-model pairs, we will revise the abstract to reference the statistical tests (including repeated-measures ANOVA with Bonferroni corrections for multiple comparisons) and report effect sizes. These details will be added to the abstract and expanded in the Results section of the revised manuscript. revision: yes
Referee: [Methods] Methods (task construction and measurement definitions): The twelve cognitive task categories and the definitions of final activation values, attention entropy, and sparsity are not shown to control for prompt length, tokenization differences across models, or task formulation biases; without such controls or baseline comparisons, the observed patterns could arise from systematic confounds rather than genuine architecture-specific processing differences, directly undermining the central claim.

Authors: We acknowledge the value of explicitly demonstrating controls for prompt length, tokenization, and task formulation. Tasks were selected from established cognitive benchmarks with an aim toward comparable complexity, but we did not previously detail length normalization or baseline comparisons. In the revision we will add a new subsection to the Methods describing prompt standardization procedures (including token-length matching where feasible across models) and will include sensitivity analyses using length-controlled and neutral-prompt baselines to confirm that the architecture-specific patterns remain robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical measurements

full rationale

The paper performs direct empirical measurements of final activation values, attention entropy, and sparsity across 144 task-model combinations on six LLM architectures. No derivation chain, equations, fitted parameters, or self-citations are invoked to produce the central claims; the reported patterns (e.g., highest entropy in mathematical reasoning, higher sparsity in decoders) are presented as observed quantities rather than reductions of inputs by construction. The work is self-contained against external benchmarks and contains no load-bearing steps that collapse to self-definition or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central observations rest on the assumption that the chosen metrics and task groupings reflect intrinsic architectural differences rather than artifacts of implementation or data selection.

axioms (2)

domain assumption The twelve cognitive task categories are well-defined and representative of distinct cognitive processes.
Invoked when grouping tasks and attributing differences to cognitive type rather than surface features.
domain assumption Attention entropy and sparsity are meaningful proxies for computational style across architectures.
Used to interpret the measured quantities as evidence of fundamental processing differences.

pith-pipeline@v0.9.0 · 5640 in / 1314 out tokens · 87705 ms · 2026-05-19T14:54:06.072337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 10 internal anchors

[1]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron et al., ”Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

A. Q. Jiang et al., ”Mistral 7B,” arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Devlin, M

J. Devlin, M. Chang, K. Lee, and K. Toutanova, ”BERT: Pre-training of deep bidirectional transformers for language understanding,” in Pro- ceedings of NAACL-HLT, 2019, pp. 4171-4186

work page 2019
[4]

Radford et al., ”Language models are unsupervised multitask learn- ers,” OpenAI blog, vol

A. Radford et al., ”Language models are unsupervised multitask learn- ers,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019
[5]

Qwen Technical Report

J. Bai et al., ”Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Textbooks Are All You Need

S. Gunasekar et al., ”Textbooks are all you need,” arXiv preprint arXiv:2306.11644, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

T. L. Scao et al., ”BLOOM: A 176B-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

StabilityAI, ”StableLM: Stability AI Language Models,” GitHub repos- itory, 2023

work page 2023
[9]

X. Qiu, T. Sun, Y . Xu, Y . Shao, N. Dai, and X. Huang, ”Pre-trained models for natural language processing: A survey,” Science China Technological Sciences, vol. 63, no. 10, pp. 1872-1897, 2020

work page 2020
[10]

Rogers, O

A. Rogers, O. Kovaleva, and A. Rumshisky, ”A primer in neural network models for natural language processing,” Journal of Artificial Intelligence Research, vol. 57, pp. 345-420, 2020

work page 2020
[11]

R. J. Sternberg and K. Sternberg, Cognitive psychology. Cengage Learning, 2019

work page 2019
[12]

Newell and H

A. Newell and H. A. Simon, Human problem solving. Prentice-Hall, 1972

work page 1972
[13]

Petroni et al., ”Language models as knowledge bases?” in Proceedings of EMNLP-IJCNLP, 2019, pp

F. Petroni et al., ”Language models as knowledge bases?” in Proceedings of EMNLP-IJCNLP, 2019, pp. 2463-2473

work page 2019
[14]

Chakrabarty, P

T. Chakrabarty, P. Xie, C. Muresan, E. Kan, S. Muresan, and N. Peng, ”Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing,” in Proceedings of EMNLP, 2022, pp. 10824-10835

work page 2022
[15]

Hendrycks et al., ”Measuring mathematical problem solving with the MATH dataset,” in Proceedings of NeurIPS, 2021, pp

D. Hendrycks et al., ”Measuring mathematical problem solving with the MATH dataset,” in Proceedings of NeurIPS, 2021, pp. 8844-8856

work page 2021
[16]

Mohammad, F

S. Mohammad, F. Bravo-Marquez, M. Salameh, and S. Kiritchenko, ”SemEval-2018 task 1: Affect in tweets,” in Proceedings of SemEval, 2018, pp. 1-17

work page 2018
[17]

Evaluating Large Language Models Trained on Code

M. Chen et al., ”Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Talbot and S

B. Talbot and S. Bethard, ”Identifying the human values behind argu- ments,” in Proceedings of ACL, 2022, pp. 4459-4476

work page 2022
[19]

Adiwardana, M.-T

D. Adiwardana et al., ”Towards a human-like open-domain chatbot,” arXiv preprint arXiv:2001.09977, 2020

work page arXiv 2001
[20]

Talmor et al., ”LEAP-OF-THOUGHT: Teaching pre-trained mod- els to systematically reason over implicit premises,” arXiv preprint arXiv:2006.06609, 2020

A. Talmor et al., ”LEAP-OF-THOUGHT: Teaching pre-trained mod- els to systematically reason over implicit premises,” arXiv preprint arXiv:2006.06609, 2020

work page arXiv 2006
[21]

Jansen, E

P. Jansen, E. Wainwright, S. Marmorstein, and C. Morrison, ”WorldTree: A corpus of explanation graphs for elementary science questions sup- porting multi-hop inference,” in Proceedings of LREC, 2018

work page 2018
[22]

Conneau, R

A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, and V . Stoyanov, ”XNLI: Evaluating cross-lingual sentence representations,” in Proceedings of EMNLP, 2018, pp. 2475-2485

work page 2018
[23]

Mishra et al., ”Cross-task generalization via natural language crowd- sourcing instructions,” in Proceedings of ACL, 2022, pp

S. Mishra et al., ”Cross-task generalization via natural language crowd- sourcing instructions,” in Proceedings of ACL, 2022, pp. 3470-3487

work page 2022
[24]

Sap et al., ”Atomic: An atlas of machine commonsense for if-then reasoning,” in Proceedings of AAAI, 2019, pp

M. Sap et al., ”Atomic: An atlas of machine commonsense for if-then reasoning,” in Proceedings of AAAI, 2019, pp. 3027-3035

work page 2019
[25]

Tenney, D

I. Tenney, D. Das, and E. Pavlick, ”BERT rediscovers the classical NLP pipeline,” in Proceedings of ACL, 2019, pp. 4593-4601

work page 2019
[26]

Clark, U

K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, ”What does BERT look at? An analysis of BERT’s attention,” in Proceedings of ACL Workshop BlackboxNLP, 2019, pp. 276-286

work page 2019
[27]

Kovaleva, A

O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky, ”Revealing the dark secrets of BERT,” in Proceedings of EMNLP, 2019, pp. 4365- 4374

work page 2019
[28]

V oita, D

E. V oita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, ”Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned,” in Proceedings of ACL, 2019, pp. 5797-5808

work page 2019
[29]

Hoefler, D

T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, ”Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,” Journal of Machine Learning Research, vol. 22, no. 241, pp. 1-124, 2021

work page 2021
[30]

Frantar and D

E. Frantar and D. Alistarh, ”SparseGPT: Massive language models can be accurately pruned in one-shot,” in Proceedings of ICML, 2023, pp. 10323-10337

work page 2023
[31]

Michel, O

P. Michel, O. Levy, and G. Neubig, ”Are sixteen heads really better than one?” in Proceedings of NeurIPS, 2019, pp. 14014-14024

work page 2019
[32]

Prasanna, A

S. Prasanna, A. Rogers, and A. Rumshisky, ”When BERT plays the lottery, all tickets are winning,” in Proceedings of EMNLP, 2020, pp. 3208-3229

work page 2020
[33]

Wolf et al., ”Transformers: State-of-the-art natural language process- ing,” in Proceedings of EMNLP: System Demonstrations, 2020, pp

T. Wolf et al., ”Transformers: State-of-the-art natural language process- ing,” in Proceedings of EMNLP: System Demonstrations, 2020, pp. 38- 45

work page 2020
[34]

Paszke et al., ”PyTorch: An imperative style, high-performance deep learning library,” in Proceedings of NeurIPS, 2019, pp

A. Paszke et al., ”PyTorch: An imperative style, high-performance deep learning library,” in Proceedings of NeurIPS, 2019, pp. 8026-8037

work page 2019
[35]

Micikevicius et al., ”Mixed precision training,” in Proceedings of ICLR, 2018

P. Micikevicius et al., ”Mixed precision training,” in Proceedings of ICLR, 2018

work page 2018
[36]

Dodge, S

J. Dodge, S. Gururangan, D. Card, R. Schwartz, and N. A. Smith, ”Show your work: Improved reporting of experimental results,” in Proceedings of EMNLP-IJCNLP, 2019, pp. 2185-2194

work page 2019
[37]

Belinkov, ”Probing classifiers: Promises, shortcomings, and ad- vances,” Computational Linguistics, vol

Y . Belinkov, ”Probing classifiers: Promises, shortcomings, and ad- vances,” Computational Linguistics, vol. 48, no. 1, pp. 207-219, 2022

work page 2022
[38]

Brunner, Y

G. Brunner, Y . Liu, D. Pascual, O. Richter, M. Ciaramita, and R. Wattenhofer, ”On identifiability in transformers,” in Proceedings of ICLR, 2020

work page 2020
[39]

Strubell, A

E. Strubell, A. Ganesh, and A. McCallum, ”Energy and policy consid- erations for deep learning in NLP,” in Proceedings of ACL, 2019, pp. 3645-3650

work page 2019
[40]

Kurtic, D

E. Kurtic, D. Campos, T. Nguyen, E. Frantar, M. Alistarh, and D. Alistarh, ”The optimal BERT surgeon: Scalable and accurate second- order pruning for large language models,” in Proceedings of EMNLP, 2022, pp. 4864-4881

work page 2022
[41]

OPT: Open Pre-trained Transformer Language Models

S. Zhang et al., ”OPT: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Dettmers, M

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, ”LLM.int8(): 8-bit matrix multiplication for transformers at scale,” in Proceedings of NeurIPS, 2022, pp. 15318-15332

work page 2022
[43]

Holistic Evaluation of Language Models

P. Liang et al., ”Holistic evaluation of language models,” arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava et al., ”Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,” arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Training Verifiers to Solve Math Word Problems

K. Cobbe et al., ”Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Y . Tay, M. Dehghani, D. So, B. Ginsburg, Z. Dai, N. Shazeer, and Q. V . Le, ”git: A survey,” ACM Computing Surveys, vol. 55, no. 6, pp. 1-28, 2022

work page 2022
[47]

Narang and A

S. Narang and A. Chowdhery, ”Pathways: Asynchronous distributed dataflow for ML,” in Proceedings of MLSys, 2022, pp. 430-448

work page 2022
[48]

Lewis et al., ”Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Proceedings of NeurIPS, 2020, pp

P. Lewis et al., ”Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Proceedings of NeurIPS, 2020, pp. 9459-9474

work page 2020
[49]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer, ”Switch transformer: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research, vol. 23, no. 120, pp. 1-39, 2022

work page 2022
[50]

Nijkamp et al., ”CodeGen: An open large language model for code with multi-turn program synthesis,” in Proceedings of ICLR, 2023

E. Nijkamp et al., ”CodeGen: An open large language model for code with multi-turn program synthesis,” in Proceedings of ICLR, 2023

work page 2023

[1] [1]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron et al., ”Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

A. Q. Jiang et al., ”Mistral 7B,” arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Devlin, M

J. Devlin, M. Chang, K. Lee, and K. Toutanova, ”BERT: Pre-training of deep bidirectional transformers for language understanding,” in Pro- ceedings of NAACL-HLT, 2019, pp. 4171-4186

work page 2019

[4] [4]

Radford et al., ”Language models are unsupervised multitask learn- ers,” OpenAI blog, vol

A. Radford et al., ”Language models are unsupervised multitask learn- ers,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019

[5] [5]

Qwen Technical Report

J. Bai et al., ”Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Textbooks Are All You Need

S. Gunasekar et al., ”Textbooks are all you need,” arXiv preprint arXiv:2306.11644, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

T. L. Scao et al., ”BLOOM: A 176B-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

StabilityAI, ”StableLM: Stability AI Language Models,” GitHub repos- itory, 2023

work page 2023

[9] [9]

X. Qiu, T. Sun, Y . Xu, Y . Shao, N. Dai, and X. Huang, ”Pre-trained models for natural language processing: A survey,” Science China Technological Sciences, vol. 63, no. 10, pp. 1872-1897, 2020

work page 2020

[10] [10]

Rogers, O

A. Rogers, O. Kovaleva, and A. Rumshisky, ”A primer in neural network models for natural language processing,” Journal of Artificial Intelligence Research, vol. 57, pp. 345-420, 2020

work page 2020

[11] [11]

R. J. Sternberg and K. Sternberg, Cognitive psychology. Cengage Learning, 2019

work page 2019

[12] [12]

Newell and H

A. Newell and H. A. Simon, Human problem solving. Prentice-Hall, 1972

work page 1972

[13] [13]

Petroni et al., ”Language models as knowledge bases?” in Proceedings of EMNLP-IJCNLP, 2019, pp

F. Petroni et al., ”Language models as knowledge bases?” in Proceedings of EMNLP-IJCNLP, 2019, pp. 2463-2473

work page 2019

[14] [14]

Chakrabarty, P

T. Chakrabarty, P. Xie, C. Muresan, E. Kan, S. Muresan, and N. Peng, ”Help me write a poem: Instruction tuning as a vehicle for collaborative poetry writing,” in Proceedings of EMNLP, 2022, pp. 10824-10835

work page 2022

[15] [15]

Hendrycks et al., ”Measuring mathematical problem solving with the MATH dataset,” in Proceedings of NeurIPS, 2021, pp

D. Hendrycks et al., ”Measuring mathematical problem solving with the MATH dataset,” in Proceedings of NeurIPS, 2021, pp. 8844-8856

work page 2021

[16] [16]

Mohammad, F

S. Mohammad, F. Bravo-Marquez, M. Salameh, and S. Kiritchenko, ”SemEval-2018 task 1: Affect in tweets,” in Proceedings of SemEval, 2018, pp. 1-17

work page 2018

[17] [17]

Evaluating Large Language Models Trained on Code

M. Chen et al., ”Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Talbot and S

B. Talbot and S. Bethard, ”Identifying the human values behind argu- ments,” in Proceedings of ACL, 2022, pp. 4459-4476

work page 2022

[19] [19]

Adiwardana, M.-T

D. Adiwardana et al., ”Towards a human-like open-domain chatbot,” arXiv preprint arXiv:2001.09977, 2020

work page arXiv 2001

[20] [20]

Talmor et al., ”LEAP-OF-THOUGHT: Teaching pre-trained mod- els to systematically reason over implicit premises,” arXiv preprint arXiv:2006.06609, 2020

A. Talmor et al., ”LEAP-OF-THOUGHT: Teaching pre-trained mod- els to systematically reason over implicit premises,” arXiv preprint arXiv:2006.06609, 2020

work page arXiv 2006

[21] [21]

Jansen, E

P. Jansen, E. Wainwright, S. Marmorstein, and C. Morrison, ”WorldTree: A corpus of explanation graphs for elementary science questions sup- porting multi-hop inference,” in Proceedings of LREC, 2018

work page 2018

[22] [22]

Conneau, R

A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, and V . Stoyanov, ”XNLI: Evaluating cross-lingual sentence representations,” in Proceedings of EMNLP, 2018, pp. 2475-2485

work page 2018

[23] [23]

Mishra et al., ”Cross-task generalization via natural language crowd- sourcing instructions,” in Proceedings of ACL, 2022, pp

S. Mishra et al., ”Cross-task generalization via natural language crowd- sourcing instructions,” in Proceedings of ACL, 2022, pp. 3470-3487

work page 2022

[24] [24]

Sap et al., ”Atomic: An atlas of machine commonsense for if-then reasoning,” in Proceedings of AAAI, 2019, pp

M. Sap et al., ”Atomic: An atlas of machine commonsense for if-then reasoning,” in Proceedings of AAAI, 2019, pp. 3027-3035

work page 2019

[25] [25]

Tenney, D

I. Tenney, D. Das, and E. Pavlick, ”BERT rediscovers the classical NLP pipeline,” in Proceedings of ACL, 2019, pp. 4593-4601

work page 2019

[26] [26]

Clark, U

K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, ”What does BERT look at? An analysis of BERT’s attention,” in Proceedings of ACL Workshop BlackboxNLP, 2019, pp. 276-286

work page 2019

[27] [27]

Kovaleva, A

O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky, ”Revealing the dark secrets of BERT,” in Proceedings of EMNLP, 2019, pp. 4365- 4374

work page 2019

[28] [28]

V oita, D

E. V oita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, ”Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned,” in Proceedings of ACL, 2019, pp. 5797-5808

work page 2019

[29] [29]

Hoefler, D

T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, ”Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,” Journal of Machine Learning Research, vol. 22, no. 241, pp. 1-124, 2021

work page 2021

[30] [30]

Frantar and D

E. Frantar and D. Alistarh, ”SparseGPT: Massive language models can be accurately pruned in one-shot,” in Proceedings of ICML, 2023, pp. 10323-10337

work page 2023

[31] [31]

Michel, O

P. Michel, O. Levy, and G. Neubig, ”Are sixteen heads really better than one?” in Proceedings of NeurIPS, 2019, pp. 14014-14024

work page 2019

[32] [32]

Prasanna, A

S. Prasanna, A. Rogers, and A. Rumshisky, ”When BERT plays the lottery, all tickets are winning,” in Proceedings of EMNLP, 2020, pp. 3208-3229

work page 2020

[33] [33]

Wolf et al., ”Transformers: State-of-the-art natural language process- ing,” in Proceedings of EMNLP: System Demonstrations, 2020, pp

T. Wolf et al., ”Transformers: State-of-the-art natural language process- ing,” in Proceedings of EMNLP: System Demonstrations, 2020, pp. 38- 45

work page 2020

[34] [34]

Paszke et al., ”PyTorch: An imperative style, high-performance deep learning library,” in Proceedings of NeurIPS, 2019, pp

A. Paszke et al., ”PyTorch: An imperative style, high-performance deep learning library,” in Proceedings of NeurIPS, 2019, pp. 8026-8037

work page 2019

[35] [35]

Micikevicius et al., ”Mixed precision training,” in Proceedings of ICLR, 2018

P. Micikevicius et al., ”Mixed precision training,” in Proceedings of ICLR, 2018

work page 2018

[36] [36]

Dodge, S

J. Dodge, S. Gururangan, D. Card, R. Schwartz, and N. A. Smith, ”Show your work: Improved reporting of experimental results,” in Proceedings of EMNLP-IJCNLP, 2019, pp. 2185-2194

work page 2019

[37] [37]

Belinkov, ”Probing classifiers: Promises, shortcomings, and ad- vances,” Computational Linguistics, vol

Y . Belinkov, ”Probing classifiers: Promises, shortcomings, and ad- vances,” Computational Linguistics, vol. 48, no. 1, pp. 207-219, 2022

work page 2022

[38] [38]

Brunner, Y

G. Brunner, Y . Liu, D. Pascual, O. Richter, M. Ciaramita, and R. Wattenhofer, ”On identifiability in transformers,” in Proceedings of ICLR, 2020

work page 2020

[39] [39]

Strubell, A

E. Strubell, A. Ganesh, and A. McCallum, ”Energy and policy consid- erations for deep learning in NLP,” in Proceedings of ACL, 2019, pp. 3645-3650

work page 2019

[40] [40]

Kurtic, D

E. Kurtic, D. Campos, T. Nguyen, E. Frantar, M. Alistarh, and D. Alistarh, ”The optimal BERT surgeon: Scalable and accurate second- order pruning for large language models,” in Proceedings of EMNLP, 2022, pp. 4864-4881

work page 2022

[41] [41]

OPT: Open Pre-trained Transformer Language Models

S. Zhang et al., ”OPT: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[42] [42]

Dettmers, M

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, ”LLM.int8(): 8-bit matrix multiplication for transformers at scale,” in Proceedings of NeurIPS, 2022, pp. 15318-15332

work page 2022

[43] [43]

Holistic Evaluation of Language Models

P. Liang et al., ”Holistic evaluation of language models,” arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava et al., ”Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,” arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [45]

Training Verifiers to Solve Math Word Problems

K. Cobbe et al., ”Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[46] [46]

Y . Tay, M. Dehghani, D. So, B. Ginsburg, Z. Dai, N. Shazeer, and Q. V . Le, ”git: A survey,” ACM Computing Surveys, vol. 55, no. 6, pp. 1-28, 2022

work page 2022

[47] [47]

Narang and A

S. Narang and A. Chowdhery, ”Pathways: Asynchronous distributed dataflow for ML,” in Proceedings of MLSys, 2022, pp. 430-448

work page 2022

[48] [48]

Lewis et al., ”Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Proceedings of NeurIPS, 2020, pp

P. Lewis et al., ”Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Proceedings of NeurIPS, 2020, pp. 9459-9474

work page 2020

[49] [49]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer, ”Switch transformer: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research, vol. 23, no. 120, pp. 1-39, 2022

work page 2022

[50] [50]

Nijkamp et al., ”CodeGen: An open large language model for code with multi-turn program synthesis,” in Proceedings of ICLR, 2023

E. Nijkamp et al., ”CodeGen: An open large language model for code with multi-turn program synthesis,” in Proceedings of ICLR, 2023

work page 2023