From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

Derek F. Wong; Henghua Shen; Jiaxu Zuo; Kaixin Lan; Lidia S. Chao; Mu You; Tao Fang; Yujia Huo

arxiv: 2606.20152 · v1 · pith:LJAQKBO4new · submitted 2026-06-18 · 💻 cs.CL · cs.AI

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

Jiaxu Zuo , Mu You , Kaixin Lan , Tao Fang , Yujia Huo , Henghua Shen , Lidia S. Chao , Derek F. Wong This is my paper

Pith reviewed 2026-06-26 17:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords automated essay scoringlinear probingLLM interpretabilityhidden representationsessay quality neuronscross-prompt transferlayer-wise analysis

0 comments

The pith

LLMs encode essay quality as linearly readable signals that build across layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that information about essay quality appears in the internal hidden states of large language models in a form that linear probes can extract directly. These signals develop gradually as processing moves through successive layers, hold up under different prompting styles, and transfer to some extent between essay topics even when the scoring rubrics differ. Nonlinear probes add little extra accuracy, which indicates that most of the quality signal is already present in linear form. The work also locates specific neurons whose activity tracks essay scores and shows that longer essays draw more on deeper layers for this information.

Core claim

Essay quality information is encoded in a linearly accessible form within LLM representations. These representations emerge progressively across layers, remain robust across prompting strategies, and partially transfer across essay prompts despite differences in scoring rubrics. Nonlinear probes provide only marginal and inconsistent improvements over linear probes, suggesting that most essay quality information is already linearly decodable. Individual essay scoring neurons can be identified whose activations correlate with scores and respond to targeted intervention, and their layer-wise distribution shifts with essay length.

What carries the argument

Linear probes applied to hidden-state activations across LLM layers to decode essay quality scores.

If this is right

Essay quality can be read from internal activations without requiring the model to generate a score in text.
Quality signals remain detectable even when the input prompt or rubric changes.
Targeted changes to identified scoring neurons can alter the model's effective scoring behavior.
Longer essays depend more on deeper-layer representations for quality information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linear structure might allow similar decoding for other subjective judgments such as code quality or argument strength.
Intervening on the identified neurons could be used to steer automated scoring systems toward different criteria.
The progressive layer emergence suggests that training methods emphasizing deeper-layer alignment could improve AES consistency.

Load-bearing premise

The essay datasets and their scoring rubrics produce a quality signal that reflects general essay quality rather than artifacts specific to those rubrics or prompt distributions.

What would settle it

Linear probes trained on one essay dataset would fail to predict scores above chance level when tested on a new dataset using entirely different rubrics and topics, or nonlinear probes would show large consistent gains over linear ones.

Figures

Figures reproduced from arXiv: 2606.20152 by Derek F. Wong, Henghua Shen, Jiaxu Zuo, Kaixin Lan, Lidia S. Chao, Mu You, Tao Fang, Yujia Huo.

**Figure 2.** Figure 2: Average QWK scores of linear probes on ASAP++ under cross-prompt settings. Each subplot corresponds [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: QWK scores of linear probes trained on overall essay scores in ASAP++ using Llama-3.1-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: QWK scores across PCA dimensionality settings for each model. Dotted lines denote probes trained on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Essay scoring neurons in each model. Spearman correlations between neuron-weight projections and true [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of the top 50 essay scoring neurons for the overall score of Prompt 8 in the ASAP++ dataset [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: When the essay scoring neuron (L2.N392.Win) is fixed to specific values, the prediction results for five different essays from prompt 1 of the ASAP dataset are compared with the prediction results from three random neurons in the same layer (L2.[0-2]) of the Llama-3.1-8B-Instruct model. We also calculate the weighted sum of top 10 tokens when the essay scoring neuron is fixed to different specific values … view at source ↗

**Figure 8.** Figure 8: Average QWK scores of linear probes trained on CSEE across all essay prompts. Each subplot corresponds [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Average QWK scores of linear probes trained on ENEM across all essay prompts. Each subplot [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Spearman correlation between predictions of probes trained on activations projected onto the top [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Distribution of the top 50 key neurons in the ASAP++ dataset, shown for different traits, essay prompts [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt template T used for neuron intervention experiments. Curly brackets {} denote placeholders to be completed [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: QWK scores of linear probes trained on the overall score of the ASAP++ dataset on each essay prompt [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

read the original abstract

Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood. In this work, we systematically analyze the hidden representations of eight LLMs across two English essay datasets (ASAP++, CSEE) and one Portuguese dataset (ENEM). Using linear probing, cross-prompt generalization, dimensionality reduction, and neuron-level analyses, we find consistent evidence that essay quality information is encoded in a linearly accessible form within LLM representations. These representations emerge progressively across layers, remain robust across prompting strategies, and partially transfer across essay prompts despite differences in scoring rubrics. In addition, nonlinear probes provide only marginal and inconsistent improvements over linear probes, suggesting that most essay quality information is already linearly decodable. We further identify individual ``essay scoring neurons'' whose activations strongly correlate with essay scores and whose behavior is sensitive to targeted intervention. Moreover, the layer-wise distribution of these neurons systematically shifts with essay length, with longer essays relying more heavily on deeper layers. Overall, our findings provide evidence that LLMs encode structured representations related to essay quality and offer new insights into the interpretability of LLM-based AES systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Probing finds mostly linear essay quality signals emerging across LLM layers with length-dependent neuron shifts, but the signal could still track rubric or length artifacts rather than general quality.

read the letter

The core finding is that linear probes recover essay scores from hidden states in eight LLMs, with the information building up layer by layer, staying stable under prompt changes, and showing only small gains from nonlinear probes. They also flag specific neurons whose activations track scores and note that these neurons sit deeper in the model when essays are longer.

The paper does the straightforward extension work well: consistent probing across English and Portuguese datasets, cross-prompt checks, and the neuron-level intervention plus layer-shift observation are concrete enough to stand out from typical AES benchmark papers. Reporting that most of the signal is already linear is useful for interpretability folks.

The soft spot is exactly the one the stress-test flags. No ablation replaces human scores with length- or vocabulary-matched proxies while keeping the same representations, so it remains possible the linear decodability, the progressive emergence, and the neuron behavior are driven by surface correlates that happen to align with the rubrics in ASAP++, CSEE, and ENEM. Partial cross-prompt transfer is claimed, but without effect sizes or details on how different the rubrics actually are, the generalization claim stays provisional. The abstract summarizes consistent results, yet the lack of those controls limits how far the mechanistic story can be pushed.

This is for readers working on LLM interpretability in educational applications or on AES internals. It is not a foundational result but supplies usable observations on representation emergence.

I would send it to peer review. The empirical framing is clear and the new observations are specific enough that referees can usefully press on the missing ablations and statistical detail.

Referee Report

1 major / 1 minor

Summary. The manuscript analyzes hidden representations from eight LLMs on three essay datasets (ASAP++, CSEE, ENEM) via linear probing, cross-prompt generalization tests, dimensionality reduction, and neuron-level interventions. It claims that essay quality information is encoded in a linearly accessible form, emerges progressively across layers, remains robust across prompting strategies, shows partial transfer across prompts with differing rubrics, that nonlinear probes yield only marginal gains, and that specific 'essay scoring neurons' can be identified whose activations correlate with scores and shift in layer distribution with essay length.

Significance. If the central claims hold after addressing controls for surface features, the work would advance interpretability of LLM-based automated essay scoring by providing observational evidence for structured, linearly decodable quality representations that generalize partially across datasets. The multi-model, multi-dataset design and identification of progressive emergence and neuron interventions are strengths that could inform future mechanistic analyses in AES.

major comments (1)

[cross-prompt generalization and neuron intervention results] The claim that representations encode essay quality in a linearly accessible form (rather than rubric-specific or surface correlates such as length or vocabulary) is load-bearing for the abstract's conclusions on emergence, robustness, and transfer. The reported layer-wise shifts with essay length and partial cross-prompt transfer are consistent with possible confounds, yet the manuscript does not appear to include an ablation that replaces human scores with matched-length or matched-vocabulary proxies while retaining the same representations and probes. Without this control, the linear decodability and neuron findings could track surface statistics rather than the intended quality construct.

minor comments (1)

[Abstract] The abstract states consistent evidence across analyses but does not list the specific eight LLMs or their parameter scales; adding this detail would clarify the scope of generality.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the suggestion to strengthen controls against surface-feature confounds. We address the single major comment below and will incorporate the requested ablation.

read point-by-point responses

Referee: The claim that representations encode essay quality in a linearly accessible form (rather than rubric-specific or surface correlates such as length or vocabulary) is load-bearing for the abstract's conclusions on emergence, robustness, and transfer. The reported layer-wise shifts with essay length and partial cross-prompt transfer are consistent with possible confounds, yet the manuscript does not appear to include an ablation that replaces human scores with matched-length or matched-vocabulary proxies while retaining the same representations and probes. Without this control, the linear decodability and neuron findings could track surface statistics rather than the intended quality construct.

Authors: We agree that an explicit ablation replacing human scores with length- or vocabulary-matched proxies is a valuable control that is currently missing. Our existing cross-prompt and cross-dataset results (including transfer to the Portuguese ENEM corpus) already indicate that performance is not fully explained by prompt-specific surface statistics, and the neuron-intervention results show causal effects on predicted scores. Nevertheless, these do not directly isolate quality from length or lexical richness. In the revision we will add the requested ablation: we will train the same linear probes to predict (i) essay length and (ii) a vocabulary-richness proxy from the identical hidden representations, then compare layer-wise emergence curves, cross-prompt generalization, and the set of high-correlation “scoring neurons” against the human-score results. We will also report the raw correlations between human scores and these surface variables in each dataset to quantify the potential confound. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical probing study with observational claims

full rationale

The paper reports results from linear and nonlinear probing, cross-prompt transfer tests, dimensionality reduction, and neuron interventions on LLM hidden states for essay scoring. No equations, derivations, or first-principles claims appear; all findings are direct measurements on fixed datasets (ASAP++, CSEE, ENEM). No fitted parameters are relabeled as predictions, no self-definitional loops, and no load-bearing self-citations or uniqueness theorems are invoked to force conclusions. The central claim (linear decodability of quality) is an empirical observation, not a reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the domain assumption that linear decodability from hidden states indicates structured semantic encoding of essay quality; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Linear probes can extract meaningful semantic information from LLM hidden states
This is the core premise enabling all probing experiments described.

pith-pipeline@v0.9.1-grok · 5762 in / 1168 out tokens · 27141 ms · 2026-06-26T17:15:16.376769+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

82 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Designing and Interpreting Probes with Control Tasks

Hewitt, John and Liang, Percy. Designing and Interpreting Probes with Control Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1275

work page doi:10.18653/v1/d19-1275 2019
[2]

2025 , eprint=

Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations , author=. 2025 , eprint=

2025
[3]

Behaviormetrika , volume=

A review of deep-neural automated essay scoring models , author=. Behaviormetrika , volume=. 2021 , publisher=

2021
[4]

, author=

Automated Essay Scoring: A Survey of the State of the Art. , author=. IJCAI , volume=
[5]

Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring

Do, Heejin and Kim, Yunsu and Lee, Gary Geunbae. Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.98

work page doi:10.18653/v1/2023.findings-acl.98 2023
[6]

Expert Systems with Applications , volume=

Pairwise dual-level alignment for cross-prompt automated essay scoring , author=. Expert Systems with Applications , volume=. 2025 , publisher=

2025
[7]

International Journal of Educational Technology in Higher Education , volume=

AI-generated feedback on writing: Insights into efficacy and ENL student preference , author=. International Journal of Educational Technology in Higher Education , volume=. 2023 , publisher=

2023
[8]

arXiv , author=

Exploring LLM prompting strategies for joint essay scoring and feedback generation. arXiv , author=
[9]

Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U) , pages=

LLM-as-a-tutor in EFL writing education: Focusing on evaluation of student-LLM interaction , author=. Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U) , pages=
[10]

Proceedings of the 15th international learning analytics and knowledge conference , pages=

Human-ai collaborative essay scoring: A dual-process framework with llms , author=. Proceedings of the 15th international learning analytics and knowledge conference , pages=
[11]

Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) , year=

ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores , author=. Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) , year=

2018
[12]

Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol

A new benchmark for automatic essay scoring in Portuguese , author=. Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol. 1 , pages=
[13]

Unleashing Large Language Models' Proficiency in Zero-shot Essay Scoring

Lee, Sanwoo and Cai, Yida and Meng, Desong and Wang, Ziyang and Wu, Yunfang. Unleashing Large Language Models' Proficiency in Zero-shot Essay Scoring. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.10

work page doi:10.18653/v1/2024.findings-emnlp.10 2024
[14]

A Neural Approach to Automated Essay Scoring

Taghipour, Kaveh and Ng, Hwee Tou. A Neural Approach to Automated Essay Scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1193

work page doi:10.18653/v1/d16-1193 2016
[15]

Automatic Text Scoring Using Neural Networks

Alikaniotis, Dimitrios and Yannakoudakis, Helen and Rei, Marek. Automatic Text Scoring Using Neural Networks. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1068

work page doi:10.18653/v1/p16-1068 2016
[16]

Automatic

Dong, Fei and Zhang, Yue. Automatic Features for Essay Scoring -- An Empirical Study. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1115

work page doi:10.18653/v1/d16-1115 2016
[17]

Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring

Dong, Fei and Zhang, Yue and Yang, Jie. Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring. Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL 2017). 2017. doi:10.18653/v1/K17-1017

work page doi:10.18653/v1/k17-1017 2017
[18]

2019 , eprint=

Language models and Automated Essay Scoring , author=. 2019 , eprint=

2019
[19]

Automated Essay Scoring via Pairwise Contrastive Regression

Xie, Jiayi and Cai, Kaiwei and Kong, Li and Zhou, Junsheng and Qu, Weiguang. Automated Essay Scoring via Pairwise Contrastive Regression. Proceedings of the 29th International Conference on Computational Linguistics. 2022

2022
[20]

Automated Essay Scoring with Discourse-Aware Neural Models

Nadeem, Farah and Nguyen, Huy and Liu, Yang and Ostendorf, Mari. Automated Essay Scoring with Discourse-Aware Neural Models. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2019. doi:10.18653/v1/W19-4450

work page doi:10.18653/v1/w19-4450 2019
[21]

Enhancing Automated Essay Scoring Performance via Fine-tuning Pre-trained Language Models with Combination of Regression and Ranking

Yang, Ruosong and Cao, Jiannong and Wen, Zhiyuan and Wu, Youzheng and He, Xiaodong. Enhancing Automated Essay Scoring Performance via Fine-tuning Pre-trained Language Models with Combination of Regression and Ranking. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.141

work page doi:10.18653/v1/2020.findings-emnlp.141 2020
[22]

Neural Automated Essay Scoring Incorporating Handcrafted Features

Uto, Masaki and Xie, Yikuan and Ueno, Maomi. Neural Automated Essay Scoring Incorporating Handcrafted Features. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.535

work page doi:10.18653/v1/2020.coling-main.535 2020
[23]

arXiv , author=

Prompt agnostic essay scorer: a domain generalization approach to cross-prompt automated essay scoring. arXiv , author=. arXiv preprint arXiv:2008.01441 , year=

arXiv 2008
[24]

PMAES : Prompt-mapping Contrastive Learning for Cross-prompt Automated Essay Scoring

Chen, Yuan and Li, Xia. PMAES : Prompt-mapping Contrastive Learning for Cross-prompt Automated Essay Scoring. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.83

work page doi:10.18653/v1/2023.acl-long.83 2023
[25]

Expert Systems with Applications , pages=

Making meta-learning solve cross-prompt automatic essay scoring , author=. Expert Systems with Applications , pages=. 2025 , publisher=

2025
[26]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=
[27]

Research Methods in Applied Linguistics, 2 (2), 100050 , author=

Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2 (2), 100050 , author=
[28]

Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

Rating short L2 essays on the CEFR scale with GPT-4 , author=. Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

2023
[29]

arXiv preprint arXiv:2505.08498 , year=

LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models , author=. arXiv preprint arXiv:2505.08498 , year=

arXiv
[30]

2012 , howpublished =

Ben Hamner and Jaison Morgan and lynnvandev and Mark Shermis and Tom Vander Ark , title =. 2012 , howpublished =

2012
[31]

ETS Research Report Series , volume=

TOEFL11: A corpus of non-native English , author=. ETS Research Report Series , volume=. 2013 , publisher=

2013
[32]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[33]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
[34]

Natural Language Engineering , volume=

Evaluation of text coherence for electronic essay scoring systems , author=. Natural Language Engineering , volume=. 2004 , publisher=

2004
[35]

The Journal of Technology, Learning and Assessment , volume=

Automated essay scoring using Bayes' theorem , author=. The Journal of Technology, Learning and Assessment , volume=
[36]

Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

A new dataset and method for automatically grading ESOL texts , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=
[37]

Human-AI collaborative essay scoring: A dual-process framework with LLMs. arXiv. doi: 10.48550 , author=. arXiv preprint arXiv.2401.06431 , year=

arXiv
[38]

arXiv preprint arXiv:2504.05736 , year=

Rank-then-score: Enhancing large language models for automated essay scoring , author=. arXiv preprint arXiv:2504.05736 , year=

arXiv
[39]

Conundrums in Cross-Prompt Automated Essay Scoring: Making Sense of the State of the Art

Li, Shengjie and Ng, Vincent. Conundrums in Cross-Prompt Automated Essay Scoring: Making Sense of the State of the Art. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.414

work page doi:10.18653/v1/2024.acl-long.414 2024
[40]

arXiv preprint arXiv:2403.06149 , year=

Can large language models automatically score proficiency of written essays? , author=. arXiv preprint arXiv:2403.06149 , year=

arXiv
[41]

Analyzing Encoded Concepts in Transformer Language Models

Sajjad, Hassan and Durrani, Nadir and Dalvi, Fahim and Alam, Firoj and Khan, Abdul and Xu, Jia. Analyzing Encoded Concepts in Transformer Language Models. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.225

work page doi:10.18653/v1/2022.naacl-main.225 2022
[42]

Computational linguistics , volume=

Building a large annotated corpus of English: The Penn Treebank , author=. Computational linguistics , volume=
[43]

CoRR , volume =

Jacob Devlin and Ming. CoRR , volume =. 2018 , url =

2018
[44]

arXiv preprint arXiv:2408.13533 , year=

Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models , author=. arXiv preprint arXiv:2408.13533 , year=

arXiv
[45]

2016 , booktitle =

Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

work page doi:10.18653/v1/p16-1162 2016
[46]

1993 , issn =

The imminence of grading essays by computer—25 years later , journal =. 1993 , issn =. doi:https://doi.org/10.1016/S8755-4615(05)80058-1 , url =

work page doi:10.1016/s8755-4615(05)80058-1 1993
[47]

A Report on the First Native Language Identification Shared Task

Tetreault, Joel and Blanchard, Daniel and Cahill, Aoife. A Report on the First Native Language Identification Shared Task. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. 2013

2013
[48]

Educational and psychological measurement , volume=

A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=

1960
[49]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

2024
[50]

The Claude 3 Model Family: Technical Report

Anthropic. The Claude 3 Model Family: Technical Report. 2024

2024
[51]

The Development of Writing Proficiency as a Function of Grade Level: A Linguistic Analysis , volume =

Weston, Jennifer and Sullivan, Susan and McNamara, Danielle , year =. The Development of Writing Proficiency as a Function of Grade Level: A Linguistic Analysis , volume =. Written Communication , doi =
[52]

Journal of Writing Research , volume=

Linguistic features in writing quality and development: An overview , author=. Journal of Writing Research , volume=
[53]

Scientific reports , volume=

A large-scale comparison of human-written versus ChatGPT-generated essays , author=. Scientific reports , volume=. 2023 , publisher=

2023
[54]

arXiv preprint arXiv:2409.11547 , year=

Small language models can outperform humans in short creative writing: A study comparing slms with humans and llms , author=. arXiv preprint arXiv:2409.11547 , year=

arXiv
[55]

2010 , edition=

The Cambridge Dictionary of Statistics , author=. 2010 , edition=

2010
[56]

A New Benchmark for Automatic Essay Scoring in P ortuguese

Silveira, Igor Cataneo and Barbosa, Andr \'e and Mau \'a , Denis Deratani. A New Benchmark for Automatic Essay Scoring in P ortuguese. Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1. 2024

2024
[57]

arXiv preprint arXiv:2409.13120 , year=

Are large language models good essay graders? , author=. arXiv preprint arXiv:2409.13120 , year=

arXiv
[58]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=
[59]

arXiv preprint arXiv:2305.01610 , year=

Finding neurons in a haystack: Case studies with sparse probing , author=. arXiv preprint arXiv:2305.01610 , year=

arXiv
[60]

Discovering latent knowledge in language models without supervision, 2024 , author=

2024
[61]

arXiv preprint arXiv:2308.09124 , year=

Linearity of relation decoding in transformer language models , author=. arXiv preprint arXiv:2308.09124 , year=

arXiv
[62]

arXiv preprint arXiv:2310.02207 , year=

Language models represent space and time , author=. arXiv preprint arXiv:2310.02207 , year=

arXiv
[63]

2023 , eprint=

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , author=. 2023 , eprint=

2023
[64]

2020 , eprint=

A Primer in BERTology: What we know about how BERT works , author=. 2020 , eprint=

2020
[65]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=
[66]

2022 , eprint=

Toy Models of Superposition , author=. 2022 , eprint=

2022
[67]

Distill , volume=

Zoom in: An introduction to circuits , author=. Distill , volume=
[68]

URL https://arxiv

Understanding intermediate layers using linear classifier probes, 2018 , author=. URL https://arxiv. org/abs/1610.01644 , volume=

Pith/arXiv arXiv 2018
[69]

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=

Probing the probing paradigm: Does probing accuracy entail task relevance? , author=. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=
[70]

2009 , publisher=

The elements of statistical learning: data mining, inference, and prediction , author=. 2009 , publisher=

2009
[71]

Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

2013
[72]

Neuroscience & Biobehavioral Reviews , volume=

The ‘reading’brain: Meta-analytic insight into functional activation during reading in adults , author=. Neuroscience & Biobehavioral Reviews , volume=. 2025 , publisher=

2025
[73]

2019 , eprint=

BERT Rediscovers the Classical NLP Pipeline , author=. 2019 , eprint=

2019
[74]

Cognition , volume=

Linguistic complexity: Locality of syntactic dependencies , author=. Cognition , volume=. 1998 , publisher=

1998
[75]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[76]

2025 , eprint=

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs , author=. 2025 , eprint=

2025
[77]

2023 , eprint=

Textbooks Are All You Need , author=. 2023 , eprint=

2023
[78]

2024 , eprint=

TinyLlama: An Open-Source Small Language Model , author=. 2024 , eprint=

2024
[79]

Probing for semantic evidence of composition by means of simple classification tasks

Ettinger, Allyson and Elgohary, Ahmed and Resnik, Philip. Probing for semantic evidence of composition by means of simple classification tasks. Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP. 2016. doi:10.18653/v1/W16-2524

work page doi:10.18653/v1/w16-2524 2016
[80]

Analysis Methods in Neural Language Processing: A Survey

Belinkov, Yonatan and Glass, James. Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00254

work page doi:10.1162/tacl_a_00254 2019

Showing first 80 references.

[1] [1]

Designing and Interpreting Probes with Control Tasks

Hewitt, John and Liang, Percy. Designing and Interpreting Probes with Control Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1275

work page doi:10.18653/v1/d19-1275 2019

[2] [2]

2025 , eprint=

Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations , author=. 2025 , eprint=

2025

[3] [3]

Behaviormetrika , volume=

A review of deep-neural automated essay scoring models , author=. Behaviormetrika , volume=. 2021 , publisher=

2021

[4] [4]

, author=

Automated Essay Scoring: A Survey of the State of the Art. , author=. IJCAI , volume=

[5] [5]

Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring

Do, Heejin and Kim, Yunsu and Lee, Gary Geunbae. Prompt- and Trait Relation-aware Cross-prompt Essay Trait Scoring. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.98

work page doi:10.18653/v1/2023.findings-acl.98 2023

[6] [6]

Expert Systems with Applications , volume=

Pairwise dual-level alignment for cross-prompt automated essay scoring , author=. Expert Systems with Applications , volume=. 2025 , publisher=

2025

[7] [7]

International Journal of Educational Technology in Higher Education , volume=

AI-generated feedback on writing: Insights into efficacy and ENL student preference , author=. International Journal of Educational Technology in Higher Education , volume=. 2023 , publisher=

2023

[8] [8]

arXiv , author=

Exploring LLM prompting strategies for joint essay scoring and feedback generation. arXiv , author=

[9] [9]

Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U) , pages=

LLM-as-a-tutor in EFL writing education: Focusing on evaluation of student-LLM interaction , author=. Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U) , pages=

[10] [10]

Proceedings of the 15th international learning analytics and knowledge conference , pages=

Human-ai collaborative essay scoring: A dual-process framework with llms , author=. Proceedings of the 15th international learning analytics and knowledge conference , pages=

[11] [11]

Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) , year=

ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores , author=. Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) , year=

2018

[12] [12]

Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol

A new benchmark for automatic essay scoring in Portuguese , author=. Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol. 1 , pages=

[13] [13]

Unleashing Large Language Models' Proficiency in Zero-shot Essay Scoring

Lee, Sanwoo and Cai, Yida and Meng, Desong and Wang, Ziyang and Wu, Yunfang. Unleashing Large Language Models' Proficiency in Zero-shot Essay Scoring. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.10

work page doi:10.18653/v1/2024.findings-emnlp.10 2024

[14] [14]

A Neural Approach to Automated Essay Scoring

Taghipour, Kaveh and Ng, Hwee Tou. A Neural Approach to Automated Essay Scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1193

work page doi:10.18653/v1/d16-1193 2016

[15] [15]

Automatic Text Scoring Using Neural Networks

Alikaniotis, Dimitrios and Yannakoudakis, Helen and Rei, Marek. Automatic Text Scoring Using Neural Networks. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1068

work page doi:10.18653/v1/p16-1068 2016

[16] [16]

Automatic

Dong, Fei and Zhang, Yue. Automatic Features for Essay Scoring -- An Empirical Study. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1115

work page doi:10.18653/v1/d16-1115 2016

[17] [17]

Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring

Dong, Fei and Zhang, Yue and Yang, Jie. Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring. Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL 2017). 2017. doi:10.18653/v1/K17-1017

work page doi:10.18653/v1/k17-1017 2017

[18] [18]

2019 , eprint=

Language models and Automated Essay Scoring , author=. 2019 , eprint=

2019

[19] [19]

Automated Essay Scoring via Pairwise Contrastive Regression

Xie, Jiayi and Cai, Kaiwei and Kong, Li and Zhou, Junsheng and Qu, Weiguang. Automated Essay Scoring via Pairwise Contrastive Regression. Proceedings of the 29th International Conference on Computational Linguistics. 2022

2022

[20] [20]

Automated Essay Scoring with Discourse-Aware Neural Models

Nadeem, Farah and Nguyen, Huy and Liu, Yang and Ostendorf, Mari. Automated Essay Scoring with Discourse-Aware Neural Models. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2019. doi:10.18653/v1/W19-4450

work page doi:10.18653/v1/w19-4450 2019

[21] [21]

Enhancing Automated Essay Scoring Performance via Fine-tuning Pre-trained Language Models with Combination of Regression and Ranking

Yang, Ruosong and Cao, Jiannong and Wen, Zhiyuan and Wu, Youzheng and He, Xiaodong. Enhancing Automated Essay Scoring Performance via Fine-tuning Pre-trained Language Models with Combination of Regression and Ranking. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.141

work page doi:10.18653/v1/2020.findings-emnlp.141 2020

[22] [22]

Neural Automated Essay Scoring Incorporating Handcrafted Features

Uto, Masaki and Xie, Yikuan and Ueno, Maomi. Neural Automated Essay Scoring Incorporating Handcrafted Features. Proceedings of the 28th International Conference on Computational Linguistics. 2020. doi:10.18653/v1/2020.coling-main.535

work page doi:10.18653/v1/2020.coling-main.535 2020

[23] [23]

arXiv , author=

Prompt agnostic essay scorer: a domain generalization approach to cross-prompt automated essay scoring. arXiv , author=. arXiv preprint arXiv:2008.01441 , year=

arXiv 2008

[24] [24]

PMAES : Prompt-mapping Contrastive Learning for Cross-prompt Automated Essay Scoring

Chen, Yuan and Li, Xia. PMAES : Prompt-mapping Contrastive Learning for Cross-prompt Automated Essay Scoring. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.83

work page doi:10.18653/v1/2023.acl-long.83 2023

[25] [25]

Expert Systems with Applications , pages=

Making meta-learning solve cross-prompt automatic essay scoring , author=. Expert Systems with Applications , pages=. 2025 , publisher=

2025

[26] [26]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

[27] [27]

Research Methods in Applied Linguistics, 2 (2), 100050 , author=

Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2 (2), 100050 , author=

[28] [28]

Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

Rating short L2 essays on the CEFR scale with GPT-4 , author=. Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023) , pages=

2023

[29] [29]

arXiv preprint arXiv:2505.08498 , year=

LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models , author=. arXiv preprint arXiv:2505.08498 , year=

arXiv

[30] [30]

2012 , howpublished =

Ben Hamner and Jaison Morgan and lynnvandev and Mark Shermis and Tom Vander Ark , title =. 2012 , howpublished =

2012

[31] [31]

ETS Research Report Series , volume=

TOEFL11: A corpus of non-native English , author=. ETS Research Report Series , volume=. 2013 , publisher=

2013

[32] [32]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[33] [33]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

[34] [34]

Natural Language Engineering , volume=

Evaluation of text coherence for electronic essay scoring systems , author=. Natural Language Engineering , volume=. 2004 , publisher=

2004

[35] [35]

The Journal of Technology, Learning and Assessment , volume=

Automated essay scoring using Bayes' theorem , author=. The Journal of Technology, Learning and Assessment , volume=

[36] [36]

Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

A new dataset and method for automatically grading ESOL texts , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

[37] [37]

Human-AI collaborative essay scoring: A dual-process framework with LLMs. arXiv. doi: 10.48550 , author=. arXiv preprint arXiv.2401.06431 , year=

arXiv

[38] [38]

arXiv preprint arXiv:2504.05736 , year=

Rank-then-score: Enhancing large language models for automated essay scoring , author=. arXiv preprint arXiv:2504.05736 , year=

arXiv

[39] [39]

Conundrums in Cross-Prompt Automated Essay Scoring: Making Sense of the State of the Art

Li, Shengjie and Ng, Vincent. Conundrums in Cross-Prompt Automated Essay Scoring: Making Sense of the State of the Art. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.414

work page doi:10.18653/v1/2024.acl-long.414 2024

[40] [40]

arXiv preprint arXiv:2403.06149 , year=

Can large language models automatically score proficiency of written essays? , author=. arXiv preprint arXiv:2403.06149 , year=

arXiv

[41] [41]

Analyzing Encoded Concepts in Transformer Language Models

Sajjad, Hassan and Durrani, Nadir and Dalvi, Fahim and Alam, Firoj and Khan, Abdul and Xu, Jia. Analyzing Encoded Concepts in Transformer Language Models. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.225

work page doi:10.18653/v1/2022.naacl-main.225 2022

[42] [42]

Computational linguistics , volume=

Building a large annotated corpus of English: The Penn Treebank , author=. Computational linguistics , volume=

[43] [43]

CoRR , volume =

Jacob Devlin and Ming. CoRR , volume =. 2018 , url =

2018

[44] [44]

arXiv preprint arXiv:2408.13533 , year=

Pandora's Box or Aladdin's Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models , author=. arXiv preprint arXiv:2408.13533 , year=

arXiv

[45] [45]

2016 , booktitle =

Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

work page doi:10.18653/v1/p16-1162 2016

[46] [46]

1993 , issn =

The imminence of grading essays by computer—25 years later , journal =. 1993 , issn =. doi:https://doi.org/10.1016/S8755-4615(05)80058-1 , url =

work page doi:10.1016/s8755-4615(05)80058-1 1993

[47] [47]

A Report on the First Native Language Identification Shared Task

Tetreault, Joel and Blanchard, Daniel and Cahill, Aoife. A Report on the First Native Language Identification Shared Task. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. 2013

2013

[48] [48]

Educational and psychological measurement , volume=

A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=

1960

[49] [49]

2024 , eprint=

GPT-4o System Card , author=. 2024 , eprint=

2024

[50] [50]

The Claude 3 Model Family: Technical Report

Anthropic. The Claude 3 Model Family: Technical Report. 2024

2024

[51] [51]

The Development of Writing Proficiency as a Function of Grade Level: A Linguistic Analysis , volume =

Weston, Jennifer and Sullivan, Susan and McNamara, Danielle , year =. The Development of Writing Proficiency as a Function of Grade Level: A Linguistic Analysis , volume =. Written Communication , doi =

[52] [52]

Journal of Writing Research , volume=

Linguistic features in writing quality and development: An overview , author=. Journal of Writing Research , volume=

[53] [53]

Scientific reports , volume=

A large-scale comparison of human-written versus ChatGPT-generated essays , author=. Scientific reports , volume=. 2023 , publisher=

2023

[54] [54]

arXiv preprint arXiv:2409.11547 , year=

Small language models can outperform humans in short creative writing: A study comparing slms with humans and llms , author=. arXiv preprint arXiv:2409.11547 , year=

arXiv

[55] [55]

2010 , edition=

The Cambridge Dictionary of Statistics , author=. 2010 , edition=

2010

[56] [56]

A New Benchmark for Automatic Essay Scoring in P ortuguese

Silveira, Igor Cataneo and Barbosa, Andr \'e and Mau \'a , Denis Deratani. A New Benchmark for Automatic Essay Scoring in P ortuguese. Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1. 2024

2024

[57] [57]

arXiv preprint arXiv:2409.13120 , year=

Are large language models good essay graders? , author=. arXiv preprint arXiv:2409.13120 , year=

arXiv

[58] [58]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=

[59] [59]

arXiv preprint arXiv:2305.01610 , year=

Finding neurons in a haystack: Case studies with sparse probing , author=. arXiv preprint arXiv:2305.01610 , year=

arXiv

[60] [60]

Discovering latent knowledge in language models without supervision, 2024 , author=

2024

[61] [61]

arXiv preprint arXiv:2308.09124 , year=

Linearity of relation decoding in transformer language models , author=. arXiv preprint arXiv:2308.09124 , year=

arXiv

[62] [62]

arXiv preprint arXiv:2310.02207 , year=

Language models represent space and time , author=. arXiv preprint arXiv:2310.02207 , year=

arXiv

[63] [63]

2023 , eprint=

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , author=. 2023 , eprint=

2023

[64] [64]

2020 , eprint=

A Primer in BERTology: What we know about how BERT works , author=. 2020 , eprint=

2020

[65] [65]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=

[66] [66]

2022 , eprint=

Toy Models of Superposition , author=. 2022 , eprint=

2022

[67] [67]

Distill , volume=

Zoom in: An introduction to circuits , author=. Distill , volume=

[68] [68]

URL https://arxiv

Understanding intermediate layers using linear classifier probes, 2018 , author=. URL https://arxiv. org/abs/1610.01644 , volume=

Pith/arXiv arXiv 2018

[69] [69]

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=

Probing the probing paradigm: Does probing accuracy entail task relevance? , author=. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=

[70] [70]

2009 , publisher=

The elements of statistical learning: data mining, inference, and prediction , author=. 2009 , publisher=

2009

[71] [71]

Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

Linguistic regularities in continuous space word representations , author=. Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages=

2013

[72] [72]

Neuroscience & Biobehavioral Reviews , volume=

The ‘reading’brain: Meta-analytic insight into functional activation during reading in adults , author=. Neuroscience & Biobehavioral Reviews , volume=. 2025 , publisher=

2025

[73] [73]

2019 , eprint=

BERT Rediscovers the Classical NLP Pipeline , author=. 2019 , eprint=

2019

[74] [74]

Cognition , volume=

Linguistic complexity: Locality of syntactic dependencies , author=. Cognition , volume=. 1998 , publisher=

1998

[75] [75]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[76] [76]

2025 , eprint=

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs , author=. 2025 , eprint=

2025

[77] [77]

2023 , eprint=

Textbooks Are All You Need , author=. 2023 , eprint=

2023

[78] [78]

2024 , eprint=

TinyLlama: An Open-Source Small Language Model , author=. 2024 , eprint=

2024

[79] [79]

Probing for semantic evidence of composition by means of simple classification tasks

Ettinger, Allyson and Elgohary, Ahmed and Resnik, Philip. Probing for semantic evidence of composition by means of simple classification tasks. Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP. 2016. doi:10.18653/v1/W16-2524

work page doi:10.18653/v1/w16-2524 2016

[80] [80]

Analysis Methods in Neural Language Processing: A Survey

Belinkov, Yonatan and Glass, James. Analysis Methods in Neural Language Processing: A Survey. Transactions of the Association for Computational Linguistics. 2019. doi:10.1162/tacl_a_00254

work page doi:10.1162/tacl_a_00254 2019