arxiv: 2605.14257 · v1 · submitted 2026-05-14 · 💻 cs.CL

Recognition: no theorem link

What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction

Adam Nohejl , Xuanxin Wu , Yusuke Ide , Maria Angelica Riera Machin , Yi-Ning Chang , Hitomi Yanaka

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords vocabulary difficulty predictionword difficultyexplainable modelLLM fine-tuningKVL listsshared tasklanguage assessment

0 comments

The pith

Spelling difficulty and test item construction often drive ratings in standard vocabulary difficulty lists beyond genuine word production demands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a high-accuracy black-box model by fine-tuning an LLM with a soft-target loss, reaching r greater than 0.91 and leading the open track of the BEA 2026 shared task. It pairs this with an explainable model that keeps r above 0.77 while surfacing which factors shape each item's difficulty score. Analysis of the British Council KVL data shows that spelling challenges and how test items are worded frequently influence the ratings in addition to real production difficulty. Readers should care because these findings indicate that current difficulty lists mix test artifacts with core language-learning demands.

Core claim

The difficulty of items in the British Council's Knowledge-based Vocabulary Lists is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words.

What carries the argument

Fine-tuned LLM with soft-target loss for black-box prediction, paired with an explainable model that decomposes per-item influences on difficulty ratings.

If this is right

Vocabulary tests and lists should separate spelling and format effects from production difficulty to give cleaner signals for learners and teachers.
Explainable models can flag which specific items in existing lists are likely inflated by non-production factors.
Training data for future difficulty predictors should include explicit spelling and item-format annotations to reduce post-hoc capture of confounders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Difficulty prediction systems may improve if they are trained to output separate scores for spelling load, item format, and core production effort rather than a single combined rating.
This approach could transfer to other language assessment domains where surface features of test items distort underlying skill measures.

Load-bearing premise

The shared task dataset and KVL lists measure genuine word production difficulty without major confounding from spelling or test-item design that the models later detect.

What would settle it

Re-rating a subset of KVL items after standardizing spelling presentation and removing item-construction cues, then checking whether the original difficulty scores remain unchanged.

Figures

Figures reproduced from arXiv: 2605.14257 by Adam Nohejl, Hitomi Yanaka, Maria Angelica Riera Machin, Xuanxin Wu, Yi-Ning Chang, Yusuke Ide.

**Figure 1.** Figure 1: Global SHAP summaries by L1. feature for Spanish. Perhaps counter-intuitively, it is much less important for L1 Chinese. We hypothesize this is caused by two factors. First, the production frequency uses learner-written texts, and therefore it partially discounts the frequency of words with frequent mistakes. As a result, the importance of the separate spelling difficulty feature is lower proportionally t… view at source ↗

**Figure 2.** Figure 2: Example of local SHAP explanations by L1 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/adno/vocabulary-difficulty .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper wins the shared task with a fine-tuned LLM using soft-target loss and adds some analysis of KVL factors, but the explanatory claims about spelling and test construction may not separate cleanly from the labels themselves.

read the letter

The main thing to know is that their black-box model tops the open track with r > 0.91 after fine-tuning an LLM on a soft-target loss, and they release the code. That gives a practical, checkable result for the BEA 2026 vocabulary difficulty task. The explainable model reaches r > 0.77 and surfaces spelling difficulty plus test-item construction as influences on the KVL ratings, which is more than a pure leaderboard entry usually offers. They also show the approach beats a fine-tuned encoder baseline on the explainable side. Those pieces are straightforward extensions of existing regression work to this specific shared task, and the public code is a real plus for anyone who wants to reproduce or adapt it. The soft-target loss looks like a sensible choice for handling the rating scale rather than forcing hard labels. The analysis tries to move past black-box prediction and point to concrete surface factors, which adds some value for people who care about why certain words score as hard. The soft spot sits in the interpretation. The abstract states that KVL difficulty is affected by spelling or test construction in addition to genuine production difficulty. Both models train directly on the KVL-derived ratings, so any orthographic or format cues that happen to correlate with those ratings will appear in the explanations. The abstract gives no separate production-only measure, such as independent free-recall or cloze data collected outside the KVL test format, that would let them partial out the confounds. Without that anchor the claim stays post-hoc and could simply reflect whatever the labels already contain. This is a shared-task paper aimed at NLP researchers working on language education or edtech tools. It is narrow in scope but has clear numbers, code, and an attempt at insight, so it deserves a serious referee to verify the data splits, error analysis, and whether the explanatory section holds once the full methods are checked. I would send it to peer review rather than desk reject.

Referee Report

1 major / 2 minor

Summary. The paper presents two models for the BEA 2026 Shared Task on Vocabulary Difficulty Prediction: a black-box LLM fine-tuned with a soft-target loss achieving r > 0.91 and topping the open track, and an explainable model reaching r > 0.77 that outperforms a fine-tuned encoder baseline. The explainable model is used to analyze KVL items, concluding that difficulty is affected by spelling difficulty and test-item construction in addition to genuine production difficulty. Code is released at https://github.com/adno/vocabulary-difficulty.

Significance. If the results and analysis hold, the work contributes a strong shared-task entry with reproducible code and an interpretable component that flags potential surface confounds in test-derived vocabulary difficulty labels. This has practical value for educational assessment design and for distinguishing production difficulty from orthographic or format effects in NLP applications.

major comments (1)

[Abstract / Analysis] Abstract and analysis section: The central interpretive claim that KVL difficulty reflects spelling difficulty and test-item construction 'in addition to' genuine production difficulty lacks an independent anchor. Both models are trained directly on KVL-derived ratings, so the explainable model necessarily learns statistical associations present in those same labels; without a separate production-only gold standard (e.g., free-recall or cloze accuracy collected independently of the KVL test format), the analysis cannot distinguish additive effects from saturation of the labels by spelling and construction confounds.

minor comments (2)

[Methods] Methods: Provide explicit details on train/dev/test splits, the precise encoder baseline architecture, hyperparameter choices for the soft-target loss, and any error analysis or ablation results to allow full verification of the reported correlations.
[Model description] Explainable model: Clarify the feature set and how interpretability is achieved (e.g., which surface cues are explicitly modeled) so readers can assess whether the r > 0.77 performance genuinely isolates the claimed factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the interpretive framing in the abstract and analysis requires careful qualification given the shared-task data constraints, and we will revise accordingly.

read point-by-point responses

Referee: [Abstract / Analysis] Abstract and analysis section: The central interpretive claim that KVL difficulty reflects spelling difficulty and test-item construction 'in addition to' genuine production difficulty lacks an independent anchor. Both models are trained directly on KVL-derived ratings, so the explainable model necessarily learns statistical associations present in those same labels; without a separate production-only gold standard (e.g., free-recall or cloze accuracy collected independently of the KVL test format), the analysis cannot distinguish additive effects from saturation of the labels by spelling and construction confounds.

Authors: We accept this criticism. The analysis is correlational and draws exclusively from KVL-derived labels; no independent production-only gold standard (such as free-recall or cloze data) is available within the shared-task setting. The explainable model surfaces features (spelling complexity, item format) that are theoretically distinct and measurable independently of the labels themselves, but we cannot rule out that these features simply saturate the observed ratings. We will revise the abstract and analysis section to present the findings as evidence of potential surface confounds in KVL difficulty labels rather than claiming additive effects beyond genuine production difficulty. A new limitations paragraph will explicitly note the absence of an independent anchor and recommend future validation with production-only measures. revision: yes

Circularity Check

0 steps flagged

No significant circularity in model training or analysis

full rationale

The paper trains supervised models (black-box LLM with soft-target loss and explainable model) directly on KVL-derived difficulty ratings and evaluates correlation against the shared-task held-out test set. No equations, derivations, or self-citations are presented that reduce any reported prediction or interpretive claim to its own inputs by construction. The post-hoc analysis of spelling and test-construction factors is an empirical observation from model features correlated with the provided labels, not a definitional or fitted-input circularity. Code release enables external reproduction, confirming the work is self-contained against the task benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard supervised learning assumptions that the provided training data and evaluation metrics capture true difficulty; no new mathematical axioms or invented entities introduced.

free parameters (1)

soft-target loss hyperparameters
Tuned during LLM fine-tuning to optimize for the rating task; exact values not specified in abstract.

axioms (1)

domain assumption Shared task data splits and labels are representative of vocabulary difficulty
Invoked implicitly when claiming top performance and explanatory insights generalize.

pith-pipeline@v0.9.0 · 5466 in / 1157 out tokens · 40364 ms · 2026-05-15T02:56:02.140941+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 7 internal anchors

[1]

Dennis Aumiller and Michael Gertz. 2022. https://doi.org/10.18653/v1/2022.tsar-1.28 U ni HD at TSAR -2022 shared task: Is compute all we need for lexical simplification? In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 251--258, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational...

work page doi:10.18653/v1/2022.tsar-1.28 2022
[2]

BNC Consortium . 2007. https://llds.ling-phil.ox.ac.uk/llds/xmlui/handle/20.500.14106/2554 British National Corpus , XML edition . https://llds.ling-phil.ox.ac.uk/llds/xmlui/handle/20.500.14106/2554

work page 2007
[3]

Annette Capel. 2012. https://doi.org/10.1017/S2041536212000013 Completing the English Vocabulary Profile : C1 and C2 vocabulary . English Profile Journal, 3:e1

work page doi:10.1017/s2041536212000013 2012
[4]

Tianqi Chen and Carlos Guestrin. 2016. https://doi.org/10.1145/2939672.2939785 XGBoost : A Scalable Tree Boosting System . In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD '16, pages 785--794, New York, NY, USA. Association for Computing Machinery

work page doi:10.1145/2939672.2939785 2016
[5]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...

work page doi:10.18653/v1/2020.acl-main.747 2020
[6]

DeepSeek-AI . 2025. http://arxiv.org/abs/2412.19437v2 DeepSeek-V3 Technical Report . ArXiv preprint, arXiv:2412.19437v2 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html QLoRA : Efficient Finetuning of Quantized LLMs . Advances in Neural Information Processing Systems, 36:10088--10115

work page 2023
[8]

Taisei Enomoto, Hwichan Kim, Tosho Hirasawa, Yoshinari Nagai, Ayako Sato, Kyotaro Nakajima, and Mamoru Komachi. 2024. https://aclanthology.org/2024.bea-1.52/ TMU - HIT at MLSP 2024: How well can GPT -4 tackle multilingual lexical simplification? In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), ...

work page 2024
[9]

Mariano Felice and Lucy Skidmore. 2026. Findings of the BEA 2026 shared task on vocabulary difficulty prediction for English learners. In Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), San Diego, California. Association for Computational Linguistics

work page 2026
[10]

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. http://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . In NIPS Deep Learning and Representation Learning Workshop

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Yusuke Ide, Masato Mita, Adam Nohejl, Hiroki Ouchi, and Taro Watanabe. 2023. https://doi.org/10.18653/v1/2023.bea-1.40 J apanese lexical complexity for non-native readers: A new dataset . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 477--487, Toronto, Canada. Association for Computati...

work page doi:10.18653/v1/2023.bea-1.40 2023
[12]

Jong, Mike Mayor, and Catherine Hayes

John H.A.L. Jong, Mike Mayor, and Catherine Hayes. 2016. https://www.pearson.com/content/dam/one-dot-com/one-dot-com/english/TeacherResources/GSE/GSE-WhitePaper-Developing-LOs.pdf Developing global scale of English learning objectives aligned to the common European framework . Technical report

work page 2016
[13]

Pierre Lison, J \"o rg Tiedemann, and Milen Kouylekov. 2018. https://aclanthology.org/L18-1275/ O pen S ubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018) , Miyazaki, Japan. European Language Resources Associ...

work page 2018
[14]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.153 G -eval: NLG evaluation using gpt-4 with better human alignment . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511--2522, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[15]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. 2017. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html A unified approach to interpreting model predictions . In Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS '17, pages 4768--4777, Red Hook, NY, USA. Curran Associates Inc

work page 2017
[16]

Marc Marone, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, and Benjamin Van Durme. 2025. http://arxiv.org/abs/2509.06888v1 mmBERT : A Modern Multilingual Encoder with Annealed Language Learning . ArXiv preprint, arXiv:2509.06888v1 [cs]

work page arXiv 2025
[17]

Mistral AI . 2026. http://arxiv.org/abs/2601.08584v1 Ministral 3 . ArXiv preprint, arXiv:2601.08584v1 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata, and Yuji Matsumoto. 2011. https://aclanthology.org/I11-1017/ Mining revision log of language learning SNS for automated J apanese error correction of second language learners . In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 147--155, Chiang Mai, Thailand. Asian Fe...

work page 2011
[19]

Adam Nohejl, Akio Hayakawa, Yusuke Ide, and Taro Watanabe. 2024. https://doi.org/10.18653/v1/2024.tsar-1.8 Difficult for whom? a study of J apanese lexical complexity . In Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pages 69--81, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.tsar-1.8 2024
[20]

Adam Nohejl, Akio Hayakawa, Yusuke Ide, and Taro Watanabe. 2025 a . https://doi.org/10.5715/jnlp.32.1129 A Japanese Dataset and Efficient Multilingual LLM-Based Methods for Lexical Simplification and Lexical Complexity Prediction . Journal of Natural Language Processing, 32(4):1129--1188

work page doi:10.5715/jnlp.32.1129 2025
[21]

Adam Nohejl, Frederikus Hudi, Eunike Andriani Kardinata, Shintaro Ozaki, Maria Angelica Riera Machin, Hongyu Sun, Justin Vasselli, and Taro Watanabe. 2025 b . https://aclanthology.org/2025.coling-main.641/ Beyond film subtitles: Is Y ou T ube the best approximation of spoken vocabulary? In Proceedings of the 31st International Conference on Computational ...

work page 2025
[22]

OpenAI. 2024. http://arxiv.org/abs/2303.08774v6 GPT-4 Technical Report . ArXiv preprint, arXiv:2303.08774v6 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

OpenAI . 2025. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf Update to GPT-5 System Card : GPT-5 .2 . Technical report

work page 2025
[24]

Gustavo Paetzold and Lucia Specia. 2016. https://doi.org/10.18653/v1/S16-1085 S em E val 2016 task 11: Complex word identification . In Proceedings of the 10th International Workshop on Semantic Evaluation ( S em E val-2016) , pages 560--569, San Diego, California. Association for Computational Linguistics

work page doi:10.18653/v1/s16-1085 2016
[25]

Qwen Team . 2025. http://arxiv.org/abs/2412.15115v2 Qwen2.5 Technical Report . ArXiv preprint, arXiv:2412.15115v2 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. http://arxiv.org/abs/1910.01108 DistilBERT , a distilled version of BERT : Smaller, faster, cheaper and lighter . In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019 , volume arXiv:1910.01108

work page internal anchor Pith review Pith/arXiv arXiv 2019
[27]

Norbert Schmitt, Karen Dunn, Barry O'Sullivan, Laurence Anthony, and Benjamin Kremmel. 2021. https://doi.org/10.1002/tesj.622 Introducing Knowledge-based Vocabulary Lists ( KVL ) . TESOL Journal, 12(4):e622

work page doi:10.1002/tesj.622 2021
[28]

Norbert Schmitt, Karen Dunn, Barry O'Sullivan, Laurence Anthony, and Benjamin Kremmel. 2024. https://doi.org/10.3138/9781800504141 Knowledge-Based Vocabulary Lists . British Council Monographs on Modern Language Testing . University of Toronto Press

work page doi:10.3138/9781800504141 2024
[29]

Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C

Graham G. Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C. Sereno. 2019. https://doi.org/10.3758/s13428-018-1099-3 The Glasgow Norms : Ratings of 5,500 words on nine scales . Behavior Research Methods, 51(3):1258--1270

work page doi:10.3758/s13428-018-1099-3 2019
[30]

Matthew Shardlow, Fernando Alva-Manchego, Riza Batista-Navarro, Stefan Bott, Saul Calderon Ramirez, R \'e mi Cardon, Thomas Fran c ois, Akio Hayakawa, Andrea Horbach, Anna H \"u lsing, Yusuke Ide, Joseph Marvin Imperial, Adam Nohejl, Kai North, Laura Occhipinti, Nelson Per \'e z Rojas, Nishat Raihan, Tharindu Ranasinghe, Martin Solis Salazar, and 3 others...

work page 2024
[31]

Matthew Shardlow, Richard Evans, Gustavo Henrique Paetzold, and Marcos Zampieri. 2021. https://doi.org/10.18653/v1/2021.semeval-1.1 S em E val-2021 task 1: Lexical complexity prediction . In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1--16, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2021.semeval-1.1 2021
[32]

Lucy Skidmore, Mariano Felice, and Karen Dunn. 2025. https://doi.org/10.18653/v1/2025.bea-1.12 Transformer architectures for vocabulary test item difficulty prediction . In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 160--174, Vienna, Austria. Association for Computational Linguistics

work page doi:10.18653/v1/2025.bea-1.12 2025
[33]

R a zvan-Alexandru Sm a du, David-Gabriel Ion, Dumitru-Clementin Cercel, Florin Pop, and Mihaela-Claudia Cercel. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.933 Investigating large language models for complex word identification in multilingual and multidomain setups . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pr...

work page doi:10.18653/v1/2024.emnlp-main.933 2024
[34]

Team GLM . 2024. http://arxiv.org/abs/2406.12793v2 ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools . ArXiv preprint, arXiv:2406.12793v2 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Seid Muhie Yimam, Chris Biemann, Shervin Malmasi, Gustavo Paetzold, Lucia Specia, Sanja S tajner, Ana \"i s Tack, and Marcos Zampieri. 2018. https://doi.org/10.18653/v1/W18-0507 A report on the complex word identification shared task 2018 . In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications , pages 66-...

work page doi:10.18653/v1/w18-0507 2018