pith. machine review for the scientific record. sign in

arxiv: 2605.14257 · v1 · submitted 2026-05-14 · 💻 cs.CL

Recognition: no theorem link

What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:56 UTC · model grok-4.3

classification 💻 cs.CL
keywords vocabulary difficulty predictionword difficultyexplainable modelLLM fine-tuningKVL listsshared tasklanguage assessment
0
0 comments X

The pith

Spelling difficulty and test item construction often drive ratings in standard vocabulary difficulty lists beyond genuine word production demands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a high-accuracy black-box model by fine-tuning an LLM with a soft-target loss, reaching r greater than 0.91 and leading the open track of the BEA 2026 shared task. It pairs this with an explainable model that keeps r above 0.77 while surfacing which factors shape each item's difficulty score. Analysis of the British Council KVL data shows that spelling challenges and how test items are worded frequently influence the ratings in addition to real production difficulty. Readers should care because these findings indicate that current difficulty lists mix test artifacts with core language-learning demands.

Core claim

The difficulty of items in the British Council's Knowledge-based Vocabulary Lists is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words.

What carries the argument

Fine-tuned LLM with soft-target loss for black-box prediction, paired with an explainable model that decomposes per-item influences on difficulty ratings.

If this is right

  • Vocabulary tests and lists should separate spelling and format effects from production difficulty to give cleaner signals for learners and teachers.
  • Explainable models can flag which specific items in existing lists are likely inflated by non-production factors.
  • Training data for future difficulty predictors should include explicit spelling and item-format annotations to reduce post-hoc capture of confounders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Difficulty prediction systems may improve if they are trained to output separate scores for spelling load, item format, and core production effort rather than a single combined rating.
  • This approach could transfer to other language assessment domains where surface features of test items distort underlying skill measures.

Load-bearing premise

The shared task dataset and KVL lists measure genuine word production difficulty without major confounding from spelling or test-item design that the models later detect.

What would settle it

Re-rating a subset of KVL items after standardizing spelling presentation and removing item-construction cues, then checking whether the original difficulty scores remain unchanged.

Figures

Figures reproduced from arXiv: 2605.14257 by Adam Nohejl, Hitomi Yanaka, Maria Angelica Riera Machin, Xuanxin Wu, Yi-Ning Chang, Yusuke Ide.

Figure 1
Figure 1. Figure 1: Global SHAP summaries by L1. feature for Spanish. Perhaps counter-intuitively, it is much less important for L1 Chinese. We hypothesize this is caused by two factors. First, the production frequency uses learner-written texts, and therefore it partially discounts the frequency of words with frequent mistakes. As a result, the im￾portance of the separate spelling difficulty feature is lower proportionally t… view at source ↗
Figure 2
Figure 2. Figure 2: Example of local SHAP explanations by L1 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/adno/vocabulary-difficulty .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents two models for the BEA 2026 Shared Task on Vocabulary Difficulty Prediction: a black-box LLM fine-tuned with a soft-target loss achieving r > 0.91 and topping the open track, and an explainable model reaching r > 0.77 that outperforms a fine-tuned encoder baseline. The explainable model is used to analyze KVL items, concluding that difficulty is affected by spelling difficulty and test-item construction in addition to genuine production difficulty. Code is released at https://github.com/adno/vocabulary-difficulty.

Significance. If the results and analysis hold, the work contributes a strong shared-task entry with reproducible code and an interpretable component that flags potential surface confounds in test-derived vocabulary difficulty labels. This has practical value for educational assessment design and for distinguishing production difficulty from orthographic or format effects in NLP applications.

major comments (1)
  1. [Abstract / Analysis] Abstract and analysis section: The central interpretive claim that KVL difficulty reflects spelling difficulty and test-item construction 'in addition to' genuine production difficulty lacks an independent anchor. Both models are trained directly on KVL-derived ratings, so the explainable model necessarily learns statistical associations present in those same labels; without a separate production-only gold standard (e.g., free-recall or cloze accuracy collected independently of the KVL test format), the analysis cannot distinguish additive effects from saturation of the labels by spelling and construction confounds.
minor comments (2)
  1. [Methods] Methods: Provide explicit details on train/dev/test splits, the precise encoder baseline architecture, hyperparameter choices for the soft-target loss, and any error analysis or ablation results to allow full verification of the reported correlations.
  2. [Model description] Explainable model: Clarify the feature set and how interpretability is achieved (e.g., which surface cues are explicitly modeled) so readers can assess whether the r > 0.77 performance genuinely isolates the claimed factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the interpretive framing in the abstract and analysis requires careful qualification given the shared-task data constraints, and we will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract / Analysis] Abstract and analysis section: The central interpretive claim that KVL difficulty reflects spelling difficulty and test-item construction 'in addition to' genuine production difficulty lacks an independent anchor. Both models are trained directly on KVL-derived ratings, so the explainable model necessarily learns statistical associations present in those same labels; without a separate production-only gold standard (e.g., free-recall or cloze accuracy collected independently of the KVL test format), the analysis cannot distinguish additive effects from saturation of the labels by spelling and construction confounds.

    Authors: We accept this criticism. The analysis is correlational and draws exclusively from KVL-derived labels; no independent production-only gold standard (such as free-recall or cloze data) is available within the shared-task setting. The explainable model surfaces features (spelling complexity, item format) that are theoretically distinct and measurable independently of the labels themselves, but we cannot rule out that these features simply saturate the observed ratings. We will revise the abstract and analysis section to present the findings as evidence of potential surface confounds in KVL difficulty labels rather than claiming additive effects beyond genuine production difficulty. A new limitations paragraph will explicitly note the absence of an independent anchor and recommend future validation with production-only measures. revision: yes

Circularity Check

0 steps flagged

No significant circularity in model training or analysis

full rationale

The paper trains supervised models (black-box LLM with soft-target loss and explainable model) directly on KVL-derived difficulty ratings and evaluates correlation against the shared-task held-out test set. No equations, derivations, or self-citations are presented that reduce any reported prediction or interpretive claim to its own inputs by construction. The post-hoc analysis of spelling and test-construction factors is an empirical observation from model features correlated with the provided labels, not a definitional or fitted-input circularity. Code release enables external reproduction, confirming the work is self-contained against the task benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard supervised learning assumptions that the provided training data and evaluation metrics capture true difficulty; no new mathematical axioms or invented entities introduced.

free parameters (1)
  • soft-target loss hyperparameters
    Tuned during LLM fine-tuning to optimize for the rating task; exact values not specified in abstract.
axioms (1)
  • domain assumption Shared task data splits and labels are representative of vocabulary difficulty
    Invoked implicitly when claiming top performance and explanatory insights generalize.

pith-pipeline@v0.9.0 · 5466 in / 1157 out tokens · 40364 ms · 2026-05-15T02:56:02.140941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 7 internal anchors

  1. [1]

    Dennis Aumiller and Michael Gertz. 2022. https://doi.org/10.18653/v1/2022.tsar-1.28 U ni HD at TSAR -2022 shared task: Is compute all we need for lexical simplification? In Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022), pages 251--258, Abu Dhabi, United Arab Emirates (Virtual). Association for Computational...

  2. [2]

    BNC Consortium . 2007. https://llds.ling-phil.ox.ac.uk/llds/xmlui/handle/20.500.14106/2554 British National Corpus , XML edition . https://llds.ling-phil.ox.ac.uk/llds/xmlui/handle/20.500.14106/2554

  3. [3]

    Annette Capel. 2012. https://doi.org/10.1017/S2041536212000013 Completing the English Vocabulary Profile : C1 and C2 vocabulary . English Profile Journal, 3:e1

  4. [4]

    Tianqi Chen and Carlos Guestrin. 2016. https://doi.org/10.1145/2939672.2939785 XGBoost : A Scalable Tree Boosting System . In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD '16, pages 785--794, New York, NY, USA. Association for Computing Machinery

  5. [5]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...

  6. [6]

    DeepSeek-AI . 2025. http://arxiv.org/abs/2412.19437v2 DeepSeek-V3 Technical Report . ArXiv preprint, arXiv:2412.19437v2 [cs.CL]

  7. [7]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html QLoRA : Efficient Finetuning of Quantized LLMs . Advances in Neural Information Processing Systems, 36:10088--10115

  8. [8]

    Taisei Enomoto, Hwichan Kim, Tosho Hirasawa, Yoshinari Nagai, Ayako Sato, Kyotaro Nakajima, and Mamoru Komachi. 2024. https://aclanthology.org/2024.bea-1.52/ TMU - HIT at MLSP 2024: How well can GPT -4 tackle multilingual lexical simplification? In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), ...

  9. [9]

    Mariano Felice and Lucy Skidmore. 2026. Findings of the BEA 2026 shared task on vocabulary difficulty prediction for English learners. In Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), San Diego, California. Association for Computational Linguistics

  10. [10]

    Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. http://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . In NIPS Deep Learning and Representation Learning Workshop

  11. [11]

    Yusuke Ide, Masato Mita, Adam Nohejl, Hiroki Ouchi, and Taro Watanabe. 2023. https://doi.org/10.18653/v1/2023.bea-1.40 J apanese lexical complexity for non-native readers: A new dataset . In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pages 477--487, Toronto, Canada. Association for Computati...

  12. [12]

    Jong, Mike Mayor, and Catherine Hayes

    John H.A.L. Jong, Mike Mayor, and Catherine Hayes. 2016. https://www.pearson.com/content/dam/one-dot-com/one-dot-com/english/TeacherResources/GSE/GSE-WhitePaper-Developing-LOs.pdf Developing global scale of English learning objectives aligned to the common European framework . Technical report

  13. [13]

    Pierre Lison, J \"o rg Tiedemann, and Milen Kouylekov. 2018. https://aclanthology.org/L18-1275/ O pen S ubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018) , Miyazaki, Japan. European Language Resources Associ...

  14. [14]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.153 G -eval: NLG evaluation using gpt-4 with better human alignment . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511--2522, Singapore. Association for Computational Linguistics

  15. [15]

    Lundberg and Su-In Lee

    Scott M. Lundberg and Su-In Lee. 2017. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html A unified approach to interpreting model predictions . In Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS '17, pages 4768--4777, Red Hook, NY, USA. Curran Associates Inc

  16. [16]

    Marc Marone, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, and Benjamin Van Durme. 2025. http://arxiv.org/abs/2509.06888v1 mmBERT : A Modern Multilingual Encoder with Annealed Language Learning . ArXiv preprint, arXiv:2509.06888v1 [cs]

  17. [17]

    Mistral AI . 2026. http://arxiv.org/abs/2601.08584v1 Ministral 3 . ArXiv preprint, arXiv:2601.08584v1 [cs.CL]

  18. [18]

    Tomoya Mizumoto, Mamoru Komachi, Masaaki Nagata, and Yuji Matsumoto. 2011. https://aclanthology.org/I11-1017/ Mining revision log of language learning SNS for automated J apanese error correction of second language learners . In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 147--155, Chiang Mai, Thailand. Asian Fe...

  19. [19]

    Adam Nohejl, Akio Hayakawa, Yusuke Ide, and Taro Watanabe. 2024. https://doi.org/10.18653/v1/2024.tsar-1.8 Difficult for whom? a study of J apanese lexical complexity . In Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024), pages 69--81, Miami, Florida, USA. Association for Computational Linguistics

  20. [20]

    Adam Nohejl, Akio Hayakawa, Yusuke Ide, and Taro Watanabe. 2025 a . https://doi.org/10.5715/jnlp.32.1129 A Japanese Dataset and Efficient Multilingual LLM-Based Methods for Lexical Simplification and Lexical Complexity Prediction . Journal of Natural Language Processing, 32(4):1129--1188

  21. [21]

    Adam Nohejl, Frederikus Hudi, Eunike Andriani Kardinata, Shintaro Ozaki, Maria Angelica Riera Machin, Hongyu Sun, Justin Vasselli, and Taro Watanabe. 2025 b . https://aclanthology.org/2025.coling-main.641/ Beyond film subtitles: Is Y ou T ube the best approximation of spoken vocabulary? In Proceedings of the 31st International Conference on Computational ...

  22. [22]

    OpenAI. 2024. http://arxiv.org/abs/2303.08774v6 GPT-4 Technical Report . ArXiv preprint, arXiv:2303.08774v6 [cs.CL]

  23. [23]

    OpenAI . 2025. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf Update to GPT-5 System Card : GPT-5 .2 . Technical report

  24. [24]

    Gustavo Paetzold and Lucia Specia. 2016. https://doi.org/10.18653/v1/S16-1085 S em E val 2016 task 11: Complex word identification . In Proceedings of the 10th International Workshop on Semantic Evaluation ( S em E val-2016) , pages 560--569, San Diego, California. Association for Computational Linguistics

  25. [25]

    Qwen Team . 2025. http://arxiv.org/abs/2412.15115v2 Qwen2.5 Technical Report . ArXiv preprint, arXiv:2412.15115v2 [cs.CL]

  26. [26]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. http://arxiv.org/abs/1910.01108 DistilBERT , a distilled version of BERT : Smaller, faster, cheaper and lighter . In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019 , volume arXiv:1910.01108

  27. [27]

    Norbert Schmitt, Karen Dunn, Barry O'Sullivan, Laurence Anthony, and Benjamin Kremmel. 2021. https://doi.org/10.1002/tesj.622 Introducing Knowledge-based Vocabulary Lists ( KVL ) . TESOL Journal, 12(4):e622

  28. [28]

    Norbert Schmitt, Karen Dunn, Barry O'Sullivan, Laurence Anthony, and Benjamin Kremmel. 2024. https://doi.org/10.3138/9781800504141 Knowledge-Based Vocabulary Lists . British Council Monographs on Modern Language Testing . University of Toronto Press

  29. [29]

    Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C

    Graham G. Scott, Anne Keitel, Marc Becirspahic, Bo Yao, and Sara C. Sereno. 2019. https://doi.org/10.3758/s13428-018-1099-3 The Glasgow Norms : Ratings of 5,500 words on nine scales . Behavior Research Methods, 51(3):1258--1270

  30. [30]

    Matthew Shardlow, Fernando Alva-Manchego, Riza Batista-Navarro, Stefan Bott, Saul Calderon Ramirez, R \'e mi Cardon, Thomas Fran c ois, Akio Hayakawa, Andrea Horbach, Anna H \"u lsing, Yusuke Ide, Joseph Marvin Imperial, Adam Nohejl, Kai North, Laura Occhipinti, Nelson Per \'e z Rojas, Nishat Raihan, Tharindu Ranasinghe, Martin Solis Salazar, and 3 others...

  31. [31]

    Matthew Shardlow, Richard Evans, Gustavo Henrique Paetzold, and Marcos Zampieri. 2021. https://doi.org/10.18653/v1/2021.semeval-1.1 S em E val-2021 task 1: Lexical complexity prediction . In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 1--16, Online. Association for Computational Linguistics

  32. [32]

    Lucy Skidmore, Mariano Felice, and Karen Dunn. 2025. https://doi.org/10.18653/v1/2025.bea-1.12 Transformer architectures for vocabulary test item difficulty prediction . In Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), pages 160--174, Vienna, Austria. Association for Computational Linguistics

  33. [33]

    R a zvan-Alexandru Sm a du, David-Gabriel Ion, Dumitru-Clementin Cercel, Florin Pop, and Mihaela-Claudia Cercel. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.933 Investigating large language models for complex word identification in multilingual and multidomain setups . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Pr...

  34. [34]

    Team GLM . 2024. http://arxiv.org/abs/2406.12793v2 ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools . ArXiv preprint, arXiv:2406.12793v2 [cs.CL]

  35. [35]

    Seid Muhie Yimam, Chris Biemann, Shervin Malmasi, Gustavo Paetzold, Lucia Specia, Sanja S tajner, Ana \"i s Tack, and Marcos Zampieri. 2018. https://doi.org/10.18653/v1/W18-0507 A report on the complex word identification shared task 2018 . In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications , pages 66-...