Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

Jan Net\'ik; Patr\'icia Martinkov\'a

arxiv: 2605.16991 · v1 · pith:HIOEVHICnew · submitted 2026-05-16 · 💻 cs.CL · cs.AI

Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

Jan Net\'ik , Patr\'icia Martinkov\'a This is my paper

Pith reviewed 2026-05-19 20:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords item difficultyresponse-free modelingtransformer fine-tuningmultiple-choice itemsmulti-task learningeducational measurementreading comprehension

0 comments

The pith

Fine-tuned transformers predict multiple-choice item difficulty directly from wording, with multi-task learning aiding small-sample cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that transformer encoders can be fine-tuned end-to-end on the text of reading-comprehension multiple-choice items to estimate their difficulty without collecting student responses. It tests a basic joint-encoding method against a component-wise variant that processes wording parts separately and a multi-task variant that adds an auxiliary question-answering task on the same encoder. Across Monte Carlo subsampling at three training sizes, joint encoding works as a viable replacement for hand-crafted feature pipelines, the component-wise version adds no benefit, and the multi-task version yields significant gains specifically when training data is scarcest. A sympathetic reader would care because response data for calibration is often limited or costly in real testing programs, so recovering difficulty information from wording alone could simplify item development and reduce reliance on large pilot samples.

Core claim

Joint encoding of item wording via fine-tuned transformers provides a workable end-to-end alternative to manual feature-engineering pipelines for response-free difficulty modeling; component-wise encoding confers no detectable advantage, while adding an auxiliary multiple-choice question-answering objective produces significant paired improvements in the smallest-sample regime and recovers a substantial share of the wording-derivable signal at training-set sizes typical of applied measurement.

What carries the argument

A shared transformer encoder fine-tuned jointly on item wording for difficulty regression, optionally augmented by an auxiliary multiple-choice question-answering task that regularizes the representation.

If this is right

Joint transformer encoding removes the need for manual feature extraction and preprocessing steps that can discard information.
Self-attention within a single encoder already captures cross-component signals, rendering separate component-wise encoding redundant.
Multi-task regularization with an auxiliary QA objective improves difficulty estimates particularly when response data for training is limited.
The approach supplies a flexible interface that can incorporate further psychometrically motivated auxiliary tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If wording alone carries most of the difficulty signal, test developers could screen or rank new items before any piloting occurs, cutting development time.
The same multi-task setup might extend to predicting other item properties such as discrimination or guessing parameters when those are also partly wording-driven.
Performance gains from auxiliary tasks suggest that difficulty modeling could benefit from borrowing representations learned on large general QA corpora even when local response data remains small.

Load-bearing premise

Item difficulty is determined enough by surface and inferential features in the wording itself, without needing specific details about the student population or testing context.

What would settle it

Observe that the model's difficulty predictions on a held-out set of items deviate substantially from empirical difficulties measured in a new student population whose background or prior knowledge differs markedly from the training data.

Figures

Figures reproduced from arXiv: 2605.16991 by Jan Net\'ik, Patr\'icia Martinkov\'a.

**Figure 2.** Figure 2: Task-conditioning vector zt in the multi-task model. The matrix Z ∈ R T ×d stores one learnable vector per task (T = 2). During a forward pass, the active task selects row zt, which is added only at the [CLS] position of the embedding-layer output before the first encoder layer. Both rows of Z are initialised to zero and trained jointly with the rest of the model. before the first encoder layer reads the r… view at source ↗

**Figure 3.** Figure 3: Paired differences in RMSE between each comparator method (component-wise, MTL, and the dummy regressor) and the joint-encoding approach, computed within the same training sub-sample. Panels are aligned so that the dummy regressor mean sits at the same vertical position across sizes. Dummy regressor Component-wise encoding MTL -0.27 -0.18 -0.09 0.00 0.09 R² difference from joint-encoding approach (positive… view at source ↗

**Figure 4.** Figure 4: Paired differences in R2 between each comparator method (component-wise, MTL, and the dummy regressor) and the joint-encoding approach, computed within the same training sub-sample. Panels are aligned so that the dummy regressor mean sits at the same vertical position across sizes. 4 Discussion This study compared three transformer-based models for response-free item-difficulty prediction within a nested t… view at source ↗

**Figure 5.** Figure 5: Paired differences in Spearman ρ between each comparator method (component-wise and MTL) and the joint-encoding approach, computed within the same training sub-sample. The dummy regressor is undefined for correlation since its variance is zero. Panels are aligned at the mean absolute performance of the joint-encoding approach across sizes. approximately 4% for the best shared-task team (0.299 versus 0.311 … view at source ↗

read the original abstract

Response-free item difficulty modelling promises to reduce reliance on response-based calibration but is intrinsically difficult on reading-comprehension multiple-choice items, where difficulty depends on inferential demands across wording components. Whereas most existing approaches extract item-text features and pass them to a separate statistical or machine-learning model, we fine-tune transformer encoders end-to-end on the item wording, eliminating the manual feature engineering and preprocessing that discards information. Moreover, two extensions to this joint-encoding approach are proposed: a component-wise variant that encodes wording components separately through a shared encoder, and a multi-task variant that retains joint encoding and adds an auxiliary multiple-choice question answering objective on the shared encoder. Each method is evaluated under a Monte Carlo subsampling design at three training-set sizes on a held-out test set. We find that joint encoding is a viable end-to-end alternative to feature-engineering pipelines; while the component-wise variant shows no detectable benefit, consistent with self-attention already harvesting the cross-component signal, the multi-task variant delivers significant paired improvements in the smallest-sample regime. Transformer fine-tuning, especially if regularised by a suitable auxiliary task, recovers a substantial share of the wording-derivable signal at training-set sizes typical of applied measurement. The framework provides a customisable interface for psychometrically motivated extensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-task fine-tuning on item text improves small-sample difficulty prediction within one dataset, but population transfer is untested.

read the letter

The main thing to know is that this paper shows a multi-task auxiliary QA objective on a shared transformer encoder can improve response-free difficulty estimates at the smallest training sizes they tested, while a component-wise encoding variant adds nothing detectable. The evaluation uses Monte Carlo subsampling across three training-set sizes on a held-out test set, which gives a clean look at how the methods behave when data is scarce, a realistic constraint in applied testing work. Joint end-to-end fine-tuning already works as a drop-in replacement for older feature-engineering pipelines, and the multi-task version pulls ahead in the low-N regime without extra manual features. That part of the design is straightforward and worth noting for anyone who needs to calibrate items with limited response data. The central limitation is that all results stay inside one item pool and, by extension, one student population. Difficulty on reading-comprehension items often shifts with prior knowledge, curriculum exposure, and demographics, so the wording signal recovered here could be partly entangled with those factors. The abstract claims significant paired improvements but does not spell out error bars or correction details, which leaves the strength of the multi-task benefit a bit hard to judge from the summary alone. This work is aimed at people who build or evaluate educational tests and want to cut down on response-based calibration. A reader who already works with transformers in measurement contexts will find the empirical comparisons useful and the small-N focus practical. It is coherent enough on its own terms to deserve a serious referee, mainly so reviewers can press on the generalization question and ask for the missing statistical details. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper presents methods for response-free modeling of item difficulty in multiple-choice reading comprehension items using fine-tuned transformer encoders on item wording. It introduces a component-wise representation where wording components are encoded separately and a multi-task learning approach that adds an auxiliary multiple-choice question answering task. These are compared to a joint encoding baseline using a Monte Carlo subsampling design with held-out test sets at three different training set sizes. The key finding is that the multi-task variant provides significant paired improvements in the smallest sample regime, while the component-wise variant does not, suggesting that self-attention already captures cross-component information. The authors conclude that this approach recovers a substantial share of the wording-derivable difficulty signal.

Significance. Should the results prove robust, the work is significant as it offers an end-to-end deep learning alternative to manual feature extraction for difficulty prediction, which can be valuable in educational assessment where collecting response data is expensive. The demonstration of multi-task benefits at small training sizes is particularly useful for practical applications. The use of Monte Carlo subsampling and held-out evaluation provides a reasonable basis for the performance claims.

major comments (2)

[§4.2] §4.2 (Results): The reported 'significant paired improvements' for the multi-task variant at the smallest training-set size are presented without error bars, standard deviations across Monte Carlo subsamples, or details on the paired statistical test and multiple-comparison correction; this directly affects the strength of the central claim regarding the multi-task benefit in the low-data regime.
[§6] §6 (Discussion): The interpretation that the model recovers a 'substantial share of the wording-derivable signal' treats within-pool held-out performance as direct evidence, but the evaluation does not test or discuss transfer across student populations or test contexts; this assumption is load-bearing for the broader promise of response-free modelling since difficulty is known to interact with population-specific factors.

minor comments (2)

[Abstract] The abstract could explicitly state the three training-set sizes used in the subsampling design to improve readability.
[§3.1] Notation for the shared encoder in the component-wise variant would benefit from an accompanying diagram or explicit equation for the aggregation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated.

read point-by-point responses

Referee: [§4.2] §4.2 (Results): The reported 'significant paired improvements' for the multi-task variant at the smallest training-set size are presented without error bars, standard deviations across Monte Carlo subsamples, or details on the paired statistical test and multiple-comparison correction; this directly affects the strength of the central claim regarding the multi-task benefit in the low-data regime.

Authors: We agree that reporting variability measures and full details of the statistical analysis would strengthen the presentation of the central results. In the revised manuscript we will add standard deviations across the Monte Carlo subsamples to the relevant tables, include error bars on the performance plots, and expand the description of the paired statistical test (including the exact test used and any multiple-comparison correction) in Section 4.2. These additions will allow readers to evaluate the robustness of the reported improvements directly. revision: yes
Referee: [§6] §6 (Discussion): The interpretation that the model recovers a 'substantial share of the wording-derivable signal' treats within-pool held-out performance as direct evidence, but the evaluation does not test or discuss transfer across student populations or test contexts; this assumption is load-bearing for the broader promise of response-free modelling since difficulty is known to interact with population-specific factors.

Authors: We acknowledge that our evaluation is limited to within-pool held-out performance on a single dataset and does not examine transfer across student populations or testing contexts. In the revised Discussion we will explicitly state this scope limitation, note that item difficulty is known to interact with population-specific factors, and qualify the claim of recovering a 'substantial share of the wording-derivable signal' to the setting studied. We will also add a forward-looking statement identifying cross-context transfer as an important direction for future work. Because the current study does not include data from multiple populations, we cannot perform such transfer experiments in this revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are empirical performance on held-out data

full rationale

The paper evaluates transformer fine-tuning, component-wise encoding, and multi-task learning for response-free difficulty prediction using Monte Carlo subsampling on held-out test sets at varying training sizes. All reported improvements are measured as paired differences in prediction accuracy on items excluded from training, with no equations or claims that reduce a derived quantity to a fitted parameter by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the core results; the framework is presented as an empirical alternative to feature-engineering pipelines rather than a deductive derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard transformer pretraining plus the assumption that item difficulty is largely recoverable from text features; no new physical or mathematical entities are introduced.

free parameters (1)

training set size
Three discrete sizes used in Monte Carlo subsampling; chosen to represent applied measurement regimes.

axioms (1)

domain assumption Item difficulty depends on inferential demands across wording components
Stated in the abstract as the reason response-free modeling is intrinsically difficult.

pith-pipeline@v0.9.0 · 5765 in / 1219 out tokens · 28407 ms · 2026-05-19T20:18:13.999786+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

190 extracted references · 190 canonical work pages · 10 internal anchors

[1]

and Tamma, Valentina , date =

AlKhuzaey, Samah and Grasso, Floriana and Payne, Terry R. and Tamma, Valentina , date =. Text-based question difficulty prediction:. doi:10.1007/s40593-023-00362-1 , url =

work page doi:10.1007/s40593-023-00362-1
[2]

Belov, Dmitry and Lüdtke, Oliver and Ulitzsch, Esther , date =. A. OSF , doi =

work page
[3]

A quantitative study of

Benedetto, Luca , date =. A quantitative study of. arXiv , doi =

work page
[5]

arXiv , doi =

Language models are few-shot learners , author =. arXiv , doi =

work page
[6]

Multitask

Caruana, Rich , date =. Multitask. doi:10.1023/A:1007379606734 , url =

work page doi:10.1023/a:1007379606734
[7]

Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and Pasupat, Ice and Sachdeva, Noveen and Dhillon, Inderjit and Blistein, Marcel and Ram, Ori and Zhang, Dan and Rosen, Evan and Marris, Luke and Petulla, Sam and Gaffney, Colin and Aharoni, Asaf and Lintz, Nathan and Pais, Tiago Cardal and Jacobsson, Henrik and Szpektor, Idan and Jiang, Nan-Jiang...

work page
[8]

arXiv , doi =

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , date =. arXiv , doi =

work page
[9]

The prediction of

Freedle, Roy and Kostin, Irene , date =. The prediction of. doi:10.1177/026553229301000203 , url =

work page doi:10.1177/026553229301000203
[10]

Predicting

Gombert, Sebastian and Menzel, Lukas and Di Mitri, Daniele and Drachsler, Hendrik , editor =. Predicting. Proceedings of the 19th

work page
[11]

and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, A

OpenAI and Hurst, Aaron and Lerer, Adam and Goucher, Adam P. and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, A. J. and Welihinda, Akila and Hayes, Alan and Radford, Alec and Mądry, Aleksander and Baker-Whitcomb, Alex and Beutel, Alex and Borzunov, Alex and Carney, Alex and Chow, Alex and Kirillov, Alex and Nichol, Alex and Paino, Alex a...

work page
[12]

, date =

Gururangan, Suchin and Marasović, Ana and Swayamdipta, Swabha and Lo, Kyle and Beltagy, Iz and Downey, Doug and Smith, Noah A. , date =. Don't stop pretraining:. arXiv , doi =

work page
[13]

Predicting the difficulty of multiple choice questions in a high-stakes medical exam , booktitle =

Ha, Le An and Yaneva, Victoria and Baldwin, Peter and Mee, Janet , editor =. Predicting the difficulty of multiple choice questions in a high-stakes medical exam , booktitle =. doi:10.18653/v1/W19-4402 , url =

work page doi:10.18653/v1/w19-4402
[14]

arXiv , doi =

Distilling the knowledge in a neural network , author =. arXiv , doi =

work page
[15]

Jawahar, Ganesh and Sagot, Benoît and Seddah, Djamé , editor =. What. Proceedings of the 57th. doi:10.18653/v1/P19-1356 , url =

work page doi:10.18653/v1/p19-1356
[16]

arXiv , doi =

Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard , date =. arXiv , doi =

work page
[18]

Fine-tuning language models to predict item difficulty from wording , author =

work page
[19]

doi:10.48550/arXiv.2303.08774 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774
[20]

Probabilistic

Rasch, Georg , date =. Probabilistic

work page
[21]

Proceedings of the 19th

Rodrigo, Alvaro and Moreno-Álvarez, Sergio and Peñas, Anselmo , editor =. Proceedings of the 19th

work page
[23]

LaFlair, Geoffrey and Hagiwara, Masato , date =

Settles, Burr and T. LaFlair, Geoffrey and Hagiwara, Masato , date =. Machine. doi:10.1162/tacl_a_00310 , url =

work page doi:10.1162/tacl_a_00310
[24]

Proceedings of

Sharpnack, James and Hao, Kevin and Mulcaire, Phoebe and Bicknell, Klinton and LaFlair, Geoff and Yancey, Kevin and family=Davier, given=Alina A., prefix=von, useprefix=true , date =. Proceedings of

work page
[26]

Štěpánek, Lubomír and Dlouhá, Jana and Martinková, Patrícia , date =. Item. doi:10.3390/math11194104 , url =

work page doi:10.3390/math11194104
[28]

, date =

Taylor, Wilson L. , date =. "

work page
[30]

Natural language processing with transformers: building language applications with

Tunstall, Lewis and family=Werra, given=Leandro, prefix=von, useprefix=false and Wolf, Thomas , date =. Natural language processing with transformers: building language applications with

work page
[31]

International Conference on Learning Representations (ICLR) , eprint =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations (ICLR) , eprint =

work page
[35]

doi:10.1038/s42256-020-00257-z , url =

Shortcut learning in deep neural networks , author =. doi:10.1038/s42256-020-00257-z , url =

work page doi:10.1038/s42256-020-00257-z
[36]

Applied Psychological Measurement , volume =

Component Latent Trait Models for Paragraph Comprehension Tests , author =. Applied Psychological Measurement , volume =. 1987 , doi =

work page 1987
[37]

Applied Psychological Measurement , volume =

Item Difficulty Modeling of Paragraph Comprehension Items , author =. Applied Psychological Measurement , volume =. 2006 , doi =

work page 2006
[38]

Behavior Research Methods , volume =

Where's the Difficulty in Standardized Reading Tests: The Passage or the Question? , author =. Behavior Research Methods , volume =. 2008 , doi =

work page 2008
[39]

Predicting the

Xue, Kang and Yaneva, Victoria and Runyon, Christopher and Baldwin, Peter , editor =. Predicting the. Proceedings of the. doi:10.18653/v1/2020.bea-1.20 , url =

work page doi:10.18653/v1/2020.bea-1.20 2020
[40]

and Runge, Andrew and LaFlair, Geoffrey and Mulcaire, Phoebe , editor =

Yancey, Kevin P. and Runge, Andrew and LaFlair, Geoffrey and Mulcaire, Phoebe , editor =. Proceedings of the 19th

work page
[41]

Findings from the

Yaneva, Victoria and North, Kai and Baldwin, Peter and Ha, Le An and Rezayi, Saed and Zhou, Yiyun and Ray Choudhury, Sagnik and Harik, Polina and Clauser, Brian , editor =. Findings from the. Proceedings of the 19th

work page
[42]

Multi-task

Zhou, Ya and Tao, Can , date =. Multi-task. 2020. doi:10.1109/CISCE50729.2020.00048 , url =

work page doi:10.1109/cisce50729.2020.00048 2020
[43]

arXiv , url =

Zou, Jiajie and Zhang, Yuran and Jin, Peiqing and Luo, Cheng and Pan, Xunyi and Ding, Nai , date =. arXiv , url =

work page
[44]

and Lüdtke, Oliver and Ulitzsch, Esther , date =

Belov, Dmitry I. and Lüdtke, Oliver and Ulitzsch, Esther , date =. A supervised learning approach to estimating. doi:10.1111/bmsp.12396 , url =

work page doi:10.1111/bmsp.12396
[45]

Ulitzsch, Esther and Belov, Dmitry and Lüdtke, Oliver and Robitzsch, Alexander , date =. Using. doi:10.1111/jedm.12426 , url =

work page doi:10.1111/jedm.12426
[46]

Proceedings of The Eleventh Asian Conference on Machine Learning , pages =

A New Multi-choice Reading Comprehension Dataset for Curriculum Learning , author =. Proceedings of The Eleventh Asian Conference on Machine Learning , pages =. 2019 , editor =

work page 2019
[47]

A Primer in

Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna , journal =. A Primer in. 2020 , doi =

work page 2020
[48]

arXiv , year =

An Overview of Multi-Task Learning in Deep Neural Networks , author =. arXiv , year =

work page
[49]

Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach , editor =

work page
[50]

Psychometrika , volume =

Generating Items during Testing: Psychometric Issues and Models , author =. Psychometrika , volume =. 1999 , doi =

work page 1999
[51]

Item Response Theory for Psychologists , author =

work page
[52]

Probabilistic Models for Some Intelligence and Attainment Tests , author =

work page
[53]

Acta Psychologica , volume =

The Linear Logistic Test Model as an Instrument in Educational Research , author =. Acta Psychologica , volume =. 1973 , doi =

work page 1973
[54]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

Jump-Starting Item Parameters for Adaptive Language Tests , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =. 2021 , publisher =

work page 2021
[55]

and Runge, Andrew and LaFlair, Geoffrey and Mulcaire, Phoebe , booktitle =

Yancey, Kevin P. and Runge, Andrew and LaFlair, Geoffrey and Mulcaire, Phoebe , booktitle =. 2024 , publisher =

work page 2024
[56]

Item characteristics and test-taker disengagement in

work page
[57]

Triplet loss , booktitle =

work page
[58]

Alberts, René V. J. , date =. Equating. doi:10.1080/09695940120089143 , url =

work page doi:10.1080/09695940120089143
[59]

Charles and Wall, Dianne , date =

Alderson, J. Charles and Wall, Dianne , date =. Does. doi:10.1093/applin/14.2.115 , url =

work page doi:10.1093/applin/14.2.115
[60]

A similarity-based theory of controlling

Alsubait, Tahani and Parsia, Bijan and Sattler, Ulrike , date =. A similarity-based theory of controlling. 2013. doi:10.1109/ICeLeTE.2013.6644389 , url =

work page doi:10.1109/icelete.2013.6644389 2013
[61]

A taxonomy for learning, teaching, and assessing: a revision of

work page
[62]

doi:10.1037/0003-066X.57.12.1060 , abstract =

Ethical. doi:10.1037/0003-066X.57.12.1060 , abstract =

work page doi:10.1037/0003-066x.57.12.1060
[63]

doi:10.1002/ets2.12042 , url =

Estimating item difficulty with comparative judgments , author =. doi:10.1002/ets2.12042 , url =

work page doi:10.1002/ets2.12042
[64]

and Yancey, Kevin and Goodwin, Sarah and Park, Yena and family=Davier, given=Alina A., prefix=von, useprefix=true , date =

Attali, Yigal and Runge, Andrew and LaFlair, Geoffrey T. and Yancey, Kevin and Goodwin, Sarah and Park, Yena and family=Davier, given=Alina A., prefix=von, useprefix=true , date =. The interactive reading task:. doi:10.3389/frai.2022.903077 , url =

work page doi:10.3389/frai.2022.903077 2022
[65]

Bartels, Meike and Rietveld, Marjolein J. H. and Baal, G. Caroline M. Van and Boomsma, Dorret I. , date =. Heritability of. doi:10.1375/twin.5.6.544 , url =

work page doi:10.1375/twin.5.6.544
[66]

Robustness of equating high-stakes tests , author =

work page
[67]

Use of different sources of information in maintaining standards: examples from the

Beguin, Anton , date =. Use of different sources of information in maintaining standards: examples from the. Psychometrics in. doi:10.3990/3.9789036533744.ch3 , url =

work page doi:10.3990/3.9789036533744.ch3
[68]

New developments in categorial data analysis for the social and behavioral sciences , author =

The. New developments in categorial data analysis for the social and behavioral sciences , author =

work page
[69]

Belcak, Peter and Heinrich, Greg and Diao, Shizhe and Fu, Yonggan and Dong, Xin and Muralidharan, Saurav and Lin, Yingyan Celine and Molchanov, Pavlo , date =. Small. doi:10.48550/arXiv.2506.02153 , url =. 2506.02153 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.02153
[70]

Belov, Dmitry and Lüdtke, Oliver and Ulitzsch, Esther , date =. A. doi:10.31234/osf.io/w3cyq_v1 , url =

work page doi:10.31234/osf.io/w3cyq_v1
[71]

doi:10.1145/3375462.3375517 , url =

Benedetto, Luca and Cappelli, Andrea and Turrin, Roberto and Cremonesi, Paolo , date =. doi:10.1145/3375462.3375517 , url =

work page doi:10.1145/3375462.3375517
[72]

On the application of

Benedetto, Luca and Aradelli, Giovanni and Cremonesi, Paolo and Cappelli, Andrea and Giussani, Andrea and Turrin, Roberto , editor =. On the application of. Proceedings of the 16th

work page
[73]

A quantitative study of

Benedetto, Luca , date =. A quantitative study of. doi:10.48550/arXiv.2305.10236 , url =. 2305.10236 , eprinttype =

work page doi:10.48550/arxiv.2305.10236
[74]

Benedetto, Luca and Cremonesi, Paolo and Caines, Andrew and Buttery, Paula and Cappelli, Andrea and Giussani, Andrea and Turrin, Roberto , date =. A. doi:10.1145/3556538 , url =

work page doi:10.1145/3556538
[75]

The impact of standardized test feedback in math:

Beuchert, Louise and Eriksen, Tine Louise Mundbjerg and Krægpøth, Morten Visby , date =. The impact of standardized test feedback in math:. doi:10.1016/j.econedurev.2020.102017 , url =

work page doi:10.1016/j.econedurev.2020.102017 2020
[76]

Taxonomy of educational objectives:

Bloom, Benjamin S and Engelhart, Max D and Furst, Edward J and Hill, Walker H and Krathwohl, David R , date =. Taxonomy of educational objectives:

work page
[77]

doi:10.1007/BF02291411 , abstract =

Estimating item parameters and latent ability when responses are scored in two or more nominal categories , author =. doi:10.1007/BF02291411 , abstract =

work page doi:10.1007/bf02291411
[78]

Darrell and Aitkin, Murray , date =

Bock, R. Darrell and Aitkin, Murray , date =. Marginal maximum likelihood estimation of item parameters:. doi:10.1007/BF02293801 , abstract =

work page doi:10.1007/bf02293801
[79]

Brown, Gavin T. L. and O’Leary, Timothy M. and Hattie, John A. C. , date =. Effective reporting for formative assessment:. Score

work page
[80]

Brown, Gavin T. L. and O'Leary, Timothy M. and Hattie, John A. C. , date =. Effective reporting for formative assessment:. Score

work page
[81]

Interaktivní nástroj pro podporu vyhodnocování dat ze standardizovaných testů , booktitle =

Martinková, Patrícia and Potužníková, Eva and Netík, Jan , editor =. Interaktivní nástroj pro podporu vyhodnocování dat ze standardizovaných testů , booktitle =

work page
[82]

Psychometrik Hynek Cígler: Státní maturita nenaplňuje standardy pedagogického testování , shorttitle =

Cígler, Hynek , date =. Psychometrik Hynek Cígler: Státní maturita nenaplňuje standardy pedagogického testování , shorttitle =

work page
[83]

Hoe komt een examenopgave tot stand? , author =

work page
[84]

Organisatie & governance , url =

work page
[85]

Toelichting op de normering , author =

work page
[86]

Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and Pasupat, Ice and Sachdeva, Noveen and Dhillon, Inderjit and Blistein, Marcel and Ram, Ori and Zhang, Dan and Rosen, Evan and Marris, Luke and Petulla, Sam and Gaffney, Colin and Aharoni, Asaf and Lintz, Nathan and Pais, Tiago Cardal and Jacobsson, Henrik and Szpektor, Idan and Jiang, Nan-Jiang...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261
[87]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , date =. doi:10.48550/arXiv.1810.04805 , url =. 1810.04805 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805
[88]

Predikce obtížnosti položek pomocí modulu EduTest Text Analysis , author =

work page
[89]

Docentenparticipatie , author =

work page

Showing first 80 references.

[1] [1]

and Tamma, Valentina , date =

AlKhuzaey, Samah and Grasso, Floriana and Payne, Terry R. and Tamma, Valentina , date =. Text-based question difficulty prediction:. doi:10.1007/s40593-023-00362-1 , url =

work page doi:10.1007/s40593-023-00362-1

[2] [2]

Belov, Dmitry and Lüdtke, Oliver and Ulitzsch, Esther , date =. A. OSF , doi =

work page

[3] [3]

A quantitative study of

Benedetto, Luca , date =. A quantitative study of. arXiv , doi =

work page

[4] [5]

arXiv , doi =

Language models are few-shot learners , author =. arXiv , doi =

work page

[5] [6]

Multitask

Caruana, Rich , date =. Multitask. doi:10.1023/A:1007379606734 , url =

work page doi:10.1023/a:1007379606734

[6] [7]

Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and Pasupat, Ice and Sachdeva, Noveen and Dhillon, Inderjit and Blistein, Marcel and Ram, Ori and Zhang, Dan and Rosen, Evan and Marris, Luke and Petulla, Sam and Gaffney, Colin and Aharoni, Asaf and Lintz, Nathan and Pais, Tiago Cardal and Jacobsson, Henrik and Szpektor, Idan and Jiang, Nan-Jiang...

work page

[7] [8]

arXiv , doi =

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , date =. arXiv , doi =

work page

[8] [9]

The prediction of

Freedle, Roy and Kostin, Irene , date =. The prediction of. doi:10.1177/026553229301000203 , url =

work page doi:10.1177/026553229301000203

[9] [10]

Predicting

Gombert, Sebastian and Menzel, Lukas and Di Mitri, Daniele and Drachsler, Hendrik , editor =. Predicting. Proceedings of the 19th

work page

[10] [11]

and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, A

OpenAI and Hurst, Aaron and Lerer, Adam and Goucher, Adam P. and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, A. J. and Welihinda, Akila and Hayes, Alan and Radford, Alec and Mądry, Aleksander and Baker-Whitcomb, Alex and Beutel, Alex and Borzunov, Alex and Carney, Alex and Chow, Alex and Kirillov, Alex and Nichol, Alex and Paino, Alex a...

work page

[11] [12]

, date =

Gururangan, Suchin and Marasović, Ana and Swayamdipta, Swabha and Lo, Kyle and Beltagy, Iz and Downey, Doug and Smith, Noah A. , date =. Don't stop pretraining:. arXiv , doi =

work page

[12] [13]

Predicting the difficulty of multiple choice questions in a high-stakes medical exam , booktitle =

Ha, Le An and Yaneva, Victoria and Baldwin, Peter and Mee, Janet , editor =. Predicting the difficulty of multiple choice questions in a high-stakes medical exam , booktitle =. doi:10.18653/v1/W19-4402 , url =

work page doi:10.18653/v1/w19-4402

[13] [14]

arXiv , doi =

Distilling the knowledge in a neural network , author =. arXiv , doi =

work page

[14] [15]

Jawahar, Ganesh and Sagot, Benoît and Seddah, Djamé , editor =. What. Proceedings of the 57th. doi:10.18653/v1/P19-1356 , url =

work page doi:10.18653/v1/p19-1356

[15] [16]

arXiv , doi =

Lai, Guokun and Xie, Qizhe and Liu, Hanxiao and Yang, Yiming and Hovy, Eduard , date =. arXiv , doi =

work page

[16] [18]

Fine-tuning language models to predict item difficulty from wording , author =

work page

[17] [19]

doi:10.48550/arXiv.2303.08774 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774

[18] [20]

Probabilistic

Rasch, Georg , date =. Probabilistic

work page

[19] [21]

Proceedings of the 19th

Rodrigo, Alvaro and Moreno-Álvarez, Sergio and Peñas, Anselmo , editor =. Proceedings of the 19th

work page

[20] [23]

LaFlair, Geoffrey and Hagiwara, Masato , date =

Settles, Burr and T. LaFlair, Geoffrey and Hagiwara, Masato , date =. Machine. doi:10.1162/tacl_a_00310 , url =

work page doi:10.1162/tacl_a_00310

[21] [24]

Proceedings of

Sharpnack, James and Hao, Kevin and Mulcaire, Phoebe and Bicknell, Klinton and LaFlair, Geoff and Yancey, Kevin and family=Davier, given=Alina A., prefix=von, useprefix=true , date =. Proceedings of

work page

[22] [26]

Štěpánek, Lubomír and Dlouhá, Jana and Martinková, Patrícia , date =. Item. doi:10.3390/math11194104 , url =

work page doi:10.3390/math11194104

[23] [28]

, date =

Taylor, Wilson L. , date =. "

work page

[24] [30]

Natural language processing with transformers: building language applications with

Tunstall, Lewis and family=Werra, given=Leandro, prefix=von, useprefix=false and Wolf, Thomas , date =. Natural language processing with transformers: building language applications with

work page

[25] [31]

International Conference on Learning Representations (ICLR) , eprint =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations (ICLR) , eprint =

work page

[26] [35]

doi:10.1038/s42256-020-00257-z , url =

Shortcut learning in deep neural networks , author =. doi:10.1038/s42256-020-00257-z , url =

work page doi:10.1038/s42256-020-00257-z

[27] [36]

Applied Psychological Measurement , volume =

Component Latent Trait Models for Paragraph Comprehension Tests , author =. Applied Psychological Measurement , volume =. 1987 , doi =

work page 1987

[28] [37]

Applied Psychological Measurement , volume =

Item Difficulty Modeling of Paragraph Comprehension Items , author =. Applied Psychological Measurement , volume =. 2006 , doi =

work page 2006

[29] [38]

Behavior Research Methods , volume =

Where's the Difficulty in Standardized Reading Tests: The Passage or the Question? , author =. Behavior Research Methods , volume =. 2008 , doi =

work page 2008

[30] [39]

Predicting the

Xue, Kang and Yaneva, Victoria and Runyon, Christopher and Baldwin, Peter , editor =. Predicting the. Proceedings of the. doi:10.18653/v1/2020.bea-1.20 , url =

work page doi:10.18653/v1/2020.bea-1.20 2020

[31] [40]

and Runge, Andrew and LaFlair, Geoffrey and Mulcaire, Phoebe , editor =

Yancey, Kevin P. and Runge, Andrew and LaFlair, Geoffrey and Mulcaire, Phoebe , editor =. Proceedings of the 19th

work page

[32] [41]

Findings from the

Yaneva, Victoria and North, Kai and Baldwin, Peter and Ha, Le An and Rezayi, Saed and Zhou, Yiyun and Ray Choudhury, Sagnik and Harik, Polina and Clauser, Brian , editor =. Findings from the. Proceedings of the 19th

work page

[33] [42]

Multi-task

Zhou, Ya and Tao, Can , date =. Multi-task. 2020. doi:10.1109/CISCE50729.2020.00048 , url =

work page doi:10.1109/cisce50729.2020.00048 2020

[34] [43]

arXiv , url =

Zou, Jiajie and Zhang, Yuran and Jin, Peiqing and Luo, Cheng and Pan, Xunyi and Ding, Nai , date =. arXiv , url =

work page

[35] [44]

and Lüdtke, Oliver and Ulitzsch, Esther , date =

Belov, Dmitry I. and Lüdtke, Oliver and Ulitzsch, Esther , date =. A supervised learning approach to estimating. doi:10.1111/bmsp.12396 , url =

work page doi:10.1111/bmsp.12396

[36] [45]

Ulitzsch, Esther and Belov, Dmitry and Lüdtke, Oliver and Robitzsch, Alexander , date =. Using. doi:10.1111/jedm.12426 , url =

work page doi:10.1111/jedm.12426

[37] [46]

Proceedings of The Eleventh Asian Conference on Machine Learning , pages =

A New Multi-choice Reading Comprehension Dataset for Curriculum Learning , author =. Proceedings of The Eleventh Asian Conference on Machine Learning , pages =. 2019 , editor =

work page 2019

[38] [47]

A Primer in

Rogers, Anna and Kovaleva, Olga and Rumshisky, Anna , journal =. A Primer in. 2020 , doi =

work page 2020

[39] [48]

arXiv , year =

An Overview of Multi-Task Learning in Deep Neural Networks , author =. arXiv , year =

work page

[40] [49]

Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach , editor =

work page

[41] [50]

Psychometrika , volume =

Generating Items during Testing: Psychometric Issues and Models , author =. Psychometrika , volume =. 1999 , doi =

work page 1999

[42] [51]

Item Response Theory for Psychologists , author =

work page

[43] [52]

Probabilistic Models for Some Intelligence and Attainment Tests , author =

work page

[44] [53]

Acta Psychologica , volume =

The Linear Logistic Test Model as an Instrument in Educational Research , author =. Acta Psychologica , volume =. 1973 , doi =

work page 1973

[45] [54]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

Jump-Starting Item Parameters for Adaptive Language Tests , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =. 2021 , publisher =

work page 2021

[46] [55]

and Runge, Andrew and LaFlair, Geoffrey and Mulcaire, Phoebe , booktitle =

Yancey, Kevin P. and Runge, Andrew and LaFlair, Geoffrey and Mulcaire, Phoebe , booktitle =. 2024 , publisher =

work page 2024

[47] [56]

Item characteristics and test-taker disengagement in

work page

[48] [57]

Triplet loss , booktitle =

work page

[49] [58]

Alberts, René V. J. , date =. Equating. doi:10.1080/09695940120089143 , url =

work page doi:10.1080/09695940120089143

[50] [59]

Charles and Wall, Dianne , date =

Alderson, J. Charles and Wall, Dianne , date =. Does. doi:10.1093/applin/14.2.115 , url =

work page doi:10.1093/applin/14.2.115

[51] [60]

A similarity-based theory of controlling

Alsubait, Tahani and Parsia, Bijan and Sattler, Ulrike , date =. A similarity-based theory of controlling. 2013. doi:10.1109/ICeLeTE.2013.6644389 , url =

work page doi:10.1109/icelete.2013.6644389 2013

[52] [61]

A taxonomy for learning, teaching, and assessing: a revision of

work page

[53] [62]

doi:10.1037/0003-066X.57.12.1060 , abstract =

Ethical. doi:10.1037/0003-066X.57.12.1060 , abstract =

work page doi:10.1037/0003-066x.57.12.1060

[54] [63]

doi:10.1002/ets2.12042 , url =

Estimating item difficulty with comparative judgments , author =. doi:10.1002/ets2.12042 , url =

work page doi:10.1002/ets2.12042

[55] [64]

and Yancey, Kevin and Goodwin, Sarah and Park, Yena and family=Davier, given=Alina A., prefix=von, useprefix=true , date =

Attali, Yigal and Runge, Andrew and LaFlair, Geoffrey T. and Yancey, Kevin and Goodwin, Sarah and Park, Yena and family=Davier, given=Alina A., prefix=von, useprefix=true , date =. The interactive reading task:. doi:10.3389/frai.2022.903077 , url =

work page doi:10.3389/frai.2022.903077 2022

[56] [65]

Bartels, Meike and Rietveld, Marjolein J. H. and Baal, G. Caroline M. Van and Boomsma, Dorret I. , date =. Heritability of. doi:10.1375/twin.5.6.544 , url =

work page doi:10.1375/twin.5.6.544

[57] [66]

Robustness of equating high-stakes tests , author =

work page

[58] [67]

Use of different sources of information in maintaining standards: examples from the

Beguin, Anton , date =. Use of different sources of information in maintaining standards: examples from the. Psychometrics in. doi:10.3990/3.9789036533744.ch3 , url =

work page doi:10.3990/3.9789036533744.ch3

[59] [68]

New developments in categorial data analysis for the social and behavioral sciences , author =

The. New developments in categorial data analysis for the social and behavioral sciences , author =

work page

[60] [69]

Belcak, Peter and Heinrich, Greg and Diao, Shizhe and Fu, Yonggan and Dong, Xin and Muralidharan, Saurav and Lin, Yingyan Celine and Molchanov, Pavlo , date =. Small. doi:10.48550/arXiv.2506.02153 , url =. 2506.02153 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.02153

[61] [70]

Belov, Dmitry and Lüdtke, Oliver and Ulitzsch, Esther , date =. A. doi:10.31234/osf.io/w3cyq_v1 , url =

work page doi:10.31234/osf.io/w3cyq_v1

[62] [71]

doi:10.1145/3375462.3375517 , url =

Benedetto, Luca and Cappelli, Andrea and Turrin, Roberto and Cremonesi, Paolo , date =. doi:10.1145/3375462.3375517 , url =

work page doi:10.1145/3375462.3375517

[63] [72]

On the application of

Benedetto, Luca and Aradelli, Giovanni and Cremonesi, Paolo and Cappelli, Andrea and Giussani, Andrea and Turrin, Roberto , editor =. On the application of. Proceedings of the 16th

work page

[64] [73]

A quantitative study of

Benedetto, Luca , date =. A quantitative study of. doi:10.48550/arXiv.2305.10236 , url =. 2305.10236 , eprinttype =

work page doi:10.48550/arxiv.2305.10236

[65] [74]

Benedetto, Luca and Cremonesi, Paolo and Caines, Andrew and Buttery, Paula and Cappelli, Andrea and Giussani, Andrea and Turrin, Roberto , date =. A. doi:10.1145/3556538 , url =

work page doi:10.1145/3556538

[66] [75]

The impact of standardized test feedback in math:

Beuchert, Louise and Eriksen, Tine Louise Mundbjerg and Krægpøth, Morten Visby , date =. The impact of standardized test feedback in math:. doi:10.1016/j.econedurev.2020.102017 , url =

work page doi:10.1016/j.econedurev.2020.102017 2020

[67] [76]

Taxonomy of educational objectives:

Bloom, Benjamin S and Engelhart, Max D and Furst, Edward J and Hill, Walker H and Krathwohl, David R , date =. Taxonomy of educational objectives:

work page

[68] [77]

doi:10.1007/BF02291411 , abstract =

Estimating item parameters and latent ability when responses are scored in two or more nominal categories , author =. doi:10.1007/BF02291411 , abstract =

work page doi:10.1007/bf02291411

[69] [78]

Darrell and Aitkin, Murray , date =

Bock, R. Darrell and Aitkin, Murray , date =. Marginal maximum likelihood estimation of item parameters:. doi:10.1007/BF02293801 , abstract =

work page doi:10.1007/bf02293801

[70] [79]

Brown, Gavin T. L. and O’Leary, Timothy M. and Hattie, John A. C. , date =. Effective reporting for formative assessment:. Score

work page

[71] [80]

Brown, Gavin T. L. and O'Leary, Timothy M. and Hattie, John A. C. , date =. Effective reporting for formative assessment:. Score

work page

[72] [81]

Interaktivní nástroj pro podporu vyhodnocování dat ze standardizovaných testů , booktitle =

Martinková, Patrícia and Potužníková, Eva and Netík, Jan , editor =. Interaktivní nástroj pro podporu vyhodnocování dat ze standardizovaných testů , booktitle =

work page

[73] [82]

Psychometrik Hynek Cígler: Státní maturita nenaplňuje standardy pedagogického testování , shorttitle =

Cígler, Hynek , date =. Psychometrik Hynek Cígler: Státní maturita nenaplňuje standardy pedagogického testování , shorttitle =

work page

[74] [83]

Hoe komt een examenopgave tot stand? , author =

work page

[75] [84]

Organisatie & governance , url =

work page

[76] [85]

Toelichting op de normering , author =

work page

[77] [86]

Comanici, Gheorghe and Bieber, Eric and Schaekermann, Mike and Pasupat, Ice and Sachdeva, Noveen and Dhillon, Inderjit and Blistein, Marcel and Ram, Ori and Zhang, Dan and Rosen, Evan and Marris, Luke and Petulla, Sam and Gaffney, Colin and Aharoni, Asaf and Lintz, Nathan and Pais, Tiago Cardal and Jacobsson, Henrik and Szpektor, Idan and Jiang, Nan-Jiang...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261

[78] [87]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , date =. doi:10.48550/arXiv.1810.04805 , url =. 1810.04805 , eprinttype =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805

[79] [88]

Predikce obtížnosti položek pomocí modulu EduTest Text Analysis , author =

work page

[80] [89]

Docentenparticipatie , author =

work page