pith. machine review for the scientific record. sign in

arxiv: 2605.02402 · v1 · submitted 2026-05-04 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Automatic Reflection Level Classification in Hungarian Student Essays

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Hungarian languagereflection level classificationmachine learningtransformer modelsclass imbalancestudent essaysautomated assessmenteducational NLP
0
0 comments X

The pith

Classical machine learning models classify reflection levels in Hungarian student essays at 71 percent average performance, slightly ahead of transformers at 68 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that machine learning can automate assessment of reflective writing in Hungarian, a language with little prior work on the task. The authors assembled nearly two thousand expert-labeled student essays on a four-level reflection scale and tested both classical models using TF-IDF and semantic features against fine-tuned Hungarian transformers. Multiple strategies for addressing class imbalance were compared through ablation experiments. Shallow models delivered the higher overall score while transformers handled the rarer reflection categories more reliably. This matters because reflective thinking is a valued educational skill yet manual evaluation remains slow and subjective, limiting how widely it can be practiced at scale.

Core claim

In the first comprehensive study of automatic reflection level classification for Hungarian, a dataset of 1,954 expert-annotated student essays on a four-level scale is used to evaluate classical machine learning pipelines and fine-tuned transformers. With appropriate feature engineering and imbalance handling, the shallow models reach up to 71% overall score averaged over accuracy, F1-score, and ROC AUC, outperforming the transformer approach at 68% overall while the transformers demonstrate better generalization on minority classes.

What carries the argument

The four-level expert-annotated reflection scale on Hungarian student essays, used to compare classical feature-based classifiers against transformer-based document classifiers under multiple class-imbalance correction methods.

If this is right

  • Classical models with targeted feature engineering stay competitive for text classification in morphologically rich low-resource languages.
  • Transformer models offer an advantage when accurate identification of minority reflection levels is the priority.
  • Class weighting, oversampling, augmentation, and adjusted loss functions each improve robustness on imbalanced educational text.
  • The released Hungarian dataset supplies a reproducible base for extending automated reflective analysis to related tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results suggest that in many non-English educational settings, simpler classical pipelines may deliver adequate performance with far less compute than transformer fine-tuning.
  • The same modeling approach could transfer to reflection assessment in other languages that share Hungarian's morphological complexity once comparable labeled collections exist.
  • Embedding the classifiers in writing platforms could enable immediate feedback that helps students improve reflective skills without added teacher workload.
  • Hybrid systems that route examples to either classical or transformer components based on class frequency might combine the observed strengths of both.

Load-bearing premise

The four-level reflection labels assigned by education experts are consistent and accurately represent students' reflective thinking.

What would settle it

Independent re-annotation of the same essays by a separate group of experts that produces substantially different level assignments or sharply lower model performance on the new labels would indicate the original ground truth is unreliable.

Figures

Figures reproduced from arXiv: 2605.02402 by Kinga Gy\"ongy, Kristian Fenech, M\'onika S\'andor, M\'onika Serf\H{o}z\H{o}, Zsolt Csibi.

Figure 1
Figure 1. Figure 1: Distribution of reflection levels in the dataset. From 0 (no reflec view at source ↗
Figure 2
Figure 2. Figure 2: Token count distribution generated by the HuBERT tokenizer. view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrix (a) and ROC curves (b) of the best Shallow view at source ↗
Figure 4
Figure 4. Figure 4: Confusion matrix (a) and ROC curves (b) of an example where view at source ↗
Figure 5
Figure 5. Figure 5: (a) ROC Curve of the Hubert model with backtranslation and focal view at source ↗
read the original abstract

Reflective thinking is a key competency in education, but assessing reflective writing remains a time-consuming and subjective task for education experts. While automated reflective analysis has been explored in several languages, Hungarian language was not researched extensively. In this paper, we present the first comprehensive study on automatic reflection level classification in Hungarian student essays. We used a large, expert-annotated Hungarian dataset consisting of 1,954 reflective essays collected over multiple academic years and labeled on a four-level reflection scale. We investigate two approaches: (1) classical machine learning models using TF-IDF and semantic embedding features, and (2) Hungarian-specific transformer models fine-tuned for document-level reflection classification. To address the strong class imbalance in the dataset, we systematically examine class weighting, oversampling, data augmentation, and alternative loss functions. An extensive ablation study is conducted to analyze the contribution of each modeling and balancing strategy. Our results show that shallow machine learning models with appropriate feature engineering achieve strong overall performance, reaching up to 71% overall score averaged over accuracy, F1-score, and ROC AUC metrics, while transformer-based models achieve slightly lower overall score (68%) averaged over the same metrics, but demonstrate better generalization on minority reflection classes. These findings highlight the continued relevance of classical methods for low-resource settings and the robustness of transformer models for imbalanced classification. The proposed dataset and experimental insights provide a solid foundation for future research on automated reflective analysis in Hungarian and other morphologically rich languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript presents the first study on automatic classification of reflection levels in Hungarian student essays. Using a dataset of 1,954 expert-annotated essays labeled on a four-level scale, it compares classical ML models with TF-IDF and embedding features against fine-tuned Hungarian transformer models. Various strategies for handling class imbalance are explored through ablations, with results indicating classical models achieve an averaged score of 71% (across accuracy, F1-score, and ROC AUC) compared to 68% for transformers, though transformers perform better on minority classes.

Significance. This work is significant for introducing automated analysis to Hungarian reflective writing, a low-resource language setting. The systematic examination of balancing techniques and ablations provides valuable insights for imbalanced classification tasks. If the ground truth is reliable, it supports the relevance of classical methods in such scenarios and offers a foundation for future work in morphologically rich languages.

major comments (3)
  1. [§3 (Dataset and Annotation)] §3 (Dataset and Annotation): No inter-annotator agreement (IAA) metrics, such as Cohen's kappa or Krippendorff's alpha, are reported for the expert annotations on the four-level reflection scale. Since reflection level assessment is inherently subjective, the absence of IAA leaves the reliability of the ground truth labels unverified, which is load-bearing for all performance claims in the results sections.
  2. [§4 (Experimental Setup)] §4 (Experimental Setup): The evaluation protocol lacks details on the train/validation/test splits, whether stratified sampling was used given the imbalance, and any statistical significance tests (e.g., McNemar's test or paired t-tests) for the reported differences between model performances (71% vs 68%). This makes it difficult to assess the robustness of the headline comparison.
  3. [§5 (Results)] §5 (Results): The 'overall score' is defined as the average of accuracy, F1-score, and ROC AUC, but it is unclear if these are macro-averaged or weighted, and how ROC AUC is computed for multi-class (one-vs-rest?). This affects interpretation of the 71% and 68% figures.
minor comments (3)
  1. [Abstract] Abstract: The abstract mentions 'up to 71%' but the full results should clarify if this is the best single model or an average across configurations.
  2. [Tables] Tables: Ensure all tables reporting metrics include standard deviations if multiple runs were performed, and specify the exact number of runs.
  3. [References] References: Consider adding references to prior work on reflection classification in other languages for better context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We have addressed each major comment point by point below and revised the paper accordingly to improve clarity and transparency.

read point-by-point responses
  1. Referee: §3 (Dataset and Annotation): No inter-annotator agreement (IAA) metrics, such as Cohen's kappa or Krippendorff's alpha, are reported for the expert annotations on the four-level reflection scale. Since reflection level assessment is inherently subjective, the absence of IAA leaves the reliability of the ground truth labels unverified, which is load-bearing for all performance claims in the results sections.

    Authors: We agree that IAA reporting is essential for subjective annotation tasks. The full dataset was annotated by a single expert in educational psychology, as multiple annotators with the required domain expertise were not available within our resource constraints. We have revised §3 to describe the annotation protocol, guideline development via pilot studies, and to explicitly state this single-annotator limitation along with its implications for ground-truth reliability. revision: yes

  2. Referee: §4 (Experimental Setup): The evaluation protocol lacks details on the train/validation/test splits, whether stratified sampling was used given the imbalance, and any statistical significance tests (e.g., McNemar's test or paired t-tests) for the reported differences between model performances (71% vs 68%). This makes it difficult to assess the robustness of the headline comparison.

    Authors: We have expanded §4 in the revised manuscript to specify the 80/10/10 train/validation/test split ratios and confirm that stratified sampling was applied based on the four reflection levels to preserve class distributions. We have also added McNemar's test results comparing the best classical and transformer models to assess the statistical significance of the performance differences. revision: yes

  3. Referee: §5 (Results): The 'overall score' is defined as the average of accuracy, F1-score, and ROC AUC, but it is unclear if these are macro-averaged or weighted, and how ROC AUC is computed for multi-class (one-vs-rest?). This affects interpretation of the 71% and 68% figures.

    Authors: We have clarified the metric computation in the revised §5: accuracy is the standard multi-class accuracy; F1-score is macro-averaged; and ROC AUC uses the one-vs-rest approach with macro-averaging. The overall score is the unweighted arithmetic mean of these three values. A supplementary table with the individual metric breakdowns for all models has been added for full transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ML evaluation on held-out data

full rationale

The paper describes an empirical pipeline: collection of 1,954 Hungarian essays, expert annotation on a four-level scale, extraction of TF-IDF and embedding features, training of classical ML and transformer models, handling of class imbalance via weighting/oversampling/augmentation, and reporting of accuracy/F1/ROC-AUC on (presumably held-out) test data. No equations, first-principles derivations, or predictions appear; results are measured against external labels rather than being forced by construction from fitted inputs or self-citations. The central claims (71% vs 68% averaged scores, transformers better on minorities) are therefore falsifiable against the fixed annotations and do not reduce to any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claims rest on the validity of the annotation process and the assumption that the collected essays represent typical student reflective writing.

free parameters (1)
  • model hyperparameters and balancing parameters
    Various parameters for class weighting, oversampling, and loss functions are tuned but not specified as fixed values in the abstract.
axioms (1)
  • domain assumption Expert annotations on the reflection scale are accurate and consistent.
    The classification performance is measured against these labels, so their quality is foundational.

pith-pipeline@v0.9.0 · 5586 in / 1332 out tokens · 67357 ms · 2026-05-08T18:30:16.415811+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 42 canonical work pages · 5 internal anchors

  1. [1]

    doi:10.21203/rs.3.rs-5408888/v1

    Machine learning to classify the depth of reflection in stem student writings. doi:10.21203/rs.3.rs-5408888/v1. Apache Arrow,

  2. [2]

    arXiv preprint arXiv:2310.18323 doi:10.48550 /arXiv.2310.18323

    Overview of adaboost: Reconciling its views to better understand its dynamics. arXiv preprint arXiv:2310.18323 doi:10.48550 /arXiv.2310.18323. Beltagy, I., Peters, M.E., Cohan, A.,

  3. [3]

    Longformer: The Long-Document Transformer

    Longformer: The long-document transformer. doi:10.48550/arXiv.2004.05150,arXiv:2004.05150. Canny, S.,

  4. [4]

    Accessed: 2026-01-16

    python-docx python library.https://pypi.org/project /python-docx/. Accessed: 2026-01-16. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.,

  5. [5]

    Chawla, Kevin W

    Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelli- gence Research 16, 321–357. doi:10.1613/jair.953. 24 Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.,

  6. [6]

    Early Years 38, 316–332

    Early childhood teachers’ thinking and reflection: A model of current practice in New Zealand. Early Years 38, 316–332. doi:10.1080/09575146.2016.1259211. Chong, C., Sheikh, U.U., Samah, N., Sha’ameri, A.,

  7. [7]

    IOP Conference Series: Materials Science and Engineering 884, 012069

    Analysis on re- flective writing using natural language processing and sentiment analysis. IOP Conference Series: Materials Science and Engineering 884, 012069. doi:10.1088/1757-899X/884/1/012069. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.,

  8. [8]

    arXiv:2003.10555

    Electra: Pre- training text encoders as discriminators rather than generators. doi:10.4 8550/arXiv.2003.10555,arXiv:2003.10555. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzm´ an, F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V.,

  9. [9]

    Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou

    Unsupervised cross-lingual representation learning at scale. doi:10.48550 /arXiv.1911.02116,arXiv:1911.02116. Dyment, J.E., O’Connell, T.S.,

  10. [10]

    Teaching in Higher Education 16, 81–97

    Assessing the quality of reflection in student journals: a review of the research. Teaching in Higher Education 16, 81–97. doi:10.1080/13562517.2010.507308. Fenniak, M.,

  11. [11]

    Accessed: 2026-01-16

    Pypdf2 python library.https://pypi.org/project/P yPDF2/. Accessed: 2026-01-16. Ferreira-Mello, R., Andr´ e, M., Pinheiro, A., Costa, E., Romero, C.,

  12. [12]

    Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9, e1332

    Text mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9, e1332. doi:10.1002/widm.1332. Grimalt-´Alvaro, C., Usart, M.,

  13. [13]

    Journal of computing in higher education 36, 647–682

    Sentiment analysis for formative as- sessment in higher education: a systematic literature review. Journal of computing in higher education 36, 647–682. doi:10.1007/s12528-023-0 9370-5. Grootendorst, M.,

  14. [14]

    BERTopic: Neural topic modeling with a class-based TF-IDF procedure

    Bertopic: Neural topic modeling with a class-based tf-idf procedure. doi:10.48550/arXiv.2203.05794,arXiv:2203.05794. Gy¨ ongy, K.,

  15. [15]

    1322–1328

    Adasyn: Adaptive synthetic sam- pling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. doi:10.1109/IJCNN.2008.4633969. Hugging Face team,

  16. [16]

    Accessed: 2026-01-16

    Hugging face transformers library.https://hu ggingface.co/docs/transformers/index. Accessed: 2026-01-16. HuggingFace Inc.,

  17. [17]

    Accessed: 2026-01-16

    Hugging face datasets library.https://huggingf ace.co/docs/datasets/index. Accessed: 2026-01-16. Jaiswal, A., Milios, E.,

  18. [18]

    arXiv preprint arXiv:2310.20558 doi:10.48550/arXiv.2310.20558

    Breaking the token barrier: chunking and convolution for efficient long text classification with BERT. arXiv preprint arXiv:2310.20558 doi:10.48550/arXiv.2310.20558. Japkowicz, N., Stephen, S.,

  19. [19]

    Intelligent Data Analysis 6, 429–449

    The class imbalance problem: A system- atic study. Intelligent Data Analysis 6, 429–449. doi:10.3233/IDA-200 2-6504. K´ apl´ ar-Kod´ acsy, K., Dorner, H.,

  20. [20]

    International Journal of Mentoring and Coaching in Education 9, 257–277

    The use of audio diaries to support reflective mentoring practice in hungarian teacher training. International Journal of Mentoring and Coaching in Education 9, 257–277. doi:10.110 8/IJMCE-05-2019-0061. Korthagen, F., Nuijten, E.,

  21. [21]

    doi:10.4324/9781003221470

    The Power of Reflection in Teacher Ed- ucation and Professional Development: Strategies for In-Depth Teacher Learning. doi:10.4324/9781003221470. Korthagen, F., Vasalos, A.,

  22. [22]

    Teachers and Teaching 11, 47–71

    Levels in reflection: Core reflection as a means to enhance professional growth. Teachers and Teaching 11, 47–71. doi:10.1080/1354060042000337093. Kov´ acs, G.,

  23. [23]

    Accessed: 2026-01-16

    Smote variants python library.https://github.com/a nalyticalmindsltd/smote_variants. Accessed: 2026-01-16. Lee, H.J.,

  24. [24]

    Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

    Understanding and assessing preservice teachers’ reflective thinking. Teaching and Teacher Education 21, 699–715. doi:10.1016/j. tate.2005.05.007. 26 Lim, J.Y., Ong, S.Y.K., Ng, C.Y.H., Chan, K.L.E., Wu, S.Y.E.A., So, W.Z., Tey, G.J.C., Lam, Y.X., Gao, N.L.X., Lim, Y.X., et al.,

  25. [25]

    Liu, X.Y., Wu, J., Zhou, Z.H.,

    doi:10.1186/s12909-022-03924-4. Liu, X.Y., Wu, J., Zhou, Z.H.,

  26. [26]

    IEEE Transactions on Systems, Man, and Cybernet- ics, Part B (Cybernetics) 39, 539–550

    Exploratory undersampling for class- imbalance learning. IEEE Transactions on Systems, Man, and Cybernet- ics, Part B (Cybernetics) 39, 539–550. doi:10.1109/TSMCB.2008.20078

  27. [27]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Roberta: A robustly opti- mized bert pretraining approach. doi:10.48550/arXiv.1907.11692, arXiv:1907.11692. Liu, Z.,

  28. [28]

    Accessed: 2026-01-16

    Imbalanced ensemble python library.https://github.com /ZhiningLiu1998/imbalanced-ensemble. Accessed: 2026-01-16. Nehyba, J., ˇStef´ anik, M.,

  29. [29]

    Nemeskey, D.M.,

    doi:10.1007/s10639-022-11254-7. Nemeskey, D.M.,

  30. [30]

    Accessed: 2026-01-16

    Nltk python library.https://pypi.org/project/nlt k/. Accessed: 2026-01-16. Occhiuto, K., Tarshis, S., Todd, S., Gheorghe, R.,

  31. [31]

    The British Journal of Social Work 54, 2642–2660

    Reflecting on reflection in clinical social work: Unsettling a key social work strategy. The British Journal of Social Work 54, 2642–2660. doi:10.1093/bjsw/b cae052. OECD,

  32. [32]

    URL:https://www

    Oecd learning compass 2030 - glossary. URL:https://www. oecd.org/content/dam/oecd/en/about/projects/edu/education-2 040/publications/OECD%20Learning%20Compass%202030%20-%20Glos sary.pdf. 27 Official Journal of the European Union,

  33. [33]

    URL: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX: 32018H0604(01)

    Council recommendation of 22 may 2018 on key competences for lifelong learning (2018/c 189/01). URL: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX: 32018H0604(01). Pennebaker, J., Booth, R., Boyd, R., Francis, M.,

  34. [34]

    Deep contextualized word representations

    Deep contextualized word representations. doi:10 .48550/arXiv.1802.05365,arXiv:1802.05365. Pilicita-Garrido, A., Barra, E.,

  35. [35]

    International Journal of Inter- active Multimedia & Artificial Intelligence 9, 177–188

    Sentiment analysis with transformers applied to education: Systematic review. International Journal of Inter- active Multimedia & Artificial Intelligence 9, 177–188. doi:10.9781/ijim ai.2025.02.008. PyTorch team,

  36. [36]

    Accessed: 2026-01-16

    Captum python library.https://pypi.org/project /captum/. Accessed: 2026-01-16. Reimers, N., Gurevych, I.,

  37. [37]

    Sentence-BERT: Sentence embeddings using Siamese BERT-networks, in: Inui, K., Jiang, J., Ng, V., Wan, X. (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Con- ference on Natural Language Processing (EMNLP-IJCNLP), Associa- tion for Computational Linguistics, Hong Kong, China. pp...

  38. [38]

    Reflective Practice 20, 761–776

    Validation of a reflection rubric for higher education. Reflective Practice 20, 761–776. doi:10.1080/14623943.2019.1676712. Schmid, H., Laws, F.,

  39. [39]

    (Eds.), Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Coling 2008 Organizing Com- mittee, Manchester, UK

    Estimation of conditional probabilities with de- cision trees and an application to fine-grained POS tagging, in: Scott, D., Uszkoreit, H. (Eds.), Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Coling 2008 Organizing Com- mittee, Manchester, UK. pp. 777–784. doi:10.3115/1599081.1599179. Schulman, A., Barbosa, S.,

  40. [40]

    1226–1229

    Text genre classification using only parts of speech, in: 2018 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 1226–1229. doi:10.1109/CS CI46756.2018.00236. 28 Singer-Vine, J.,

  41. [41]

    Accessed: 2026-01-16

    Pdfplumber python library.https://pypi.org/pro ject/pdfplumber/. Accessed: 2026-01-16. Solopova, V., Rostom, E., Cremer, F., Gruszczynski, A., Witte, S., Zhang, C., L´ opez, F.R., Pl¨ oßl, L., Hofmann, F., Romeike, R., Gl¨ aser-Zikuda, M., Benzm¨ uller, C., Landgraf, T.,

  42. [42]

    (Eds.), KI 2023: Advances in Artificial Intelligence, Springer Nature Switzerland, Cham

    Papagai: Automated feedback for reflective essays, in: Seipel, D., Steen, A. (Eds.), KI 2023: Advances in Artificial Intelligence, Springer Nature Switzerland, Cham. pp. 198–206. Sumsion, J.,

  43. [43]

    Reflective Practice 1, 199–214

    Facilitating reflection: A cautionary account. Reflective Practice 1, 199–214. doi:10.1080/14623943.2000.11661687. Sun, Y., Kamel, M.S., Wong, A.K., Wang, Y.,

  44. [44]

    Pattern Recognition 40, 3358–3378

    Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378. doi:10.1016/j.patcog.2007.04.009. SzegedAI, MILAB,

  45. [45]

    Accessed: 2026-01-16

    huSpacy python library.https://huspacy.gith ub.io/. Accessed: 2026-01-16. Tang, X., Cao, J.,

  46. [46]

    Procedia-Social and Behavioral Sciences 198, 474–478

    Automatic genre classification via n-grams of part- of-speech tags. Procedia-Social and Behavioral Sciences 198, 474–478. doi:10.1016/j.sbspro.2015.07.468. Tashiro, J., Shimpuku, Y., Naruse, K., Maftuhah, Matsutani, M.,

  47. [47]

    Japan Journal of Nursing Science 10, 170–179

    Concept analysis of reflection in nursing professional development. Japan Journal of Nursing Science 10, 170–179. doi:10.1111/j.1742-7924.20 12.00222.x. Tiedemann, J., Aulamo, M., Bakshandaeva, D., Boggia, M., Gr¨ onroos, S.A., Nieminen, T., Raganato, A., Scherrer, Y., V´ azquez, R., Virpioja, S.,

  48. [48]

    Language Re- sources and Evaluation 58, 713–755

    Democratizing neural machine translation with opus-mt. Language Re- sources and Evaluation 58, 713–755. doi:10.1007/s10579-023-09704-w. Tuning Academy,

  49. [49]

    International Journal of Artificial Intelli- gence in Education 29, 217–257

    Automated analysis of reflection in writing: Validating machine learning approaches. International Journal of Artificial Intelli- gence in Education 29, 217–257. doi:10.1007/s40593-019-00174-2. Wald, H.S., Reis, S.P.,

  50. [50]

    Journal of General Internal Medicine 25, 746–749

    Beyond the margins: reflective writing and de- velopment of reflective capacity in medical education. Journal of General Internal Medicine 25, 746–749. doi:10.1007/s11606-010-1347-4. 29 Wald, H.S., White, J., Reis, S.P., Esquibel, A.Y., Anthony, D.,

  51. [51]

    Medical Teacher 41, 152–160

    Grap- pling with complexity: medical students’ reflective writings about chal- lenging patient encounters as a window into professional identity forma- tion. Medical Teacher 41, 152–160. doi:10.1080/0142159X.2018.147572

  52. [52]

    Diversity analysis on imbalanced data sets by using ensemble models, in: 2009 IEEE symposium on computational intelligence and data mining, IEEE. pp. 324–331. doi:10.1109/CIDM.2009.4938667. Wulff, P., Mientus, L., Nowak, A., Borowski, A.,

  53. [53]

    International Journal of Artificial Intelligence in Education 33, 439–466

    Utilizing a pretrained language model (bert) to classify preservice physics teachers’ written re- flections. International Journal of Artificial Intelligence in Education 33, 439–466. doi:10.1007/s40593-022-00290-6. Yang, Z.G., Dod´ e, R., Ferenczi, G., H´ eja, E., Jelencsik-M´ atyus, K., K˝ or¨ os, ´A., Laki, L.J., Ligeti-Nagy, N., Vad´ asz, N., V´ aradi, T.,

  54. [54]

    Magyar Sz´ am´ ıt´ og´ epes Nyelv´ eszeti Konferencia (MSZNY 2023), Szegedi Tudom´ anyegyetem, Informatikai Int´ ezet, Szeged, Hungary

    J¨ onnek a nagyok! BERT-large, GPT-2 ´ es GPT-3 nyelvmodellek magyar nyelvre, in: XIX. Magyar Sz´ am´ ıt´ og´ epes Nyelv´ eszeti Konferencia (MSZNY 2023), Szegedi Tudom´ anyegyetem, Informatikai Int´ ezet, Szeged, Hungary. pp. 247–262. Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., A...

  55. [55]

    A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al

    Big bird: Transformers for longer sequences. doi:10.48550/arXiv.2007.14062, arXiv:2007.14062. Zhang, C., Hofmann, F., Pl¨ oßl, L., Gl¨ aser-Zikuda, M.,

  56. [56]

    Education and Information Technologies 29, 21593–21619

    Classifica- tion of reflective writing: A comparative analysis with shallow machine learning and pre-trained language models. Education and Information Technologies 29, 21593–21619. doi:10.1007/s10639-024-12720-0. Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., Huang, F., Zhou, J.,

  57. [57]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Qwen3 embedding: Advancing text embedding and reranking through foundation models. doi:10.48550 /arXiv.2506.05176,arXiv:2506.05176. 30