Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations

Ilkka P\"ol\"onen, Jan B\"ohm, Liisa Pet\"ainen, Maarit Ahtiainen, Rami Luisto, Sami \"Ayr\"am\"o, Tomi Lilja, Tommi Gr\"onholm

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords fine-tuningfinnishdatadomainbertacquiringaimsapproach

0 comments

The pith

Fine-tuning FinBERT on Finnish medical text produces embedding geometry shifts whose correlation with downstream performance the authors attempt to measure as a potential early signal for domain adaptation benefit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors start with a general Finnish language model called FinBERT and continue training it on a large collection of Finnish pathology reports that have no labels. They watch how the numerical representations of words and sentences shift in the model's internal space as training progresses. The key question is whether the size or direction of those shifts early in training can forecast whether the adapted model will perform better on later tasks such as classifying reports. This matters in healthcare AI because labeled examples are often scarce and slow to obtain, so any cheap signal that tells you in advance whether extra unlabeled data will help is valuable. The work is therefore an empirical exploration of train-time diagnostics rather than a new algorithm or theoretical proof.

Core claim

We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning.

Load-bearing premise

That observable changes in embedding geometry during domain fine-tuning on unlabeled data will correlate with, and therefore predict, performance gains on downstream labeled classification tasks.

read the original abstract

In NLP classification tasks where little labeled data exists, domain fine-tuning of transformer models on unlabeled data is an established approach. In this paper we have two aims. (1) We describe our observations from fine-tuning the Finnish BERT model on Finnish medical text data. (2) We report on our attempts to predict the benefit of domain-specific pre-training of Finnish BERT from observing the geometry of embedding changes due to domain fine-tuning. Our driving motivation is the common\situation in healthcare AI where we might experience long delays in acquiring datasets, especially with respect to labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper observes embedding shifts during FinBERT fine-tuning on Finnish histopathology reports but does not show a solid quantitative link to predicting downstream gains.

read the letter

The main point is that the authors fine-tune a Finnish BERT variant on unlabeled histopathological reports and track changes in embedding geometry, then try to use those changes as a signal for whether the adaptation will help on labeled tasks. The observations from the fine-tuning step itself are straightforward and add a data point for a low-resource language in a medical subdomain. That part is useful as a case study, especially given the real-world issue of label delays in healthcare AI they highlight. They keep the scope narrow and the motivation practical without inflating the claims.

Referee Report

2 major / 2 minor

Summary. The paper reports observations from domain fine-tuning of Finnish BERT (FinBERT) on unlabeled Finnish histopathological reports. It then attempts to use changes in the geometry of the embedding space (e.g., shifts in pairwise distances, variance, or alignment) induced by this fine-tuning to predict performance improvements on downstream labeled classification tasks. The motivation is to enable forecasting of domain adaptation benefits in settings where labeled data acquisition is delayed, such as healthcare AI.

Significance. If validated, the approach could provide a practical way to assess the value of domain-specific pre-training using only unlabeled data, which is particularly useful in low-resource or label-scarce domains like medical NLP. The work highlights potential train-time signals but currently lacks the quantitative evidence needed to establish predictive utility.

major comments (2)

[Abstract and Results] Abstract and Results: The abstract and results describe observations and correlation attempts, but no specific methods, statistical controls, regression models, or quantitative metrics (such as correlation coefficients, p-values, or cross-validation scores) are reported for the prediction of downstream gains from geometry changes. This makes it impossible to evaluate whether the claimed correlations are robust or due to post-hoc selection.
[Prediction step] Prediction step: The central claim that embedding geometry metrics can predict downstream performance requires a demonstrated quantitative link. Without comparison to null models, baseline predictors, or validation on held-out tasks, the predictive power remains unestablished, which is load-bearing for the motivating use-case of forecasting without labels.

minor comments (2)

[Notation] Clarify the exact definitions of the geometry metrics used (e.g., what is meant by 'alignment metrics' or 'variance' in the embedding space).
[Figures] Ensure all figures showing embedding changes include axis labels, legends, and statistical annotations if applicable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our exploratory study of domain fine-tuning signals in Finnish histopathological text. The comments correctly identify areas where the original submission was insufficiently quantitative, and we have revised the manuscript to incorporate additional statistical details and controls as described below.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results: The abstract and results describe observations and correlation attempts, but no specific methods, statistical controls, regression models, or quantitative metrics (such as correlation coefficients, p-values, or cross-validation scores) are reported for the prediction of downstream gains from geometry changes. This makes it impossible to evaluate whether the claimed correlations are robust or due to post-hoc selection.

Authors: We agree that the original abstract and results sections lacked sufficient methodological transparency. The work was framed as an initial exploration of observable patterns rather than a validated prediction system. In the revised manuscript we have expanded both the abstract and the Results section to explicitly describe the geometry metrics computed (pairwise cosine distances, embedding variance, and Procrustes alignment), the downstream performance deltas, and the statistical procedures used. We now report Pearson and Spearman correlation coefficients, associated p-values, and a simple linear regression linking geometry shifts to F1 improvements, with the full set of tasks included to avoid post-hoc metric selection. revision: yes
Referee: [Prediction step] Prediction step: The central claim that embedding geometry metrics can predict downstream performance requires a demonstrated quantitative link. Without comparison to null models, baseline predictors, or validation on held-out tasks, the predictive power remains unestablished, which is load-bearing for the motivating use-case of forecasting without labels.

Authors: The manuscript presents observed correlations as a potential early signal rather than a proven forecasting method. To strengthen the quantitative link, the revision now includes (i) a null-model baseline obtained by randomly permuting the geometry metrics across tasks and recomputing correlations, (ii) comparison against a simple baseline predictor using only the quantity of unlabeled fine-tuning data, and (iii) leave-one-task-out validation across the downstream classification tasks. These additions are reported with the corresponding correlation coefficients and p-values so that readers can assess the incremental value of the geometry-based signals. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely observational with empirical correlation attempts

full rationale

The paper reports observations from domain fine-tuning of Finnish BERT on unlabeled histopathological text and describes attempts to correlate observable embedding geometry changes (e.g., pairwise distances or variance shifts) with downstream labeled-task gains. No derivation, equation, or first-principles claim reduces to its own inputs by construction. The 'prediction' component consists of empirical attempts at correlation rather than a fitted parameter renamed as a prediction or a self-definitional loop. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes are invoked. The work is self-contained as an exploratory observational study and does not manufacture a closed predictive result from the same data used to define the signal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, mathematical axioms, or newly invented entities are described. The work appears to rely on standard transformer training assumptions and existing BERT architecture without introducing new postulates.

pith-pipeline@v0.9.0 · 5431 in / 1121 out tokens · 69031 ms · 2026-05-10T11:29:47.538281+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 17 canonical work pages · 5 internal anchors

[1]

Universal language model fine-tuning for text classification

Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018)

work page arXiv 2018
[2]

Attention Is All You Need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 13

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1907
[5]

Transactions of the association for computational linguistics8, 842–866 (2021)

Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in bertology: What we know about how bert works. Transactions of the association for computational linguistics8, 842–866 (2021)

2021
[6]

Gururangan, A

Gururangan, S., Marasovi´ c, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., Smith, N.A.: Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020)

work page arXiv 2004
[7]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[8]

Training Compute-Optimal Large Language Models

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D.d.L., Hendricks, L.A., Welbl, J., Clark, A., et al.: Training compute- optimal large language models. arXiv preprint arXiv:2203.15556 (2022)

work page internal anchor Pith review arXiv 2022
[9]

Brandfonbrener, N

Brandfonbrener, D., Anand, N., Vyas, N., Malach, E., Kakade, S.: Loss-to-loss prediction: Scaling laws for all datasets. arXiv preprint arXiv:2411.12925 (2024)

work page arXiv 2024
[10]

Data10(7), 104 (2025) https://doi.org/10

Myllyl¨ a, E., Siirtola, P., Isosalo, A., Reponen, J., Tamminen, S., Laatikainen, O.: Extracting information from unstructured medical reports written in minority languages: A case study of finnish. Data10(7), 104 (2025) https://doi.org/10. 3390/data10070104

2025
[11]

International nursing review67(2), 218–230 (2020)

Bani Issa, W., Al Akour, I., Ibrahim, A., Almarzouqi, A., Abbas, S., Hisham, F., Griffiths, J.: Privacy, confidentiality, security and patient safety concerns about electronic health records. International nursing review67(2), 218–230 (2020)

2020
[12]

Egyptian Informatics Journal22(2), 177–183 (2021)

Keshta, I., Odeh, A.: Security and privacy of electronic health records: Concerns and challenges. Egyptian Informatics Journal22(2), 177–183 (2021)

2021
[13]

arXiv preprint arXiv:1912.07076 (2019)

Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Gin- ter, F., Pyysalo, S.: Multilingual is not enough: Bert for finnish. arXiv preprint arXiv:1912.07076 (2019)

work page arXiv 1912
[14]

Bioinformatics36(4), 1234–1240 (2020)

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234–1240 (2020)

2020
[15]

In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp

Alsentzer, E., Murphy, J., Boag, W., Weng, W.-H., Jindi, D., Naumann, T., McDermott, M.: Publicly available clinical bert embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78 (2019) 14

2019
[16]

Journal of Healthcare Informatics Research7(4), 433–446 (2023) https://doi.org/10.1007/s41666-023-00140-7

T¨ urkmen, H., Dikenelli, O., Eraslan, C., C ¸ allı, M.C., ¨Ozbek, S.S.: Bioberturk: Exploring turkish biomedical language model development strategies in low- resource setting. Journal of Healthcare Informatics Research7(4), 433–446 (2023) https://doi.org/10.1007/s41666-023-00140-7

work page doi:10.1007/s41666-023-00140-7 2023
[17]

Computers in Biology and Medicine182, 109233 (2024) https://doi.org/10.1016/j.compbiomed.2024

Nunes, M., Bon´ e, J.a., Ferreira, J.a.C., Chaves, P., Elvas, L.B.: Medial- bertina: An european portuguese medical language model. Computers in Biology and Medicine182, 109233 (2024) https://doi.org/10.1016/j.compbiomed.2024. 109233

work page doi:10.1016/j.compbiomed.2024 2024
[18]

https://doi.org/10.1016/j

Bui, N., Nguyen, G., Nguyen, N., Vo, B., Vo, L., Huynh, T., Tang, A., Tran, V.N., Huynh, T., Nguyen, H.Q., Dinh, M.: Fine-tuning large language models for improved health communication in low-resource languages. Computer Meth- ods and Programs in Biomedicine263, 108655 (2025) https://doi.org/10.1016/j. cmpb.2025.108655

work page doi:10.1016/j 2025
[19]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp

Warner, B., Chaffin, A., Clavi´ e, B., Weller, O., Hallstr¨ om, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T.,et al.: Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long con- text finetuning and inference. In: Proceedings of the 63rd Annual Meeting of the Association for Computation...

2025
[20]

MIT press, ??? (2023)

Prince, S.J.: Understanding Deep Learning. MIT press, ??? (2023)

2023
[21]

In: Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pp

Luotolahti, J., Kanerva, J., Laippala, V., Pyysalo, S., Ginter, F.: Towards uni- versal web parsebanks. In: Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pp. 211–220 (2015)

2015
[22]

arXiv preprint arXiv:2002.05688 (2020)

Eilertsen, G., J¨ onsson, D., Ropinski, T., Unger, J., Ynnerman, A.: Classifying the classifier: dissecting the weight space of neural networks. arXiv preprint arXiv:2002.05688 (2020)

work page arXiv 2002
[23]

Advances in neural information processing systems30(2017)

Raghu, M., Gilmer, J., Yosinski, J., Sohl-Dickstein, J.: SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems30(2017)

2017
[24]

Advances in neural information processing systems31(2018)

Morcos, A., Raghu, M., Bengio, S.: Insights on representational similarity in neural networks with canonical correlation. Advances in neural information processing systems31(2018)

2018
[25]

Advances in Neural Information Processing Systems 34, 1556–1568 (2021)

Ding, F., Denain, J.-S., Steinhardt, J.: Grounding representation similarity through statistical testing. Advances in Neural Information Processing Systems 34, 1556–1568 (2021)

2021
[26]

arXiv 15 preprint arXiv:2509.04622 (2025)

Wu, J., Saha, S., Bo, Y., Khosla, M.: Measuring the measures: Discrimina- tive capacity of representational similarity metrics across model families. arXiv 15 preprint arXiv:2509.04622 (2025)

work page arXiv 2025
[27]

arXiv preprint arXiv:2411.14633 (2024)

Bo, Y., Soni, A., Srivastava, S., Khosla, M.: Evaluating representational sim- ilarity measures from the lens of functional correspondence. arXiv preprint arXiv:2411.14633 (2024)

work page arXiv 2024
[28]

Advances in neural information processing systems34, 4738–4750 (2021)

Williams, A.H., Kunz, E., Kornblith, S., Linderman, S.: Generalized shape metrics on neural representations. Advances in neural information processing systems34, 4738–4750 (2021)

2021
[29]

In: International Conference on Machine Learning, pp

Kornblith, S., Norouzi, M., Lee, H., Hinton, G.: Similarity of neural network representations revisited. In: International Conference on Machine Learning, pp. 3519–3529 (2019). PMlR

2019
[30]

Advances in neural information processing systems32(2019)

Maheswaranathan, N., Williams, A., Golub, M., Ganguli, S., Sussillo, D.: Univer- sality and individuality in neural dynamics across large populations of recurrent networks. Advances in neural information processing systems32(2019)

2019
[31]

In: Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp

Phang, J., Liu, H., Bowman, S.: Fine-tuned transformers show clusters of sim- ilar representations across layers. In: Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp. 529–538 (2021)

2021
[32]

arXiv preprint arXiv:1702.03859 (2017)

Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017)

work page arXiv 2017
[33]

Neuron72(2), 404–416 (2011)

Haxby, J.V., Guntupalli, J.S., Connolly, A.C., Halchenko, Y.O., Conroy, B.R., Gobbini, M.I., Hanke, M., Ramadge, P.J.: A common, high-dimensional model of the representational space in human ventral temporal cortex. Neuron72(2), 404–416 (2011)

2011
[34]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Dwivedi, K., Roig, G.: Representation similarity analysis for efficient task tax- onomy & transfer learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12387–12396 (2019)

2019
[35]

Frontiers in systems neuro- science2, 249 (2008)

Kriegeskorte, N., Mur, M., Bandettini, P.A.: Representational similarity analysis- connecting the branches of systems neuroscience. Frontiers in systems neuro- science2, 249 (2008)

2008
[36]

All-but-the-top: Simple and effective postprocessing for word representations.arXiv preprint arXiv:1702.01417,

Mu, J., Bhat, S., Viswanath, P.: All-but-the-top: Simple and effective postpro- cessing for word representations. arXiv preprint arXiv:1702.01417 (2017)

work page arXiv 2017
[37]

In: 2007 15th European Signal Processing Conference, pp

Roy, O., Vetterli, M.: The effective rank: A measure of effective dimensionality. In: 2007 15th European Signal Processing Conference, pp. 606–610 (2007). IEEE

2007
[38]

In: Findings of the Association for Computational Linguistics: ACL 2022, pp

Rudman, W., Gillman, N., Rayne, T., Eickhoff, C.: Isoscore: Measuring the 16 uniformity of embedding space utilization. In: Findings of the Association for Computational Linguistics: ACL 2022, pp. 3325–3339 (2022)

2022
[39]

Journal of machine learning research3(Dec), 583–617 (2002)

Strehl, A., Ghosh, J.: Cluster ensembles—a knowledge reuse framework for com- bining multiple partitions. Journal of machine learning research3(Dec), 583–617 (2002)

2002
[40]

Journal of computational and applied mathematics20, 53–65 (1987)

Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics20, 53–65 (1987)

1987
[41]

Journal of classification2(1), 193– 218 (1985)

Hubert, L., Arabie, P.: Comparing partitions. Journal of classification2(1), 193– 218 (1985)

1985
[42]

New Generation Computing41(1), 109–134 (2023)

Nagatsuka, K., Broni-Bediako, C., Atsumi, M.: Length-based curriculum learning for efficient pre-training of language models. New Generation Computing41(1), 109–134 (2023)

2023
[43]

MacKay, D.J.: Information Theory, Inference and Learning Algorithms. Cam- bridge university press, Cambridge, UK (2003) A Details on our datasets 1.Histopathological reports:This was a dataset provided by the Central Finland Biobank, containing medical descriptions of patients who had ended up getting histopathological studies. Labels were not readily ava...

2003