arxiv: 2605.00718 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

Learning Coarse-to-Fine Osteoarthritis Representations under Noisy Hierarchical Labels

Tongxu Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords knee osteoarthritisKellgren-Lawrence gradinghierarchical supervisiondual-head modellatent space organizationsaliency analysis3D convolutional networks

0 comments

The pith

A simple dual-head model trained jointly on coarse binary OA labels and fine KL grades produces more ordered latent representations and backbone-specific gains in severity grading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether the clinical hierarchy of knee osteoarthritis labels—binary presence of disease together with a five-level Kellgren-Lawrence severity score—can serve as a useful supervisory signal for learning disease representations. Rather than designing a new architecture, the authors probe the question with an ordinary shared-encoder network that has one head for the coarse label and one for the fine label. Across several 3D backbones they compare this dual-head training against single-task baselines and find that, for responsive networks, the joint objective improves KL-grade metrics while also producing a latent space whose geometry is more clearly aligned with increasing severity and whose attention maps overlap better with cartilage. The result suggests that even noisy hierarchical labels can supply an inductive bias that reshapes internal representations without extra model complexity.

Core claim

Dual-head supervision with a shared encoder for both the coarse OA presence label and the fine KL severity label yields backbone-dependent improvements in KL classification metrics; these gains are accompanied by a more ordered coarse-to-fine arrangement of samples in the latent space and, in responsive backbones, greater spatial overlap between saliency maps and cartilage anatomy.

What carries the argument

A dual-head architecture consisting of a shared 3D convolutional encoder plus two task-specific heads (one binary, one five-class) that is used as a minimal probe to inject the clinical label hierarchy directly into representation learning.

If this is right

For some 3D backbones, joint training on both label levels outperforms training on the KL label alone.
The latent space exhibits clearer monotonic ordering along the clinical severity axis.
Saliency maps of responsive backbones overlap more strongly with anatomically relevant cartilage regions.
A deliberately simple hierarchical supervisory signal can reshape disease representations even when the supplied labels are noisy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-head pattern could be tried on other medical grading tasks that already possess both coarse and fine annotations, such as tumor staging or retinopathy severity.
Backbone dependence implies that certain network inductive biases interact productively with hierarchical supervision, which could guide future architecture choices.
If the ordering effect is truly hierarchy-driven, one could test whether the same latent structure emerges when the coarse head is trained on a non-hierarchical but still related auxiliary task.

Load-bearing premise

The observed metric gains, latent ordering, and saliency improvements are caused by the specific coarse-to-fine label relationship rather than by generic multi-task regularization or by the particular capacity of the chosen backbones.

What would settle it

Training the identical dual-head model but replacing the coarse OA head with an unrelated auxiliary task such as age prediction or image-quality regression, then checking whether the same KL-metric gains, latent-axis ordering, and cartilage-saliency alignment still appear.

Figures

Figures reproduced from arXiv: 2605.00718 by Tongxu Zhang.

**Figure 2.** Figure 2: Paired statistical comparison between Dual models and single-task base [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of saliency maps and their overlap with cartilage regions. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Additional neural manifold visualization of penultimate-layer features. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Additional confusion matrices for all backbones and supervision branches. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

Knee osteoarthritis (OA) assessment involves a natural but often underused label hierarchy: a coarse binary OA decision and a fine-grained Kellgren--Lawrence (KL) severity grade. Existing deep learning studies commonly treat these targets as separate classification problems, either reducing OA assessment to disease presence or directly optimizing noisy ordinal KL labels. In this work, we ask whether this clinical hierarchy can serve as a representation-level supervisory prior. Rather than introducing a complex architecture, we use a deliberately simple dual-head model with a shared encoder and two task-specific heads as a probe of hierarchical supervision. We compare single-OA, single-KL, and dual-head training across multiple 3D backbones under the same test protocol. Beyond standard classification metrics, we perform paired statistical comparisons, analyze latent severity-axis geometry, and examine saliency overlap with cartilage regions. The results show that dual-head supervision produces backbone-dependent gains, with clear improvements in KL-related metrics for selected backbones. More importantly, the gains are accompanied by a more ordered coarse-to-fine latent organization and, for responsive backbones, stronger anatomical alignment of saliency with cartilage. These findings suggest that even simple hierarchical dual-head supervision can reshape disease representations under noisy coarse/fine labels, providing a useful inductive bias for OA diagnosis and severity grading.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows backbone-dependent gains from dual-head OA/KL supervision plus nicer latent geometry and saliency, but the design does not isolate the hierarchy from plain multi-task regularization.

read the letter

The one thing to know is that a plain shared-encoder dual-head model trained on both binary OA presence and KL grades produces measurable lifts in KL metrics for some 3D backbones, together with a more ordered coarse-to-fine latent space and stronger cartilage overlap in saliency maps. The gains are real on the reported backbones but not universal. The work is new mainly in the concrete combination: applying the dual-head probe to this particular noisy clinical hierarchy and then adding the latent-axis geometry and anatomical saliency checks on top of standard accuracy numbers. It does that cleanly. The authors keep the architecture deliberately simple, run the same test protocol across multiple backbones, and include paired statistical tests rather than just claiming improvements. That is better than many incremental medical-imaging papers that stop at metric tables. The latent and saliency analyses give some mechanistic insight into why the dual supervision might be helping the encoder organize severity information. The soft spot is exactly the one flagged in the stress test. The experiments compare single-OA, single-KL, and dual-head training, but there is no control that adds a second unrelated supervisory signal. Without that, it remains possible that any extra head would produce similar regularization effects under noisy labels, especially since the benefit is backbone-dependent. The paper does not claim universality, which is honest, but the central attribution to the hierarchical structure would be stronger with an ablation that breaks the clinical relation between the two labels. Data-split and exclusion details are not visible in the abstract, so robustness is harder to judge from what is shown. This paper is for researchers who work on knee OA grading or other hierarchical label problems in radiology. Someone looking for a lightweight way to inject clinical structure into an existing backbone will find the empirical pattern and the extra analyses useful. It is not a new framework, but the question is well-posed and the evidence is multi-faceted enough to merit referee time. I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that a simple dual-head model with a shared encoder and task-specific heads for coarse binary OA and fine KL grades can use the clinical label hierarchy as a representation-level supervisory prior under noisy labels. Comparisons of single-OA, single-KL, and dual-head training across multiple 3D backbones show backbone-dependent gains in KL-related metrics, accompanied by more ordered coarse-to-fine latent geometry and, for responsive backbones, stronger saliency alignment with cartilage regions.

Significance. If the gains can be attributed specifically to the hierarchical structure of the labels rather than general multi-task regularization, the work would demonstrate that even minimal dual-head supervision provides a useful inductive bias for learning disease representations in medical imaging tasks with hierarchical and noisy labels. The inclusion of latent-axis analysis and saliency checks adds depth beyond standard classification metrics.

major comments (2)

[Experimental comparisons (as described in abstract and results)] The experimental design compares only single-OA, single-KL, and dual-head (OA+KL) training regimes. There is no control using dual heads with non-hierarchical or unrelated supervisory signals to isolate the effect of the label hierarchy from general multi-task regularization. This is load-bearing for the central claim (abstract) that the observed KL-metric gains, latent organization, and saliency improvements are caused by the hierarchical supervisory prior.
[Results] The backbone-dependent nature of the results is acknowledged, yet without the non-hierarchical multi-task control the attribution of improved coarse-to-fine latent geometry specifically to the hierarchical prior remains tentative, particularly in the noisy label setting.

minor comments (1)

[Abstract] The abstract refers to 'multiple 3D backbones' without naming them; explicitly listing the architectures used would aid reproducibility and interpretation of the backbone dependence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The primary concern is the absence of a non-hierarchical multi-task control to isolate the effect of label hierarchy from general multi-task regularization. We address this point by point below and commit to revisions that strengthen the attribution of our findings.

read point-by-point responses

Referee: [Experimental comparisons (as described in abstract and results)] The experimental design compares only single-OA, single-KL, and dual-head (OA+KL) training regimes. There is no control using dual heads with non-hierarchical or unrelated supervisory signals to isolate the effect of the label hierarchy from general multi-task regularization. This is load-bearing for the central claim (abstract) that the observed KL-metric gains, latent organization, and saliency improvements are caused by the hierarchical supervisory prior.

Authors: We agree that the current set of comparisons does not fully separate the contribution of the hierarchical label structure from the general benefits of multi-task supervision. In the revised manuscript we will add a control experiment training the dual-head architecture with the coarse OA label paired to a non-hierarchical fine-grained signal (e.g., randomly permuted KL grades or an unrelated auxiliary task). The same backbone-dependent evaluation, latent-axis analysis, and saliency overlap metrics will be reported for this control, allowing direct attribution of any gains to the hierarchical prior rather than multi-task regularization alone. revision: yes
Referee: [Results] The backbone-dependent nature of the results is acknowledged, yet without the non-hierarchical multi-task control the attribution of improved coarse-to-fine latent geometry specifically to the hierarchical prior remains tentative, particularly in the noisy label setting.

Authors: We concur that the backbone-dependent pattern leaves the interpretation of the latent geometry tentative without the proposed control. The additional non-hierarchical dual-head runs will include the same paired statistical tests on latent severity axes and cartilage saliency overlap. This will clarify whether the observed ordering of the coarse-to-fine latent space is specifically induced by the hierarchical supervisory signal under noisy labels, or whether it arises from any dual-head configuration. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparisons are self-contained

full rationale

The paper describes an experimental comparison of single-task (OA or KL) versus dual-head (OA+KL) training on shared 3D backbones, reporting classification metrics, latent geometry, and saliency maps. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on observed differences across training regimes and backbones rather than any derivation that reduces to its own inputs by construction. The absence of a non-hierarchical multi-task control is a methodological limitation but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard supervised deep learning assumptions that neural networks can extract hierarchical features when trained with multi-level labels; no new entities or ad-hoc parameters are introduced beyond ordinary training hyperparameters.

pith-pipeline@v0.9.0 · 5522 in / 1060 out tokens · 29898 ms · 2026-05-09T19:25:51.436595+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Current Medical Imaging (2024) Learning Osteoarthritis Representations under Hierarchical Labels 15

Alyami, J., et al.: Identification of severe grading in knee osteoarthritis from mri using ensemble deep learning. Current Medical Imaging (2024) Learning Osteoarthritis Representations under Hierarchical Labels 15

2024
[2]

Medical image analysis52, 109– 118 (2019)

Ambellan, F., Tack, A., Ehlke, M., Zachow, S.: Automated segmentation of knee bone and cartilage combining statistical shape knowledge and convolutional neural networks: Data from the osteoarthritis initiative. Medical image analysis52, 109– 118 (2019)

2019
[3]

BMC Muscu- loskeletal Disorders9(1), 116 (2008)

Bedson, J., Croft, P.R.: The discordance between clinical and radiographic knee osteoarthritis: a systematic search and summary of the literature. BMC Muscu- loskeletal Disorders9(1), 116 (2008)

2008
[4]

IEEE transactions on pattern analysis and machine intelligence 35(8), 1798–1828 (2013)

Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35(8), 1798–1828 (2013)

2013
[5]

Journal TBD (2025), complete bibliographic details to be verified

Beyaz, S., et al.: Interobserver differences in kellgren–lawrence grading of knee osteoarthritis and implications for artificial intelligence datasets. Journal TBD (2025), complete bibliographic details to be verified

2025
[6]

Machine Learning28(1), 41–75 (1997)

Caruana, R.: Multitask learning. Machine Learning28(1), 41–75 (1997)

1997
[7]

Biomedical Signal Processing and Control (2025), complete bibliographic details to be verified

Chen, L., et al.: An attention-enhanced multi-task framework for knee osteoarthri- tis detection, grading, and localization. Biomedical Signal Processing and Control (2025), complete bibliographic details to be verified

2025
[8]

Journal TBD (2019), complete bibliographic details to be verified

Chen, P., et al.: Automatic knee osteoarthritis grading using deep neural networks with ordinal-aware modeling. Journal TBD (2019), complete bibliographic details to be verified

2019
[9]

Med3d: Transfer learning for 3d medical image analysis

Chen, S., Ma, K., Zheng, Y.: Med3d: Transfer learning for 3d medical image anal- ysis. arXiv preprint arXiv:1904.00625 (2019)

work page arXiv 1904
[10]

arXiv preprint arXiv:2603.02367 (2026)

Chen, Y., Ni, S., Zhang, J., Saeed, S.U., Wang, Y., Ivanova, A., Hargunani, R., Liu, C., Huang, J., Hu, Y.: Retrieving patient-specific radiomic feature sets for transparent knee mri assessment. arXiv preprint arXiv:2603.02367 (2026)

work page arXiv 2026
[11]

Arthritis Care & Research65(3), 363–372 (2013)

Finan, P.H., Buenaver, L.F., Bounds, S.C., Hussain, S., Park, R.J., Haque, U.J., Campbell, C.M., Haythornthwaite, J.A., Smith, M.T.: Discordance between pain and radiographic severity in knee osteoarthritis: findings from quantitative sensory testing of central sensitization. Arthritis Care & Research65(3), 363–372 (2013)

2013
[12]

arXiv preprint arXiv:2402.03526 (2024)

Gong, H., Kang, L., Wang, Y., Wan, X., Li, H.: nnmamba: 3d biomedical image segmentation, classification and landmark detection with state space model. arXiv preprint arXiv:2402.03526 (2024)

work page arXiv 2024
[13]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)

work page internal anchor Pith review arXiv 2023
[14]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

2016
[15]

Osteoarthritis and cartilage19(8), 990–1002 (2011)

Hunter, D.J., Guermazi, A., Lo, G.H., Grainger, A.J., Conaghan, P.G., Boudreau, R.M., Roemer, F.W.: Evolution of semi-quantitative whole joint assessment of knee oa: Moaks (mri osteoarthritis knee score). Osteoarthritis and cartilage19(8), 990–1002 (2011)

2011
[16]

In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Huo, Y., Lu, Y., Niu, Y., Lu, Z., Wen, J.R.: Coarse-to-fine grained classification. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1033–1036 (2019)

2019
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Jang, J., Hwang, D.: M3t: Three-dimensional medical image classifier using multi- plane and multi-slice transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 20718–20729 (2022)

2022
[18]

Ann Rheum Dis16(4), 494–502 (1957) 16 T

Kellgren, J.H., Lawrence, J., et al.: Radiological assessment of osteo-arthrosis. Ann Rheum Dis16(4), 494–502 (1957) 16 T. Zhang

1957
[19]

Journal TBD (2024), complete bibliographic details to be verified

Kinger, S., et al.: Deep learning for automatic knee osteoarthritis severity assess- ment and knee replacement likelihood prediction. Journal TBD (2024), complete bibliographic details to be verified

2024
[20]

Clinical Orthopaedics and Related Re- search®474(8), 1886–1893 (2016)

Kohn, M.D., Sassoon, A.A., Fernando, N.D.: Classifications in brief: Kellgren- lawrence classification of osteoarthritis. Clinical Orthopaedics and Related Re- search®474(8), 1886–1893 (2016)

2016
[21]

Clinical Orthopaedics and Related Re- search474(8), 1886–1893 (2016).https://doi.org/10.1007/s11999-016-4732-4

Kohn, M.D., Sassoon, A.A., Fernando, N.D.: Classifications in brief: Kellgren– lawrence classification of osteoarthritis. Clinical Orthopaedics and Related Re- search474(8), 1886–1893 (2016).https://doi.org/10.1007/s11999-016-4732-4

work page doi:10.1007/s11999-016-4732-4 2016
[22]

Knee Surgery, Sports Traumatology, Arthroscopy26(4), 1076–1082 (2018)

Köse, Ö., Gök, K., Güler, F., Egerci, O.F., Yigit, S.: Inter- and intra-observer reliability of the kellgren–lawrence and oarsi atlas classification systems for os- teoarthritis of the knee. Knee Surgery, Sports Traumatology, Arthroscopy26(4), 1076–1082 (2018)

2018
[23]

Lang, N., Snæbjarnarson, V., Cole, E., Mac Aodha, O., Igel, C., Belongie, S.: Fromcoarsetofine-grainedopen-setrecognition.In:ProceedingsoftheIEEE/CVF conference on computer vision and pattern recognition. pp. 17804–17814 (2024)

2024
[24]

Neural manifold clustering and embedding,

Li, Z., Chen, Y., LeCun, Y., Sommer, F.T.: Neural manifold clustering and em- bedding. arXiv preprint arXiv:2201.10000 (2022)

work page arXiv 2022
[25]

Journal of machine learning research9(11) (2008)

Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research9(11) (2008)

2008
[26]

Scientific Reports14, 78203 (2024)

Panwar, P., et al.: Optimizing knee osteoarthritis severity prediction on mri using deep learning. Scientific Reports14, 78203 (2024)

2024
[27]

arXiv preprint arXiv:2406.11608 (2024)

Park, S., Zhang, Y., Yu, S.X., Beery, S., Huang, J.: Visually consistent hierarchical image classification. arXiv preprint arXiv:2406.11608 (2024)

work page arXiv 2024
[28]

Osteoarthritis and cartilage16(12), 1433–1441 (2008)

Peterfy, C.G., Schneider, E., Nevitt, M.: The osteoarthritis initiative: report on the design rationale for the magnetic resonance imaging protocol for the knee. Osteoarthritis and cartilage16(12), 1433–1441 (2008)

2008
[29]

Osteoarthritis and cartilage22(5), 668–682 (2014)

Roemer, F.W., Frobell, R., Lohmander, L.S., Niu, J., Guermazi, A.: Anterior cru- ciate ligament osteoarthritis score (acloas): longitudinal mri-based whole joint as- sessment of anterior cruciate ligament injury. Osteoarthritis and cartilage22(5), 668–682 (2014)

2014
[30]

An Overview of Multi-Task Learning in Deep Neural Networks

Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017)

work page internal anchor Pith review arXiv 2017
[31]

In: Proceedings of the IEEE international conference on computer vision

Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)

2017
[32]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional net- works: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)

work page Pith review arXiv 2013
[33]

Diagnostics (2025), complete bibliographic details to be verified

Vaattovaara, E., et al.: Kellgren–lawrence grading of knee osteoarthritis using deep learning: external evaluation against expert readers. Diagnostics (2025), complete bibliographic details to be verified

2025
[34]

Advances in neural information pro- cessing systems30(2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

2017
[35]

Medical Image Analysis91, 103035 (2024) Learning Osteoarthritis Representations under Hierarchical Labels 17

Yao, Y., Zhong, J., Zhang, L., Khan, S., Chen, W.: Cartimorph: A framework for automated knee articular cartilage morphometrics. Medical Image Analysis91, 103035 (2024) Learning Osteoarthritis Representations under Hierarchical Labels 17

2024
[36]

Yong, X., et al.: Ordinal regression for knee osteoarthritis severity assessment. Multimedia Tools and Applications (2022), complete bibliographic details to be verified A Additional neural manifold visualization To complement the quantitative severity-axis analysis in the main text, we pro- vide additional neural manifold visualizations in Fig. 4. These ...

work page arXiv 2022