JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification

Dat Cao; Hien Chu; Hien Kha; Minh Le; Nguyen Quoc Khanh Le; Phan Nguyen; Trang Pham

arxiv: 2604.27343 · v2 · pith:TMX3AMVOnew · submitted 2026-04-30 · 💻 cs.CV

JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification

Phan Nguyen , Dat Cao , Hien Kha , Hien Chu , Minh Le , Trang Pham , Nguyen Quoc Khanh Le This is my paper

Pith reviewed 2026-07-01 08:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal learningskin lesion classificationadaptive fusiondermoscopic imagesclinical photographspatient metadatadeep learningclass imbalance

0 comments

The pith

JI-ADF uses joint learning and adaptive per-sample fusion of dermoscopic images, clinical photos, and metadata to classify skin lesions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a trimodal framework that learns shared representations across three input types while applying separate supervision to each modality. An adaptive fusion step then adjusts the weight of each modality's output for every individual case, aided by a multimodal fusion attention module that captures cross-modal dependencies. The approach is assessed on the MILK10k benchmark, which mirrors real clinical capture conditions and severe class imbalance. Results indicate gains in sensitivity and Dice score alongside preserved specificity and calibration. This setup aims to make fuller use of routinely collected multimodal evidence rather than relying mainly on dermoscopic images alone.

Core claim

The JI-ADF architecture, by combining joint multimodal representation learning with modality-specific auxiliary supervision, an adaptive decision fusion mechanism, and the multimodal fusion attention module, delivers strong and well-balanced performance across lesion categories on the MILK10k benchmark, with gains in sensitivity and Dice score while retaining high specificity and good calibration.

What carries the argument

The adaptive decision fusion mechanism that dynamically calibrates modality contributions on a per-sample basis, supported by the multimodal fusion attention (MMFA) module inside the JI-ADF trimodal architecture.

If this is right

Higher sensitivity and Dice scores while keeping high specificity across imbalanced lesion categories.
More robust handling of real-world clinical acquisition variations and class imbalance.
Verified behavior through modality ablation studies, calibration metrics, and attention visualizations.
A practical base for using all three data types together in diagnostic support systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The per-sample weighting could reduce errors when one input type is noisy or incomplete in deployment.
The same joint-individual plus adaptive fusion pattern may transfer to other multimodal medical tasks that mix images with structured records.
External validation on datasets with different acquisition protocols would test whether the reported balance generalizes.
Clinical integration would need checks on how performance shifts when metadata completeness varies.

Load-bearing premise

The adaptive decision fusion and MMFA module will generate clinically meaningful gains by dynamically adjusting per-sample modality contributions under the MILK10k acquisition conditions.

What would settle it

A controlled test on MILK10k showing that replacing adaptive fusion with fixed averaging or removing the MMFA module produces no measurable change in sensitivity, Dice score, or calibration.

Figures

Figures reproduced from arXiv: 2604.27343 by Dat Cao, Hien Chu, Hien Kha, Minh Le, Nguyen Quoc Khanh Le, Phan Nguyen, Trang Pham.

**Figure 1.** Figure 1: Illustration of (a) the Joint Fusion Structure and (b) our proposed Joint–Individual architecture with Adaptive Decision Fusion. view at source ↗

**Figure 2.** Figure 2: Multimodal Fusion Attention Module (MMFA), where view at source ↗

**Figure 3.** Figure 3: Comparison between the original input images and view at source ↗

**Figure 4.** Figure 4: Calibration Curve. The calibration curve of the fused JI-ADF model lies close to the diagonal, indicating that predicted probabilities match observed frequencies well overall. The curve is slightly below the perfect-calibration line for mid-range probabilities, suggesting mild over-confidence in this region, but it aligns closely with the diagonal for highconfidence predictions (≥ 0.7), where clinical d… view at source ↗

**Figure 5.** Figure 5: Fusion Architecture Ablation – Multimetrics Compari view at source ↗

read the original abstract

Skin lesion classification is essential for early dermatological diagnosis, yet many existing computer-aided systems rely primarily on dermoscopic images and underutilize the multimodal evidence routinely available in clinical practice. To address this gap, we propose \textbf{JI-ADF}, a trimodal deep learning framework that integrates dermoscopic images, clinical photographs, and structured patient metadata for clinically grounded skin lesion classification. The proposed architecture combines joint multimodal representation learning with modality-specific auxiliary supervision and an adaptive decision fusion mechanism that dynamically calibrates modality contributions on a per-sample basis. To enhance cross-modal reasoning while preserving modality-specific evidence, we further introduce a multimodal fusion attention (MMFA) module. We evaluate JI-ADF on the large-scale MILK10k benchmark, which reflects real-world clinical acquisition conditions and severe class imbalance. The proposed method demonstrates strong and well-balanced performance across lesion categories, improving sensitivity and Dice score while maintaining high specificity and good calibration. Extensive analyses, including modality ablation, calibration evaluation, and Grad-CAM visualization, further confirm the robustness and clinically meaningful behavior of the model. These results indicate that JI-ADF provides a reliable and practical foundation for multimodal skin lesion classification in real-world clinical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JI-ADF applies standard multimodal blocks to trimodal skin lesion data on a new benchmark but the abstract supplies no numbers, so the size of any gain from adaptive fusion stays unclear.

read the letter

The paper's main move is to take joint representation learning, auxiliary per-modality supervision, an attention module called MMFA, and an adaptive decision fusion step and run them on dermoscopic images, clinical photos, and metadata together. The target is skin lesion classification under real acquisition conditions and class imbalance on the MILK10k set.

What it does reasonably is lay out an evaluation plan that includes modality ablations, calibration checks, and Grad-CAM. That is the right kind of work for a medical imaging paper that wants to claim clinical relevance. The adaptive fusion is presented as learning per-sample weights, which is a common way to handle uneven modality quality.

The soft spot is the complete absence of any quantitative results in the abstract. There are no accuracy figures, no baseline comparisons, no sensitivity or Dice values, and no indication of how much the adaptive component actually moves the needle versus a simpler fusion. Without those numbers it is impossible to tell whether the claimed balanced performance and good calibration are meaningful or just the usual outcome of adding more modalities. The full manuscript presumably contains the tables, but the abstract should at least sketch the effect sizes.

This is a narrow applied paper aimed at researchers already working on multimodal dermatology imaging. A reader who needs a concrete trimodal baseline on a new benchmark might extract something useful from the architecture and the ablation design. It does not introduce new theory or reorganize prior results.

I would bring it to a reading group only if the group is focused on medical multimodal methods. I would not cite it in the next year unless the numbers turn out to be unusually strong. It is worth sending to peer review because the task matters and the evaluation outline is sensible, but the authors will need to add concrete results and clearer comparisons before it can be accepted.

Referee Report

1 major / 0 minor

Summary. The paper proposes JI-ADF, a trimodal deep learning framework for skin lesion classification integrating dermoscopic images, clinical photographs, and structured patient metadata. It combines joint multimodal representation learning with modality-specific auxiliary supervision, an adaptive decision fusion mechanism that dynamically calibrates per-sample modality contributions, and a multimodal fusion attention (MMFA) module. Evaluation is on the MILK10k benchmark reflecting real-world acquisition and class imbalance; the abstract claims strong balanced performance across lesion categories, improved sensitivity and Dice score, maintained high specificity and good calibration, with supporting analyses via modality ablation, calibration evaluation, and Grad-CAM visualizations.

Significance. If the claimed performance gains and robustness hold with quantitative validation, JI-ADF could offer a practical advance for multimodal skin lesion classification by better utilizing routinely available clinical data sources and addressing class imbalance in real-world settings. The adaptive fusion and MMFA components target a clinically relevant gap in existing dermoscopy-focused systems.

major comments (1)

[Abstract] Abstract: The central claims of 'strong and well-balanced performance', 'improving sensitivity and Dice score', 'good calibration', and robustness confirmed by ablation/calibration/Grad-CAM analyses are asserted without any quantitative results, baseline comparisons, error bars, statistical tests, or specific metric values. This absence makes it impossible to assess the magnitude or reliability of the reported improvements and directly undermines evaluation of the core contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and agree that revisions to the abstract are warranted to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'strong and well-balanced performance', 'improving sensitivity and Dice score', 'good calibration', and robustness confirmed by ablation/calibration/Grad-CAM analyses are asserted without any quantitative results, baseline comparisons, error bars, statistical tests, or specific metric values. This absence makes it impossible to assess the magnitude or reliability of the reported improvements and directly undermines evaluation of the core contribution.

Authors: We acknowledge that the current abstract presents only qualitative claims. In the revised manuscript we will update the abstract to include specific quantitative results drawn from the experimental section (e.g., sensitivity, Dice score, specificity, and calibration metrics on MILK10k), along with brief mention of baseline comparisons. The full results, including error bars, statistical tests, ablation studies, calibration plots, and Grad-CAM visualizations, remain unchanged in the body of the paper; only the abstract summary will be augmented for clarity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture with no derivation chain

full rationale

The paper introduces an empirical multimodal DL framework (JI-ADF) consisting of joint representation learning, auxiliary supervision, adaptive decision fusion, and an MMFA module, then reports benchmark performance on MILK10k. No equations, first-principles derivations, or predictions appear in the text. Performance claims rest on standard train/test evaluation, modality ablations, calibration metrics, and Grad-CAM visualizations rather than any fitted parameter renamed as a prediction or any self-citation that bears the central result. The architecture is therefore self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The model implies standard deep-learning hyperparameters and the assumption that the three modalities are complementary under the described fusion.

pith-pipeline@v0.9.1-grok · 5762 in / 1126 out tokens · 16151 ms · 2026-07-01T08:57:51.475526+00:00 · methodology

Review history (2 revisions) →

JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)