Recognition: unknown
DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection
Pith reviewed 2026-05-10 17:04 UTC · model grok-4.3
The pith
A dual-branch multimodal framework detects out-of-distribution samples in endoscopic images by integrating text-image and vision branch scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After separate training of the text-image branch (yielding score St) and the vision branch (yielding score Sv), their integration produces a final OOD score S that is thresholded to decide whether an input is out-of-distribution.
What carries the argument
The dual-branch architecture in which one branch computes an image-text matching score and the other computes a vision-only score, followed by direct integration of the two scores to form the final decision threshold.
If this is right
- Deep-learning models for endoscopy become more reliable when they encounter previously unseen disease presentations.
- The performance gain holds when the underlying backbone networks are swapped for different architectures.
- State-of-the-art OOD detection on the tested public endoscopic datasets improves by up to 24.84 percent.
Where Pith is reading between the lines
- The same score-integration pattern could be tested on other medical imaging tasks where both visual appearance and descriptive text are available.
- Deployment in hospitals would still require validation on live data streams that contain natural, unscripted distribution shifts.
- Replacing the fixed score combination with a learned fusion layer might further improve results on more diverse data.
Load-bearing premise
The two branches supply genuinely complementary information whose simple combination yields a better detector than either branch alone, and the chosen public datasets reflect the distribution shifts that occur in actual clinical practice.
What would settle it
A controlled test on an endoscopic dataset collected from a different clinical site or patient cohort in which the dual-branch integrated score shows no gain, or a loss, in OOD detection metrics relative to the strongest single-branch baseline.
Figures
read the original abstract
The complex and dynamic real-world clinical environment demands reliable deep learning (DL) systems. Out-of-distribution (OOD) detection plays a critical role in enhancing the reliability and generalizability of DL models when encountering data that deviate from the training distribution, such as unseen disease cases. However, existing OOD detection methods typically rely either on a single visual modality or solely on image-text matching, failing to fully leverage multimodal information. To overcome the challenge, we propose a novel dual-branch multimodal framework by introducing a text-image branch and a vision branch. Our framework fully exploits multimodal representations to identify OOD samples through these two complementary branches. After training, we compute scores from the text-image branch ($S_t$) and vision branch ($S_v$), and integrate them to obtain the final OOD score $S$ that is compared with a threshold for OOD detection. Comprehensive experiments on publicly available endoscopic image datasets demonstrate that our proposed framework is robust across diverse backbones and improves state-of-the-art performance in OOD detection by up to 24.84%
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DBMF, a dual-branch multimodal framework for out-of-distribution (OOD) detection in endoscopic images. It consists of a text-image branch and a vision branch whose scores S_t and S_v are integrated into a final OOD score S; the framework is claimed to be robust across diverse backbones and to improve state-of-the-art OOD detection performance by up to 24.84% on publicly available endoscopic datasets.
Significance. If the reported gains prove robust and attributable to genuine complementarity between the text-image and vision branches rather than simple averaging of correlated signals, the work could meaningfully advance reliable multimodal OOD detection for clinical deep-learning systems. The emphasis on public datasets supports reproducibility, yet the absence of detailed metrics, ablations, and shift characterizations currently limits the ability to gauge its broader impact on handling real-world clinical distribution shifts.
major comments (3)
- Abstract: The integration rule that produces the final score S from S_t and S_v is never specified, nor are the exact OOD metrics (AUROC, FPR@95, etc.), the precise baselines, or any statistical significance tests underlying the “up to 24.84 %” improvement. Without these details the central performance claim cannot be evaluated or reproduced.
- Experiments (assumed section): No ablation results, per-branch performance numbers, or failure-case analysis are presented to show that the two branches supply genuinely complementary signals. It is therefore impossible to determine whether the reported gains exploit complementarity or merely average correlated scores.
- Dataset description: The paper asserts robustness on public endoscopic datasets but provides no characterization of the concrete distribution shifts (new pathologies, equipment changes, demographic variations) present in the OOD test splits, leaving open whether the 24.84 % figure generalizes beyond the chosen splits to clinically relevant shifts.
minor comments (2)
- Abstract: The abstract states that “comprehensive experiments” were performed yet supplies no quantitative metrics, confidence intervals, or backbone-specific results, which would allow readers to assess the strength of the claims at a glance.
- Notation: The symbols S_t, S_v, and S are introduced without an accompanying equation or pseudocode defining their computation and combination, which would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving clarity, reproducibility, and depth of analysis. We address each major comment below and have revised the manuscript to incorporate the requested details and additional experiments.
read point-by-point responses
-
Referee: Abstract: The integration rule that produces the final score S from S_t and S_v is never specified, nor are the exact OOD metrics (AUROC, FPR@95, etc.), the precise baselines, or any statistical significance tests underlying the “up to 24.84 %” improvement. Without these details the central performance claim cannot be evaluated or reproduced.
Authors: We agree that the abstract, in its current concise form, omits critical specifics needed for immediate evaluation. The integration rule (a weighted combination of the two branch scores) and the full set of metrics, baselines, and significance testing are described in the methods and experimental sections. In the revised manuscript we will expand the abstract to explicitly state the integration rule, list the precise metrics (AUROC, FPR@95, etc.), name the baselines, and report the statistical tests supporting the 24.84 % figure. revision: yes
-
Referee: Experiments (assumed section): No ablation results, per-branch performance numbers, or failure-case analysis are presented to show that the two branches supply genuinely complementary signals. It is therefore impossible to determine whether the reported gains exploit complementarity or merely average correlated scores.
Authors: We acknowledge that the current experimental section lacks the ablations necessary to isolate the contribution of each branch. We will add (i) per-branch AUROC/FPR@95 numbers, (ii) an ablation table showing performance when using only S_t, only S_v, and the combined S, and (iii) a qualitative failure-case analysis. These additions will directly demonstrate whether the observed gains stem from complementary signals. revision: yes
-
Referee: Dataset description: The paper asserts robustness on public endoscopic datasets but provides no characterization of the concrete distribution shifts (new pathologies, equipment changes, demographic variations) present in the OOD test splits, leaving open whether the 24.84 % figure generalizes beyond the chosen splits to clinically relevant shifts.
Authors: We agree that a more explicit characterization of the distribution shifts is required for clinical interpretability. In the revised dataset section we will describe the specific shifts present in each OOD split—new pathologies, changes in imaging equipment, and demographic variations—supported by quantitative statistics on the test sets. This will clarify the scope of the reported gains. revision: yes
Circularity Check
No significant circularity; framework and claims are self-contained
full rationale
The paper defines a dual-branch architecture with separate text-image (S_t) and vision (S_v) branches whose scores are computed after training and then integrated into a final OOD score S. No equations, derivations, or performance claims reduce the reported improvements (up to 24.84%) to a fitted parameter renamed as a prediction, a self-referential definition, or a load-bearing self-citation chain. The central contribution is an empirical architectural proposal evaluated on public datasets, with no steps that collapse by construction to the inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
In: The Twelfth International Confer- ence on Learning Representations (2024)
Ammar, M.B., Belkhir, N., Popescu, S., Manzanera, A., Franchi, G.: Neco: Neural collapse based out-of-distribution detection. In: The Twelfth International Confer- ence on Learning Representations (2024)
2024
-
[2]
In: Proceedings of the ieee/cvf international conference on computer vision
Chan, R., Rottmann, M., Gottschalk, H.: Entropy maximization and meta classifi- cation for out-of-distribution detection in semantic segmentation. In: Proceedings of the ieee/cvf international conference on computer vision. pp. 5128–5137 (2021)
2021
-
[3]
ACM transactions on intelligent systems and technology15(3), 1–45 (2024)
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al.: A survey on evaluation of large language models. ACM transactions on intelligent systems and technology15(3), 1–45 (2024)
2024
-
[4]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Chhetri, A., Korhonen, J., Gyawali, P., Bhattarai, B.: Nero: Explainable out-of- distribution detection with neuron-level relevance in gastrointestinal imaging. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 349–359. Springer (2025)
2025
-
[5]
arXiv preprint arXiv:1603.05202 (2016)
Cohn, H.: Packing, coding, and ground states. arXiv preprint arXiv:1603.05202 (2016)
-
[6]
In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)
2019
-
[7]
ICLR (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. ICLR (2021)
2021
-
[8]
Pattern recognition letters27(8), 861–874 (2006)
Fawcett, T.: An introduction to roc analysis. Pattern recognition letters27(8), 861–874 (2006)
2006
-
[9]
He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
2016
-
[10]
In: International Conference on Machine Learning
Hendrycks, D., Basart, S., Mazeika, M., Zou, A., Kwon, J., Mostajabi, M., Stein- hardt, J., Song, D.: Scaling out-of-distribution detection for real-world settings. In: International Conference on Machine Learning. pp. 8759–8773. PMLR (2022)
2022
-
[11]
A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks
Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of- distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016)
work page internal anchor Pith review arXiv 2016
-
[12]
Advances in Neural Information Processing Systems34, 677–689 (2021)
Huang, R., Geng, A., Li, Y.: On the importance of gradients for detecting distribu- tional shifts in the wild. Advances in Neural Information Processing Systems34, 677–689 (2021)
2021
-
[13]
In: Workshop on machine learning for multimodal healthcare data
Jha, D., Sharma, V., Dasu, N., Tomar, N.K., Hicks, S., Bhuyan, M.K., Das, P.K., Riegler, M.A., Halvorsen, P., Bagci, U., et al.: Gastrovision: A multi-class endoscopy image dataset for computer aided gastrointestinal disease detection. In: Workshop on machine learning for multimodal healthcare data. pp. 125–140. Springer (2023) 10 Jiangbei Yue et al
2023
-
[14]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention
Ju, L., Zhou, S., Zhou, Y., Lu, H., Zhu, Z., Keane, P.A., Ge, Z.: Delving into out- of-distribution detection with medical vision-language models. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 133–143. Springer (2025)
2025
-
[15]
arXiv preprint arXiv:2412.10372 (2024)
Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv preprint arXiv:2412.10372 (2024)
-
[16]
nature521(7553), 436–444 (2015)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature521(7553), 436–444 (2015)
2015
-
[17]
Advances in neural information processing systems31(2018)
Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out- of-distribution samples and adversarial attacks. Advances in neural information processing systems31(2018)
2018
-
[18]
In: International Conference on Learning Represen- tations (2018)
Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. In: International Conference on Learning Represen- tations (2018)
2018
-
[19]
Advances in neural information processing systems33, 21464–21475 (2020)
Liu, W., Wang, X., Owens, J., Li, Y.: Energy-based out-of-distribution detection. Advances in neural information processing systems33, 21464–21475 (2020)
2020
-
[20]
Ming, Y., Li, Y.: How does fine-tuning impact out-of-distribution detection for vision-language models? International Journal of Computer Vision132(2), 596– 609 (2024)
2024
-
[21]
In: Annual Conference on Medical Image Understanding and Analysis
Pokhrel, S., Bhandari, S., Ali, S., Lambrou, T., Nguyen, A., Shrestha, Y.R., Wat- son, A., Stoyanov, D., Gyawali, P., Bhattarai, B.: Out-of-distribution detection in gastrointestinal vision by estimating nearest centroid distance deficit. In: Annual Conference on Medical Image Understanding and Analysis. pp. 190–200. Springer (2025)
2025
-
[22]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
2021
-
[23]
International journal of computer vision115(3), 211–252 (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International journal of computer vision115(3), 211–252 (2015)
2015
-
[24]
International Journal of Medical Informatics177, 105142 (2023)
Sharma, A., Kumar, R., Garg, P.: Deep learning-based prediction model for diag- nosing gastrointestinal diseases using endoscopy images. International Journal of Medical Informatics177, 105142 (2023)
2023
-
[25]
Advances in neural information processing systems34, 144–157 (2021)
Sun, Y., Guo, C., Li, Y.: React: Out-of-distribution detection with rectified acti- vations. Advances in neural information processing systems34, 144–157 (2021)
2021
-
[26]
In: International conference on machine learning
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. pp. 10347–10357. PMLR (2021)
2021
-
[27]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, H., Li, Z., Feng, L., Zhang, W.: Vim: Out-of-distribution with virtual-logit matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4921–4930 (2022)
2022
-
[28]
IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)
Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence46(8), 5625–5644 (2024)
2024
-
[29]
Advances in neural information processing systems31 (2018)
Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems31 (2018)
2018
-
[30]
International journal of computer vision130(9), 2337–2348 (2022)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. International journal of computer vision130(9), 2337–2348 (2022)
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.