arxiv: 2604.20306 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.AI

Recognition: unknown

Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

Jin Wang, Ke Lu, Qiang Li, Weizhi Nie, Yuting Su, Zibo Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:02 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical visual question answeringcausal inferencebackdoor adjustmentinstrumental variabledeconfounded representationsstructural causal modelout-of-distribution generalizationmultimodal learning

0 comments

The pith

A dual causal method combines backdoor adjustment with instrumental variable learning to remove both visible and hidden biases from medical visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical visual question answering systems frequently latch onto superficial image-question patterns instead of genuine diagnostic links. The paper introduces Dual Causal Inference, a unified framework that builds a structural causal model to separate observable confounders via backdoor adjustment from unobserved ones via an instrumental variable. Mutual information constraints are imposed to ensure the instrumental variable stays independent of hidden biases while remaining tied to the fused representations. This produces deconfounded features that reflect true causal relationships rather than data shortcuts. Tests on SLAKE, SLAKE-CP, VQA-RAD, and PathVQA show consistent gains, especially under out-of-distribution shifts.

Core claim

DCI is the first unified architecture that integrates Backdoor Adjustment and Instrumental Variable learning to jointly tackle both observable and unobserved confounders in MedVQA, extracting deconfounded representations that capture genuine causal relationships through a structural causal model and mutual information constraints.

What carries the argument

The structural causal model that applies backdoor adjustment to observable cross-modal biases while learning a valid instrumental variable from a shared latent space under mutual information constraints.

Load-bearing premise

The mutual information constraints are sufficient to produce an instrumental variable that is independent of unobserved confounders while still informative for the multimodal representations.

What would settle it

Evaluation on a dataset where new unobserved confounders are deliberately injected and the mutual information constraints fail to isolate them, resulting in performance no better than non-causal baselines.

Figures

Figures reproduced from arXiv: 2604.20306 by Jin Wang, Ke Lu, Qiang Li, Weizhi Nie, Yuting Su, Zibo Xu.

**Figure 1.** Figure 1: Motivation for applying causal inference in MedVQA. (a) Conventional MedVQA models estimate the observational probability P(A|V, Q). They easily fall prey to spurious correlations driven by dataset biases, leading to incorrect predictions. (b) Our proposed framework intervenes via P(A|do(V, Q)) utilizing a dual causal approach (BDA and IV). This structurally mitigates confounding effects, enabling the mod… view at source ↗

**Figure 2.** Figure 2: Structural Causal Model (SCM) for multi-modal MedVQA. Gray [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed Dual Causal Inference (DCI) framework. It employs a sequential deconfounding paradigm to mitigate both observable and unobserved biases. The Backdoor Adjustment module (top) leverages confounder dictionaries to eliminate explicit dataset biases, yielding the observabledeconfounded representation X. Subsequently, the Instrumental Variable Learning module (bottom) distills a valid l… view at source ↗

**Figure 4.** Figure 4: Hyperparameter sensitivity of the dictionary size [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on SLAKE under the challenging “one-image, multi-question” scenario. Green bounding boxes denote the true reasons, red bounding [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results across various medical modalities and organs. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Medical Visual Question Answering (MedVQA) aims to generate clinically reliable answers conditioned on complex medical images and questions. However, existing methods often overfit to superficial cross-modal correlations, neglecting the intrinsic biases embedded in multimodal medical data. Consequently, models become vulnerable to cross-modal confounding effects, severely hindering their ability to provide trustworthy diagnostic reasoning. To address this limitation, we propose a novel Dual Causal Inference (DCI) framework for MedVQA. To the best of our knowledge, DCI is the first unified architecture that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to jointly tackle both observable and unobserved confounders. Specifically, we formulate a Structural Causal Model (SCM) where observable cross-modal biases (e.g., frequent visual and textual co-occurrences) are mitigated via BDA, while unobserved confounders are compensated using an IV learned from a shared latent space. To guarantee the validity of the IV, we design mutual information constraints that maximize its dependence on the fused multimodal representations while minimizing its associations with the unobserved confounders and target answers. Through this dual mechanism, DCI extracts deconfounded representations that capture genuine causal relationships. Extensive experiments on four benchmark datasets, SLAKE, SLAKE-CP, VQA-RAD, and PathVQA, demonstrate that our method consistently outperforms existing approaches, particularly in out-of-distribution (OOD) generalization. Furthermore, qualitative analyses confirm that DCI significantly enhances the interpretability and robustness of cross-modal reasoning by explicitly disentangling true causal effects from spurious cross-modal shortcuts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies backdoor adjustment plus a mutual-information instrumental variable to MedVQA and reports OOD gains, but the IV validity rests on unproven constraints.

read the letter

The main takeaway is that this work puts backdoor adjustment and instrumental variable learning into one MedVQA model to reduce both visible cross-modal shortcuts and hidden confounders. It is the first such combined architecture in this subfield, and the experiments on SLAKE, SLAKE-CP, VQA-RAD, and PathVQA show consistent improvements over prior methods, especially on out-of-distribution splits. The qualitative examples also indicate the model focuses more on genuine image-question links than on frequent co-occurrences. That is useful progress for anyone trying to make diagnostic VQA more reliable. The architecture itself is straightforward: an SCM is defined, BDA handles the observable biases, and an IV is pulled from a shared latent space with mutual-information penalties to enforce relevance while cutting ties to confounders and answers. The reported numbers look solid enough on the standard benchmarks. The soft spot is the IV construction. Mutual information is estimated variationally in the network, so the constraints are approximate. The paper does not supply identification arguments, bounds on the approximation error, or direct checks that the learned variable meets the exclusion restriction and independence conditions under the actual data-generating process. In medical images, unobserved factors such as scanner artifacts or unrecorded clinical context can have high-dimensional, non-linear links that simple MI penalties may not cleanly separate. Without those checks, the deconfounding guarantee stays partly aspirational. This paper is for researchers who already work on causal interventions in vision-language medical tasks and want a concrete implementation to build on. It is worth sending to peer review because the empirical results are there and the problem matters, even though the theoretical grounding for the IV needs tightening before the causal claims can be taken at face value.

Referee Report

2 major / 2 minor

Summary. The paper proposes Dual Causal Inference (DCI), a unified framework for Medical Visual Question Answering (MedVQA) that integrates Backdoor Adjustment (BDA) to mitigate observable cross-modal biases and Instrumental Variable (IV) learning to handle unobserved confounders within a Structural Causal Model (SCM). An IV is learned from a shared latent space and regularized via mutual information constraints that aim to maximize dependence on fused multimodal representations while minimizing associations with unobserved confounders and target answers. Experiments on SLAKE, SLAKE-CP, VQA-RAD, and PathVQA report consistent gains over baselines, with emphasis on improved out-of-distribution generalization and interpretability.

Significance. If the deconfounding claims hold, the work would offer a practical route to more reliable MedVQA systems by explicitly separating causal effects from spurious correlations in multimodal medical data, which could improve robustness in clinical settings where unobserved factors such as imaging artifacts or prior knowledge are common.

major comments (2)

[Abstract and Methods (IV construction)] Abstract and Methods (IV construction): The mutual information constraints are asserted to guarantee a valid IV satisfying relevance, independence from unobserved confounders, and the exclusion restriction. However, because MI estimation in neural networks is variational and approximate, the manuscript must supply either an identification proof under the SCM or empirical bounds showing that the learned latent satisfies the three IV conditions; without these, the deconfounding guarantee does not necessarily follow from the stated penalties.
[Experiments section] Experiments section: While outperformance on four benchmarks including OOD splits is reported, the absence of ablations that isolate the incremental contribution of the BDA module versus the IV module, or of statistical tests comparing against strong multimodal baselines, leaves open the possibility that gains arise from other architectural choices rather than the dual causal mechanism.

minor comments (2)

[Abstract] The abstract states that DCI is 'the first unified architecture' integrating BDA and IV; a brief citation to the closest prior causal MedVQA works would strengthen this positioning.
[Methods] Notation for the shared latent space and the precise form of the MI terms (e.g., which estimators are used) should be introduced earlier and used consistently in the methods description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment point-by-point below, indicating the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: Abstract and Methods (IV construction): The mutual information constraints are asserted to guarantee a valid IV satisfying relevance, independence from unobserved confounders, and the exclusion restriction. However, because MI estimation in neural networks is variational and approximate, the manuscript must supply either an identification proof under the SCM or empirical bounds showing that the learned latent satisfies the three IV conditions; without these, the deconfounding guarantee does not necessarily follow from the stated penalties.

Authors: We agree that the variational approximation of mutual information means the constraints provide only an empirical proxy rather than a strict guarantee. In the revised manuscript we will add a new subsection (Methods 3.4) containing (i) a proof sketch under the SCM showing how the three MI penalties encourage relevance, independence from U, and exclusion, and (ii) empirical bounds obtained by reporting the estimated MI values together with sensitivity analyses on the learned latent across all four datasets. These additions will make the deconfounding claim more rigorous. revision: yes
Referee: Experiments section: While outperformance on four benchmarks including OOD splits is reported, the absence of ablations that isolate the incremental contribution of the BDA module versus the IV module, or of statistical tests comparing against strong multimodal baselines, leaves open the possibility that gains arise from other architectural choices rather than the dual causal mechanism.

Authors: We concur that component-wise ablations and statistical significance tests are necessary to attribute gains specifically to the dual causal design. In the revised Experiments section we will add (i) three controlled variants—DCI without BDA, DCI without IV, and full DCI—evaluated on SLAKE, SLAKE-CP, VQA-RAD, and PathVQA (including OOD splits), and (ii) paired statistical tests (McNemar’s test and Wilcoxon signed-rank test with p-values) against the strongest multimodal baselines. The new tables will quantify the incremental benefit of each module. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain

full rationale

The paper formulates an SCM for MedVQA, applies BDA for observable confounders, and introduces an IV from a shared latent space whose validity is enforced by mutual information constraints that maximize dependence on fused representations while minimizing links to confounders and answers. This design choice produces deconfounded representations as the optimization outcome, but the provided text contains no equations demonstrating that the final causal claims or predictions are identical to the fitted inputs by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing steps. The central architecture is presented as a novel integration with external benchmark validation, rendering the derivation self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on an assumed Structural Causal Model for medical VQA data and on the effectiveness of mutual-information constraints for IV validity; no numerical free parameters are stated in the abstract.

axioms (1)

domain assumption A Structural Causal Model accurately captures the observable and unobserved confounding structure in multimodal medical data.
Invoked to justify the use of BDA for visible biases and IV for hidden biases.

invented entities (1)

Instrumental variable learned from shared latent space no independent evidence
purpose: To compensate for unobserved confounders while satisfying mutual information constraints.
Introduced as the mechanism for handling hidden confounders; no external falsifiable evidence (e.g., predicted observable) is supplied in the abstract.

pith-pipeline@v0.9.0 · 5588 in / 1321 out tokens · 46183 ms · 2026-05-10T00:02:24.625612+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
cs.CV 2026-05 unverdicted novelty 5.0

LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.

Reference graph

Works this paper leans on

45 extracted references · 3 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Multiscale feature extraction and fusion of image and text in vqa,

S. Lu, Y . Ding, M. Liu, Z. Yin, L. Yin, and W. Zheng, “Multiscale feature extraction and fusion of image and text in vqa,”International Journal of Computational Intelligence Systems, vol. 16, no. 1, p. 54, 2023

2023
[2]

From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models ,

J. Guo, J. Li, D. Li, A. M. Huat Tiong, B. Li, D. Tao, and S. Hoi, “ From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models ,” inCVPR, 2023, pp. 10 867–10 877

2023
[3]

Cross modality bias in visual question answering: A causal view with possible worlds vqa,

A. V osoughi, S. Deng, S. Zhang, Y . Tian, C. Xu, and J. Luo, “Cross modality bias in visual question answering: A causal view with possible worlds vqa,”IEEE TMM, vol. 26, pp. 8609–8624, 2024

2024
[4]

Pmc-llama: toward building open-source language models for medicine,

C. Wu, W. Lin, X. Zhang, Y . Zhang, W. Xie, and Y . Wang, “Pmc-llama: toward building open-source language models for medicine,”Journal of the American Medical Informatics Association, vol. 31, no. 9, pp. 1833–1843, 2024

2024
[5]

Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

X. Zhang, C. Wu, Z. Zhao, W. Lin, Y . Zhang, Y . Wang, and W. Xie, “Pmc-vqa: Visual instruction tuning for medical visual question answer- ing,”arXiv preprint arXiv:2305.10415, 2023

work page arXiv 2023
[6]

Pubmedclip: How much does clip benefit visual question answering in the medical domain?

S. Eslami, C. Meinel, and G. De Melo, “Pubmedclip: How much does clip benefit visual question answering in the medical domain?” in Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1151–1163

2023
[7]

Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering,

P. Li, G. Liu, J. He, Z. Zhao, and S. Zhong, “Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering,” inMICCAI. Springer, 2023, pp. 374–383

2023
[8]

Multi-modal masked autoencoders for medical vision-and-language pre-training,

Z. Chen, Y . Du, J. Hu, Y . Liu, G. Li, X. Wan, and T.-H. Chang, “Multi-modal masked autoencoders for medical vision-and-language pre-training,” inMICCAI. Springer, 2022, pp. 679–689

2022
[9]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,”NeurIPS, vol. 36, pp. 28 541– 28 564, 2023

2023
[10]

Chest x-ray image classification: a causal perspective,

W. Nie, C. Zhang, D. Song, Y . Bai, K. Xie, and A.-A. Liu, “Chest x-ray image classification: a causal perspective,” inMICCAI. Springer, 2023, pp. 25–35

2023
[11]

Cross-modal causal relational reasoning for event-level visual question answering,

Y . Liu, G. Li, and L. Lin, “Cross-modal causal relational reasoning for event-level visual question answering,”IEEE TPAMI, vol. 45, pp. 11 624–11 641, 2022

2022
[12]

Mitigating inherent bias of answer heuristic based frameworks in knowledge-based visual question answering,

J. Lu, M. Jiang, J. Kong, D. Zhuang, and M. Lu, “Mitigating inherent bias of answer heuristic based frameworks in knowledge-based visual question answering,”IEEE TMM, vol. 28, pp. 1744–1755, 2026

2026
[13]

Enhanced reasoning via multimodal llms and collaborative inference,

Z. Wen, M. Tan, Y . Wang, Q. Wu, and Q. Wu, “Enhanced reasoning via multimodal llms and collaborative inference,”IEEE TMM, vol. 27, pp. 7166–7178, 2025

2025
[14]

Causal inference in statistics: An overview,

J. Pearl, “Causal inference in statistics: An overview,”Statistics Surveys, vol. 3, pp. 96–146, 2009

2009
[15]

Causal attention for vision- language tasks,

X. Yang, H. Zhang, G. Qi, and J. Cai, “Causal attention for vision- language tasks,” inCVPR, 2021, pp. 9847–9857

2021
[16]

Open-ended medical visual question answering through prefix tuning of language models,

T. Van Sonsbeek, M. M. Derakhshani, I. Najdenkoska, C. G. Snoek, and M. Worring, “Open-ended medical visual question answering through prefix tuning of language models,” inMICCAI. Springer, 2023, pp. 726–736

2023
[17]

Pmc-clip: Contrastive language-image pre-training using biomedical documents,

W. Lin, Z. Zhao, X. Zhang, C. Wu, Y . Zhang, Y . Wang, and W. Xie, “Pmc-clip: Contrastive language-image pre-training using biomedical documents,” inMICCAI. Springer, 2023, pp. 525–536

2023
[18]

Overcoming data limitation in medical visual question answering,

B. D. Nguyen, T.-T. Do, B. X. Nguyen, T. Do, E. Tjiputra, and Q. D. Tran, “Overcoming data limitation in medical visual question answering,” inMICCAI. Springer, 2019, pp. 522–530

2019
[19]

Contrastive pre-training and representation distillation for medical visual question answering based on radiology images,

B. Liu, L.-M. Zhan, and X.-M. Wu, “Contrastive pre-training and representation distillation for medical visual question answering based on radiology images,” inMICCAI. Springer, 2021, pp. 210–220

2021
[20]

Self-supervised vision- language pretraining for medial visual question answering,

P. Li, G. Liu, L. Tan, J. Liao, and S. Zhong, “Self-supervised vision- language pretraining for medial visual question answering,” in2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). IEEE, 2023, pp. 1–5

2023
[21]

Amam: an attention-based multimodal alignment model for medical visual question answering,

H. Pan, S. He, K. Zhang, B. Qu, C. Chen, and K. Shi, “Amam: an attention-based multimodal alignment model for medical visual question answering,”Knowledge-Based Systems, vol. 255, p. 109763, 2022

2022
[22]

A dual-attention learning network with word and sentence embedding for medical visual question answering,

X. Huang and H. Gong, “A dual-attention learning network with word and sentence embedding for medical visual question answering,”IEEE Transactions on Medical Imaging, vol. 43, no. 2, pp. 832–845, 2023

2023
[23]

Localized questions in medical visual question answering,

S. Tascon-Morales, P. M ´arquez-Neila, and R. Sznitman, “Localized questions in medical visual question answering,” inMICCAI. Springer, 2023, pp. 361–370

2023
[24]

Motor: Multimodal optimal transport via grounded retrieval in medical visual question answering,

M. A. Shaaban, T. J. Saleem, V . R. K. Papineni, and M. Yaqub, “Motor: Multimodal optimal transport via grounded retrieval in medical visual question answering,” inMICCAI. Springer, 2025, pp. 459–469

2025
[25]

Causal interven- tion for weakly-supervised semantic segmentation,

D. Zhang, H. Zhang, J. Tang, X.-S. Hua, and Q. Sun, “Causal interven- tion for weakly-supervised semantic segmentation,” inNeurIPS, vol. 33. Curran Associates, Inc., 2020, pp. 655–666

2020
[26]

Multi-label chest x-ray image classification via category disentangled causal learning,

Q. Li, M. Liu, R. Chang, W. Nie, S. Bai, and A. Liu, “Multi-label chest x-ray image classification via category disentangled causal learning,” IEEE Transactions on Artificial Intelligence, 2025

2025
[27]

Cross- modal causal representation learning for radiology report generation,

W. Chen, Y . Liu, C. Wang, J. Zhu, G. Li, C.-L. Liu, and L. Lin, “Cross- modal causal representation learning for radiology report generation,” IEEE TIP, vol. 34, pp. 2970–2985, 2025

2025
[28]

Debiasing medical visual question answering via counterfactual training,

C. Zhan, P. Peng, H. Zhang, H. Sun, C. Shang, T. Chen, H. Wang, G. Wang, and H. Wang, “Debiasing medical visual question answering via counterfactual training,” inMICCAI. Springer, 2023, pp. 382–393

2023
[29]

Eliminating language bias for medical visual question answering with counterfactual contrastive training,

X. Wan, Q. Teng, J. Chen, Y . Lu, D. Yuan, and Z. Liu, “Eliminating language bias for medical visual question answering with counterfactual contrastive training,” inMICCAI. Springer, 2025, pp. 194–204

2025
[30]

Counterfactual causal-effect intervention for interpretable medical visual question answering,

L. Cai, H. Fang, N. Xu, and B. Ren, “Counterfactual causal-effect intervention for interpretable medical visual question answering,”IEEE Transactions on Medical Imaging, vol. 43, no. 12, pp. 4430–4441, 2024

2024
[31]

Cimb- mvqa: Causal intervention on modality-specific biases for medical visual question answering,

B. Liu, L. Liu, J. Ding, X. Yang, W. Peng, and L. Liu, “Cimb- mvqa: Causal intervention on modality-specific biases for medical visual question answering,”Medical Image Analysis, p. 103850, 2025

2025
[32]

Instrumental variable learning for chest x-ray classification,

W. Nie, C. Zhang, D. Song, Y . Bai, K. Xie, and A. Liu, “Instrumental variable learning for chest x-ray classification,” in2023 IEEE Interna- tional Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2023, pp. 4506–4512

2023
[33]

Instrumental variable-driven domain generalization with unobserved confounders,

J. Yuan, X. Ma, R. Xiong, M. Gong, X. Liu, F. Wu, L. Lin, and K. Kuang, “Instrumental variable-driven domain generalization with unobserved confounders,”ACM Transactions on Knowledge Discovery from Data, vol. 17, no. 8, pp. 1–21, 2023

2023
[34]

Instrumental variables in causal inference and machine learning: A survey,

A. Wu, K. Kuang, R. Xiong, and F. Wu, “Instrumental variables in causal inference and machine learning: A survey,”ACM Computing Surveys, vol. 57, no. 11, pp. 1–36, 2025

2025
[35]

Unbiased visual question answering by leveraging instrumental variable,

Y . Pan, J. Liu, L. Jin, and Z. Li, “Unbiased visual question answering by leveraging instrumental variable,”IEEE TMM, vol. 26, pp. 6648–6662, 2024

2024
[36]

Long-tailed classification by keeping the good and removing the bad momentum causal effect,

K. Tang, J. Huang, and H. Zhang, “Long-tailed classification by keeping the good and removing the bad momentum causal effect,” inNeurIPS, vol. 33. Curran Associates, Inc., 2020, pp. 1513–1524

2020
[37]

An introduction to causal inference,

J. Pearl, “An introduction to causal inference,”The international journal of biostatistics, vol. 6, no. 2, p. 7, 2010

2010
[38]

Show, attend and tell: Neural image caption generation with visual attention,

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y . Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” inICML. PMLR, 2015, pp. 2048–2057

2015
[39]

Club: A contrastive log-ratio upper bound of mutual information,

P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin, “Club: A contrastive log-ratio upper bound of mutual information,” inICML. PMLR, 2020, pp. 1779–1788

2020
[40]

A dataset of clinically generated visual questions and answers about radiology images,

J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman, “A dataset of clinically generated visual questions and answers about radiology images,”Scientific data, vol. 5, no. 1, pp. 1–10, 2018

2018
[41]

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,

B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y . Yang, and X.-M. Wu, “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,” in2021 IEEE 18th international symposium on biomedical imaging (ISBI). IEEE, 2021, pp. 1650–1654

2021
[42]

Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

X. He, Y . Zhang, L. Mou, E. Xing, and P. Xie, “Pathvqa: 30000+ questions for medical visual question answering,”arXiv preprint arXiv:2003.10286, 2020

work page arXiv 2003
[43]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

Miss: A generative pre-training and fine-tuning approach for med-vqa,

J. Chen, D. Yang, Y . Jiang, Y . Lei, and L. Zhang, “Miss: A generative pre-training and fine-tuning approach for med-vqa,” inInternational Conference on Artificial Neural Networks. Springer, 2024, pp. 299–313

2024
[45]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inICCV, 2017, pp. 618–626

2017