pith. sign in

arxiv: 2603.11689 · v2 · pith:EW2TIK67new · submitted 2026-03-12 · 💻 cs.AI

Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

Pith reviewed 2026-05-21 11:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords Multimodal Large Language ModelsZero-Shot TasksExplicit ReasoningModel ValidationConsistency RateVisual Question AnsweringReferring Expression ComprehensionTrustworthy AI
0
0 comments X

The pith

An explicit logic channel running parallel to multimodal models validates their zero-shot outputs and boosts accuracy by checking consistency against visual evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an Explicit Logic Channel that operates alongside black-box multimodal large language models to validate and improve performance on new tasks without ground-truth labels. This channel uses a large language model, visual foundation model, and probabilistic inference to carry out factual, counterfactual, and relational reasoning directly on explicit visual evidence. A Consistency Rate then measures agreement between the explicit channel and the model's implicit processing, enabling model selection and cross-channel integration. Experiments on visual question answering and referring expression comprehension tasks across multiple benchmarks and models show gains in accuracy along with added explainability from the visual grounding.

Core claim

The frontier MLLM can be treated as an Implicit Logic Channel, while the proposed Explicit Logic Channel mimics human logical reasoning by incorporating an LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over explicit visual evidence. A Consistency Rate enables cross-channel validation and model selection even without ground-truth annotations, and cross-channel integration further improves performance in zero-shot tasks while grounding outputs in explicit visual evidence for greater trustworthiness.

What carries the argument

The Explicit Logic Channel, which performs factual, counterfactual, and relational reasoning over explicit visual evidence using an LLM, VFM, and probabilistic inference in parallel with the MLLM's implicit channel.

If this is right

  • Consistency Rate enables selection among MLLMs for zero-shot tasks without requiring ground-truth annotations.
  • Cross-channel integration raises accuracy on multiple-choice visual question answering and human-centric referring expression comprehension.
  • Explicit visual evidence from the logic channel increases explainability and trustworthiness of MLLM decisions.
  • The approach applies across eleven open-source models from four frontier families on three challenging benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency-checking idea could be tested on additional multimodal tasks such as image captioning or visual navigation to check broader applicability.
  • Divergences between channels could serve as automatic signals for targeted retraining or prompt adjustment of the underlying MLLM.
  • The method might support real-time deployment if the explicit channel can be distilled into a faster module while preserving the validation benefit.

Load-bearing premise

The Explicit Logic Channel can accurately carry out factual, counterfactual, and relational reasoning over explicit visual evidence to produce consistency checks that meaningfully align with the MLLM's implicit outputs.

What would settle it

Running the proposed channel on the same MC-VQA and HC-REC benchmarks and finding that consistency rate fails to predict actual accuracy gains or that integration yields no performance lift would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.11689 by Hui Li Tan, Liyuan Li, Mei Chee Leong, Nancy Chen, Ying Gu.

Figure 1
Figure 1. Figure 1: Illustration of Explicit Logic Channel (ELC) for MLLM model validation and selection (upper) and performance enhancement (lower) for novel VLC application in zero-shot setting without the need of ground-truth annotation. In ELC, we combine LLM, VFM and Logical reasoning to derive a prediction on explicit and concrete visual evidence and logical validation for the VLC task. to be able to identify reliable m… view at source ↗
Figure 2
Figure 2. Figure 2: ELC (Explicit Logic Channel) and logic consistency for MC-VQA task on factual and counter-factual reasoning, with an example from NegBench for illustration. the question with a set of candidate captions T = [T1, · · · , TK], where the model selects the most suitable caption (cˆ ∈ [1, K]). It can be formulated as \hat {c} = \arg \max _{c \in [1,K]} P_M(c|I,T,\mathcal {A}) \quad \textrm {or} \quad \hat {c} =… view at source ↗
Figure 3
Figure 3. Figure 3: ELC (Explicit Logic Channel) and logic consistency for HC-REC task on object association, with an example from HC-RefCOCOg for illustration. where the geometric mean is used for normalization as done in [46]. The ELC selects the choice with the maximum posterior PLR(T|I) as the correct answer. For validation, the two channels are run independently, and CR is then com￾puted. With a small set of logical cons… view at source ↗
Figure 4
Figure 4. Figure 4: ELC (Explicit Logic Channel) and logic consistency for task of HC-REC on long context text query, with an example from HC-RefLoCo for illustration. is not closely related and may incorrectly draw attention to every person in I. The third (S3) provides the core description that uniquely identifies the target H. The fourth (S4) provides secondary but relevant information. Following the recommendations in [12… view at source ↗
Figure 1
Figure 1. Figure 1: Correlation between CR scores and CRgt/CR ratios. the description ‘+ label.rstrip(‘.’) + ’ and this image? The output is just value number using the same format.” B Logic Consistency and True Correction In principle, when the ILC and ELC predictions are logically consistent, the prediction is highly likely to be correct. To verify this assumption, we compute CRgt on samples that are both consistent and cor… view at source ↗
Figure 2
Figure 2. Figure 2: Inconsistent examples from NegBench, where the last red text under each example indicates the cause of error on explicit visual evidence [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Inconsistent examples from HC-RefCOCOg [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Inconsistent examples from HC-RefLoCo. E.1 Examples from NegBench Four inconsistent examples from NegBench are presented in [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
read the original abstract

Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model channel, to perform explicit logical reasoning for model validation, selection and enhancement. The frontier MLLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logical reasoning, incorporates a LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over the explicit visual evidence. A Consistency Rate (CR) is proposed for cross-channel validation and model selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over MLLMs, grounded with explicit visual evidence to enhance trustworthiness. Comprehensive experiments conducted for two representative VLC tasks, i.e., MC-VQA and HC-REC, on three challenging benchmarks, with 11 recent open-source MLLMs from 4 frontier families. Our systematic evaluations demonstrate the effectiveness of proposed ELC and CR for model validation, selection and improvement on MLLMs with enhanced explainability and trustworthiness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an Explicit Logic Channel (ELC) running in parallel with the Implicit Logic Channel of frontier MLLMs. ELC combines an LLM, a Visual Foundation Model, and probabilistic inference to perform factual, counterfactual, and relational reasoning over explicit visual evidence. A Consistency Rate (CR) is introduced for cross-channel validation and model selection in zero-shot settings without ground-truth labels. The authors report that cross-channel integration yields performance gains on MC-VQA and HC-REC tasks across three benchmarks and 11 open-source MLLMs from four families, while improving explainability and trustworthiness.

Significance. If ELC reasoning is shown to be accurate and sufficiently independent of the MLLM channel, the approach would supply a practical, annotation-free method for validating and improving zero-shot MLLM behavior together with explicit visual grounding. The scale of the experimental evaluation (11 models, two tasks, three benchmarks) is a clear strength and supports claims of broad applicability.

major comments (3)
  1. [§3] §3 (ELC construction): The claim that probabilistic inference produces reliable factual/counterfactual/relational outputs rests on an unverified assumption that the LLM+VFM component does not inherit the same zero-shot failure modes as the MLLM; no diagnostic is provided to measure or correct for such correlation.
  2. [§5.1] §5.1 (CR definition and use): CR is presented as a proxy for correctness without ground truth, yet the manuscript contains no held-out evaluation (even on a small annotated subset) demonstrating that high CR predicts actual accuracy rather than joint error; this directly affects the validity of both the validation and the model-selection claims.
  3. [§5.3] §5.3 (cross-channel integration results): Reported performance lifts from ELC+MLLM fusion are not accompanied by an ablation that isolates whether gains arise from complementary correct reasoning or from correlated mistakes; without this, the trustworthiness benefit cannot be substantiated.
minor comments (2)
  1. [§3.2] Notation for the probabilistic inference step inside ELC is introduced without an explicit equation or pseudocode; adding one would improve reproducibility.
  2. [Table 2] Table 2 (model-selection results): the CR threshold used for selection is stated but its sensitivity is not reported; a short sensitivity plot would strengthen the claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us identify areas where the manuscript can be strengthened. We address each major comment below and will incorporate the suggested analyses into a revised version of the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (ELC construction): The claim that probabilistic inference produces reliable factual/counterfactual/relational outputs rests on an unverified assumption that the LLM+VFM component does not inherit the same zero-shot failure modes as the MLLM; no diagnostic is provided to measure or correct for such correlation.

    Authors: We acknowledge that the original manuscript does not include an explicit diagnostic measuring the correlation between failure modes of the ELC (LLM+VFM+probabilistic inference) and the MLLM. The ELC is constructed to operate on explicit visual evidence and logical rules, providing a distinct reasoning pathway from the MLLM's implicit channel. To address this concern directly, we will add a new analysis in the revised manuscript that quantifies error overlap and independence across the 11 models and benchmarks, including joint error rates and a discussion of how explicit evidence reduces shared vulnerabilities. revision: yes

  2. Referee: [§5.1] §5.1 (CR definition and use): CR is presented as a proxy for correctness without ground truth, yet the manuscript contains no held-out evaluation (even on a small annotated subset) demonstrating that high CR predicts actual accuracy rather than joint error; this directly affects the validity of both the validation and the model-selection claims.

    Authors: The referee correctly identifies that validating CR as a reliable proxy requires evidence beyond its logical definition. While CR is designed as an annotation-free measure of cross-channel agreement, we agree that a direct check against ground truth is needed to confirm it predicts accuracy rather than joint errors. In the revision, we will add a held-out evaluation on a small annotated subset from one benchmark, reporting metrics such as the accuracy of high-CR selections versus random or low-CR baselines to substantiate the validation and model-selection claims. revision: yes

  3. Referee: [§5.3] §5.3 (cross-channel integration results): Reported performance lifts from ELC+MLLM fusion are not accompanied by an ablation that isolates whether gains arise from complementary correct reasoning or from correlated mistakes; without this, the trustworthiness benefit cannot be substantiated.

    Authors: We agree that the current results do not isolate the source of the observed performance improvements in the ELC+MLLM fusion. To substantiate the trustworthiness claims, the revised §5.3 will include an ablation that categorizes outcomes into agreement and disagreement cases between the two channels. We will specifically measure gains in disagreement cases (where one channel is correct) to demonstrate benefits from complementary reasoning, while also reporting joint-error scenarios to rule out correlated mistakes as the primary driver. revision: yes

Circularity Check

0 steps flagged

No significant circularity; ELC and CR are independently proposed components evaluated via experiments

full rationale

The paper introduces the Explicit Logic Channel (ELC) as a new parallel mechanism incorporating LLM, VFM, and probabilistic inference for factual/counterfactual/relational reasoning over explicit visual evidence, alongside the Consistency Rate (CR) as a cross-channel validation proxy without ground-truth. These are not derived by fitting parameters to a subset of data and then relabeling related outputs as predictions, nor do they reduce to self-definitional loops or load-bearing self-citations. The central claims rest on systematic experiments across MC-VQA and HC-REC tasks with 11 MLLMs on three benchmarks, providing external falsifiability through performance metrics and explainability gains rather than tautological equivalence to inputs. Potential concerns about error correlation between channels pertain to empirical validity, not derivation circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim relies on the effectiveness of the proposed ELC and CR, which introduce new conceptual entities and assumptions about reasoning capabilities.

axioms (1)
  • domain assumption Probabilistic inference can effectively model logical reasoning for factual, counterfactual, and relational tasks over visual evidence.
    Invoked in the description of the Explicit Logic Channel.
invented entities (2)
  • Explicit Logic Channel no independent evidence
    purpose: To perform explicit logical reasoning in parallel with the MLLM for validation and enhancement.
    Newly proposed component without external validation mentioned.
  • Implicit Logic Channel no independent evidence
    purpose: To refer to the black-box MLLM encapsulating latent knowledge.
    Conceptual entity introduced for contrast.

pith-pipeline@v0.9.0 · 5784 in / 1432 out tokens · 43354 ms · 2026-05-21T11:05:02.254222+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 11 internal anchors

  1. [1]

    Akoglu, H.: User’s guide to correlation coefficients. Turkish Journal of Emergency Medicine18(3), 91–93 (2018).https://doi.org/https://doi.org/10.1016/ j.tjem.2018.08.001,https://www.sciencedirect.com/science/article/pii/ S2452247318302164

  2. [2]

    In: CVPR-25 (2025)

    Alhamoud, K., Alshammari, S., Tian, Y., Li, G., Torr, P., Kim, Y., Ghassemi, M.: Vision-Language Models Do Not Understand Negation. In: CVPR-25 (2025)

  3. [3]

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (2023),https://arxiv.org/abs/2308.12966

  4. [4]

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond (2023) 28 Leong, et al

  5. [5]

    Bai, Z., Wang, P., Xiao, T., He, T., Han, Z., Zhang, Z., Shou, M.Z.: Hallucination of multimodal large language models: A survey (2025),https://arxiv.org/abs/ 2404.18930

  6. [6]

    Nat Commun11(1753) (2020).https://doi.org/https://doi

    Balsdon, T., Wyart, V., Mamassian, P.: Confidence controls perceptual evidence accumulation. Nat Commun11(1753) (2020).https://doi.org/https://doi. org/10.1038/s41467-020-15561-w

  7. [7]

    In: Hans- son, S.O., Hendricks, V.F

    Bradley, R.: Decision Theory: A Formal Philosophical Introduction. In: Hans- son, S.O., Hendricks, V.F. (eds.) Introduction to Formal Philosophy, pp. 611–655. Springer International Publishing (2018)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, C., Anjum, S., Gurari, D.: Grounding answers for visual questions asked by visually impaired people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19098–19107 (June 2022)

  9. [9]

    In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision

    Chen, C., Anjum, S., Gurari, D.: Vqa therapy: Exploring answer differences by visually grounding answers. In: Proceedings of the IEEE/CVF International Con- ference on Computer Vision. pp. 15315–15325 (2023)

  10. [10]

    Electronics13(11), 2061 (2024)

    Chen, Y., Su, L., Chen, L., Lin, Z.: Lcv2: a universal pretraining-free framework for grounded visual question answering. Electronics13(11), 2061 (2024)

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., Dai, J.: InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 24185–24198 (June 2024)

  12. [12]

    ArXivabs/2512.10791 (2025),https://api.semanticscholar.org/CorpusID:283737312

    Cheng, A., Jacovi, A., Globerson, A., Golan, B., Kwong, C., Alberti, C., Tao, C., Ben-David, E., Tomar, G.S., Haas, L., Bitton, Y., Bloniarz, A.E., Bai, A., Wang, A., Siddiqui, A., Castillo, A.B., Atias, A., Liu, C., Fry, C., Balle, D., Ghosal, D., Kukliansky, D., Marcus, D., Gribovskaya, E., Ofek, E.O., Zhuang, H., Laish, I., Ackermann, J., Wang, L., Ris...

  13. [13]

    Cheng, F., Li, H., Liu, F., van Rooij, R., Zhang, K., Lin, Z.: Empowering LLMs with Logical Reasoning: A Comprehensive Survey (2025),https://arxiv.org/ abs/2502.15652

  14. [14]

    Dang,Y.,Huang,K.,Huo,J.,Yan,Y.,Huang,S.,Liu,D.,Gao,M.,Zhang,J.,Qian, C., Wang, K., Liu, Y., Shao, J., Xiong, H., Hu, X.: Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey (2024),https:// arxiv.org/abs/2412.02104

  15. [15]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Di, S., Xie, W.: Grounded question-answering in long egocentric videos. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 12934–12943 (2024)

  16. [16]

    In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= 7PGluppo4k

    Diego, C., Stefano, T., Antonio, V.: Logically Consistent Language Mod- els via Neuro-Symbolic Integration. In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= 7PGluppo4k

  17. [17]

    Batch-ICL: Effective, efficient, and order-agnostic in-context learning

    Feng, J., Xu, R., Hao, J., Sharma, H., Shen, Y., Zhao, D., Chen, W.: Lan- guage Models can be Deductive Solvers. In: Duh, K., Gomez, H., Bethard, S. Supplementary Material 29 (eds.) Findings of the Association for Computational Linguistics: NAACL 2024. pp. 4026–4042. Association for Computational Linguistics, Mexico City, Mex- ico (June 2024).https://doi....

  18. [18]

    In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= fAAaT826Vv

    Feng, Y., Zhou, B., Lin, W., Roth, D.: BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models. In: The Thirteenth International Confer- ence on Learning Representations (2025),https://openreview.net/forum?id= fAAaT826Vv

  19. [19]

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., Ji, R., Shan, C., He, R.: Mme: A comprehensive evaluation benchmark for multimodal large language models (2025),https://arxiv.org/ abs/2306.13394

  20. [20]

    Fu, C., Zhang, Y.F., Yin, S., Li, B., Fang, X., Zhao, S., Duan, H., Sun, X., Liu, Z., Wang, L., Shan, C., He, R.: MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs (2024),https://arxiv.org/abs/2411.15296

  21. [21]

    Gemini Team: Gemini: A Family of Highly Capable Multimodal Models (2025), https://arxiv.org/abs/2312.11805

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Han, Z., Zhu, F., Lao, Q., Jiang, H.: Zero-shot referring expression comprehen- sion via structural similarity between images and captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14364– 14374 (2024)

  23. [23]

    In: International Conference on Learning Representations (2025),https : / / api

    Huang, J., Li, Z., Jacobs, D., Naik, M., Lim, S.N.: Laser: A neuro-symbolic framework for learning spatio-temporal scene graphs with weak supervision. In: International Conference on Learning Representations (2025),https : / / api . semanticscholar.org/CorpusID:258179880

  24. [24]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on In- formation Systems43(2), 1–55 (Jan 2025).https://doi.org/10.1145/3703155, http://dx.doi.org/10.1145/3703155

  25. [25]

    In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual rea- soning and compositional question answering. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6693–6702 (2019). https://doi.org/10.1109/CVPR.2019.00686

  26. [26]

    Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., Duerig, T.: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (2021),https://arxiv.org/abs/2102.05918

  27. [27]

    Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.A., Stock, P., Scao, T.L., Lavril, T., Wang, T., Lacroix, T., Sayed, W.E.: Mistral 7B (2023),https://arxiv.org/abs/2310.06825

  28. [28]

    In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G

    Kamath, A., Hsieh, C.Y., Chang, K.W., Krishna, R.: The hard positive truth about vision-language compositionality. In: Leonardis, A., Ricci, E., Roth, S., Rus- sakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 37–54. Springer Nature Switzerland, Cham (2025)

  29. [29]

    Derf: Decomposed radiance fields,

    Khan, A.U., Kuehne, H., Duarte, K., Gan, C., Lobo, N., Shah, M.: Found a rea- son for me? weakly-supervised grounded visual question answering using capsules. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8461–8470 (2021).https://doi.org/10.1109/CVPR46437.2021. 00836 30 Leong, et al

  30. [30]

    In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

    Khan, A.U., Kuehne, H., Gan, C., Lobo, N.D.V., Shah, M.: Weakly supervised grounding for vqa in vision-language transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. pp. 652–670. Springer Nature Switzerland, Cham (2022)

  31. [31]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR)

    Khan, Z., Fu, Y.: Consistency and uncertainty: Identifying unreliable responses from black-box vision-language models for selective visual question answering. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition (CVPR). pp. 10854–10863 (June 2024)

  32. [32]

    Kil, J., Mai, Z., Lee, J., Wang, Z., Cheng, K., Wang, L., Liu, Y., Chowdhury, A., Chao, W.L.: Mllm-compbench: A comparative reasoning benchmark for multi- modal llms (2025),https://arxiv.org/abs/2407.16837

  33. [33]

    In: International Conference on Learning Representations (2023)

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. In: International Conference on Learning Representations (2023)

  34. [34]

    In: Advances in Neural Infor- mation Processing Systems

    Li, B., Lin, Z., Peng, W., Nyandwi, J.d.D., Jiang, D., Ma, Z., Khanuja, S., Krishna, R., Neubig, G., Ramanan, D.: Naturalbench: Evaluating vision- language models on natural adversarial samples. In: Advances in Neural Infor- mation Processing Systems. vol. 37, pp. 17044–17068. Curran Associates, Inc. (2024),https : / / proceedings . neurips . cc / paper _...

  35. [35]

    Li, J., Lu, W., Fei, H., Luo, M., Dai, M., Xia, M., Jin, Y., Gan, Z., Qi, D., Fu, C., Tai, Y., Yang, W., Wang, Y., Wang, C.: A survey on benchmarks of multimodal large language models (2024),https://arxiv.org/abs/2408.08632

  36. [36]

    In: Chaud- huri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S

    Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: Bootstrapping Language-Image Pre- training for Unified Vision-Language Understanding and Generation. In: Chaud- huri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Pro- ceedings of the 39th International Conference on Machine Learning. Proceed- ings of Machine Learning Research, vol. 162, p...

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, Y., Wang, X., Xiao, J., Ji, W., Chua, T.S.: Invariant grounding for video ques- tion answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2928–2937 (2022)

  38. [38]

    Li, Z., Huang, J., Naik, M.: Scallop: A Language for Neurosymbolic Programming. Proc. ACM Program. Lang7, 166:1–166:25 (2023)

  39. [39]

    Li, Z., Huang, J., Naik, M.: Scallop: A language for neurosymbolic programming (2023),https://arxiv.org/abs/2304.04812

  40. [40]

    In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence (CVPR) Workshops

    Li, Z., Wu, X., Du, H., Liu, F., Nghiem, H., Shi, G.: A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Chal- lenges. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence (CVPR) Workshops. pp. 1587–1606 (June 2025)

  41. [41]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops

    Li, Z., Wu, X., Du, H., Liu, F., Nghiem, H., Shi, G.: A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops. pp. 1587–1606 (June 2025)

  42. [42]

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual Instruction Tuning (2023),https:// arxiv.org/abs/2304.08485

  43. [43]

    Advances in neural information processing systems36, 34892–34916 (2023) Supplementary Material 31

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) Supplementary Material 31

  44. [44]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., Lin, D.: Mmbench: Is your multi-modal model an all-around player? (2024),https://arxiv.org/abs/2307.06281

  45. [45]

    CoRRabs/2509.11862(2025).https: //doi.org/10.48550/ARXIV.2509.11862,https://doi.org/10.48550/arXiv

    Ma, H., Pathak, V., Wang, D.Z.: Bridging vision language models and symbolic grounding for video question answering. CoRRabs/2509.11862(2025).https: //doi.org/10.48550/ARXIV.2509.11862,https://doi.org/10.48550/arXiv. 2509.11862

  46. [46]

    Uncertainty estimation in autoregressive structured prediction, 2021

    Malinin, A., Gales, M.: Uncertainty estimation in autoregressive structured pre- diction. arXiv preprint arXiv:2002.07650 (2020)

  47. [47]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016)

  48. [48]

    Muhammad Ahsan Awais Muhammad Ahsan Awais, Peter Redmond, T.E.W., Healy, G.: Amber: advancing multimodal brain-computer interfaces for enhanced robustness—a dataset for naturalistic settings. Front. Neuroergonomics4(2023). https://doi.org/10.3389/fnrgo.2023.1216440

  49. [49]

    In: Bouamor, H., Pino, J., Bali, K

    Olausson, T., Gu, A., Lipkin, B., Zhang, C., Solar-Lezama, A., Tenenbaum, J., Levy, R.: LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 5153–5176. Associati...

  50. [50]

    Pournemat, M., Rezaei, K., Sriramanan, G., Zarei, A., Fu, J., Wang, Y., Egh- balzadeh, H., Feizi, S.: Reasoning under uncertainty: Exploring probabilistic rea- soning capabilities of llms (2025),https://arxiv.org/abs/2509.10739

  51. [51]

    In: Meila, M., Zhang, T

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Trans- ferable Visual Models From Natural Language Supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Res...

  52. [52]

    Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Models.Artificial Intelligence Review, 59(2):70, January 2026

    Rahman, S.S., Islam, M.A., Alam, M.M., Zeba, M., Rahman, M.A., Chowa, S.S., Raiaan, M.A.K., Azam, S.: Hallucination to truth: a review of fact-checking and factuality evaluation in large language models. Artificial Intelligence Review59(2) (Jan 2026).https://doi.org/10.1007/s10462-025-11454-w,http://dx.doi. org/10.1007/s10462-025-11454-w

  53. [53]

    Reich, D., Putze, F., Schultz, T.: Visually grounded vqa by lattice-based retrieval (2022),https://arxiv.org/abs/2211.08086

  54. [54]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

    Reich, D., Putze, F., Schultz, T.: Measuring faithful and plausible visual grounding in VQA. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 3129–3144. Association for Compu- tational Linguistics, Singapore (Dec 2023).https://doi.org/10.18653/v1/2023. findings-emnlp.206,https://aclantho...

  55. [55]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Shen, H., Zhao, T., Zhu, M., Yin, J.: Groundvlp: Harnessing zero-shot visual grounding from vision-language pre-training and open-vocabulary object detec- tion. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4766–4775 (2024)

  56. [56]

    arXiv preprint arXiv:2402.04252 (2023) 32 Leong, et al

    Sun, Q., Wang, J., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, X.: Eva-clip-18b: Scaling clip to 18 billion parameters. arXiv preprint arXiv:2402.04252 (2023) 32 Leong, et al

  57. [57]

    Team, G.: Gemma 3 (2025),https://goo.gle/Gemma3Report

  58. [58]

    Team, Q.: Qwen3 technical report (2025),https://arxiv.org/abs/2505.09388

  59. [59]

    Wang, J., Jiang, H., Liu, Y., Ma, C., Zhang, X., Pan, Y., Liu, M., Gu, P., Xia, S., Li, W., Zhang, Y., Wu, Z., Liu, Z., Zhong, T., Ge, B., Zhang, T., Qiang, N., Hu, X., Jiang, X., Zhang, X., Zhang, W., Shen, D., Liu, T., Zhang, S.: A comprehensive review of multimodal large language models: Performance and challenges across different tasks (2024),https://...

  60. [60]

    Frontiers in Artificial IntelligenceV olume 5 - 2022(2022).https://doi

    Wang, P., Hahm, C., Hammer, P.: A Model of Unified Perception and Cogni- tion. Frontiers in Artificial IntelligenceV olume 5 - 2022(2022).https://doi. org/10.3389/frai.2022.806403,https://www.frontiersin.org/journals/ artificial-intelligence/articles/10.3389/frai.2022.806403

  61. [61]

    2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Wang, T., Lin, K., Li, L., Lin, C.C., Yang, Z., Zhang, H., Liu, Z., Wang, L.: Equivariant similarity for vision-language foundation models. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 11964–11974 (2023), https://api.semanticscholar.org/CorpusID:257766245

  62. [62]

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models (2023),https://arxiv.org/abs/2203.11171

  63. [63]

    Wang, Y., Wang, M., Manzoor, M.A., Liu, F., Georgiev, G., Das, R.J., Nakov, P.: Factuality of large language models: A survey (2024),https://arxiv.org/abs/ 2402.02420

  64. [64]

    In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

    Wei,F.,Zhao,J.,Yan,K.,Zhang,H.,Xu,C.:ALarge-ScaleHuman-CentricBench- mark for Referring Expression Comprehension in the LMM Era. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Ad- vances in Neural Information Processing Systems. vol. 37, pp. 69566–69587. Curran Associates, Inc. (2024)

  65. [65]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)

    Xiao, J., Yao, A., Li, Y., Chua, T.S.: Can i trust your answer? visually grounded video question answering. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 13204–13214 (June 2024)

  66. [66]

    Xiao, L., Yang, X., Lan, X., Wang, Y., Xu, C.: Towards Visual Grounding: A Survey (2024),https://arxiv.org/abs/2412.20206

  67. [67]

    IEEE Trans- actions on Multimedia26, 3331–3340 (2024)

    Yang, X., Liu, F., Lin, G.: Neural Logic Vision Language Explainer. IEEE Trans- actions on Multimedia26, 3331–3340 (2024)

  68. [68]

    The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.C., Liu, Z., Wang, L.: The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) (2023),https://arxiv. org/abs/2309.17421

  69. [69]

    In: Proceedings of the 37th International Conference on Neural Infor- mation Processing Systems

    Yarom, M., Bitton, Y., Changpinyo, S., Aharoni, R., Herzig, J., Lang, O., Ofek, E., Szpektor, I.: What you see is what you read? improving text-image alignment evaluation. In: Proceedings of the 37th International Conference on Neural Infor- mation Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

  70. [70]

    National Science Review11(12) (2024).https://doi.org/ 10.1093/nsr/nwae403,http://dx.doi.org/10.1093/nsr/nwae403

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12) (2024).https://doi.org/ 10.1093/nsr/nwae403,http://dx.doi.org/10.1093/nsr/nwae403

  71. [71]

    In: European conference on computer vision

    Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in refer- ring expressions. In: European conference on computer vision. pp. 69–85. Springer (2016)

  72. [72]

    Yuan, J., Peng, T., Jiang, Y., Lu, Y., Zhang, R., Feng, K., Fu, C., Chen, T., Bai, L., Zhang, B., Yue, X.: Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms (2025),https://arxiv.org/abs/2505.21327 Supplementary Material 33

  73. [73]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9556– 9567 (2024)

  74. [74]

    Zhao, J.X., Xie, Y., Kawaguchi, K., He, J., Xie, M.Q.: Automatic model selection with large language models for reasoning (2023),https://arxiv.org/abs/2305. 14333

  75. [75]

    Zhao, T.Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate before use: Im- proving few-shot performance of language models (2021),https://arxiv.org/ abs/2102.09690

  76. [76]

    Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J.Y., Wen, J.R.: A Survey of Large Language Models (2025),https://arxiv.org/abs/2303.18223

  77. [77]

    Cordts, M

    Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: Grounded question an- swering in images. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4995–5004 (2016).https://doi.org/10.1109/CVPR. 2016.540