pith. sign in

arxiv: 2606.25701 · v1 · pith:25ATGYAZnew · submitted 2026-06-24 · 💻 cs.CV

Falcon: Functional Assembly and Language for Compositional Reasoning in X-ray

Pith reviewed 2026-06-25 20:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords compositional threat reasoningX-ray baggage screeningstructured safety statemultimodal reasoningfunctional compatibilityvision-language modelsthreat assessment
0
0 comments X

The pith

Falcon injects an explicit structured safety state into language models to reason about relational threats rather than isolated objects in X-ray scans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that safety threats in X-ray baggage screening arise from functional relations among dispersed components such as batteries, detonators, and charges, not from detecting any single item. It introduces Falcon to extract segmentation-aware region features, assemble them into a structured safety state that records component presence, pairwise compatibility, and scene risk, and then pass this state as an intermediate layer to the language model. A new benchmark, Falcon-X, supplies dense grounding labels plus structured supervision on completeness and risk inference. Experiments indicate that standard multimodal models handle appearance but fail at the relational part, while Falcon yields better functional grounding and more consistent threat judgments.

Core claim

Falcon abstracts segmentation-aware region features into a structured safety state capturing component presence, pairwise functional compatibility, and scene-level risk; this state is injected into the language model as an explicit intermediate interface to encourage relationally consistent and safety-aware reasoning.

What carries the argument

The structured safety state that encodes component presence, pairwise functional compatibility, and scene-level risk, serving as an explicit intermediate interface between vision features and the language model.

If this is right

  • Existing multimodal models adapt to appearance but struggle with compositional safety reasoning.
  • Falcon improves functional grounding and produces more coherent threat assessments.
  • Compositional safety reasoning becomes a distinct evaluation paradigm for multimodal systems.
  • Risk is modeled as a relational property of grounded regions rather than an independent detection outcome.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same explicit-state approach could transfer to other settings where function depends on part relations, such as medical device inspection or assembly verification.
  • Falcon-X offers a template for benchmarks that jointly test dense localization and structured relational inference.
  • If the intermediate state proves reliable, it may reduce the need for purely end-to-end training on complex safety tasks.

Load-bearing premise

Abstracting segmentation-aware region features into an explicit structured safety state and injecting it into the language model will encourage relationally consistent and safety-aware reasoning.

What would settle it

A controlled test in which language models receive only raw region features without the explicit safety state yet match Falcon's performance on compositional threat inference tasks using the Falcon-X benchmark.

Figures

Figures reproduced from arXiv: 2606.25701 by Andreas Henschel, Mohamad Alansari, Naoufel Werghi, Natnael Takele, Yonathan Michael.

Figure 1
Figure 1. Figure 1: Falcon enables segmentation-aware functional threat reasoning in X-ray im￾agery. It performs instance grounding, component presence recognition, referring func￾tional grounding and more, while generating grounded and domain-aware natural￾language explanations. Falcon reasons over spatially distant components of IED under heavy clutter, supporting structured threat analysis beyond object detection. Abstract… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison of structured threat reasoning. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Falcon-X data collection pipeline. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Falcon-X Task Suite. Left: Example of multi-level tasks on Falcon-X base image. Right: Samples of compositional grounding on controlled synthetic Falcon-X images for functional reasoning. 5 Falcon-X Task Suite Falcon-X introduces a hierarchical evaluation framework for compositional safety reasoning. The task suite progressively probes three levels of capability: (i) Grounded perception under X-ray superpo… view at source ↗
Figure 5
Figure 5. Figure 5: Falcon architecture. Segmentation-aware perception extracts mask-aligned region embeddings, which are aggregated into structured component slots by the SSA to predict presence, functional links, and scene risk. The resulting structured tokens are fused with visual and textual tokens, introducing a relational perception before LLM decoding. 6.1 Falcon Architecture Falcon integrates structured relational sig… view at source ↗
Figure 6
Figure 6. Figure 6: Falcon qualitative results on compositional grounding and scene-level reasoning. 7.4 Ablations We conduct controlled ablations to quantify the contribution of (i) structured prediction heads and (ii) perception versus reasoning bottlenecks. All variants are evaluated on the primary task of Referring Functional Grounding (RFG). Additional ablations are reported in Appendix 14. Prediction head ablations. We … view at source ↗
Figure 7
Figure 7. Figure 7: Acquisition setup and collection protocol. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of counterfactual synthetic samples used in Falcon-X. Starting from a real X-ray image (left), controlled variants are generated by removing one functional component using mask-guided inpainting. From left to right, each row shows: the orig￾inal image, missing battery, missing detonator, and missing main charge variants. The background is filled using a texture sampled from surrounding regions of … view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative examples of Falcon on RFG. Given a functionally constrained query, the model grounds all visible components that could jointly participate in a potential IED assembly. VQA [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative examples of Falcon on domain-specific VQA in X-ray [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

Conventional vision-language models are largely object-centric, focusing on detecting and describing individual entities. In safety-critical X-ray baggage screening, however, threat often emerges not from a single object but from the functional compatibility of spatially dispersed components, such as batteries, detonators, and explosive charges. We formalize this setting as \emph{compositional threat reasoning}, where risk is modeled as a relational property of grounded regions rather than an independent detection outcome. We introduce \textbf{Falcon}, a multimodal framework that abstracts segmentation-aware region features into a structured safety state capturing component presence, pairwise functional compatibility, and scene-level risk. This structured representation is injected into the language model as an explicit intermediate interface, encouraging relationally consistent and safety-aware reasoning. To evaluate this problem, we present \textbf{Falcon-X}, a benchmark that unifies dense grounding with structured supervision over component completeness and risk inference in cluttered X-ray imagery. Experiments show that while existing multimodal models adapt to appearance, they struggle with compositional safety reasoning. Falcon improves functional grounding and produces more coherent threat assessments, establishing compositional safety reasoning as a distinct evaluation paradigm for multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Falcon, a multimodal framework for compositional threat reasoning in X-ray baggage screening. It abstracts segmentation-aware region features into a structured safety state capturing component presence, pairwise functional compatibility, and scene-level risk; this state is injected into the language model as an explicit intermediate interface. The paper presents the Falcon-X benchmark unifying dense grounding with structured supervision over component completeness and risk inference, and reports that Falcon improves functional grounding and produces more coherent threat assessments than existing multimodal models, establishing compositional safety reasoning as a distinct evaluation paradigm.

Significance. If the central mechanism is shown to causally improve relational consistency, the work could meaningfully advance safety-critical multimodal reasoning by shifting focus from object-centric detection to functional assembly and relational risk modeling. The introduction of Falcon-X as a benchmark with structured supervision is a concrete contribution that could support future research in this area.

major comments (1)
  1. [Abstract] Abstract (paragraph describing the framework): the claim that abstracting segmentation-aware region features into an explicit structured safety state (component presence, pairwise compatibility, scene risk) and injecting it 'encourages relationally consistent and safety-aware reasoning' is not supported by any described ablation or isolation experiment. No comparison is provided against a baseline that receives the same region features without the structured state, so it remains unclear whether observed gains on Falcon-X are attributable to the claimed mechanism rather than segmentation quality, benchmark supervision, or other factors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract claim. We address the point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph describing the framework): the claim that abstracting segmentation-aware region features into an explicit structured safety state (component presence, pairwise compatibility, scene risk) and injecting it 'encourages relationally consistent and safety-aware reasoning' is not supported by any described ablation or isolation experiment. No comparison is provided against a baseline that receives the same region features without the structured state, so it remains unclear whether observed gains on Falcon-X are attributable to the claimed mechanism rather than segmentation quality, benchmark supervision, or other factors.

    Authors: We agree that the current manuscript lacks an ablation that directly isolates the contribution of the structured safety state by comparing against a baseline receiving identical segmentation-aware region features without the structured interface. The reported experiments compare Falcon to existing multimodal models but do not rule out that gains could arise from segmentation quality or benchmark supervision alone. In the revised version we will add this ablation experiment to provide causal evidence for the mechanism. revision: yes

Circularity Check

0 steps flagged

No derivations or equations present; framework proposal has no circular derivation chain

full rationale

The manuscript introduces Falcon as a multimodal framework that abstracts segmentation-aware region features into a structured safety state (component presence, pairwise compatibility, scene risk) and injects it into the language model. The abstract and description contain no equations, no first-principles derivations, no fitted parameters renamed as predictions, and no self-citation chains invoked to justify a mathematical result. The central claim is an empirical assertion that the framework improves functional grounding and threat assessment coherence on the Falcon-X benchmark. This does not reduce by construction to its inputs, nor does it rely on any of the enumerated circularity patterns. The absence of a derivation chain means the circularity score defaults to 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities (beyond the named framework) are identifiable.

pith-pipeline@v0.9.1-grok · 5741 in / 1047 out tokens · 22645 ms · 2026-06-25T20:50:30.033839+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 2 linked inside Pith

  1. [1]

    In: ICCV (2025)

    Cai, Z., Ke, F., Jahangard, S., Garcia de la Banda, M., Haffari, R., Stuckey, P.J., Rezatofighi, H.: Naver: A neuro-symbolic compositional automaton for vi- sual grounding with explicit logic reasoning. In: ICCV (2025)

  2. [2]

    Knowledge- Based Systems (2022)

    Chang,A.,Zhang,Y.,Zhang,S.,Zhong,L.,Zhang,L.:Detectingprohibitedobjects with physical size constraint from cluttered x-ray baggage images. Knowledge- Based Systems (2022)

  3. [3]

    arXiv preprint arXiv:2306.15195 (2023)

    Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleash- ing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

  4. [4]

    Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality (March 2023),https://lmsys.org/ blog/2023-03-30-vicuna/

  5. [5]

    United States Patent Application US20160161228A1 (June 2016)

    Eshetu, A., Burton, T.B., Howell, J.D., Rutter, M.F., Winnett, T.J.: Inert ied training kits. United States Patent Application US20160161228A1 (June 2016)

  6. [6]

    In: ICCV (2025)

    Garcia-Fernandez,P.,Vaquero,L.,Liu,M.,Xue,F.,Cores,D.,Sebe,N.,Mucientes, M., Ricci, E.: Superpowering open-vocabulary object detectors for x-ray vision. In: ICCV (2025)

  7. [7]

    CIPAE (2023)

    He, C., Mu, T., Ren, W., Zhao, B.: Lpixray: A large-scale logistics prohibited item x-ray dataset for the application of deep learning in security inspection. CIPAE (2023)

  8. [8]

    In: ICCV (2017) 16 Y

    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017) 16 Y. Michael, M. Alansari et al

  9. [9]

    In: ICLR (2022)

    Hu, E.J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022)

  10. [10]

    In: ICCV (2025)

    Lafon, M., Karmim, Y., Silva-Rodríguez, J., Couairon, P., Rambour, C., Fournier- Sniehotta, R., Ayed, I.B., Dolz, J., Thome, N.: Vilu: Learning vision-language uncertainties for failure prediction. In: ICCV (2025)

  11. [11]

    In: CVPR (2024)

    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: CVPR (2024)

  12. [12]

    In: ICML (2023)

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023)

  13. [13]

    TNNLS36(2024)

    Li, M., Jia, T., Wang, H., Ma, B., Lu, H., Lin, S., Cai, D., Chen, D.: Ao-detr: Anti-overlapping detr for x-ray prohibited items detection. TNNLS36(2024)

  14. [14]

    In: ECCV (2014)

    Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)

  15. [15]

    In: CVPR (2023)

    Liu, C., Ding, H., Jiang, X.: GRES: Generalized referring expression segmentation. In: CVPR (2023)

  16. [16]

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023)

  17. [17]

    ITIFS19, 3866–3878 (2024)

    Ma, B., Jia, T., Li, M., Wu, S., Wang, H., Chen, D.: Toward dual-view x-ray baggage inspection: A large-scale benchmark and adaptive hierarchical cross re- finement for prohibited item discovery. ITIFS19, 3866–3878 (2024)

  18. [18]

    ITMM25, 4374–4386 (2022)

    Ma, B., Jia, T., Su, M., Jia, X., Chen, D., Zhang, Y.: Automated segmentation of prohibited items in x-ray baggage images using dense de-overlap attention snake. ITMM25, 4374–4386 (2022)

  19. [19]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Ma, C., Jiang, Y., Wu, J., Yuan, Z., Qi, X.: Groma: Localized visual tokenization for grounding multimodal large language models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 417–435. Springer Nature Switzerland, Cham (2025)

  20. [20]

    In: ICLR (2019)

    Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In: ICLR (2019)

  21. [21]

    Journal of Nondestructive Evaluation (2015)

    Mery, D., Riffo, V., Zscherpel, U., Mondragón, G., Lillo, I., Zuccar, I., Lobel, H., Carrasco, M.: Gdxray: The database of x-ray images for nondestructive testing. Journal of Nondestructive Evaluation (2015)

  22. [22]

    In: CVPR (2019)

    Miao, C., Xie, L., Wan, F., Su, c., Liu, H., Jiao, j., Ye, Q.: Sixray: A large-scale security inspection x-ray benchmark for prohibited item discovery in overlapping images. In: CVPR (2019)

  23. [23]

    Transactions on Machine Learning Research (2024)

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visual feat...

  24. [24]

    ArXiv abs/2306.14824(2023)

    Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. ArXiv abs/2306.14824(2023)

  25. [25]

    In: ICML (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

  26. [26]

    CVPR (2024) Falcon 17

    Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S.: Glamm: Pixel grounding large multimodal model. CVPR (2024) Falcon 17

  27. [27]

    In: ICLR (2026)

    Robinson,I.,Robicheaux,P.,Popov,M.,Ramanan,D.,Peri,N.:RF-DETR:Neural architecture search for real-time detection transformers. In: ICLR (2026)

  28. [28]

    In: CVPR (2022)

    Tao, R., Li, H., Wang, T., Wei, Y., Ding, Y., Bowei Jin and, H.Z., Liu, X., Liu, A.: Exploring endogenous shift for cross-domain detection: A large-scale benchmark and perturbation suppression network. In: CVPR (2022)

  29. [29]

    In: ICCV (2021)

    Tao, R., Wei, Y., Jiang, X., Li, H., Qin, H., Wang, J., Ma, Y., Zhang, L., Liu, X.: Towards real-world x-ray security inspection: A high-quality benchmark and lateral inhibition module for prohibited items detection. In: ICCV (2021)

  30. [30]

    In: ICCV (2021)

    Tao, R., Wei, Y., Jiang, X., Li, H., Qin, H., Wang, J., Ma, Y., Zhang, L., Liu*, X.: Towards real-world x-ray security inspection: A high-quality benchmark and lateral inhibition module for prohibited items detection. In: ICCV (2021)

  31. [31]

    In: CVPR (2025)

    Velayudhan, D., Ahmed, A., Alansari, M., Gour, N., Behouch, A., Hassan, T., Wasim, S.T., Maalej, N., Naseer, M., Gall, J., et al.: Sting-bee: Towards vision- language model for real-world x-ray baggage security inspection. In: CVPR (2025)

  32. [32]

    ACM Computing Surveys55(8) (2022)

    Velayudhan, D., Hassan, T., Damiani, E., Werghi, N.: Recent advances in bag- gage threat detection: A comprehensive and systematic survey. ACM Computing Surveys55(8) (2022)

  33. [33]

    In: ACMMM (2020)

    Wei, Y., Tao, R., Wu, Z., Ma, Y., Zhang, L., Liu, X.: Occluded prohibited items detection:Anx-raysecurityinspectionbenchmarkandde-occlusionattentionmod- ule. In: ACMMM (2020)

  34. [34]

    arXiv (2025)

    Yuan, H., Li, X., Zhang, T., Huang, Z., Xu, S., Ji, S., Tong, Y., Qi, L., Feng, J., Yang, M.H.: Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv (2025)

  35. [35]

    IJCV (2023)

    Zhang, L., Jiang, L., Ji, R., Fan, H.: Pidray: A large-scale x-ray benchmark for real-world prohibited item detection. IJCV (2023)

  36. [36]

    In: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T

    Zhang, S., Sun, P., Chen, S., Xiao, M., Shao, W., Zhang, W., Liu, Y., Chen, K., Luo, P.: Gpt4roi: Instruction tuning large language model on region-of-interest. In: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T. (eds.) Computer Vision – ECCV 2024 Workshops. pp. 52–70. Springer Nature Switzerland, Cham (2025)

  37. [37]

    ITIFS17, 998–1009 (2022)

    Zhao, C., Zhu, L., Dou, S., Deng, W., Wang, L.: Detecting overlapped objects in x-ray security imagery by a label-aware mechanism. ITIFS17, 998–1009 (2022)

  38. [38]

    scene-caption

    Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. In: ICLR (2024) Falcon 1 Appendix –Additional Details on Falcon-X (section 9) –Additional Details on Falcon-X Task Suite Generation (section 10) –Qualitative Results (section 11) –Cross-dataset Evaluation (section 12...

  39. [39]

    Which components could possibly form an IED. Ground all that apply

    Scene captions must describe only visible components. 2. Referring instructions must be uniquely resolvable from spatial metadata. 3. VQA questions must be answerable from structured annotations. 4. Functional links must be inferred using spatial proximity and component types. 5. Risk score guidelines: - 0.0–0.3: benign - 0.3–0.6: incomplete assembly - 0....