pith. sign in

arxiv: 2605.18988 · v1 · pith:KMLQ56I3new · submitted 2026-05-18 · 💻 cs.CR · cs.AI

Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks

Pith reviewed 2026-05-20 09:03 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords multimodal attacksmulti-turn defenseTRIAD frameworksurvival analysisadversarial robustnesstrajectory dynamicsanomaly detection
0
0 comments X

The pith

The TRIAD framework models multi-turn multimodal conversations as trajectories to bound expected time until an attack succeeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current static defenses miss attacks that spread malicious intent across many conversation turns and modalities because they treat each input in isolation. It introduces the TRIAD framework to represent the full conversational flow as a continuous trajectory and track covariance shifts plus topological acceleration to spot building malicious drift. These signals feed into a time-varying Cox survival model through a Bayesian HMM loop to deliver a mathematically bounded prediction of time-to-failure. A sympathetic reader would care because this could let agentic AI systems anticipate progressive poisoning in real time rather than reacting only after damage occurs.

Core claim

The TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively, by mapping multimodal multi-turn flow to a continuous trajectory monitored with structural anomaly detection, Ledoit-Wolf regularized Mahalanobis distance, and topological trajectory acceleration, all integrated into a time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model feedback loop.

What carries the argument

The Triple-tier Anomaly Defense (TRIAD) framework, which maps conversational flow to a continuous trajectory and integrates covariance-shift monitoring, regularized Mahalanobis distance, and topological acceleration into a Cox proportional-hazards model through a Bayesian HMM feedback loop.

If this is right

  • Detects cumulative structural poisoning across longitudinal trajectories that turn-by-turn Markov guards miss.
  • Supplies a predictive, real-time safeguard for autonomous agentic workflows.
  • Differentiates benign creative exploration from continuous malicious drift using kinematic and geometric features.
  • Supports continuous safety alignment without requiring periodic empirical retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trajectory-monitoring idea could extend to tracking intent drift in multi-agent systems where goals evolve over successive exchanges.
  • If the positive divergence property holds, hazard thresholds could trigger automated pauses or clarifications before failure occurs.
  • A natural next test would measure how well the bounds survive when conversations mix more than two modalities or run for dozens of turns.

Load-bearing premise

The assumption that multimodal multi-turn conversational flow can be usefully represented as a continuous trajectory whose covariance shifts and topological acceleration reliably separate benign exploration from malicious drift.

What would settle it

Run controlled multi-turn attacks on an MLLM where malicious content is injected gradually across turns and check whether the predicted time-to-failure bound is violated or whether the model misclassifies clearly benign trajectories as high-hazard.

Figures

Figures reproduced from arXiv: 2605.18988 by Doohee You.

Figure 1
Figure 1. Figure 1: Operational Flow of the TRIAD Framework 12 [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
read the original abstract

The expansion of Multimodal Large Language Models (MLLMs) and their integration into autonomous agentic workflows has introduced a non-stationary attack surface. Empirical observations indicate that adversaries employ progressive, cross-modal perturbations that evade turn-specific guardrails by distributing malicious intent across longitudinal conversational trajectories. Static defense mechanisms, constrained by the Markov property, evaluate inputs in isolation and fail to detect cumulative structural poisoning. To handle this limitation, this paper formulates safety verification as a dynamic survival prediction and trajectory dynamics problem. The Triple-tier Anomaly Defense (TRIAD) framework is proposed as a predictive model that maps multimodal and multi-turn conversational flow as a continuous trajectory. The framework integrates structural anomaly detection to monitor covariance shifts, a Ledoit-Wolf regularized Mahalanobis distance to monitor covariance shifts in high-dimensional spaces, and topological trajectory acceleration to differentiate benign creative exploration from continuous malicious drift. These kinematic and geometric features are integrated into a time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model (HMM) feedback loop. Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively. This framework provides a computationally efficient, interpretable, and predictive safeguard for real-time agentic AI systems, establishing a rigorous foundation for continuous safety alignment without relying on empirical retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the TRIAD framework to defend against progressive, cross-modal multi-turn attacks on MLLMs in agentic workflows. It models conversational flows as continuous trajectories, using structural anomaly detection via Ledoit-Wolf regularized Mahalanobis distance to track covariance shifts and topological trajectory acceleration to distinguish benign exploration from malicious drift. These kinematic and geometric features are fed into a time-varying Cox proportional hazards model through a Bayesian HMM feedback loop. The central claim is that this integration yields a mathematically bounded expected time-to-failure under adversarial perturbations, with malicious acceleration diverging positively, providing a predictive, interpretable safeguard without empirical retraining.

Significance. If the claimed bound on expected time-to-failure can be rigorously derived from the feature integration and the continuous-trajectory assumption holds without unmodeled discontinuities, the work would advance dynamic safety verification by adapting survival analysis to non-stationary multimodal attack surfaces, offering an interpretable alternative to static guardrails.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively' is presented without any derivation, explicit model equations, proof sketch, or integration details showing how the Ledoit-Wolf Mahalanobis and topological acceleration features, when mapped via the Bayesian HMM, enforce the bound or positive divergence in the Cox proportional-hazards form. This is load-bearing for the central claim.
  2. [Framework description (Bayesian HMM feedback loop)] Framework description (Bayesian HMM feedback loop): The assumption that multimodal multi-turn flow can be represented as a continuous trajectory whose covariance shifts and topological acceleration reliably separate benign exploration from malicious drift is invoked when mapping inputs to the time-varying Cox model, yet no analysis addresses potential non-stationarities from discrete turn boundaries or cross-modal switches that could invalidate the kinematic features and prevent the claimed bound from following.
minor comments (2)
  1. [Methods] The notation for 'topological trajectory acceleration' and its computation from the continuous trajectory is introduced without a formal definition or pseudocode, hindering reproducibility of the geometric features.
  2. [Evaluation] No empirical validation or simulation results are referenced to support the separation of benign vs. malicious trajectories under the proposed features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our manuscript. We address the major comments point by point below, indicating the revisions we plan to make to enhance the rigor and clarity of the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively' is presented without any derivation, explicit model equations, proof sketch, or integration details showing how the Ledoit-Wolf Mahalanobis and topological acceleration features, when mapped via the Bayesian HMM, enforce the bound or positive divergence in the Cox proportional-hazards form. This is load-bearing for the central claim.

    Authors: We agree with the referee that the abstract presents the theoretical claim without sufficient supporting details. The manuscript includes a high-level description of the theoretical analysis, but to fully substantiate the central claim, we will revise the abstract to reference the key theoretical components and add a dedicated subsection or appendix providing the derivation, explicit equations, and proof sketch for how the Ledoit-Wolf regularized Mahalanobis distance and topological acceleration features, integrated through the Bayesian HMM, lead to the bounded expected time-to-failure and positive divergence in the Cox proportional hazards model. revision: yes

  2. Referee: [Framework description (Bayesian HMM feedback loop)] Framework description (Bayesian HMM feedback loop): The assumption that multimodal multi-turn flow can be represented as a continuous trajectory whose covariance shifts and topological acceleration reliably separate benign exploration from malicious drift is invoked when mapping inputs to the time-varying Cox model, yet no analysis addresses potential non-stationarities from discrete turn boundaries or cross-modal switches that could invalidate the kinematic features and prevent the claimed bound from following.

    Authors: The continuous trajectory representation is a foundational assumption of the TRIAD framework, and the Bayesian HMM feedback loop is designed to capture and adapt to non-stationarities, including those from discrete turn boundaries and cross-modal switches, by dynamically updating the state and feature mappings. However, we acknowledge that a more explicit analysis of these potential invalidations is warranted. In the revision, we will include additional discussion and analysis in the framework description section to demonstrate that the kinematic features remain reliable and the bound holds under such conditions. revision: yes

Circularity Check

1 steps flagged

Bounded E[time-to-failure] presented as theoretical result but constructed directly from TRIAD's own trajectory features and Cox-HMM integration

specific steps
  1. fitted input called prediction [Abstract]
    "These kinematic and geometric features are integrated into a time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model (HMM) feedback loop. Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively."

    The 'theoretical analysis' is invoked immediately after describing the feature extraction and Cox-HMM integration. The bounded E[time-to-failure] and positive divergence are therefore outputs of the same continuous-trajectory representation and covariance/topological features that the framework introduces; the survival bound is statistically forced by the model definition rather than derived from independent premises.

full rationale

The paper's central theoretical claim reduces to a restatement of its modeling assumptions. The abstract defines the TRIAD framework by mapping conversational flow to a continuous trajectory, extracting covariance-shift and topological-acceleration features, and feeding them into a time-varying Cox model via Bayesian HMM. It then asserts that this same construction 'provides a mathematically bounded expected time-to-failure' with positive divergence for malicious cases. No independent derivation, external benchmark, or parameter-free proof is supplied; the bound is therefore equivalent to the input modeling choices by construction. This matches the fitted-input-called-prediction pattern at the level of the survival outcome itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Only the abstract is available, so the ledger records the modeling choices and new constructs explicitly named or implied in the abstract; no empirical or formal support is provided for any of them.

axioms (2)
  • domain assumption Conversational flow can be represented as a continuous trajectory in a high-dimensional multimodal space
    Invoked when the paper states that inputs are mapped as a continuous trajectory for the Cox model.
  • domain assumption Covariance shifts and topological acceleration distinguish malicious drift from benign exploration
    Central to the structural anomaly detection and kinematic features described.
invented entities (2)
  • Triple-tier Anomaly Defense (TRIAD) framework no independent evidence
    purpose: Predictive safety verification for multi-turn multimodal attacks
    Newly proposed integrated system combining anomaly detection, distance metrics, and survival modeling.
  • Topological trajectory acceleration no independent evidence
    purpose: Differentiate benign creative exploration from continuous malicious drift
    New kinematic feature introduced for the trajectory model.

pith-pipeline@v0.9.0 · 5766 in / 1543 out tokens · 33090 ms · 2026-05-20T09:03:39.587609+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 9 internal anchors

  1. [1]

    arXiv preprint arXiv:2602.16935

    URLhttps: //arxiv.org/abs/2602.16935. arXiv preprint arXiv:2602.16935. Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. Image hijacks: Adversarial images can 9 control generative models at runtime,

  2. [2]

    Image hijacks: Adversarial images can control generative models at runtime

    URLhttps://arxiv.org/abs/2309.00236. arXiv preprint arXiv:2309.00236. Anshuman Chhabra, Shrestha Datta, Shahriar Kabir Nahin, and Prasant Mohapatra. Agentic AI security: Threats, defenses, evaluation, and open challenges,

  3. [3]

    Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

    URLhttps://arxiv.org/ abs/2510.23883. arXiv preprint arXiv:2510.23883. David R Cox. Regression models and life-tables.Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202,

  4. [4]

    arXiv preprint arXiv:2602.01025

    URLhttps://arxiv.org/abs/2602.01025. arXiv preprint arXiv:2602.01025. Badhan Chandra Das, Md Tasnim Jawad, Joaquin Molto, M. Hadi Amini, and Yanzhao Wu. Multi- turn jailbreaking attack in multi-modal large language models,

  5. [5]

    arXiv preprint arXiv:2601.05339

    URLhttps://arxiv.or g/abs/2601.05339. arXiv preprint arXiv:2601.05339. Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yu- val Kluger. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network.BMC Medical Research Methodology, 18(1):24, feb

  6. [6]

    Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger

    doi: 10.1186/s12874-018-0482-1. URLhttps://doi.org/10.1186/s12874-018-0482-1. Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices.Journal of Multivariate Analysis, 88(2):365–411,

  7. [7]

    A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks

    URLhttps://arxiv.org/abs/1807 .03888. arXiv preprint arXiv:1807.03888. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation Forest. In2008 Eighth IEEE Interna- tional Conference on Data Mining, pages 413–422,

  8. [8]

    Isolation forest

    doi: 10.1109/ICDM.2008.17. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models,

  9. [9]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    URLhttps://arxiv.org/abs/2310.04451. arXiv preprint arXiv:2310.04451. Atharva Mehta, Rajesh Kumar, Aman Singla, Kartik Bisht, Yaman Kumar Singla, and Rajiv Ratn Shah. Detecting LLM-assisted academic dishonesty using keystroke dynamics,

  10. [10]

    Detecting LLM-Assisted Academic Dishonesty using Keystroke Dynamics

    URLhttps: //arxiv.org/abs/2511.12468. arXiv preprint arXiv:2511.12468. Maximilian Mueller and Matthias Hein. Mahalanobis++: Improving OOD detection via fea- ture normalization,

  11. [11]

    arXiv preprint arXiv:2505.18032

    URLhttps://arxiv.org/abs/2505.18032. arXiv preprint arXiv:2505.18032. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. GPT-4 technical report,

  12. [12]

    GPT-4 Technical Report

    URL https://arxiv.org/abs/2303.08774. arXiv preprint arXiv:2303.08774. Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. Deep learning for anomaly detection: A review.ACM Comput. Surv., 54(2), mar

  13. [13]

    Deep learning for anomaly detection

    doi: 10.1145/3439950. URLhttps://doi.org/10.1145/3439950. 10 J. Ramprasath, S. Ramakrishnan, V. Tharani, R. Sushmitha, and D. Arunima. Cloud service anomaly traffic detection using Random Forest. In Shailesh Tiwari, Munesh C. Trivedi, Mohan L. Kolhe, and Brajesh Kumar Singh, editors,Advances in Data and Information Sciences, pages 269–279, Singapore,

  14. [14]

    arXiv preprint arXiv:1906.02845

    URLhttps://arxiv.org/abs/1906.02845. arXiv preprint arXiv:1906.02845. Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack,

  15. [15]

    Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

    URLhttps://arxiv.org/abs/2404.01833. arXiv preprint arXiv:2404.01833. Abhishek Singhania, Christophe Dupuy, Shivam Mangale, and Amani Namboori. Multi-lingual multi-turn automated red teaming for LLMs,

  16. [16]

    arXiv preprint arXiv:2504.03174

    URLhttps://arxiv.org/abs/2504.03174. arXiv preprint arXiv:2504.03174. Songze Li, Ruishi He, Xiaojun Jia, Jun Wang, and Zhihui Fu. Knowledge-driven multi-turn jail- breaking on large language models,

  17. [17]

    arXiv preprint arXiv:2601.05445

    URLhttps://arxiv.org/abs/2601.05445. arXiv preprint arXiv:2601.05445. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, et al. Gemini: A family of highly capable multimodal models,

  18. [18]

    Gemini: A Family of Highly Capable Multimodal Models

    URLhttps://arxiv.org/abs/2312.11805. arXiv preprint arXiv:2312.11805. Xinkai Wang, Beibei Li, Zerui Shao, Ao Liu, Guangquan Xu, and Shouling Ji. PolyJailbreak: Cross-modal jailbreaking attacks on black-box multimodal LLMs,

  19. [19]

    org/abs/2510.17277

    URLhttps://arxiv. org/abs/2510.17277. arXiv preprint arXiv:2510.17277. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail?,

  20. [20]

    Jailbroken: How Does LLM Safety Training Fail?

    URLhttps://arxiv.org/abs/2307.02483. arXiv preprint arXiv:2307.02483. Zixuan Weng, Xiaolong Jin, Jinyuan Jia Regel, and Xiangyu Zhang. Foot-in-the-door: A multi- turn jailbreak for LLMs,

  21. [21]

    arXiv preprint arXiv:2502.19820

    URLhttps://arxiv.org/abs/2502.19820. arXiv preprint arXiv:2502.19820. Yubo Li, Ramayya Krishnan, and Rema Padman. Time-to-inconsistency: A survival analysis of large language model robustness to adversarial attacks,

  22. [22]

    arXiv preprint arXiv:2510.02712

    URLhttps://arxiv.org/abs/ 2510.02712. arXiv preprint arXiv:2510.02712. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models,

  23. [23]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    URLhttps://arxiv.or g/abs/2307.15043. arXiv preprint arXiv:2307.15043. 11 Multimodal Input & Telemetric CovariatesV (t) Pillar 1: Structural Scout (Isolation Forest) CalculateS (t) iso S(t) iso > α Pillar 2: Distributional Anchoring & Kinematics CalculateD (t) M anda t CCM: Bayesian Belief Update HMM State Tracking Pillar 3: Survival Forecast Cox Hazardh(...