pith. sign in

arxiv: 2606.04612 · v1 · pith:6I2TJ4OPnew · submitted 2026-06-03 · 💻 cs.CL

Hybrid Adversarial Defence for Natural Language Understanding Tasks

Pith reviewed 2026-06-28 06:52 UTC · model grok-4.3

classification 💻 cs.CL
keywords hybrid defenceadversarial robustnesshallucination reductionnatural language understandingentropy featuresuncertainty featuresgeometric featureslarge language models
0
0 comments X

The pith

A hybrid model combining entropy, uncertainty and geometric features defends large language models against hallucinations and adversarial attacks better than any single feature type.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates a hybrid defence framework that merges entropy-based models for reducing hallucinations with uncertainty-based and geometric-based models for reducing adversarial vulnerability. On in-domain NLU datasets the hybrid raises clean accuracy by up to 43.34 percent and adversarial accuracy by up to 64.92 percent while cutting attack success rate by 62.27 percent. Comparable robustness gains appear on out-of-distribution datasets and on prompt-injection and jailbreak tasks. The central result is that the three-feature combination outperforms any one feature used alone across these settings.

Core claim

The hybrid defence framework integrates entropy-based models designed to reduce hallucinations with uncertainty-based models and geometric-based models designed to reduce vulnerability. Under in-domain tests on Natural Language Understanding datasets the hybrid improves both clean-task performance and adversarial robustness. For out-of-distribution datasets the hybrid yields similar adversarial robustness. On prompt injection and jailbreak detection datasets the hybrid is also strong. Overall the results show that combining entropy, uncertainty and geometric features provides a more effective defence strategy than using any single feature alone for both in-domain and out-of-distribution task

What carries the argument

The hybrid defence framework that combines entropy-based, uncertainty-based and geometric-based models to address hallucinations and adversarial attacks together.

If this is right

  • Clean-task accuracy on NLU datasets rises by up to 43.34 percent.
  • Adversarial accuracy improves by up to 64.92 percent and attack success rate falls by 62.27 percent on in-domain tasks.
  • Out-of-distribution adversarial accuracy improves by up to 57.14 percent.
  • Attack success rate on prompt-injection and jailbreak tasks drops by up to 51 percent relative to state-of-the-art baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-feature mix could be tested on open-ended generation tasks to check whether it reduces hallucinations outside closed NLU settings.
  • Real-time deployment might combine the hybrid detector with existing LLM inference pipelines without retraining the underlying model.
  • The geometric component may capture embedding-space properties that entropy and uncertainty alone miss, suggesting targeted ablation studies on feature interactions.

Load-bearing premise

The observed improvements are caused by the hybrid combination of the three feature types rather than by specific implementation choices, dataset properties or baseline comparisons.

What would settle it

A controlled replication that applies the same hybrid feature set to different base models and training procedures yet finds no consistent advantage over the strongest single-feature baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.04612 by Chenghua Lin, Manar Abouzaid, Stuart E. Middleton, Yang Wang.

Figure 1
Figure 1. Figure 1: Hybrid Adversarial Defence Framework from each expert (7). [wrt, wllm, wpure] = softmax(fweight(x)) (7) pfinal = wrt · prt + wllm · pllm + wpure · ppure Both models use identical architectures to ensure a fair comparison between discrete routing (hard selection) and continuous weighting (soft aggre￾gation). The key difference lies in the decision mechanism: hard routing selects a single expert, while soft … view at source ↗
read the original abstract

Large Language Models (LLMs) are vulnerable both to hallucination and adversarial manipulation. Although these problems are closely related, existing defences typically address them separately. We investigate a hybrid defence framework that combines entropy-based models, designed to reduce hallucinations, with uncertainty-based models and geometric-based models, designed to reduce vulnerability. Under in-domain tests on Natural Language Understanding datasets (FEVER, HotpotQA, CSQA, SIQA) we find our hybrid model improves both clean-task performance (up to 43.34\% increase in accuracy) and adversarial robustness (up to 64.92\% improvement in accuracy and 62.27\% reduction in attack success rate). For out-of-distribution datasets (AeroEngQA, CPIQA) we see similar adversarial robustness from our hybrid model (up to 57.14\% improvement in accuracy). For prompt injection (SafeGuard) and jailbreak detection (AdvBench, DAN) datasets our hybrid model is also very strong (up to 51\% reduction in attack success rate compared to state of the art baseline models). Overall, our results show that combining entropy, uncertainty and geometric features provides a more effective defence strategy than using any single feature alone for both in-domain and out-of-distribution tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a hybrid defence framework for LLMs on NLU tasks that integrates entropy-based models (targeting hallucinations) with uncertainty-based and geometric-based models (targeting adversarial vulnerability). It reports large gains in clean accuracy (up to 43.34%) and adversarial robustness (up to 64.92% accuracy improvement and 62.27% attack-success reduction) on in-domain datasets (FEVER, HotpotQA, CSQA, SIQA), comparable gains on OOD datasets (AeroEngQA, CPIQA), and strong results on prompt-injection/jailbreak sets (SafeGuard, AdvBench, DAN), attributing superiority to the three-feature combination over any single feature type.

Significance. If the quantitative claims are supported by reproducible experiments and controlled ablations, the work would be significant: it offers a unified approach to hallucination and adversarial defence and supplies evidence that multi-feature hybrids can outperform single-feature baselines on both in-domain and OOD NLU tasks.

major comments (3)
  1. [Abstract] Abstract: quantitative improvements (43.34% clean accuracy, 64.92% adversarial accuracy, 62.27% attack-success reduction) are stated without any description of base models, training procedures, fusion mechanism for the hybrid, error bars, or statistical tests, rendering the data unverifiable against the claims.
  2. [Abstract] Abstract / central claim: the assertion that the hybrid outperforms any single feature type requires matched ablation experiments (identical base model, hyperparameters, and training) on the reported datasets. No such single-feature controls or ablation results are described, so the causal link between the three-feature design and the reported gains cannot be assessed.
  3. [Abstract] Abstract: OOD and prompt-injection results cite improvements relative to 'state of the art baseline models' without defining those baselines, the datasets' properties, or the evaluation protocol, preventing assessment of whether the hybrid effect generalizes or is an artefact of particular comparisons.
minor comments (1)
  1. [Abstract] Abstract: each 'up to' figure is given without indicating the exact dataset or task on which it was measured, complicating interpretation of the scope of the improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for greater clarity in the abstract. We address each point below with references to the full manuscript and indicate where revisions will be made to improve verifiability without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: quantitative improvements (43.34% clean accuracy, 64.92% adversarial accuracy, 62.27% attack-success reduction) are stated without any description of base models, training procedures, fusion mechanism for the hybrid, error bars, or statistical tests, rendering the data unverifiable against the claims.

    Authors: The full manuscript (Sections 3 and 4) specifies the base LLMs, training procedures, and fusion mechanism (entropy-uncertainty-geometric ensemble). Experimental tables include error bars and statistical significance tests. We agree the abstract is overly concise and will revise it to briefly note the base models, fusion approach, and presence of statistical controls. revision: yes

  2. Referee: [Abstract] Abstract / central claim: the assertion that the hybrid outperforms any single feature type requires matched ablation experiments (identical base model, hyperparameters, and training) on the reported datasets. No such single-feature controls or ablation results are described, so the causal link between the three-feature design and the reported gains cannot be assessed.

    Authors: Section 5.2 of the full manuscript presents matched ablation experiments (identical base models, hyperparameters, and training) comparing the hybrid against entropy-only, uncertainty-only, and geometric-only variants on all reported datasets. These confirm the three-feature combination yields the gains. We will add a concise reference to these ablations in the revised abstract. revision: yes

  3. Referee: [Abstract] Abstract: OOD and prompt-injection results cite improvements relative to 'state of the art baseline models' without defining those baselines, the datasets' properties, or the evaluation protocol, preventing assessment of whether the hybrid effect generalizes or is an artefact of particular comparisons.

    Authors: The full manuscript (Section 4) defines the SOTA baselines (specific models and references), dataset properties, and evaluation protocol for OOD (AeroEngQA, CPIQA) and prompt-injection/jailbreak sets (SafeGuard, AdvBench, DAN). We will revise the abstract to name the primary baselines and note that full protocol details appear in the experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results with no derivation chain

full rationale

The paper reports experimental outcomes on a hybrid defence combining entropy, uncertainty and geometric features for NLU tasks. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear in the provided abstract or description. Claims rest on observed accuracy/robustness improvements across datasets, which are externally falsifiable by replication rather than reducing to inputs by construction. This is a standard empirical study; the central attribution of gains to the hybrid combination is a testable hypothesis, not a self-referential definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or newly postulated entities; all such elements are absent from the provided text.

pith-pipeline@v0.9.1-grok · 5755 in / 1160 out tokens · 31262 ms · 2026-06-28T06:52:54.382778+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 22 canonical work pages · 1 internal anchor

  1. [1]

    International Journal of Computer Applications , issue_date =

    Basemah Alshemali, Jugal Kalita , title =. International Journal of Computer Applications , issue_date =. 2019 , issn =. doi:10.5120/ijca2019919384 , publisher =

  2. [2]

    Safe Pruning L o RA : Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLM s

    Ao, Shuang and Dong, Yi and Hu, Jinwei and Ramchurn, Sarvapali D. Safe Pruning L o RA : Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLM s. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl.a.44

  3. [3]

    The internal state of an LLM knows when it’s lying

    Azaria, Amos and Mitchell, Tom , booktitle =. The Internal State of an. 2023 , address =. doi:10.18653/v1/2023.findings-emnlp.68 , url =

  4. [4]

    2025 , address =

    Bang, Yejin and Ji, Ziwei and Schelten, Alan and Hartshorn, Anthony and Fowler, Tara and Zhang, Cheng and Cancedda, Nicola and Fung, Pascale , booktitle =. 2025 , address =. doi:10.18653/v1/2025.acl-long.1176 , url =

  5. [5]

    Defending Pre-trained Language Models from Adversarial Word Substitution Without Performance Sacrifice

    Bao, Rongzhou and Wang, Jiayi and Zhao, Hai. Defending Pre-trained Language Models from Adversarial Word Substitution Without Performance Sacrifice. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.287

  6. [6]

    Safety-Tuned

    Federico Bianchi and Mirac Suzgun and Giuseppe Attanasio and Paul Rottger and Dan Jurafsky and Tatsunori Hashimoto and James Zou , booktitle=. Safety-Tuned. 2024 , url=

  7. [7]

    Advances in Neural Information Processing Systems , volume =

    Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =

  8. [8]

    2021 , eprint=

    Towards Robustness Against Natural Language Word Substitutions , author=. 2021 , eprint=

  9. [9]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

    Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =. 2021 , address =. doi:10.18653/v1/2021.emnlp-main.168 , url =

  10. [10]

    Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages =

    On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models? , author =. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages =. 2022 , address =. doi:10.18653/v1/2022.naacl-main.387 , url =

  11. [11]

    arXiv preprint arXiv:2308.11764 , year=

    Halo: Estimation and Reduction of Hallucinations in Open-Source Weak Large Language Models , author=. arXiv preprint arXiv:2308.11764 , year=

  12. [12]

    Erdogan, Lutfi Eren and Shang, Chuyi and Goyal, Aryan and Ijju, Siddarth , year =

  13. [13]

    2024 , eprint =

    Improved Large Language Model Jailbreak Detection via Pretrained Embeddings , author =. 2024 , eprint =

  14. [14]

    DSRM : Boost Textual Adversarial Training with Distribution Shift Risk Minimization

    Gao, SongYang and Dou, Shihan and Liu, Yan and Wang, Xiao and Zhang, Qi and Wei, Zhongyu and Ma, Jin and Shan, Ying. DSRM : Boost Textual Adversarial Training with Distribution Shift Risk Minimization. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.680

  15. [15]

    International Conference on Learning Representations , year =

    Explaining and Harnessing Adversarial Examples , author =. International Conference on Learning Representations , year =

  16. [16]

    Not What You've Signed Up For: Compromising Real-World

    Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , booktitle =. Not What You've Signed Up For: Compromising Real-World. 2023 , publisher =

  17. [17]

    Bypassing

    Hackett, William and Birch, Lewis and Trawicki, Stefan and Suri, Nirav and Garraghan, Peter , booktitle =. Bypassing. 2025 , address =

  18. [18]

    SIAM Review , volume =

    Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , author =. SIAM Review , volume =. 2011 , doi =

  19. [19]

    Chia-Yi Hsu and Yu-Lin Tsai and Chih-Hsun Lin and Pin-Yu Chen and Chia-Mu Yu and Chun-Ying Huang , booktitle=. Safe Lo. 2024 , url=

  20. [20]

    2023 , eprint =

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models , author =. 2023 , eprint =

  21. [21]

    ACM Computing Surveys , volume =

    Survey of Hallucination in Natural Language Generation , author =. ACM Computing Surveys , volume =. 2023 , doi =

  22. [22]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2020 , month=. doi:10.1609/aaai.v34i05.6311 , abstractNote=

  23. [23]

    The Eleventh International Conference on Learning Representations , year=

    Finding Actual Descent Directions for Adversarial Training , author=. The Eleventh International Conference on Learning Representations , year=

  24. [24]

    A Mathematical Investigation of Hallucination and Creativity in

    Lee, Minhyeok , journal =. A Mathematical Investigation of Hallucination and Creativity in. 2023 , doi =

  25. [25]

    Retrieval-Augmented Generation for Knowledge-Intensive

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems , volume =

  26. [26]

    CoRR , volume =

    Jinfeng Li and Shouling Ji and Tianyu Du and Bo Li and Ting Wang , title =. CoRR , volume =. 2018 , url =. 1812.05271 , timestamp =

  27. [27]

    Transactions on Machine Learning Research , year =

    Holistic Evaluation of Language Models , author =. Transactions on Machine Learning Research , year =

  28. [28]

    Manakul, Potsawee and Liusie, Adian and Gales, Mark J. F. , booktitle =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.557 , url =

  29. [29]

    Massenon, Ronan and Gambo, Ibrahim and Khan, Javed Ali and Iacob, Ioana and Berki, Eleni and Araujo, Allan and Liu, Chenyi and Zhao, Weijia and Zhang, Wei , journal =. My. 2025 , doi =

  30. [30]

    On Faithfulness and Factuality in Abstractive Summarization

    On Faithfulness and Factuality in Abstractive Summarization , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =. 2020 , publisher =. doi:10.18653/v1/2020.acl-main.173 , url =

  31. [31]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

    TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

  32. [32]

    and Hawkins, Ed and Middleton, Stuart E

    Mutalik, Rohit and Panchalingam, Arumugam and Singh, Lovedeep Gondara and Osborn, Timothy J. and Hawkins, Ed and Middleton, Stuart E. , booktitle =. 2025 , address =. doi:10.18653/v1/2025.climatenlp-1.15 , url =

  33. [33]

    2023 , howpublished =

    Fine-Tuned. 2023 , howpublished =

  34. [34]

    Raina, Vatsal and Liusie, Adian and Gales, Mark , booktitle =. Is. 2024 , address =. doi:10.18653/v1/2024.emnlp-main.427 , url =

  35. [35]

    Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency

    Ren, Shuhuai and Deng, Yihe and He, Kun and Che, Wanxiang. Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1103

  36. [36]

    Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin , booktitle =. Social. 2019 , address =. doi:10.18653/v1/D19-1454 , url =

  37. [37]

    Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security , year =

    ``Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models , author =. Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security , year =

  38. [38]

    Proceedings of the AIAA Scitech 2025 Forum , year =

    Retrieval-Augmented Generation and In-Context Prompted Large Language Models in Aircraft Engineering , author =. Proceedings of the AIAA Scitech 2025 Forum , year =

  39. [39]

    Nature , volume =

    Large Language Models Encode Clinical Knowledge , author =. Nature , volume =. 2023 , doi =

  40. [40]

    2019 , address =

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , booktitle =. 2019 , address =. doi:10.18653/v1/N19-1421 , url =

  41. [41]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit , booktitle =. 2018 , address =. doi:10.18653/v1/N18-1074 , url =

  42. [42]

    2023 , eprint =

    Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. 2023 , eprint =

  43. [43]

    A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of

    Varshney, Neeraj and Yao, Wenlin and Zhang, Hongming and Chen, Jianshu and Yu, Dong , year =. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of. 2307.03987v1 , archivePrefix =

  44. [44]

    Boxin Wang and Shuohang Wang and Yu Cheng and Zhe Gan and Ruoxi Jia and Bo Li and Jingjing Liu , booktitle=. Info. 2021 , url=

  45. [45]

    2025 , eprint =

    Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal , author =. 2025 , eprint =

  46. [46]

    Jailbroken: How Does

    Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , booktitle =. Jailbroken: How Does

  47. [47]

    H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

    Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle =. 2018 , address =. doi:10.18653/v1/D18-1259 , url =

  48. [48]

    2310.01469v2 , archivePrefix =

    Yao, Jia-Yu and Ning, Kun-Peng and Liu, Zhen-Hui and Ning, Mu-Nan and Yuan, Li , year =. 2310.01469v2 , archivePrefix =

  49. [49]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages =

    Do Large Language Models Know What They Don't Know? , author =. Findings of the Association for Computational Linguistics: ACL 2023 , pages =. 2023 , address =. doi:10.18653/v1/2023.findings-acl.551 , url =

  50. [50]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

    Scaling Vision Transformers , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2023 , doi =

  51. [51]

    R -Tuning: Instructing Large Language Models to Say ` I Don ' t Know'

    Zhang, Hanning and Diao, Shizhe and Lin, Yong and Fung, Yi and Lian, Qing and Wang, Xingyao and Chen, Yangyi and Ji, Heng and Zhang, Tong , booktitle =. 2024 , address =. doi:10.18653/v1/2024.naacl-long.394 , url =

  52. [52]

    Siren's Song in the

    Zhang, Yue and Li, Yafu and Cui, Leyang and Cai, Deng and Liu, Lemao and Fu, Tingchen and Huang, Xinting and Zhao, Enbo and Zhang, Yu and Chen, Yulong and Wang, Longyue and Luu, Anh Tuan and Bi, Wei and Shi, Freda and Shi, Shuming , journal =. Siren's Song in the. 2025 , doi =

  53. [53]

    2023 , eprint =

    Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. 2023 , eprint =

  54. [54]

    2501.04899v1 , archivePrefix =

    Zubkova, Hanna and Park, Ji-Hoon and Lee, Seong-Whan , year =. 2501.04899v1 , archivePrefix =

  55. [55]

    2020 , url =

    Zhu, Chen and Cheng, Yu and Gan, Zhe and Sun, Siqi and Goldstein, Tom and Liu, Jingjing , booktitle =. 2020 , url =

  56. [56]

    2023 , eprint=

    A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily , author=. 2023 , eprint=

  57. [57]

    DAMON : A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models

    Zhang, Xu and Yin, Xunjian and Jing, Dinghao and Zhang, Huixuan and Hu, Xinyu and Wan, Xiaojun. DAMON : A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.323

  58. [58]

    Jailbreak LLM s through Internal Stance Manipulation

    Fu, Shuangjie and Su, Du and Huang, Beining and Sun, Fei and Wang, Jingang and Chen, Wei and Shen, Huawei and Cheng, Xueqi. Jailbreak LLM s through Internal Stance Manipulation. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.780

  59. [59]

    2026 , eprint=

    Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki , author=. 2026 , eprint=