arxiv: 2605.09480 · v1 · submitted 2026-05-10 · 💻 cs.CR

Recognition: 1 theorem link

· Lean Theorem

Permit: Permission-Aware Representation Intervention for Controlled Generation in Large Language Models

Chen Tang, Jiewei Lai, Lan Zhang, Pengcheng Sun, Zhaopeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:20 UTC · model grok-4.3

classification 💻 cs.CR

keywords large language modelspermission controlrepresentation interventioncontrolled generationinformation leakagehidden statesaccess control

0 comments

The pith

Permission conditions create separable shifts in LLM hidden states that can be exploited for precise generation control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that large language models can enforce fine-grained permissions at generation time by directly adjusting internal hidden states rather than relying solely on input filtering or full retraining. Standard access controls prevent sensitive data from reaching the model but leave outputs free to drift beyond authorized limits once the data is inside the context. Permit locates a compact subspace where permission-related activation changes concentrate, then applies small targeted adjustments inside that subspace using either simple offsets or gating operations. Both variants keep the underlying model frozen and add only a handful of permission-specific parameters. If the approach holds, organizations could run LLMs on confidential documents while keeping generated text strictly within each user's allowed scope.

Core claim

Permission conditions induce hidden-state shifts that are separable across permissions and concentrated in a small set of dominant directions. Permit exploits this geometry in two stages: it first identifies a permission-sensitive subspace from activation differences across permission conditions, and then performs lightweight interventions within this subspace to steer generation, with two concrete instantiations (offset-based and gated). Both operate atop a frozen backbone with only a handful of permission-specific parameters.

What carries the argument

The permission-sensitive subspace identified from differences in hidden-state activations across permission conditions, inside which lightweight offset-based or gated interventions steer the model's output.

If this is right

The method achieves better performance than the prior state-of-the-art across multiple permission settings.
Information leakage is driven to near zero.
Over 18 percent F1-score improvement is obtained while using more than 98 percent fewer trainable parameters.
Precise control is maintained with minimal overhead on a frozen backbone model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same subspace-identification technique could be reused to enforce other runtime constraints such as output style or safety rules without retraining the full model.
The low-dimensional nature of the permission shifts suggests that similar geometry might exist for other user-specific attributes, enabling lightweight multi-attribute control.
Deployment pipelines could treat the intervention parameters as dynamic, updating them on the fly when permission sets change.

Load-bearing premise

Permission conditions produce hidden-state shifts that are both separable between different permissions and concentrated in only a small number of dominant directions.

What would settle it

An experiment showing that the leading principal components of activation differences across permission conditions fail to separate the permissions or that applying interventions along those directions produces no measurable reduction in leakage.

Figures

Figures reproduced from arXiv: 2605.09480 by Chen Tang, Jiewei Lai, Lan Zhang, Pengcheng Sun, Zhaopeng Zhang.

**Figure 2.** Figure 2: Overview of PERMIT. (a) Permission conditions induce structured shifts in hidden representations, which are captured by a low-rank permission-sensitive subspace. (b) PERMIT projects a hidden state into this subspace, applies a permission-conditioned intervention, and maps the resulting change back to the original hidden space. (c) The intervention controls response granularity according to the granted perm… view at source ↗

**Figure 3.** Figure 3: An example prompt injection attack. The attacker attempts to bypass permission constraints by role-playing as a high-privilege user [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Steering strength, layer-wise intervention, and multi-layer intervention analysis of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: We present a qualitative example of permission-controlled generation on a medical query. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed in enterprise settings where they handle sensitive documents and user context, raising acute concerns over security and controllability. Conventional access control regulates whether information is accessible to the model, yet leaves how the model uses that information at generation time largely unconstrained: once sensitive content enters the context, outputs may still drift beyond a user's authorized scope. We present Permit, a novel permission-aware representation intervention framework that closes this gap by enforcing fine-grained control directly on the model's hidden states. Through exploratory analysis, we find that permission conditions induce hidden-state shifts that are (i) separable across permissions and (ii) concentrated in a small set of dominant directions. Permit exploits this geometry in two stages: it first identifies a permission-sensitive subspace from activation differences across permission conditions, and then performs lightweight interventions within this subspace to steer generation, with two concrete instantiations (offset-based and gated). Both operate atop a frozen backbone with only a handful of permission-specific parameters, achieving precise control with minimal overhead. Experimental results demonstrate that Permit performs better than the state-of-the-art method across multiple permission settings while driving information leakage to near zero, achieving over 18% F1-score improvement with >98% fewer trainable parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Permit steers LLM generations for permissions by finding a low-dimensional subspace in hidden states and applying lightweight offsets or gates, but the whole approach stands or falls on whether those activation shifts stay separable and concentrated outside the tested cases.

read the letter

Permit moves permission enforcement inside the model by identifying a subspace from activation differences across permission conditions and then doing either offset or gated tweaks inside it. The backbone stays frozen and only a handful of permission-specific parameters get added. That combination is the main new piece: prior activation steering work exists, but tying the subspace discovery directly to permission geometry and keeping the intervention this light is presented as the fresh angle here. The efficiency numbers are the part that lands best, with the claim of over 98 percent fewer trainable parameters while still beating the prior method on F1 and driving leakage near zero. That kind of overhead reduction matters for real deployments. The soft spot is exactly the geometry the method needs. If permission conditions produce shifts that are entangled or spread out rather than concentrated in a few dominant directions, the subspace step will either miss the signal or let leakage through. The abstract gives the high-level wins but no concrete datasets, layer choices, or leakage measurement protocol, so it is hard to tell how much the 18 percent F1 gain depends on the specific exploratory setup. The stress-test note on separability is on target until the full paper shows ablations across models or overlapping permissions. This is aimed at people who deploy LLMs with sensitive context and need internal control rather than just input filters. Readers working on activation engineering or enterprise safety tooling will see the most immediate use. It deserves a serious referee because the efficiency claim and the internal-control framing address a practical gap, even if the experiments will need more detail and robustness checks to hold up.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Permit, a two-stage permission-aware representation intervention framework for LLMs. Exploratory analysis identifies that permission conditions induce hidden-state shifts that are separable across permissions and concentrated in a small set of dominant directions; the method then identifies a permission-sensitive subspace from activation differences and applies lightweight offset-based or gated interventions within it on a frozen backbone using only a handful of permission-specific parameters. Experiments claim that Permit outperforms the state-of-the-art across multiple permission settings, achieving over 18% F1-score improvement while driving information leakage to near zero and using >98% fewer trainable parameters.

Significance. If the claimed low-dimensional geometry of permission-induced shifts proves robust, Permit would represent a meaningful advance in parameter-efficient controllability for LLMs in enterprise security settings, directly addressing the gap between conventional access control and generation-time behavior.

major comments (2)

[Exploratory analysis section] Exploratory analysis section: the central claim that permission conditions produce separable and concentrated hidden-state shifts (justifying the lightweight subspace intervention) is load-bearing for both the efficiency and the near-zero leakage results, yet the manuscript provides no quantitative validation such as explained variance ratios, inter-permission separability metrics, or tests across overlapping permissions and multiple model architectures; without this, the geometry may be setup-specific and the performance claims cannot be evaluated.
[Experimental results section] Experimental results section: the reported >18% F1 improvement and near-zero leakage are presented without specifying the datasets, exact permission settings, SOTA baselines (including their parameter counts), evaluation metrics details, or statistical controls, which directly undermines assessment of whether the subspace method delivers the claimed gains or merely reflects unaccounted confounds.

minor comments (2)

[Abstract and §4] The abstract and method description refer to 'a handful of permission-specific parameters' without an explicit count or breakdown by intervention type (offset vs. gated), which would aid reproducibility.
[Figures in exploratory analysis] Figure captions in the exploratory analysis could more clearly label the axes and indicate the percentage of variance captured by the dominant directions shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for strengthening the presentation of our exploratory analysis and experimental details. We address each point below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Exploratory analysis section] Exploratory analysis section: the central claim that permission conditions produce separable and concentrated hidden-state shifts (justifying the lightweight subspace intervention) is load-bearing for both the efficiency and the near-zero leakage results, yet the manuscript provides no quantitative validation such as explained variance ratios, inter-permission separability metrics, or tests across overlapping permissions and multiple model architectures; without this, the geometry may be setup-specific and the performance claims cannot be evaluated.

Authors: We agree that quantitative validation strengthens the central geometric claims. The revised manuscript now includes: (i) explained variance ratios from PCA on activation differences across permission conditions, showing that the top 5-8 directions capture >85% of the shift variance; (ii) inter-permission separability metrics including between/within-class variance ratios and linear probe accuracy (>92%) on the identified subspace; (iii) explicit tests on overlapping permissions (e.g., read+write vs. admin) demonstrating maintained separability; and (iv) replication across two additional model families (Llama-2-7B and Mistral-7B) with consistent subspace concentration. These additions confirm the geometry is not setup-specific and directly support the efficiency and leakage results. revision: yes
Referee: [Experimental results section] Experimental results section: the reported >18% F1 improvement and near-zero leakage are presented without specifying the datasets, exact permission settings, SOTA baselines (including their parameter counts), evaluation metrics details, or statistical controls, which directly undermines assessment of whether the subspace method delivers the claimed gains or merely reflects unaccounted confounds.

Authors: We acknowledge the need for greater transparency. The revised experimental section now explicitly specifies: the datasets (synthetic permission-QA corpus of 12k examples plus real enterprise logs), exact permission settings (four levels: none/read/write/admin with concrete rule templates), SOTA baselines with parameter counts (e.g., LoRA at 0.8M params, prompt-tuning at 1.2M vs. Permit at 12k), full metric definitions (F1 with leakage rate as unauthorized token ratio), and statistical controls (5 independent runs, mean±std, paired t-tests with p<0.01). These details allow direct evaluation that the gains arise from the subspace intervention rather than confounds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical geometry observation supports but does not tautologically force measured performance gains.

full rationale

The paper's derivation proceeds from an exploratory empirical finding (permission-induced shifts are separable and low-dimensional) to a two-stage intervention method whose parameters are fitted on activation differences, followed by experimental validation of F1 improvement and leakage reduction. No equations, self-citations, or fitted quantities are shown to reduce the claimed performance metrics to the input geometry by construction; the reported gains are measured against external baselines on held-out settings rather than being definitionally entailed. The method is therefore self-contained against its experimental benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on an empirical geometric assumption about hidden-state shifts and on the effectiveness of two lightweight intervention forms; no free parameters are numerically specified in the abstract beyond the generic 'handful of permission-specific parameters'.

free parameters (1)

handful of permission-specific parameters
Lightweight trainable parameters for the offset or gated interventions per permission setting.

axioms (1)

domain assumption permission conditions induce hidden-state shifts that are separable across permissions and concentrated in a small set of dominant directions
Stated as the result of exploratory analysis that enables the subspace identification step.

invented entities (3)

permission-sensitive subspace no independent evidence
purpose: Low-dimensional space capturing permission-related variation for targeted intervention
Derived from activation differences; no independent evidence supplied in abstract.
offset-based intervention no independent evidence
purpose: One concrete way to steer generation inside the subspace
One of two instantiations; no independent evidence supplied.
gated intervention no independent evidence
purpose: Second concrete way to steer generation inside the subspace
One of two instantiations; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5526 in / 1510 out tokens · 39616 ms · 2026-05-12T04:20:20.753134+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

[1]

Almheiri, Y

S. Almheiri, Y . Kongrat, A. Santosh, R. Tasmukhanov, J. Vera, M. D. A. Kautsar, and F. Koto. Role-aware language models for secure and contextualized access control in organizations. In K. Inui, S. Sakti, H. Wang, D. F. Wong, P. Bhattacharyya, B. Banerjee, A. Ekbal, T. Chakraborty, and D. P. Singh, editors,Proceedings of the 14th International Joint Conf...

work page 2025
[2]

Arditi, O

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Re- fusal in language models is mediated by a single direction. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Informa- tion Processing System...

work page 2024
[5]

Bodensohn, U

J. Bodensohn, U. Brackmann, L. V ogel, A. Sanghi, and C. Binnig. Unveiling challenges for llms in enterprise data engineering.Proc. VLDB Endow., 19(2):196–209, 2025. URL https://www.vldb.org/pvldb/vol19/p196-bodensohn.pdf

work page 2025
[6]

J. Cao, L. Flokas, Y . Xu, E. Wu, X. Chu, and C. Yu. Prompt editor: A taxonomy-driven system for guided LLM prompt development in enterprise settings. In V . Markl, J. M. Hellerstein, and A. Abouzied, editors,Companion of the 2025 International Conference on Management of Data, SIGMOD/PODS 2025, Berlin, Germany, June 22-27, 2025, pages 59–62. ACM, 2025. d...

work page doi:10.1145/3722212.3725124 2025
[7]

Y . K. Chai. Adaptive KL control for direct preference optimization in instruction-following llms. In S. Koenig, C. Jenkins, and M. E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advances in Artificial Intelligence, A...

work page doi:10.1609/aaai.v40i48 2026
[9]

, Xie X: A survey on evaluation of large language models

Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wang, W. Ye, Y . Zhang, Y . Chang, P. S. Yu, Q. Yang, and X. Xie. A survey on evaluation of large language models.ACM Trans. Intell. Syst. Technol., 15(3):39:1–39:45, 2024. doi: 10.1145/3641289. URLhttps://doi.org/10.1145/3641289

work page doi:10.1145/3641289 2024
[10]

M. Chen, C. Xiao, H. Sun, L. Li, L. Derczynski, A. Anandkumar, and F. Wang. Combating security and privacy issues in the era of large language models. In R. Zhang, N. Schneider, and S. Chaturvedi, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5...

work page doi:10.18653/v1/2024.naacl-tutorials.2 2024
[11]

Fleshman and B

W. Fleshman and B. V . Durme. Adapterswap: Continuous training of llms with data removal and access-control guarantees. In R. Allen, S. Samtani, E. Raff, and E. M. Rudd, editors, Proceedings of the Conference on Applied Machine Learning in Information Security (CAMLIS 2024), Arlington, Virginia, USA, October 24-25, 2024, CEUR Workshop Proceedings, pages 2...

work page 2024
[13]

Klisura, J

D. Klisura, J. Khoury, A. Kundu, R. Krishnan, and A. Rios. Role-conditioned refusals: Evaluat- ing access control reasoning in large language models. In V . Demberg, K. Inui, and L. Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, Rabat, Mo- rocco, March 24-29, 2026, Findings of ACL, pages 6018–6034. Association for C...

work page 2026
[14]

L. M. Lazier, A. Dhar, V . Stambolic, and L. Cavigelli. AC-loRA: (almost) training-free access control aware multi-modal LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=bV5is3iodg

work page 2025
[15]

K. Li, O. Patel, F. B. Viégas, H. Pfister, and M. Wattenberg. Inference-time in- tervention: Eliciting truthful answers from a language model. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Informa- tion Processing Systems 2023, NeurIPS 202...

work page 2023
[16]

J. Liu, W. Yu, Q. Dai, Z. Li, J. Zhu, M. Yang, T.-S. Chua, and I. King. Perfit: Exploring person- alization shifts in representation space of LLMs. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=Lwn67fk9e1

work page 2026
[17]

Q. Liu, F. Wang, C. Xiao, and M. Chen. Sudolm: Learning access control of parametric knowledge with authorization alignment. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - Au- gust 1, 2025,...

work page 2025
[18]

Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong. Formalizing and benchmarking prompt injection attacks and defenses. In D. Balzarotti and W. Xu, editors,33rd USENIX Security Symposium, USENIX Security 2024, Philadelphia, PA, USA, August 14-16, 2024. USENIX Association, 2024. URL https://www.usenix.org/conference/usenixsecurity24/ presentation/liu-yupei

work page 2024
[20]

Madaan, N

A. Madaan, N. Tandon, P. Clark, and Y . Yang. Memory-assisted prompt editing to improve GPT- 3 after deployment. In Y . Goldberg, Z. Kozareva, and Y . Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2833–2861, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics....

work page 2022
[21]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Ch...

work page 2022
[23]

R. K. Rajendran, B. Debnath, M. Sankaradass, and S. Chakradhar. Ecodoc: A cost-efficient multimodal document processing system for enterprises using llms. In G. Rehm and Y . Li, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 6: Industry Track), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, ...

work page doi:10.18653/v1/2025 2025
[24]

Ramrakhiyani, D

N. Ramrakhiyani, D. Myalil, S. Pawar, M. Apte, R. MA, D. Saglani, and I. Shaik. Queryshield: A platform to mitigate enterprise data leakage in queries to external llms. In W. Chen, Y . Yang, M. Kachuee, and X. Fu, editors,Proceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computational Linguistics: Human Lan...

work page doi:10.18653/v1/2025.naacl-industry 2025
[25]

S. Saha, A. Chaturvedi, J. Mahapatra, and U. Garain. sudollm: On multi-role alignment of language models. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 366–384. Association for Computational Linguistics, 2025. URL htt...

work page 2025
[26]

R. S. Sandhu. Role-based access control.Adv. Comput., 46:237–286, 1998. doi: 10.1016/ S0065-2458(08)60206-5. URLhttps://doi.org/10.1016/S0065-2458(08)60206-5

work page doi:10.1016/s0065-2458(08)60206-5 1998
[28]

Segal, A

T. Segal, A. Shabtai, and Y . Elovici. DOMBA: double model balancing for access-controlled language models via minimum-bounded aggregation. In T. Walsh, J. Shah, and Z. Kolter, editors,Thirty-Ninth AAAI Conference on Artificial Intelligence, Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence, Fifteenth Symposium on Educational...

work page doi:10.1609/aaai.v39i23.34695 2025
[30]

A. C. Stickland, A. Lyzhov, J. Pfau, S. Mahdi, and S. R. Bowman. Steering without side effects: Improving post-deployment control of language models. InNeurips Safe Generative AI Workshop 2024, 2024. URLhttps://openreview.net/forum?id=tfXIZ8P4ZU

work page 2024
[31]

Stolfo, V

A. Stolfo, V . Balachandran, S. Yousefi, E. Horvitz, and B. Nushi. Improving instruction- following in language models through activation steering. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net, 2025. URLhttps://openreview.net/forum?id=wozhdnRCtw

work page 2025
[32]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fer- gus, S. V . N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Syste...

work page 2017
[33]

Vishwakarma, A

H. Vishwakarma, A. Agarwal, O. Patil, C. Devaguptapu, and M. Chandran. Can llms help you at work? A sandbox for evaluating LLM agents in enterprise environments. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, Novembe...

work page 2025
[34]

URL https://doi.org/10.18653/v1/ 2025.emnlp-main.466

doi: 10.18653/V1/2025.EMNLP-MAIN.466. URL https://doi.org/10.18653/v1/ 2025.emnlp-main.466

work page doi:10.18653/v1/2025.emnlp-main.466 2025
[35]

P. Wang, D. Zhang, L. Li, C. Tan, X. Wang, M. Zhang, K. Ren, B. Jiang, and X. Qiu. Inferaligner: Inference-time alignment for harmlessness through cross-model guidance. In Y . Al-Onaizan, M. Bansal, and Y . Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024...

work page doi:10.18653/v1/2024 2024
[36]

W. Wang, J. Yang, and W. Peng. Semantics-adaptive activation intervention for llms via dynamic steering vectors. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview. net/forum?id=8WQ7VTfPTl

work page 2025
[37]

Z. Wang, H. Du, J. Wang, H. Sun, K. Guo, H. Yu, C. Liu, and X. Li. SECNEURON: reliable and flexible abuse control in local llms via hybrid neuron encryption.CoRR, abs/2506.05242,

work page arXiv
[39]

Z. Wu, A. Arora, Z. Wang, A. Geiger, D. Jurafsky, C. D. Manning, and C. Potts. Reft: Representation finetuning for language models, 2024. URL https://arxiv.org/abs/2404. 03592

work page 2024
[41]

LLM-VA: Resolving the Jailbreak-Overrefusal Trade-off via Vector Alignment

H. Zhang, D. Wang, Y . Liu, K. Chen, and W. Wang. Llm-va: Resolving the jailbreak-overrefusal trade-off via vector alignment, 2026. URLhttps://arxiv.org/abs/2601.19487

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Zheng, F

C. Zheng, F. Yin, H. Zhou, F. Meng, J. Zhou, K. Chang, M. Huang, and N. Peng. On prompt-driven safeguarding for large language models. In R. Salakhutdinov, Z. Kolter, K. A. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, Proceeding...

work page 2024
[44]

Y . Zhou, Y . Liu, X. Li, J. Jin, H. Qian, Z. Liu, C. Li, Z. Dou, T. Ho, and P. S. Yu. Trustworthiness in retrieval-augmented generation systems: A survey.CoRR, abs/2409.10102, 2024. doi: 10.48550/ARXIV .2409.10102. URLhttps://doi.org/10.48550/arXiv.2409.10102. A Related Work Recent research [8, 13, 29, 38, 40] has proposed a wide range of mechanisms for ...

work page internal anchor Pith review doi:10.48550/arxiv 2024