pith. sign in

arxiv: 2605.24552 · v1 · pith:4JYQQ2F5new · submitted 2026-05-23 · 💻 cs.CR

Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling

Pith reviewed 2026-06-30 13:08 UTC · model grok-4.3

classification 💻 cs.CR
keywords jailbreak defenseLLM safetyrepresentation engineeringwhite-list defenselatent space ellipsoidprojected gradient descentbenign data modeling
0
0 comments X

The pith

Ellipsoid Control elicits refusal on jailbreaks by constraining updates inside a benign latent ellipsoid fitted from safe data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that RepE defenses fail when they must learn from incomplete harmful samples that attackers can always expand. It therefore switches to a white-list strategy that uses only abundant benign activations. An anisotropic ellipsoid is fitted to those activations to mark the safe latent region. Projected gradient descent then pushes any input toward a refusal direction, but the ellipsoid keeps every step inside the original benign geometry so utility is not lost. Experiments on multiple models and attacks show higher safety with smaller utility drops than black-list baselines.

Core claim

Ellipsoid Control is a test-time defense that fits an anisotropic ellipsoid from benign activations to tightly constrain the search space of projected gradient descent, allowing the model to generate refusal responses on arbitrary inputs while keeping the benign latent geometry intact and thus preserving utility on harmless tasks.

What carries the argument

An anisotropic benign-geometry ellipsoid fitted from benign activations that bounds the latent-space updates performed by projected gradient descent to elicit refusal.

If this is right

  • Refusal can be elicited on unseen jailbreaks without any coverage of harmful distributions.
  • Utility on benign tasks degrades less than with defenses that fit directly to harmful examples.
  • The defense separates protection of the benign region from estimation of the harmful region.
  • Safety gains remain consistent across different LLMs, attack methods, and safety-boundary tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Task-specific ellipsoids could be fitted to give finer control over different kinds of safe behavior.
  • The result suggests that preserving latent geometry may matter more for utility than directly modeling attacks.
  • Periodic refitting on new benign data could adapt the boundary as the model's normal usage evolves.

Load-bearing premise

That an anisotropic ellipsoid fitted from abundant benign data can constrain projected gradient descent updates tightly enough to minimize distortion of the benign latent geometry while still allowing refusal to be elicited on arbitrary inputs.

What would settle it

A jailbreak input on which the constrained projected gradient descent either fails to produce refusal or produces a measurable accuracy drop on standard benign benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.24552 by Ahmed Asiri, Feng Wu, Luoyu Chen, Shui Yu, Weiqi Wang, Zhiyi Tian.

Figure 1
Figure 1. Figure 1: Illustration of Ellipsoid Control: EC performs benign-preserving defense by increasing the refusal likelihood of its hidden state h via projected gradient descent. More specifically, for a potentially malicious input whose h initially has low refusal likelihood, EC first increases the likelihood of a refusal response by optimizing a linear transform matrix ∆. This refusal objective is a fixed model-output … view at source ↗
Figure 2
Figure 2. Figure 2: Gray axis: original benign ellipsoid semantic directions in latent space. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Refusal Log-likelihood score distribution across attack categories [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Negative Log-likelihood loss dynamics for samples from benign safety [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: TSNE visualization on benign, boundary and jailbreak samples [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: KDE distribution of refusal-elicitation drift norms for in-distribution [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE visualization of hidden-state activations from the benign [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: AUROC curves for jailbreak Defense Success Rate (DSR) and boundary ORR for LLAMA3-8B. 1) Safety–Over-Refusal Pareto Front: To evaluate how data volume interacts with the safety–over-refusal trade-off, we randomly sample 300 boundary questions from ORBench and compare them with 300 jailbreak inputs from Harmbench. For each sample, we compute the final log-likelihood of the corresponding refusal phrase after… view at source ↗
Figure 8
Figure 8. Figure 8: ERR trends when data size scales up, indicating more diverse semantic [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Representation engineering (RepE) defenses have shown strong robustness against jailbreak attacks on large language models (LLMs). However, these methods fundamentally rely on black-list supervision: they learn jailbreak-to-refusal activation transformations from harmful or jailbreak data that are inherently incomplete and continuously evolving. Hence, the performance of RepE-based defenses becomes tightly coupled to the quality and coverage of collected harmful samples, leaving models vulnerable to unseen attacks. This reliance also obscures the distinction between defenses that fit known harmful distributions and defenses that protect a benign latent region without estimating the harmful distribution. We adopt the opposite, the white-list perspective, by leveraging the accessibility and abundance of benign data. The goal is to elicit refusal on arbitrary inputs while ensuring that harmless inputs are not falsely rejected. This shifts the core research question to: How can we design a robust benign-latent preservation mechanism such that the benign latent distribution remains intact while refusal is elicited? To answer this, we propose Ellipsoid Control, a test-time defense. It performs projected gradient descent that can elicit refusal on arbitrary inputs, aiming to improve defense effectiveness. At the same time, an anisotropic benign-geometry ellipsoid is fitted from abundant benign data to constrain the update to minimize distortion of the benign latent geometry. This tight constraint helps preserve model utility. Across multiple LLMs, jailbreak attacks, benign tasks, and safety-boundary evaluations, Ellipsoid Control consistently enhances safety while better preserving utility, demonstrating the effectiveness of the white-list approach for jailbreak defense

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Ellipsoid Control, a test-time white-list defense for LLMs against jailbreaks. It fits an anisotropic ellipsoid from abundant benign data to constrain PGD updates that elicit refusal on arbitrary inputs, aiming to preserve benign latent geometry and model utility while enhancing safety. The paper claims this approach consistently improves safety across multiple LLMs, jailbreak attacks, benign tasks, and safety-boundary evaluations while better preserving utility compared to black-list methods.

Significance. If the empirical claims hold, the work would be significant for shifting RepE defenses from black-list to white-list paradigms, potentially offering better robustness to evolving and unseen attacks by focusing on preserving benign regions rather than modeling harmful ones. The geometric constraint via ellipsoid could provide a principled way to balance safety and utility.

major comments (1)
  1. [Abstract] Abstract: The abstract states that Ellipsoid Control 'consistently enhances safety while better preserving utility' across evaluations but supplies no quantitative results, baselines, error bars, or experimental details; the central empirical claim cannot be assessed from the provided text alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states that Ellipsoid Control 'consistently enhances safety while better preserving utility' across evaluations but supplies no quantitative results, baselines, error bars, or experimental details; the central empirical claim cannot be assessed from the provided text alone.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript reports these metrics (including baselines, safety gains, utility preservation rates, and error bars) in the experimental sections. In the revised version we will update the abstract to incorporate representative numerical findings from those sections so that the central claims are directly assessable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper fits an anisotropic ellipsoid to abundant external benign activation data and uses the resulting region to constrain projected gradient descent updates that elicit refusal. The central claim is an empirical demonstration of improved safety-utility trade-off across multiple LLMs, attacks, and tasks. No derivation step reduces by construction to its own fitted parameters, no load-bearing self-citation chain is invoked, and the white-list framing is independent of the target refusal behavior. The method remains falsifiable on held-out data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that benign data is sufficiently abundant and representative to define a constraining ellipsoid that preserves utility while enabling refusal elicitation.

free parameters (1)
  • ellipsoid parameters
    Fitted from benign activation data to define the geometry constraint; exact fitting procedure and dimensionality not specified in abstract.
axioms (1)
  • domain assumption Benign data is abundant and representative of the benign latent distribution
    Invoked to justify fitting the ellipsoid that constrains updates.

pith-pipeline@v0.9.1-grok · 5817 in / 1219 out tokens · 25582 ms · 2026-06-30T13:08:08.779168+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 25 canonical work pages · 15 internal anchors

  1. [1]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lam- ple, “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  2. [2]

    GPT-4 Technical Report

    OpenAI, “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Qwen Technical Report

    A. G. Qwen Team, “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  4. [4]

    Iron: Private inference on transformers,

    M. Hao, H. Li, H. Chen, P. Xing, G. Xu, and T. Zhang, “Iron: Private inference on transformers,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

  5. [5]

    Efficient and privacy-enhanced federated learning for industrial artificial intelligence,

    M. Hao, H. Li, X. Luo, G. Xu, H. Yang, and S. Liu, “Efficient and privacy-enhanced federated learning for industrial artificial intelligence,”IEEE Transactions on Industrial Informatics, vol. 16, no. 10, pp. 6532–6542, 2020

  6. [6]

    Scalable zero-knowledge proofs for non- linear functions in machine learning,

    M. Hao, H. Chen, H. Li, C. Weng, Y . Zhang, H. Yang, and T. Zhang, “Scalable zero-knowledge proofs for non- linear functions in machine learning,” in33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, 2024, pp. 3819–3836

  7. [7]

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    S. Yi, Y . Liu, Z. Sun, T. Cong, X. He, J. Song, K. Xu, and Q. Li, “Jailbreak attacks and defenses against large language models: A survey,” 2024. [Online]. Available: https://arxiv.org/abs/2407.04295

  8. [8]

    Improving alignment and robustness with circuit breakers,

    A. Zou, L. Phan, J. Wang, D. Duenas, M. Lin, M. Andriushchenko, J. Z. Kolter, M. Fredrikson, and D. Hendrycks, “Improving alignment and robustness with circuit breakers,” inAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, A. G...

  9. [9]

    Latent adver- sarial training improves robustness to persistent harmful behaviors in llms.ArXiv preprint, abs/2407.15549, 2024

    A. Sheshadri, A. Ewart, P. Guo, A. Lynch, C. Wu, V . Hebbar, H. Sleight, A. C. Stickland, E. Perez, D. Hadfield-Menell, and S. Casper, “Targeted latent adversarial training improves robustness to persistent harmful behaviors in llms,”CoRR, vol. abs/2407.15549,

  10. [10]

    Meng, C., Choi, K., Song, J., and Ermon, S

    [Online]. Available: https://doi.org/10.48550/arXiv. 2407.15549

  11. [11]

    Refusal in Language Models Is Mediated by a Single Direction

    A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Pan- ickssery, W. Gurnee, and N. Nanda, “Refusal in language models is mediated by a single direction, 2024,”URL https://arxiv. org/abs/2406.11717

  12. [12]

    Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language models,

    G. Shen, D. Zhao, Y . Dong, X. He, and Y . Zeng, “Jailbreak antidote: Runtime safety-utility balance via sparse representation adjustment in large language models,”CoRR, vol. abs/2410.02298, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2410.02298

  13. [13]

    Programming refusal with conditional activation steering,

    B. W. Lee, I. Padhi, K. N. Ramamurthy, E. Miehling, P. L. Dognin, M. Nagireddy, and A. Dhurandhar, “Programming refusal with conditional activation steering,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. [Online]. Available: https://openreview.net/forum?id=Oi47wc10sm

  14. [14]

    Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space,

    L. Schwinn, D. Dobre, S. Xhonneux, G. Gidel, and S. G ¨unnemann, “Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space,” inAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 20...

  15. [15]

    arXiv preprint arXiv:2405.20947 , year=

    J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh, “Or- bench: An over-refusal benchmark for large language models,”arXiv preprint arXiv:2405.20947, 2024

  16. [16]

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. A. Forsyth, and D. Hendrycks, “Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,” inForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21- 27, 2024. OpenReview.net, 2024. [Online]. ...

  17. [17]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,” arXiv preprint arXiv:2309.00614, 2023

  18. [18]

    Defending large language models against jailbreak attacks via semantic smoothing,

    J. Ji, B. Hou, A. Robey, G. J. Pappas, H. Hassani, THIS PAPER IS SUBMITTED TO TIFS 14 Y . Zhang, E. Wong, and S. Chang, “Defending large language models against jailbreak attacks via semantic smoothing,”CoRR, vol. abs/2402.16192, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.16192

  19. [19]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defending large language models against jailbreaking attacks,”arXiv preprint arXiv:2310.03684, 2023

  20. [20]

    Defending against alignment-breaking attacks via robustly aligned llm,

    B. Cao, Y . Cao, L. Lin, and J. Chen, “Defending against alignment-breaking attacks via robustly aligned llm,” arXiv preprint arXiv:2309.14348, 2023

  21. [21]

    Intention analysis makes llms a good jailbreak defender,

    Y . Zhang, L. Ding, L. Zhang, and D. Tao, “Intention analysis makes llms a good jailbreak defender,”CoRR abs/2401.06561, vol. 12, p. 14, 2024

  22. [22]

    Gradient cuff: De- tecting jailbreak attacks on large language models by exploring refusal loss landscapes,

    X. Hu, P.-Y . Chen, and T.-Y . Ho, “Gradient cuff: De- tecting jailbreak attacks on large language models by exploring refusal loss landscapes,” inNeural Information Processing Systems, 2024

  23. [23]

    Gradsafe: detect- ing unsafe prompts for llms via safety-critical gradient analysis,

    Y . Xie, M. Fang, R. Pi, and N. Gong, “Gradsafe: detect- ing unsafe prompts for llms via safety-critical gradient analysis,” inProc. 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Long Papers), 2024

  24. [24]

    HSF: defending against jailbreak attacks with hidden state filtering,

    C. Qian, H. Zhang, L. Sha, and Z. Zheng, “HSF: defending against jailbreak attacks with hidden state filtering,” inCompanion Proceedings of the ACM on Web Conference 2025, WWW 2025, Sydney, NSW, Australia, 28 April 2025 - 2 May 2025, G. Long, M. Blumestein, Y . Chang, L. Lewin-Eytan, Z. H. Huang, and E. Yom- Tov, Eds. ACM, 2025, pp. 2078–2087. [Online]. A...

  25. [25]

    Hiddendetect: Detecting jailbreak attacks against large vision-language models via monitoring hidden states,

    Y . Jiang, X. Gao, T. Peng, Y . Tan, X. Zhu, B. Zheng, and X. Yue, “Hiddendetect: Detecting jailbreak attacks against large vision-language models via monitoring hidden states,”CoRR, vol. abs/2502.14744, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2502. 14744

  26. [26]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnieret al., “Mistral 7b,”arXiv preprint arXiv:2310.06825, 2023

  27. [27]

    Qwen2.5 Technical Report

    Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...

  28. [28]

    The llama 3 herd of models,

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024

  29. [29]

    Enhancing chat language models by scaling high-quality instructional conversations,

    N. Ding, Y . Chen, B. Xu, Y . Qin, S. Hu, Z. Liu, M. Sun, and B. Zhou, “Enhancing chat language models by scaling high-quality instructional conversations,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computation...

  30. [30]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,”arXiv preprint arXiv:2009.03300, 2020

  31. [31]

    Judging llm-as-a-judge with mt-bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Ad- vances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

  32. [32]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  33. [33]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    X. Liu, N. Xu, M. Chen, and C. Xiao, “Autodan: Generat- ing stealthy jailbreak prompts on aligned large language models,”arXiv preprint arXiv:2310.04451, 2023

  34. [34]

    GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

    J. Yu, X. Lin, Z. Yu, and X. Xing, “GPTFUZZER: red teaming large language models with auto-generated jailbreak prompts,”CoRR, vol. abs/2309.10253, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309. 10253

  35. [35]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,”arXiv preprint arXiv:2310.08419, 2023

  36. [36]

    T. B. Thompson and M. Sklar. Breaking circuit breakers. [Online]. Available: https://confirmlabs.org/posts/circuit breaking.html

  37. [37]

    Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025

    L. Sheng, C. Shen, W. Zhao, J. Fang, X. Liu, Z. Liang, X. Wang, A. Zhang, and T.-S. Chua, “Alphasteer: Learn- ing refusal steering with principled null-space con- straint,”arXiv preprint arXiv:2506.07022, 2025