pith. machine review for the scientific record. sign in

arxiv: 2604.22167 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.AI

Recognition: unknown

Estimating Tail Risks in Language Model Output Distributions

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords language model safetytail risk estimationimportance samplingrare harmful outputsmisuse and misalignmentoutput distributionsafety evaluation
0
0 comments X

The pith

Creating unsafe versions of language models allows accurate estimation of rare harmful outputs with 10-20 times fewer samples than brute force.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models deployed at scale will occasionally produce harmful outputs even if those events are rare. The paper develops a method to estimate the probability of such events for any given input by using importance sampling. This is achieved by temporarily altering the model to make harmful outputs more likely, then adjusting the probabilities accordingly. The resulting estimates match those from exhaustive sampling but require far fewer model queries. The approach also shows that these probabilities can indicate how sensitive a model is to changes in its inputs and help forecast risks in real deployment.

Core claim

The paper establishes that operationalizing importance sampling through the creation of unsafe variants of the target language model enables sample-efficient estimation of the probability of harmful outputs, achieving accuracy comparable to brute-force Monte Carlo sampling while using 10-20 times fewer samples, such as estimating probabilities around 10 to the power of -4 with only 500 samples.

What carries the argument

Importance sampling performed by constructing unsafe versions of the language model that increase the likelihood of harmful outputs, allowing reweighting to recover the original probabilities.

Load-bearing premise

Unsafe versions of the target model can be constructed so that importance sampling produces unbiased estimates of the original model's probabilities for harmful outputs.

What would settle it

A direct comparison on the same set of inputs where the importance sampling estimates are computed with 500 samples and brute-force Monte Carlo is run with 10,000 or more samples; if the two do not agree within expected statistical variation, the method's accuracy is not supported.

Figures

Figures reproduced from arXiv: 2604.22167 by He He, Kathleen McKeown, Raghav Singhal, Rajesh Ranganath, Rico Angell, Zachary Horvitz, Zhou Yu.

Figure 1
Figure 1. Figure 1: Estimating probabilities of rare harmful outputs. We estimate the probability of a target model generating harmful outputs to a given input query with importance sampling. Since harmful outputs could be rare, we use activation steering to find an unsafe model that is similar to the original target model but is more likely to surface harmful outputs. By sampling responses from the unsafe model and then rewe… view at source ↗
Figure 2
Figure 2. Figure 2: Repeated sampling yields unsafe outputs from “safe” models. We consider the 313 STRONGREJECT (Souly et al., 2024) queries without jailbreaks, and evaluate model out￾puts with the STRONGREJECT judge: we say that output is harmful if the judge score is ≥ 0.75. Repeatedly sampling with temperature 1.0 eventually yields at least one harmful output for significantly more queries, indicating that in many context… view at source ↗
Figure 3
Figure 3. Figure 3: Importance sampling converges faster than naive Monte Carlo estimation. In this plot, we compare naive Monte Carlo estimation (MC) and importance sampling (IS) against expensive brute-force estimates (computed with 10, 000 samples). The y-axis is the average log-ratio between the two estimators and the brute-force estimate. We observe that importance sampling leads to much faster convergence to the brute f… view at source ↗
Figure 4
Figure 4. Figure 4: Importance Sampling (IS) vs. Monte Carlo (MC) Estimates: Here we plot the brute-force Monte Carlo (k = 10, 000) estimates of harm on the x-axis and the corresponding importance sampling estimate (k = 1, 000) on the y-axis on the STRONGREJECT benchmark. For visual convenience, we clamp estimates to 10−6 . The red dashed line signifies exact agreement when y = x. We note that most queries lie on this line, a… view at source ↗
Figure 5
Figure 5. Figure 5: Sample-Efficiency of Misalignment Estimation. In these plots, we compare the sample-efficiency of estimators measuring sycophancy and hallucination rates using both naive Monte Carlo (MC) sampling and importance sampling (IS). We compare these two estimators against pseudo ground-truth brute-force Monte Carlo estimates computed with 10, 000 samples. The y-axis is the average log-ratio between the estimator… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of rewriting texts on harmfulness probabilities: We generate 25 rewrites of harmful queries using Grok-3 (xAI, 2025). We then compare their harm estimates with harm estimates from the original queries. We observe that rewrites can shift harmfulness probabilities by multiple orders of magnitude. alent, changes in the input context. Our findings demon￾strate that, in isolation, a single query-level ha… view at source ↗
Figure 7
Figure 7. Figure 7: Predicting Worst Case Risk on Unseen Queries: In this plot, we show that using QRISK estimates from the evaluation can provide accurate predictions of the likelihood of worst-case risk on unseen queries. 7 view at source ↗
Figure 8
Figure 8. Figure 8: Optimizing proposal model reduces estimator variance: In this experiment, we measure the mean-squared error between the importance sampling estimate and the brute-force Monte Carlo estimate for the optimized proposal model as well as proposal models with varying steering coefficients. We observe that for all models, the optimized proposal has the lowest variance. 10 1 10 3 0% 25% 50% 75% 100% Queries w. Ha… view at source ↗
Figure 9
Figure 9. Figure 9: Queries with harmful outputs vs number of samples per query, alongside importance sampling and naive Monte Carlo estimates. We plot the number of non-zero queries estimated by k=1,000 Monte Carlo sampling versus k=500 Importance sampling. Even with 2× samples, Monte Carlo estimates significantly underestimate the number of queries with non-zero harmfulness probabilities. 17 view at source ↗
Figure 10
Figure 10. Figure 10: Worst rewrite harmfulness versus importance sampling estimates. We sample k = 200 outputs per query, and compare percent with harmful output against the importance sampling estimate (k = 500) from fig. 6. Note: We exclude samples with estimates of zero. 19 view at source ↗
read the original abstract

Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at https://github.com/rangell/LMTailRisk

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes an importance sampling method to estimate tail probabilities of harmful outputs in language models by constructing 'unsafe' versions of the target model as proposal distributions. It claims these estimates match brute-force Monte Carlo on misuse and misalignment benchmarks while using 10-20x fewer samples, e.g., recovering probabilities of order 10^{-4} from 500 proposal samples, and that the estimates can reveal input sensitivity and predict deployment risks.

Significance. If the unbiasedness of the importance sampling procedure holds, the work could meaningfully advance safety evaluations by enabling efficient quantification of rare harmful events at deployment scale. The claimed empirical agreement with Monte Carlo and public code release are positive for reproducibility, but the central claim depends on validation details that are not yet fully substantiated.

major comments (3)
  1. [Abstract] Abstract: the reported match between IS estimates and brute-force MC lacks specification of the exact MC sample size N, confidence intervals on the MC estimator, or convergence checks across independent runs. For target probabilities p≈10^{-4}, even N=10,000 yields an expected event count of 1 and relative standard error near 100%; without these diagnostics, agreement between two noisy estimators does not confirm that the IS procedure recovers the true tail probability.
  2. [Methods] Methods (importance sampling construction): the claim that unsafe model variants yield unbiased estimates of the original model's harmful output probabilities requires an explicit derivation showing that the importance weights fully correct for any distributional shift introduced by the modification process. The weakest assumption—that such modifications introduce no systematic bias—must be validated, e.g., via a controlled experiment on a known distribution or a proof that the proposal remains a valid importance sampler.
  3. [Experiments] Experiments: the manuscript must report data exclusion rules, exact benchmark query sets, and any filtering applied before sampling, because these choices directly affect whether the reported agreement with MC supports the central claim of accurate rare-event estimation across the full output distribution.
minor comments (2)
  1. [Abstract] Abstract: the statement that harmfulness estimates 'can reveal the sensitivity of models to perturbations in model input' would benefit from a concrete example or figure reference in the main text to clarify the operationalization.
  2. [Appendix] The code repository link is a strength for reproducibility; the manuscript should include a brief methods appendix summarizing the unsafe-model construction hyperparameters so readers can replicate without inspecting the repository.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which identify key areas for improving the statistical rigor, theoretical grounding, and transparency of our work. We have revised the manuscript accordingly and address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported match between IS estimates and brute-force MC lacks specification of the exact MC sample size N, confidence intervals on the MC estimator, or convergence checks across independent runs. For target probabilities p≈10^{-4}, even N=10,000 yields an expected event count of 1 and relative standard error near 100%; without these diagnostics, agreement between two noisy estimators does not confirm that the IS procedure recovers the true tail probability.

    Authors: We agree that these details are necessary to substantiate the comparison. In the revised manuscript we have added the exact MC sample sizes used (sufficient to observe multiple events for the reported probabilities), bootstrap-derived 95% confidence intervals on the MC estimates, and results from multiple independent runs demonstrating convergence of both estimators. These additions show that the IS estimates lie within the MC confidence intervals. revision: yes

  2. Referee: [Methods] Methods (importance sampling construction): the claim that unsafe model variants yield unbiased estimates of the original model's harmful output probabilities requires an explicit derivation showing that the importance weights fully correct for any distributional shift introduced by the modification process. The weakest assumption—that such modifications introduce no systematic bias—must be validated, e.g., via a controlled experiment on a known distribution or a proof that the proposal remains a valid importance sampler.

    Authors: We appreciate this request for formal justification. The revised Methods section now contains an explicit derivation establishing that the importance weights w(y) = p_target(y | x) / p_unsafe(y | x) recover the target expectation without bias, provided the unsafe variant preserves support over the target distribution (which holds under our logit-modification and fine-tuning procedures). We have also added a controlled validation on a synthetic autoregressive model with known closed-form tail probabilities, confirming that the estimator is unbiased. revision: yes

  3. Referee: [Experiments] Experiments: the manuscript must report data exclusion rules, exact benchmark query sets, and any filtering applied before sampling, because these choices directly affect whether the reported agreement with MC supports the central claim of accurate rare-event estimation across the full output distribution.

    Authors: We agree that full experimental transparency is required. The revised Experiments section now specifies the complete benchmark query sets (with sizes and public references), states that no queries were excluded beyond standard length-based truncation, and confirms that no post-sampling filtering on harmfulness or other criteria was applied. All generated outputs were retained for the probability estimates. revision: yes

Circularity Check

0 steps flagged

No significant circularity; estimator validated externally against Monte Carlo baselines

full rationale

The paper's core contribution is an importance-sampling procedure that constructs unsafe proposal models to estimate rare harmful-output probabilities under the original target model. This follows the standard unbiased IS identity p(harm) = E_proposal[ indicator(harm) * (target_prob / proposal_prob) ], with the proposal obtained by an explicit (non-self-referential) modification step whose bias properties are asserted as an assumption rather than derived from the target quantity. The reported results consist of direct numerical comparisons to independent brute-force Monte Carlo runs on the same benchmarks; no fitted parameter is relabeled as a prediction, no uniqueness theorem is imported from prior self-work, and no ansatz is smuggled via citation. The derivation chain therefore remains self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the claim rests on standard importance sampling assumptions and the feasibility of creating unsafe model variants that increase harmful output probability while preserving reweighting validity. No explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5558 in / 1133 out tokens · 63679 ms · 2026-05-08T12:23:01.623617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 24 canonical work pages · 12 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Abdin, M. et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024

  3. [3]

    Jailbreaking leading safety-aligned LLM s with simple adaptive attacks

    Andriushchenko, M., Croce, F., and Flammarion, N. Jailbreaking leading safety-aligned LLM s with simple adaptive attacks. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=hXA8wqRdyV

  4. [4]

    Claude sonnet 4.5 system card

    Anthropic. Claude sonnet 4.5 system card. Technical report, September 2025. URL https://www.anthropic.com/claude-sonnet-4-5-system-card

  5. [5]

    B., Syed, A., Paleka, D., Rimsky, N., Gurnee, W., and Nanda, N

    Arditi, A., Obeso, O. B., Syed, A., Paleka, D., Rimsky, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=pH3XAQME6c

  6. [6]

    Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025

    Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., and Evans, O. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms, 2025. URL https://arxiv.org/abs/2502.17424

  7. [7]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language models in twenty queries, 2024. URL https://arxiv.org/abs/2310.08419

  8. [8]

    and Diaconis, P

    Chatterjee, S. and Diaconis, P. The sample size required in importance sampling. The Annals of Applied Probability, 28 0 (2): 0 1099--1135, 2018

  9. [9]

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Chen, R., Arditi, A., Sleight, H., Evans, O., and Lindsey, J. Persona vectors: Monitoring and controlling character traits in language models, 2025. URL https://arxiv.org/abs/2507.21509

  10. [10]

    An introduction to sequential Monte Carlo, volume 4

    Chopin, N., Papaspiliopoulos, O., et al. An introduction to sequential Monte Carlo, volume 4. Springer, 2020

  11. [11]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. Sparse autoencoders find highly interpretable features in language models, 2023. URL https://arxiv.org/abs/2309.08600

  12. [12]

    P., Mannor, S., and Rubinstein, R

    De Boer, P.-T., Kroese, D. P., Mannor, S., and Rubinstein, R. Y. A tutorial on the cross-entropy method. Annals of operations research, 134 0 (1): 0 19--67, 2005

  13. [13]

    and Ferreira, A

    De Haan, L. and Ferreira, A. Extreme value theory: an introduction. Springer, 2006

  14. [14]

    The Llama 3 Herd of Models

    Dubey, A., Grattafiori, A., Jauhri, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  15. [15]

    Ettinger, A. et al. Olmo 3. arXiv preprint arXiv:2512.13961, 2025

  16. [16]

    Alignment faking in large language models

    Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., and Hubinger, E. Alignment faking in large language models, 2024 a . URL https://arxiv.org/abs/2412.14093

  17. [17]

    AI control: Improving safety despite intentional subversion,

    Greenblatt, R., Shlegeris, B., Sachan, K., and Roger, F. Ai control: Improving safety despite intentional subversion, 2024 b . URL https://arxiv.org/abs/2312.06942

  18. [18]

    Hesterberg, T. C. Advances in importance sampling. Stanford University, 1988

  19. [19]

    Catastrophic jailbreak of open-source llms via exploiting generation

    Huang, Y., Gupta, S., Xia, M., Li, K., and Chen, D. Catastrophic jailbreak of open-source llms via exploiting generation, 2023. URL https://arxiv.org/abs/2310.06987

  20. [20]

    Catastrophic jailbreak of open-source LLM s via exploiting generation

    Huang, Y., Gupta, S., Xia, M., Li, K., and Chen, D. Catastrophic jailbreak of open-source LLM s via exploiting generation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=r42tSSCHPh

  21. [21]

    Best-of-n jailbreaking

    Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., and Sharma, M. Best-of-n jailbreaking, 2024. URL https://arxiv.org/abs/2412.03556

  22. [22]

    Forecasting rare language model behaviors

    Jones, E., Tong, M., Mu, J., Mahfoud, M., Leike, J., Grosse, R., Kaplan, J., Fithian, W., Perez, E., and Sharma, M. Forecasting rare language model behaviors. arXiv preprint arXiv:2502.16797, 2025

  23. [23]

    L., Chowdhury, N., Johnson, D

    Li, X. L., Chowdhury, N., Johnson, D. D., Hashimoto, T., Liang, P., Schwettmann, S., and Steinhardt, J. Eliciting language model behaviors with investigator agents. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=AulTigiaMv

  24. [24]

    E., Vorobeychik, Y., Mao, Z., Jha, S., McDaniel, P., Sun, H., Li, B., and Xiao, C

    Liu, X., Li, P., Suh, G. E., Vorobeychik, Y., Mao, Z., Jha, S., McDaniel, P., Sun, H., Li, B., and Xiao, C. Auto DAN -turbo: A lifelong agent for strategy self-exploration to jailbreak LLM s. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=bhK7U37VW8

  25. [25]

    Natural emergent misalignment from reward hacking in production rl, 2025

    MacDiarmid, M., Wright, B., Uesato, J., Benton, J., Kutasov, J., Price, S., Bouscal, N., Bowman, S., Bricken, T., Cloud, A., Denison, C., Gasteiger, J., Greenblatt, R., Leike, J., Lindsey, J., Mikulik, V., Perez, E., Rodrigues, A., Thomas, D., Webson, A., Ziegler, D., and Hubinger, E. Natural emergent misalignment from reward hacking in production rl, 202...

  26. [26]

    and Tegmark, M

    Marks, S. and Tegmark, M. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=aajyHYjjsk

  27. [27]

    Tdc 2023 (llm edition): The trojan detection challenge

    Mazeika, M., Zou, A., Mu, N., Phan, L., Wang, Z., Yu, C., Khoja, A., Jiang, F., O'Gara, A., Sakhaee, E., Xiang, Z., Rajabi, A., Hendrycks, D., Poovendran, R., Li, B., and Forsyth, D. Tdc 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track, 2023

  28. [28]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., and Hendrycks, D. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024. URL https://arxiv.org/abs/2402.04249

  29. [29]

    B., and Singh, J

    Neumann, A., Kirsten, E., Zafar, M. B., and Singh, J. Position is power: System prompts as a mechanism of bias in large language models (llms). In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’25, pp.\ 573–598. ACM, June 2025. doi:10.1145/3715275.3732038. URL http://dx.doi.org/10.1145/3715275.3732038

  30. [30]

    Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296, 2024

    O'Brien, K., Majercak, D., Fernandes, X., Edgar, R., Bullwinkel, B., Chen, J., Nori, H., Carignan, D., Horvitz, E., and Poursabzi-Sangdeh, F. Steering language model refusal with sparse autoencoders, 2025. URL https://arxiv.org/abs/2411.11296

  31. [31]

    Update to gpt-5 system card: Gpt-5.2

    OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, December 2025 a . URL https://openai.com/index/gpt-5-system-card-update-gpt-5-2/

  32. [32]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI. gpt-oss-120b and gpt-oss-20b model card, 2025 b . URL https://arxiv.org/abs/2508.10925

  33. [33]

    and Zhou, Y

    Owen, A. and Zhou, Y. Safe and effective importance sampling. Journal of the American Statistical Association, 95 0 (449): 0 135--143, 2000

  34. [34]

    Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., and Turner, A. M. Steering llama 2 via contrastive activation addition, 2024. URL https://arxiv.org/abs/2312.06681

  35. [35]

    P., Casella, G., and Casella, G

    Robert, C. P., Casella, G., and Casella, G. Monte Carlo statistical methods, volume 2. Springer, 1999

  36. [36]

    Sharma, M. et al. Constitutional classifiers: Defending against universal jailbreaks across thousands of hours of red teaming, 2025. URL https://arxiv.org/abs/2501.18837

  37. [37]

    A strong REJECT for empty jailbreaks

    Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., and Toyer, S. A strong REJECT for empty jailbreaks. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=KZLE5BaaOH

  38. [38]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  39. [39]

    Team, G. et al. Gemma: Open models based on gemini research and technology, 2024. URL https://arxiv.org/abs/2403.08295

  40. [40]

    Qwen2.5: A party of foundation models, September 2024

    Team, Q. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

  41. [41]

    Estimating worst-case frontier risks of open-weight llms, 2025

    Wallace, E., Watkins, O., Wang, M., Chen, K., and Koch, C. Estimating worst-case frontier risks of open-weight llms, 2025. URL https://arxiv.org/abs/2508.03153

  42. [42]

    and Hilton, J

    Wu, G. and Hilton, J. Estimating the probabilities of rare outputs in language models. arXiv preprint arXiv:2410.13211, 2024

  43. [43]

    Grok 3 beta — the age of reasoning agents, February 2025

    xAI. Grok 3 beta — the age of reasoning agents, February 2025. URL https://x.ai/news/grok-3. Accessed: 2026-01-27

  44. [44]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models, 2023. URL https://arxiv.org/abs/2307.15043

  45. [45]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  46. [46]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  47. [47]

    B6 '& &G ѡUtf! -Ff92

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...