arxiv: 2605.12813 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

Recognition: no theorem link

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

Buyun Liang, Darshan Thaker, Fengrui Tian, Hamed Hassani, Jinqi Luo, Kaleab A. Kinfu, Kwan Ho Ryan Chan, Liangzu Peng, Ren\'e Vidal

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.LG

keywords adversarial attacksLLM hallucinationslatent space optimizationrealistic promptssemantic equivalenceediting directionsconstrained optimization

0 comments

The pith

Optimizing continuous combinations of input-dependent latent editing directions produces realistic adversarial prompts that elicit hallucinations in large language models, including reasoning models where prior realistic attacks fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates hallucination elicitation as finding semantically coherent adversarial prompts equivalent to benign inputs. It builds an input-dependent dictionary of editing directions in latent space, where each direction corresponds to a valid rephrasing, then optimizes continuous linear combinations of those directions. This hybrid approach aims to retain the flexibility of continuous latent search while ensuring decoded outputs stay coherent and equivalent, unlike purely discrete rephrasings or unconstrained latent attacks. A sympathetic reader would care because the method reaches large reasoning models under free-form response settings, where existing realistic attacks do not succeed.

Core claim

REALISTA constructs an input-dependent dictionary of valid editing directions in latent space, each corresponding to a semantically equivalent and coherent rephrasing of the original prompt, and optimizes continuous combinations of these directions to generate adversarial prompts that elicit LLM hallucinations while preserving realism and equivalence.

What carries the argument

Input-dependent dictionary of valid editing directions in latent space, which enables optimization of continuous combinations that decode to coherent rephrasings.

If this is right

The method succeeds in attacking large reasoning models under free-form response settings where prior realistic attacks fail.
It achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs.
The constrained optimization framework combines the search richness of continuous latent attacks with the semantic guarantees of discrete rephrasing attacks.
It provides a practical way to generate realistic adversarial prompts for testing hallucination vulnerabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-direction approach could be applied to elicit other model failures such as biased outputs or unsafe responses.
Generated adversarial prompts could serve as synthetic data for training more robust LLMs against hallucinations.
The input-dependent dictionary construction might reveal which kinds of subtle rephrasings most reliably surface hallucination triggers in reasoning models.

Load-bearing premise

Continuous combinations of the input-dependent editing directions in latent space will decode to prompts that remain semantically equivalent and coherent rephrasings of the original benign prompt.

What would settle it

Decoding the optimized latent vectors and checking whether the resulting prompts are mostly incoherent or semantically divergent from the original input; if they are, or if they elicit no more hallucinations than discrete baselines on reasoning models, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2605.12813 by Buyun Liang, Darshan Thaker, Fengrui Tian, Hamed Hassani, Jinqi Luo, Kaleab A. Kinfu, Kwan Ho Ryan Chan, Liangzu Peng, Ren\'e Vidal.

**Figure 2.** Figure 2: (Left) Input-dependent edit dictionary construction. We employ a concept optimization procedure to construct a set of latent concepts c (1) , . . . , c (n) conditioned on the original prompt x0 and WordNet (Miller, 1995). These concepts are assembled into an edit dictionary D, where each column corresponds to an interpretable editing direction z (i) = c (i) − z0. See §3.1 and Appendix §C for details on the… view at source ↗

**Figure 3.** Figure 3: Optimization trajectory when solving (12). At each optimization iteration, the objective value and the maximum constraint violation are reported as bootstrap means (10,000 resamples) computed over the MMLU subset [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 6.** Figure 6: Normalized word edit distance across layers between original prompts and their reconstructions. Dots indicate the bestperforming layer for each model, and shaded regions show standard deviation over 10,000 bootstrap samples. Reconstruction quality degrades as depth increases. The reconstruction quality of decoder ψ directly affects our latent-space optimization in REALISTA. To assess this, we conduct an e… view at source ↗

**Figure 7.** Figure 7: Top 50 most frequently used concepts. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_7.png] view at source ↗

**Figure 8.** Figure 8: Top 50-100 most frequently used concepts. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_8.png] view at source ↗

read the original abstract

Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, motivating the need for realistic adversarial prompts that elicit such failures. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun-Liang/REALISTA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REALISTA's input-dependent dictionary plus continuous latent combos is a clean way to blend discrete realism with richer search, and it reportedly works on reasoning models where others don't.

read the letter

The core advance is building a per-prompt dictionary of valid rephrasing directions in latent space and then optimizing linear combinations of them. This sits between pure discrete attacks that stay coherent but search narrowly and pure continuous attacks that explore more but often decode to invalid text. The paper shows this hybrid gets comparable or better results on open-source LLMs and, more interestingly, succeeds against large reasoning models in free-form settings where prior realistic methods fall short. That last part is the practical payoff for anyone doing robustness work on models people actually deploy. The framing as a constrained optimization problem that keeps semantic equivalence is straightforward and avoids the usual hand-wavy definitions of realism. Experiments are claimed to back the superiority, with code released, which helps. The soft spot is the decoder assumption: nothing in the setup guarantees that arbitrary combinations of those directions will decode to prompts that remain coherent rephrasings rather than drifted or incoherent text. If the directions do not form something close to a linear subspace under the decoder, the optimization can exploit invalid outputs while still claiming success on the attack objective. The abstract gives no numbers on semantic similarity metrics, human validation rates, or ablations that isolate how often the outputs stay valid, so the performance edge could partly reflect better engineering rather than the method itself. This is for people working on LLM red-teaming and hallucination elicitation. A reader focused on practical attack construction would find the framework useful to try, especially if they need something that scales to reasoning models. It deserves peer review because the idea is technically coherent, the claim on reasoning models is specific, and the released code lets referees check the details directly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes REALISTA, a latent-space adversarial attack framework to elicit hallucinations in LLMs. It constructs an input-dependent dictionary of discrete valid rephrasing directions in latent space and optimizes continuous combinations of these directions to produce semantically coherent adversarial prompts that remain equivalent to the original benign input. Experiments claim that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds against large reasoning models in free-form response settings where prior realistic attacks fail. Code is released.

Significance. If the central assumption holds, the work would meaningfully advance realistic adversarial testing of LLMs by combining the search flexibility of continuous latent methods with the semantic guarantees of discrete rephrasing attacks. The reported success on reasoning models under free-form conditions addresses a documented gap in prior realistic attacks and could inform robustness evaluation practices. Reproducibility via released code is a positive factor.

major comments (2)

[§3.2] §3.2 (Method): the optimization of continuous (linear) combinations of input-dependent editing directions assumes that arbitrary convex combinations remain within the decoder's region of valid, semantically equivalent rephrasings. No analysis or bound is provided showing that the latent directions form a subspace closed under the decoder; non-linear interactions could produce drifted or incoherent outputs while still optimizing the attack objective. This assumption is load-bearing for the superiority claim on reasoning models.
[§5.3] §5.3 (Experiments on reasoning models): the reported success rates lack accompanying metrics confirming that the generated prompts remain semantically equivalent to the originals (e.g., via entailment scores, human ratings, or automatic similarity thresholds). Without such checks, it is unclear whether the performance gain stems from the continuous search or from inadvertently relaxed equivalence constraints.

minor comments (2)

[Figure 2] Figure 2: axis labels and legend are too small for readability; increase font size and add a caption clarifying what the plotted directions represent.
[Related Work] Related Work: the distinction between REALISTA and prior continuous latent attacks (e.g., those using direct latent optimization without the dictionary constraint) could be made more explicit with a short comparison table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our method's theoretical grounding and experimental validation. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Method): the optimization of continuous (linear) combinations of input-dependent editing directions assumes that arbitrary convex combinations remain within the decoder's region of valid, semantically equivalent rephrasings. No analysis or bound is provided showing that the latent directions form a subspace closed under the decoder; non-linear interactions could produce drifted or incoherent outputs while still optimizing the attack objective. This assumption is load-bearing for the superiority claim on reasoning models.

Authors: We agree that a formal analysis of closure under convex combinations would strengthen the presentation. Our construction derives each editing direction from a valid decoder output (i.e., a semantically equivalent rephrasing), and the optimization is performed only over directions that individually decode to coherent text. In practice, the resulting combinations remain within the valid region, as evidenced by the high semantic similarity scores we observe. In the revision we will add to §3.2 an explicit discussion of this assumption together with empirical measurements (entailment scores and cosine similarity in embedding space) across a range of combination coefficients, providing quantitative support for the validity of the interpolated prompts. revision: yes
Referee: [§5.3] §5.3 (Experiments on reasoning models): the reported success rates lack accompanying metrics confirming that the generated prompts remain semantically equivalent to the originals (e.g., via entailment scores, human ratings, or automatic similarity thresholds). Without such checks, it is unclear whether the performance gain stems from the continuous search or from inadvertently relaxed equivalence constraints.

Authors: We acknowledge that additional quantitative checks would make the equivalence claim more transparent. The original experiments relied on the fact that each basis direction is a verified valid rephrasing and on manual verification of coherence for the final outputs. For the revised version we will augment §5.3 with automatic semantic-equivalence metrics (entailment probability from a fine-tuned NLI model and sentence-level embedding similarity) computed on all adversarial prompts used in the reasoning-model experiments. These metrics will be reported alongside the attack success rates to confirm that equivalence constraints were not relaxed. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper constructs an input-dependent dictionary of discrete valid rephrasing directions and optimizes continuous combinations in latent space to generate adversarial prompts. Performance claims rest on direct experimental comparisons to prior realistic attacks across open-source LLMs and large reasoning models, without any reduction of success metrics to fitted parameters, self-definitional equivalences, or load-bearing self-citations. The assumption that convex combinations decode to coherent equivalents is presented as an empirical property tested in the evaluation, not a definitional tautology or imported uniqueness result. The framework therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the constructed editing directions preserve semantic equivalence when combined continuously; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Editing directions in the input-dependent dictionary correspond to semantically equivalent and coherent rephrasings.
This premise is required for the optimized prompts to remain realistic and is stated as the basis for the dictionary construction.

pith-pipeline@v0.9.0 · 5561 in / 1147 out tokens · 56198 ms · 2026-05-14T20:08:45.207424+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

287 extracted references · 175 canonical work pages · 38 internal anchors

[1]

Advances in

Liang, Buyun and Peng, Liangzu and Luo, Jinqi and Thaker, Darshan and Chan, Kwan Ho Ryan and Vidal, Rene , editor =. Advances in. 2025 , pages =

2025
[2]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks , booktitle =

Croce, Francesco and Hein, Matthias , year =. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks , booktitle =
[3]

Advances in Neural Information Processing Systems , author =

Logicity:. Advances in Neural Information Processing Systems , author =. 2024 , pages =

2024
[4]

Transactions on Machine Learning Research , author =

Reliable and. Transactions on Machine Learning Research , author =
[5]

Frontiers in Artificial Intelligence , author =

Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior , volume =. Frontiers in Artificial Intelligence , author =. 2025 , pages =. doi:10.3389/frai.2025.1622292 , abstract =

work page doi:10.3389/frai.2025.1622292 2025
[6]

Peng, Baolin and Galley, Michel and He, Pengcheng and Cheng, Hao and Xie, Yujia and Hu, Yu and Huang, Qiuyuan and Liden, Lars and Yu, Zhou and Chen, Weizhu and Gao, Jianfeng , month = mar, year =. Check. doi:10.48550/arXiv.2302.12813 , abstract =

work page doi:10.48550/arxiv.2302.12813
[7]

Tonmoy, S. M. Towhidul Islam and Zaman, S. M. Mehedi and Jain, Vinija and Rani, Anku and Rawte, Vipula and Chadha, Aman and Das, Amitava , month = jan, year =. A. doi:10.48550/arXiv.2401.01313 , abstract =

work page doi:10.48550/arxiv.2401.01313
[8]

doi:10.48550/arXiv.2410.02707 , abstract =

Orgad, Hadas and Toker, Michael and Gekhman, Zorik and Reichart, Roi and Szpektor, Idan and Kotek, Hadas and Belinkov, Yonatan , month = may, year =. doi:10.48550/arXiv.2410.02707 , abstract =

work page doi:10.48550/arxiv.2410.02707
[9]

Training

Ellis, Evan and Myers, Vivek and Tuyls, Jens and Levine, Sergey and Dragan, Anca and Eysenbach, Benjamin , month = oct, year =. Training. doi:10.48550/arXiv.2510.13709 , abstract =

work page doi:10.48550/arxiv.2510.13709
[10]

doi:10.48550/arXiv.2603.13435 , abstract =

Xu, Shuhan and Liang, Siyuan and Zheng, Hongling and Luo, Yong and Hu, Han and Zhang, Lefei and Tao, Dacheng , month = mar, year =. doi:10.48550/arXiv.2603.13435 , abstract =

work page doi:10.48550/arxiv.2603.13435
[11]

Zhang, Qizheng and Hu, Changran and Upasani, Shubhangi and Ma, Boyuan and Hong, Fenglu and Kamanuru, Vamsidhar and Rainton, Jay and Wu, Chen and Ji, Mengmeng and Li, Hanchen and Thakker, Urmish and Zou, James and Olukotun, Kunle , month = jan, year =. Agentic. doi:10.48550/arXiv.2510.04618 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.04618
[12]

Explaining and Harnessing Adversarial Examples

Goodfellow, Ian J. and Shlens, Jonathon and Szegedy, Christian , month = mar, year =. Explaining and. doi:10.48550/arXiv.1412.6572 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.6572
[13]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Agrawal, Lakshya A and Tan, Shangyin and Soylu, Dilara and Ziems, Noah and Khare, Rishi and Opsahl-Ong, Krista and Singhvi, Arnav and Shandilya, Herumb and Ryan, Michael J and Jiang, Meng and Potts, Christopher and Sen, Koushik and Dimakis, Alexandros G. and Stoica, Ion and Klein, Dan and Zaharia, Matei and Khattab, Omar , year =. doi:10.48550/ARXIV.2507....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.19457
[14]

doi:10.48550/arXiv.2505.19514 , abstract =

Yu, Yaoning and Yu, Ye and Zhang, Peiyan and Wei, Kai and Luo, Haojing and Wang, Haohan , month = jan, year =. doi:10.48550/arXiv.2505.19514 , abstract =

work page doi:10.48550/arxiv.2505.19514
[15]

Efficient Estimation of Word Representations in Vector Space

Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey , month = sep, year =. Efficient. doi:10.48550/arXiv.1301.3781 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1301.3781
[16]

Distributed Representations of Words and Phrases and their Compositionality

Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg and Dean, Jeffrey , month = oct, year =. Distributed. doi:10.48550/arXiv.1310.4546 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1310.4546
[17]

Pennington, Jeffrey and Socher, Richard and Manning, Christopher , year =. Glove:. Proceedings of the 2014. doi:10.3115/v1/D14-1162 , abstract =

work page doi:10.3115/v1/d14-1162 2014
[18]

Shazeer, Noam and Doherty, Ryan and Evans, Colin and Waterson, Chris , month = feb, year =. Swivel:. doi:10.48550/arXiv.1602.02215 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1602.02215
[19]

Deep contextualized word representations

Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke , month = mar, year =. Deep contextualized word representations , url =. doi:10.48550/arXiv.1802.05365 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1802.05365
[20]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , month = may, year =. doi:10.48550/arXiv.1810.04805 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04805
[21]

Ethayarajh, Kawin , month = sep, year =. How. doi:10.48550/arXiv.1909.00512 , abstract =

work page doi:10.48550/arxiv.1909.00512 1909
[22]

What do you learn from context? Probing for sentence structure in contextualized word representations

Tenney, Ian and Xia, Patrick and Chen, Berlin and Wang, Alex and Poliak, Adam and McCoy, R. Thomas and Kim, Najoung and Durme, Benjamin Van and Bowman, Samuel R. and Das, Dipanjan and Pavlick, Ellie , month = may, year =. What do you learn from context?. doi:10.48550/arXiv.1905.06316 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.06316 1905
[23]

Steering Language Models With Activation Engineering

Turner, Alexander Matt and Thiergart, Lisa and Leech, Gavin and Udell, David and Vazquez, Juan J. and Mini, Ulisse and MacDiarmid, Monte , month = oct, year =. Steering. doi:10.48550/arXiv.2308.10248 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.10248
[24]

Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , month = oct, year =. Sparse. doi:10.48550/arXiv.2309.08600 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.08600
[25]

Scaling and evaluating sparse autoencoders

Gao, Leo and Tour, Tom Dupré la and Tillman, Henk and Goh, Gabriel and Troll, Rajan and Radford, Alec and Sutskever, Ilya and Leike, Jan and Wu, Jeffrey , month = jun, year =. Scaling and evaluating sparse autoencoders , url =. doi:10.48550/arXiv.2406.04093 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.04093
[26]

Marks, Samuel and Treutlein, Johannes and Bricken, Trenton and Lindsey, Jack and Marcus, Jonathan and Mishra-Sharma, Siddharth and Ziegler, Daniel and Ameisen, Emmanuel and Batson, Joshua and Belonax, Tim and Bowman, Samuel R. and Carter, Shan and Chen, Brian and Cunningham, Hoagy and Denison, Carson and Dietz, Florian and Golechha, Satvik and Khan, Akbir...

work page doi:10.48550/arxiv.2503.10965
[27]

arXiv , url =:2411.02193 , primaryclass =

Chalnev, Sviatoslav and Siu, Matthew and Conmy, Arthur , month = nov, year =. Improving. doi:10.48550/arXiv.2411.02193 , abstract =

work page doi:10.48550/arxiv.2411.02193
[28]

Interpretable

Soo, Samuel and Guang, Chen and Teng, Wesley and Balaganesh, Chandrasekaran and Guoxian, Tan and Ming, Yan , month = apr, year =. Interpretable. doi:10.48550/arXiv.2501.09929 , abstract =

work page doi:10.48550/arxiv.2501.09929
[29]

Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas , month = may, year =. U-. doi:10.48550/arXiv.1505.04597 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1505.04597
[30]

Park, Kiho and Choe, Yo Joong and Veitch, Victor , month = jul, year =. The. doi:10.48550/arXiv.2311.03658 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.03658
[31]

and LeCun, Yann , month = apr, year =

Yun, Zeyu and Chen, Yubei and Olshausen, Bruno A. and LeCun, Yann , month = apr, year =. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors , shorttitle =. doi:10.48550/arXiv.2103.15949 , abstract =

work page doi:10.48550/arxiv.2103.15949
[32]

Adam: A Method for Stochastic Optimization

Kingma, Diederik P. and Ba, Jimmy , month = jan, year =. Adam:. doi:10.48550/arXiv.1412.6980 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.6980
[33]

CVXPY: A Python-Embedded Modeling Language for Convex Optimization

Diamond, Steven and Boyd, Stephen , month = jun, year =. doi:10.48550/arXiv.1603.00943 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1603.00943
[34]

and Sudderth, Erik B

Duchi, John and Shalev-Shwartz, Shai and Singer, Yoram and Chandra, Tushar , year =. Efficient projections onto the. Proceedings of the 25th international conference on. doi:10.1145/1390156.1390191 , abstract =

work page doi:10.1145/1390156.1390191
[35]

Defending

Casper, Stephen and Schulze, Lennart and Patel, Oam and Hadfield-Menell, Dylan , month = jul, year =. Defending. doi:10.48550/arXiv.2403.05030 , abstract =

work page doi:10.48550/arxiv.2403.05030
[36]

Howe, Nikolaus and McKenzie, Ian and Hollinsworth, Oskar and Zajac, Michał and Tseng, Tom and Tucker, Aaron and Bacon, Pierre-Luc and Gleave, Adam , month = jun, year =. Scaling. doi:10.48550/arXiv.2407.18213 , abstract =

work page doi:10.48550/arxiv.2407.18213
[37]

Sheshadri, Abhay and Ewart, Aidan and Guo, Phillip and Lynch, Aengus and Wu, Cindy and Hebbar, Vivek and Sleight, Henry and Stickland, Asa Cooper and Perez, Ethan and Hadfield-Menell, Dylan and Casper, Stephen , month = jul, year =. Latent. doi:10.48550/arXiv.2407.15549 , abstract =

work page doi:10.48550/arxiv.2407.15549
[38]

and Vechev, Martin , month = oct, year =

Dékány, Csaba and Balauca, Stefan and Staab, Robin and Dimitrov, Dimitar I. and Vechev, Martin , month = oct, year =. doi:10.48550/arXiv.2505.16947 , abstract =

work page doi:10.48550/arxiv.2505.16947
[39]

Efficient

Xhonneux, Sophie and Sordoni, Alessandro and Günnemann, Stephan and Gidel, Gauthier and Schwinn, Leo , month = nov, year =. Efficient. doi:10.48550/arXiv.2405.15589 , abstract =

work page doi:10.48550/arxiv.2405.15589
[40]

Adversarial

Sabbaghi, Mahdi and Kassianik, Paul and Pappas, George and Singer, Yaron and Karbasi, Amin and Hassani, Hamed , month = jun, year =. Adversarial. doi:10.48550/arXiv.2502.01633 , abstract =

work page doi:10.48550/arxiv.2502.01633
[41]

Simple statistical gradient-following algorithms for connectionist reinforcement learning , volume =

Williams, Ronald J , year =. Simple statistical gradient-following algorithms for connectionist reinforcement learning , volume =. Machine learning , publisher =
[42]

doi:10.48550/arXiv.2506.22666 , abstract =

Lochab, Anamika and Yan, Lu and Pynadath, Patrick and Zhang, Xiangyu and Zhang, Ruqi , month = nov, year =. doi:10.48550/arXiv.2506.22666 , abstract =

work page doi:10.48550/arxiv.2506.22666
[43]

Communications of the ACM , publisher =

Miller, George A , year =. Communications of the ACM , publisher =
[44]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.01405
[45]

Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing , author =

Divide and. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing , author =

2011
[46]

Paraphrase. Mech. Transl. Comput. Linguistics , author =
[47]

Xu, Fengli and Hao, Qianyue and Zong, Zefang and Wang, Jingwei and Zhang, Yunke and Wang, Jingyi and Lan, Xiaochong and Gong, Jiahui and Ouyang, Tianjian and Meng, Fanjin and Shao, Chenyang and Yan, Yuwei and Yang, Qinglong and Song, Yiwen and Ren, Sijian and Hu, Xinyuan and Li, Yu and Feng, Jie and Gao, Chen and Li, Yong , month = jan, year =. Towards. d...

work page doi:10.48550/arxiv.2501.09686
[48]

DeepSeek-AI and Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bocha...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948
[49]

Safety in

Wang, Cheng and Liu, Yue and Bi, Baolong and Zhang, Duzhen and Li, Zhong-Zhi and Ma, Yingwei and He, Yufei and Yu, Shengju and Li, Xinfeng and Fang, Junfeng and Zhang, Jiaheng and Hooi, Bryan , month = may, year =. Safety in. doi:10.48550/arXiv.2504.17704 , abstract =

work page doi:10.48550/arxiv.2504.17704
[50]

doi:10.48550/arXiv.2410.02832 , abstract =

Liu, Yue and He, Xiaoxin and Xiong, Miao and Fu, Jinlan and Deng, Shumin and Hooi, Bryan , month = oct, year =. doi:10.48550/arXiv.2410.02832 , abstract =

work page doi:10.48550/arxiv.2410.02832
[51]

doi:10.48550/arXiv.2501.18492 , abstract =

Liu, Yue and Gao, Hongcheng and Zhai, Shengfang and Xia, Jun and Wu, Tianyi and Xue, Zhiwei and Chen, Yulin and Kawaguchi, Kenji and Zhang, Jiaheng and Hooi, Bryan , month = jan, year =. doi:10.48550/arXiv.2501.18492 , abstract =

work page doi:10.48550/arxiv.2501.18492
[52]

doi:10.48550/arXiv.2505.11049 , abstract =

Liu, Yue and Zhai, Shengfang and Du, Mingzhe and Chen, Yulin and Cao, Tri and Gao, Hongcheng and Wang, Cheng and Li, Xinfeng and Wang, Kun and Fang, Junfeng and Zhang, Jiaheng and Hooi, Bryan , month = may, year =. doi:10.48550/arXiv.2505.11049 , abstract =

work page doi:10.48550/arxiv.2505.11049
[53]

Luo, Jinqi and Wang, Zhaoning and Wu, Chen Henry and Huang, Dong and De La Torre, Fernando , month = jun, year =. Zero-. 2023. doi:10.1109/CVPR52729.2023.01119 , abstract =

work page doi:10.1109/cvpr52729.2023.01119 2023
[54]

doi:10.48550/arXiv.2111.13984 , abstract =

Liang, Buyun and Mitchell, Tim and Sun, Ju , year =. doi:10.48550/arXiv.2111.13984 , abstract =

work page doi:10.48550/arxiv.2111.13984
[55]

Constraint-handling in evolutionary optimization , isbn =

Mezura-Montes, Efrén , year =. Constraint-handling in evolutionary optimization , isbn =
[56]

Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms , isbn =

Bäck, Thomas , year =. Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms , isbn =. doi:10.1093/oso/9780195099713.001.0001 , abstract =

work page doi:10.1093/oso/9780195099713.001.0001
[57]

and Smith, James E

Eiben, Agoston E. and Smith, James E. , year =. Introduction to evolutionary computing , isbn =
[58]

Shadows can be

Zhong, Yiqi and Liu, Xianming and Zhai, Deming and Jiang, Junjun and Ji, Xiangyang , month = jun, year =. Shadows can be. 2022. doi:10.1109/CVPR52688.2022.01491 , abstract =

work page doi:10.1109/cvpr52688.2022.01491 2022
[59]

Semantic

Hosseini, Hossein and Poovendran, Radha , month = jun, year =. Semantic. 2018. doi:10.1109/CVPRW.2018.00212 , language =

work page doi:10.1109/cvprw.2018.00212 2018
[60]

Semantic

Wang, Chenan and Duan, Jinhao and Xiao, Chaowei and Kim, Edward and Stamm, Matthew and Xu, Kaidi , month = sep, year =. Semantic. doi:10.48550/arXiv.2309.07398 , abstract =

work page doi:10.48550/arxiv.2309.07398
[61]

Semantic

Joshi, Ameya and Mukherjee, Amitangshu and Sarkar, Soumik and Hegde, Chinmay , month = aug, year =. Semantic. doi:10.48550/arXiv.1904.08489 , abstract =

work page doi:10.48550/arxiv.1904.08489 1904
[62]

doi:10.48550/arXiv.2311.15551 , abstract =

Liu, Jiang and Wei, Chen and Guo, Yuxiang and Yu, Heng and Yuille, Alan and Feizi, Soheil and Lau, Chun Pong and Chellappa, Rama , month = nov, year =. doi:10.48550/arXiv.2311.15551 , abstract =

work page doi:10.48550/arxiv.2311.15551
[63]

Leveraging

Robinson, Joshua and Rytting, Christopher Michael and Wingate, David , month = mar, year =. Leveraging. doi:10.48550/arXiv.2210.12353 , abstract =

work page doi:10.48550/arxiv.2210.12353
[64]

Pezeshkpour, Pouya and Hruschka, Estevam , month = aug, year =. Large. doi:10.48550/arXiv.2308.11483 , abstract =

work page doi:10.48550/arxiv.2308.11483
[65]

Artifacts or

Balepur, Nishant and Ravichander, Abhilasha and Rudinger, Rachel , month = jun, year =. Artifacts or. doi:10.48550/arXiv.2402.12483 , abstract =

work page doi:10.48550/arxiv.2402.12483
[66]

Accelerating

Zhao, Yiran and Zheng, Wenyue and Cai, Tianle and Do, Xuan Long and Kawaguchi, Kenji and Goyal, Anirudh and Shieh, Michael , month = nov, year =. Accelerating. doi:10.48550/arXiv.2403.01251 , abstract =

work page doi:10.48550/arxiv.2403.01251
[67]

Machine Translation , author =

Measuring machine translation quality as semantic equivalence:. Machine Translation , author =. 2009 , pages =. doi:10.1007/s10590-009-9060-y , abstract =

work page doi:10.1007/s10590-009-9060-y 2009
[68]

Journal of Artificial Intelligence Research , author =

A. Journal of Artificial Intelligence Research , author =. 2010 , pages =. doi:10.1613/jair.2985 , abstract =

work page doi:10.1613/jair.2985 2010
[69]

Nature , author =

Detecting hallucinations in large language models using semantic entropy , volume =. Nature , author =. 2024 , pages =. doi:10.1038/s41586-024-07421-0 , abstract =

work page doi:10.1038/s41586-024-07421-0 2024
[70]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , month = may, year =. doi:10.48550/arXiv.1905.07830 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1905.07830 1905
[71]

Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

Fan, Kaixuan and Feng, Kaituo and Lyu, Haoming and Zhou, Dongzhan and Yue, Xiangyu , month = may, year =. doi:10.48550/arXiv.2505.17018 , abstract =

work page doi:10.48550/arxiv.2505.17018
[72]

doi:10.48550/arXiv.2403.10949 , abstract =

Chen, Haozhe and Vondrick, Carl and Mao, Chengzhi , month = mar, year =. doi:10.48550/arXiv.2403.10949 , abstract =

work page doi:10.48550/arxiv.2403.10949
[73]

doi:10.48550/arXiv.2412.08686 , abstract =

Pan, Alexander and Chen, Lijie and Steinhardt, Jacob , month = dec, year =. doi:10.48550/arXiv.2412.08686 , abstract =

work page doi:10.48550/arxiv.2412.08686
[74]

and Chen, Danqi and Arora, Sanjeev , month = jan, year =

Malladi, Sadhika and Gao, Tianyu and Nichani, Eshaan and Damian, Alex and Lee, Jason D. and Chen, Danqi and Arora, Sanjeev , month = jan, year =. Fine-. doi:10.48550/arXiv.2305.17333 , abstract =

work page doi:10.48550/arxiv.2305.17333
[75]

doi:10.48550/arXiv.2505.10838 , abstract =

Li, Ran and Wang, Hao and Mao, Chengzhi , month = may, year =. doi:10.48550/arXiv.2505.10838 , abstract =

work page doi:10.48550/arxiv.2505.10838
[76]

Obfuscated

Bailey, Luke and Serrano, Alex and Sheshadri, Abhay and Seleznyov, Mikhail and Taylor, Jordan and Jenner, Erik and Hilton, Jacob and Casper, Stephen and Guestrin, Carlos and Emmons, Scott , month = feb, year =. Obfuscated. doi:10.48550/arXiv.2412.09565 , abstract =

work page doi:10.48550/arxiv.2412.09565
[77]

Liu, Junteng and Chen, Shiqi and Cheng, Yu and He, Junxian , year =. On the. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.1012 , abstract =

work page doi:10.18653/v1/2024.emnlp-main.1012 2024
[78]

Marks, Samuel and Tegmark, Max , month = aug, year =. The. doi:10.48550/arXiv.2310.06824 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06824
[79]

Park, Seongheon and Du, Xuefeng and Yeh, Min-Hsuan and Wang, Haobo and Li, Yixuan , month = may, year =. Steer. doi:10.48550/arXiv.2503.01917 , abstract =

work page doi:10.48550/arxiv.2503.01917
[80]

Bridge the

Yuan, Lifan and Zhang, Yichi and Chen, Yangyi and Wei, Wei , month = jun, year =. Bridge the. doi:10.48550/arXiv.2110.15317 , abstract =

work page doi:10.48550/arxiv.2110.15317

Showing first 80 references.