{"total":22,"items":[{"citing_arxiv_id":"2606.23716","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Legal Reasoning Is Not Lawyering: Rethinking Legal Benchmarks for Pro Se Access to Justice","primary_cat":"cs.CY","submitted_at":"2026-06-16T14:19:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Legal AI benchmarks must evaluate robustness to pro se litigant inputs rather than expert-preprocessed ones to support access-to-justice claims.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12978","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Trajectory-Level Redirection Attacks on Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-06-11T07:12:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A prompt-only attack called command-preserving trajectory redirection can steer VLA robot behavior to attacker-chosen physical outcomes while the text still appears to match the intended task.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06924","ref_index":158,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Sampled Outcomes to Capability Distributions: Rethinking Supervision for LLM Routing","primary_cat":"cs.LG","submitted_at":"2026-06-05T05:42:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DARS replaces single-shot response labels with distribution-aware supervision derived from input and output uncertainty to produce more reliable LLM routing policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01441","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dive into Ambiguity: A*-Inspired Multi-Agents Commonsense Obfuscation Attack on LLM Prompts","primary_cat":"cs.AI","submitted_at":"2026-05-31T20:20:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An A*-inspired multi-agent framework with hierarchical rewriting and a dynamic gamma parameter generates obfuscated prompts that achieve higher LLM attack success rates with fewer attempts than exhaustive search.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01210","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Can we trust LLM Self-Explanations for Entity Resolution?","primary_cat":"cs.DB","submitted_at":"2026-05-31T13:00:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM self-explanations for entity resolution are unstable and weakly faithful to causal evidence, but a hybrid framework using them as priors matches post-hoc quality at up to 10x lower cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10516","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability","primary_cat":"cs.AI","submitted_at":"2026-05-11T13:06:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[33] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. 2023. URLhttps://arxiv.org/abs/2210. 03629. [34] S. Yao, N. Shinn, P. Razavi, and K. Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains. 2024. URLhttps://arxiv.org/abs/2406.12045. [35] K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y . Wang, L. Yang, W. Ye, Y . Zhang, N. Z. Gong, and X. Xie. Promptrobust: Towards evaluating the robustness of large language models on adversarial prompts. 2024. URLhttps://arxiv.org/abs/2306.04528. 11 A Proofs A.1 Proof of Theorem 1 Under Assumption 2, the instance-level U-statistics{U m n }M m=1 are independent and identically dis-"},{"citing_arxiv_id":"2605.09041","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence","primary_cat":"cs.CL","submitted_at":"2026-05-09T16:26:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Prior work has established that LLM benchmark scores are unstable under surface-level prompt variation. For LLMs specifically, adversarial prompt perturbations can substantially degrade performance [33], formatting choices alone can shift accuracy by tens of points on chain-of-thought tasks [28], and single-prompt scores can exhibit high variance across seman- tically equivalent phrasings [27], [34]. A broader robustness literature further shows that model behavior and evaluation outcomes can be highly sensitive to input perturbations, dis- tribution shifts, and evaluation protocols across adversarial- learning and vision-language settings [35]-[41]. This literature frames prompt sensitivity primarily as a property of themodel under test, however, rather than as a property of the evaluation"},{"citing_arxiv_id":"2605.04665","ref_index":5,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs","primary_cat":"cs.CL","submitted_at":"2026-05-06T09:11:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"cessing tasks, achieving human-level or near-human perfor- mance on numerous benchmarks [1]-[3]. However, recent studies have revealed a critical vulnerability: these models exhibit significant brittleness to input variations that preserve the task content while changing its surface realization, pro- ducing inconsistent outputs when presented with rewritten prompts [4], [5]. This phenomenon, termedsurface-form sensitivity, poses fundamental challenges for the reliable deployment of LLMs in real-world applications where users naturally express queries in diverse linguistic forms. The implications of this inconsistency extend beyond mere performance metrics. In conversational AI systems, users frequently reword the substantive content of a task when"},{"citing_arxiv_id":"2604.24712","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation","primary_cat":"cs.SE","submitted_at":"2026-04-27T17:21:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23338","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework","primary_cat":"cs.CR","submitted_at":"2026-04-25T14:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"and out-of-distribution attacks largely intact. The second is robust optimization: adversarial training [73] and randomized smoothing [74] give formal guarantees on restricted input spaces, but extending these guarantees to free-text inputs remains open. The third is runtime filtering: classifier-based safeguards such as Llama Guard [75], perplexity thresholds, and content classifiers [76], [77] catch syntactically anomalous jailbreaks but, by construction, fail against semantically coher- ent attacks (AutoDAN, PAIR) and against gradient-optimized low-perplexity suffixes (GCG). Red-teaming [78], [79] remains the primary empirical method for discovering L1 vulnerabilities before deployment, and watermarking [80] and differentially private training [81] address attribution and extraction respec-"},{"citing_arxiv_id":"2604.23135","ref_index":11,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization","primary_cat":"cs.LG","submitted_at":"2026-04-25T04:26:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Paraphrase sensitivity in Lean 4 autoformalization is dominated by code-generation failures that differ between undergraduate and Olympiad datasets across multiple models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16421","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measuring Representation Robustness in Large Language Models for Geometry","primary_cat":"cs.CL","submitted_at":"2026-04-03T11:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capacity models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tasks [3, 22], including mathematical problem solving [ 7, 12], symbolic manipulation [16], and logical inference [36]. Recent advances in scale, instruction tuning, and chain-of-thought prompting have led to substantial gains on benchmarks spanning arithmetic, algebra, and geometry [ 3, 36, 6, 22]. However, growing evidence-sensitivity to prompt phrasing [ 40], adversarial perturbations [ 11], and surface-level rewording [27]-suggests LLM performance can be brittle under representational changes [32]. Geometry provides a uniquely structured testbed: the same problem can be expressed via Euclidean, coordinate, or vector representations without altering its semantic content. 1 arXiv:2604.16421v1 [cs.CL] 3 Apr 2026"},{"citing_arxiv_id":"2603.10477","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses","primary_cat":"cs.CL","submitted_at":"2026-03-11T07:00:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PEEM is a multi-criteria LLM-based evaluator for prompts and responses that aligns with standard accuracy while enabling zero-shot prompt optimization via feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.03332","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations","primary_cat":"cs.CL","submitted_at":"2026-02-11T03:11:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and ExtraSteps causing minimal degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.10102","ref_index":92,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Trustworthiness in Retrieval-Augmented Generation Systems: A Survey","primary_cat":"cs.IR","submitted_at":"2024-09-16T09:06:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Trust-RAG Compass framework and TRC Bench benchmark to assess RAG trustworthiness across factuality, robustness, fairness, transparency, accountability, and privacy, with evaluations showing performance gaps between LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Goyal, S. Doddapaneni, M. M. Khapra, and B. Ravindran, \"A survey of adversarial defenses and robustness in NLP,\" ACM Comput. Surv. , vol. 55, no. 14s, pp. 332:1-332:39, 2023. [91] Z. Zhang, G. Zhang, B. Hou, W. Fan, Q. Li, S. Liu, Y. Zhang, and S. Chang, \"Certified robustness for large language models with self-denoising,\" CoRR, vol. abs/2307.07171, 2023. [92] K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang, L. Yang, W. Ye, N. Z. Gong, Y. Zhang, and X. Xie, \"Promptbench: Towards evaluating the robustness of large language models on adversarial prompts,\" CoRR, vol. abs/2306.04528, 2023. [93] Y. Du, A. Bosselut, and C. D. Manning, \"Synthetic disinformation attacks on automated fact verification systems,\" in AAAI."},{"citing_arxiv_id":"2406.04244","ref_index":191,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Benchmark Data Contamination of Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-06-06T16:41:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.06922","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Whispers in the Machine: Confidentiality in Agentic Systems","primary_cat":"cs.CR","submitted_at":"2024-02-10T11:07:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.05561","ref_index":172,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TrustLLM: Trustworthiness in Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-01-10T22:07:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"inclusiveness, transparency, and accountability. Moreover, it has proposed DecodingTrust [71], a comprehen- sive assessment of trustworthiness in GPT models, which considers diverse perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Moreover, PromptBench [172] comprehensively evaluated the robustness of LLMs on prompts with both natural (e.g., typos and synonyms) and adversarial perturbations. Google. Google has also proposed many measures to improve the trustworthiness of their LLMs. For instance, for the Palm API, Google provides users with safety filters [ 173] to prevent generating harmful content."},{"citing_arxiv_id":"2309.08532","ref_index":137,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers","primary_cat":"cs.CL","submitted_at":"2023-09-15T16:50:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EvoPrompt uses LLMs to run evolutionary operators on populations of prompts, outperforming human-engineered prompts by up to 25% on BIG-Bench Hard tasks across 31 datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.01219","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-09-03T16:56:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A literature survey that taxonomizes hallucination phenomena in LLMs, reviews evaluation benchmarks, and analyzes approaches for their detection, explanation, and mitigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.00614","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Baseline Defenses for Adversarial Attacks Against Aligned Language Models","primary_cat":"cs.LG","submitted_at":"2023-09-01T17:59:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2307.15043","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","primary_cat":"cs.CL","submitted_at":"2023-07-27T17:49:12+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}