When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Gelei Deng; Han Qiu; Haoran Ou; Jie Zhang; Kangjie Chen; Tianwei Zhang; Xingshuo Han

arxiv: 2510.09689 · v3 · submitted 2025-10-09 · 💻 cs.CR · cs.AI

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Haoran Ou , Kangjie Chen , Xingshuo Han , Gelei Deng , Jie Zhang , Han Qiu , Tianwei Zhang This is my paper

Pith reviewed 2026-05-18 09:26 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords red-teamingweb-augmented LLMssafety vulnerabilitiesadversarial search queriesharmful content retrievalblack-box attacks

0 comments

The pith

Web-augmented LLMs can be tricked into citing harmful content via queries that look harmless.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that adding web search to large language models opens a safety gap that standard red-teaming for standalone models does not cover. It builds CREST-Search around three attack strategies that craft ordinary-looking search queries meant to retrieve unsafe web pages and then cite them in the model's reply. The approach adds an iterative refinement step that works when the attacker cannot see inside the retrieval or safety systems. A reader should care because many deployed systems now rely on live search for up-to-date answers, so any weakness in the retrieval step can reach users even when the model itself refuses to generate harm directly.

Core claim

The central claim is that three novel attack strategies, paired with iterative in-context refinement, can produce search queries that stay effective against black-box web-augmented LLMs, bypass their safety filters, and cause the models to cite harmful or low-credibility web content.

What carries the argument

CREST-Search framework built on three attack strategies that turn benign search queries into vectors for unsafe web citations, plus a WebSearch-Harm dataset used to fine-tune a red-teaming model.

If this is right

Safety design for web-augmented LLMs must cover the full search-and-citation workflow rather than generation alone.
A dedicated harmful search dataset improves the quality of queries that surface vulnerabilities.
Current filters built for standalone models leave measurable gaps when live web results are involved.
Systematic red-teaming can map out which parts of the retrieval pipeline are easiest to exploit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future safety benchmarks for LLMs should include search-augmented test cases by default.
The same query-generation idea could be tested on other external tools such as code interpreters or database queries.
Companies might add separate credibility scoring on retrieved pages before they are shown to the user.

Load-bearing premise

The three attack strategies will keep working against real deployed web-augmented systems even though the researchers have no direct access to the retrieval or safety components.

What would settle it

Take the queries produced by CREST-Search and submit them to commercial web-augmented LLMs; record whether the models return or cite harmful content that their normal safety filters are supposed to block.

Figures

Figures reproduced from arXiv: 2510.09689 by Gelei Deng, Han Qiu, Haoran Ou, Jie Zhang, Kangjie Chen, Tianwei Zhang, Xingshuo Han.

**Figure 1.** Figure 1: Overview of CREST-Search, consisting of three main phases. (1) Adversarial search queries generation: it generates the adversarial search queries based on various harmful content categories and strategies; (2) Web search execution and risk evaluation: it executes the search query and evaluates the toxicity of the cited webpages; (3) Adversarial search queries refinement: it optimizes the query based on the… view at source ↗

**Figure 2.** Figure 2: Detailed risks analysis across baseline models. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Transferability of CREST-Search across various victim models. shown in Appendix A.3. These results underscore the urgent requirement for more comprehensive citation defense mechanisms to protect LLMs with web search than vanilla LLMs. 5.3 Ablation Study Finally, we conduct the ablation study to validate the contributions of each component during query generation and refinement stages. Two key factors af… view at source ↗

**Figure 4.** Figure 4: The impact of refinement rounds on risk detection rate (a), optimization cost (b), and optimization time (c) by five [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Detection rates for five harmful-content categories [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have been augmented with web search to overcome the limitations of the static knowledge boundary by accessing up-to-date information from the open Internet. While this integration enhances model capability, it also introduces a distinct safety threat surface: the retrieval and citation process has the potential risk of exposing users to harmful or low-credibility web content. Existing red-teaming methods are largely designed for standalone LLMs as they primarily focus on unsafe generation, ignoring risks emerging from the complex search workflow. To address this gap, we propose CREST-Search, a pioneering red-teaming framework for LLMs with web search. The cornerstone of CREST-Search is three novel attack strategies that generate seemingly benign search queries yet induce unsafe citations. It also employs an iterative in-context refinement mechanism to strengthen adversarial effectiveness under black-box constraints. In addition, we construct a search-specific harmful dataset, WebSearch-Harm, which enables fine-tuning a specialized red-teaming model to improve query quality. Our experiments demonstrate that CREST-Search can effectively bypass safety filters and systematically expose vulnerabilities in web search-based LLM systems, underscoring the necessity of the development of robust search models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies a gap in red-teaming for web-augmented LLMs with CREST-Search but the evaluation details are too sparse to assess the claims fully.

read the letter

This paper flags an under-studied risk in LLMs that use web search and proposes CREST-Search to exploit it, though the experiments are described at too high a level to judge their strength. The new part is the focus on the search workflow itself. Existing red-teaming mostly targets direct generation of unsafe text, but here the attacks aim to produce queries that retrieve and cite harmful material. The three strategies for making queries look benign, the iterative refinement loop, and the WebSearch-Harm dataset for fine-tuning all seem like fresh pieces aimed at this gap. They handle the black-box constraint reasonably by not assuming access to the models. That matches how these systems are deployed in practice. The main weakness is the lack of detail on results. The abstract states that the approach bypasses safety filters effectively, but there are no numbers on success rates, no baseline comparisons, and no explanation of how they verified unsafe citations. The concern about whether these were tested on actual production systems or on open proxies is important to resolve, because proxy results could be easier to achieve. This work is for researchers and engineers working on AI safety for retrieval-augmented generation. A reader who wants to explore attack surfaces in search-enabled LLMs would find the strategies worth trying out. I would send this to peer review. The topic matters for current systems, and getting referee input on the methods and evaluation would strengthen it.

Referee Report

2 major / 2 minor

Summary. The paper introduces CREST-Search, a red-teaming framework for web-augmented LLMs. It proposes three novel attack strategies that craft seemingly benign search queries to induce unsafe citations, an iterative in-context refinement mechanism to improve effectiveness under black-box constraints, and the WebSearch-Harm dataset to fine-tune a specialized red-teaming model. Experiments are presented to demonstrate that the framework bypasses safety filters and exposes vulnerabilities in web search-based LLM systems.

Significance. If the empirical results are robust, the work is significant because it addresses a previously underexplored attack surface arising from the retrieval and citation workflow in web-augmented LLMs, rather than focusing solely on direct generation. The black-box setting, new attack strategies, iterative refinement, and dedicated dataset provide concrete tools that could help developers of production search-augmented systems improve safety; the emphasis on falsifiable red-teaming outcomes is a strength.

major comments (2)

[§5] §5 (Experimental Setup): The central claim that CREST-Search 'effectively bypasses safety filters and systematically expose vulnerabilities in web search-based LLM systems' rests on black-box experiments, yet the manuscript does not clarify whether evaluations used live commercial APIs with proprietary retrieval and post-retrieval filtering or open-source proxies/simulations. This distinction is load-bearing; success rates observed on proxies may not transfer to production pipelines where ranking and safety mechanisms are unknown and potentially stronger.
[§4.2 and Table 2] §4.2 and Table 2: The three attack strategies and iterative refinement are presented as novel, but the paper provides limited ablation showing their individual contributions versus a simple baseline of direct harmful queries or existing red-teaming methods adapted to search. Without these controls, it is difficult to establish that the reported bypass rates are attributable to the proposed components rather than the underlying LLM's weaknesses.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., bypass rate or comparison to baseline) to support the claim of effectiveness.
[§3.3] Notation for the iterative refinement loop (e.g., how many iterations and the exact in-context prompt template) should be formalized in a figure or algorithm box for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the recognition of the work's potential significance in highlighting an underexplored attack surface in web-augmented LLMs. We respond to each major comment below and outline the revisions we will make to address the concerns.

read point-by-point responses

Referee: [§5] §5 (Experimental Setup): The central claim that CREST-Search 'effectively bypasses safety filters and systematically expose vulnerabilities in web search-based LLM systems' rests on black-box experiments, yet the manuscript does not clarify whether evaluations used live commercial APIs with proprietary retrieval and post-retrieval filtering or open-source proxies/simulations. This distinction is load-bearing; success rates observed on proxies may not transfer to production pipelines where ranking and safety mechanisms are unknown and potentially stronger.

Authors: We agree that explicitly distinguishing between live commercial APIs and open-source proxies is necessary for evaluating the robustness and transferability of our results. The current §5 description does not provide sufficient detail on this point. We will revise the experimental setup section to clearly specify the exact APIs, retrieval systems, and any filtering mechanisms employed in each set of experiments, along with a discussion of how these choices relate to production environments. revision: yes
Referee: [§4.2 and Table 2] §4.2 and Table 2: The three attack strategies and iterative refinement are presented as novel, but the paper provides limited ablation showing their individual contributions versus a simple baseline of direct harmful queries or existing red-teaming methods adapted to search. Without these controls, it is difficult to establish that the reported bypass rates are attributable to the proposed components rather than the underlying LLM's weaknesses.

Authors: We acknowledge that stronger ablations would better isolate the contributions of the proposed attack strategies and iterative refinement. While direct harmful queries are often filtered prior to retrieval (making them less relevant as a baseline for search-specific attacks) and existing red-teaming approaches target generation rather than query crafting, we agree that additional controls would strengthen the claims. We will add new ablation experiments comparing against adapted baselines and expand Table 2 to report the individual and combined effects of each component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical red-teaming framework with independent experimental validation

full rationale

The paper presents CREST-Search as an empirical red-teaming approach consisting of three attack strategies, iterative refinement, and the WebSearch-Harm dataset for fine-tuning. No equations, derivations, or load-bearing self-citations are present that would reduce any claimed result to a fitted parameter or prior input by construction. The central claims rest on experimental outcomes in a stated black-box setting, which are externally testable against deployed systems and do not rely on self-referential definitions or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the premise that web retrieval introduces a distinct safety threat surface not covered by existing LLM red-teaming. No free parameters are visible in the abstract. One domain assumption is invoked: that 'harmful or low-credibility web content' can be reliably identified for dataset construction and evaluation.

axioms (1)

domain assumption Web search integration in LLMs creates a distinct safety threat surface from retrieval and citation of harmful content that existing red-teaming methods do not address.
Stated directly in the abstract as the motivation for the new framework.

invented entities (2)

CREST-Search framework no independent evidence
purpose: Red-teaming web-augmented LLMs via benign-looking queries that induce unsafe citations
Introduced as the main contribution; no independent evidence outside the paper is provided in the abstract.
WebSearch-Harm dataset no independent evidence
purpose: Enables fine-tuning a specialized red-teaming model
Constructed for this work; no external validation or public release details given in the abstract.

pith-pipeline@v0.9.0 · 5753 in / 1609 out tokens · 34395 ms · 2026-05-18T09:26:41.748343+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three novel attack strategies that generate seemingly benign search queries yet induce unsafe citations... iterative in-context refinement mechanism
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Risk detection rate... citation risk... combined risk

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
cs.LG 2026-05 unverdicted novelty 6.0

NodeSynth generates evidence-anchored synthetic queries that trigger up to five times higher failure rates in mainstream LLMs than human-authored benchmarks.
NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
cs.LG 2026-05 unverdicted novelty 6.0

NodeSynth creates evidence-based synthetic queries via a taxonomy generator to evaluate LLMs, revealing up to 5x higher failure rates than human benchmarks and gaps in guard models.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Bang An, Shiyue Zhang, and Mark Dredze. 2025. Rag llms are not safer: A safety analysis of retrieval-augmented generation for large language models.arXiv preprint arXiv:2504.18041(2025)

work page arXiv 2025
[2]

Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment.arXiv preprint arXiv:2308.09662 (2023)

work page arXiv 2023
[3]

Nicholas Boucher, Luca Pajola, Ilia Shumailov, Ross Anderson, and Mauro Conti

work page
[4]

InProceed- ings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses

Boosting big brother: Attacking search engines with encodings. InProceed- ings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses. 700–713

work page
[5]

Kangjie Chen, Li Muyang, Guanlin Li, Shudong Zhang, Shangwei Guo, and Tian- wei Zhang. [n. d.]. TRUST-VLM: Thorough Red-Teaming for Uncovering Safety Threats in Vision-Language Models. InForty-second International Conference on Machine Learning

work page
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 34. 24066–24080

work page 2021
[7]

Google DeepMind. 2024. Gemini: A family of multimodal models. https:// deepmind.google/technologies/gemini/

work page 2024
[8]

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayi- heng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. 2023. How abilities in large language models are affected by supervised fine-tuning data composition.arXiv preprint arXiv:2310.05492(2023)

work page arXiv 2023
[9]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Abhimanyu Dubey et al . 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, Kimish Patel, et al. 2024. Llama guard 3-1b-int4: Compact and efficient safeguard for human-ai conversations. arXiv preprint arXiv:2411.17713(2024)

work page arXiv 2024
[12]

Michael Feffer, Anusha Sinha, Wesley H Deng, Zachary C Lipton, and Hoda Heidari. 2024. Red-teaming for generative AI: Silver bullet or security theater?. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 7. 421–437

work page 2024
[13]

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. 2023. Mart: Improving llm safety with multi-round automatic red-teaming.arXiv preprint arXiv:2311.07689(2023)

work page arXiv 2023
[15]

2025.Grounding with Google Search

Google. 2025.Grounding with Google Search. https://ai.google.dev/gemini- api/docs/google-search

work page 2025
[16]

2012.Google and the Culture of Search

Ken Hillis, Michael Petit, and Kylie Jarrett. 2012.Google and the Culture of Search. Routledge

work page 2012
[17]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, et al. 2023. Survey of hallucination in natural language generation.Comput. Surveys(2023)

work page 2023
[19]

Matthew Joslin, Neng Li, Shuang Hao, Minhui Xue, and Haojin Zhu. 2019. Mea- suring and analyzing search engine poisoning of linguistic collisions. In2019 IEEE Symposium on Security and Privacy (SP). IEEE, 1311–1325

work page 2019
[20]

Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh. 2023. Propile: Probing privacy leakage in large language models. Advances in Neural Information Processing Systems36 (2023), 20750–20762

work page 2023
[21]

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. 2023. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705(2023)

work page arXiv 2023
[22]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, et al . 2019. Natural Questions: a Benchmark for Question Answering Research. InTransactions of the Association for Computational Linguistics (TACL), Vol. 7. 453–466

work page 2019
[23]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, et al. 2020. Retrieval- augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 33. 9459–9474

work page 2020
[24]

Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, and Tianwei Zhang. 2024. ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users. arXiv:2405.19360 [cs.CR] https://arxiv.org/abs/2405.19360

work page arXiv 2024
[25]

Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng- Xin Yong, Suhas Kotha, et al . 2024. A safe harbor for ai evaluation and red teaming.arXiv preprint arXiv:2403.04893(2024)

work page arXiv 2024
[26]

Zeren Luo, Zifan Peng, Yule Liu, Zhen Sun, Mingchen Li, Jingyi Zheng, and Xinlei He. 2025. Unsafe LLM-Based Search: Quantitative Analysis and Mitigation of Safety Risks in AI Web Search.arXiv preprint arXiv:2502.04951(2025)

work page arXiv 2025
[27]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, et al . 2021. WebGPT: Browser- assisted question-answering with human feedback. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), Vol. 34. 5662–5674

work page 2021
[29]

OpenAI. 2024. GPT-4o System Card. https://openai.com/research/gpt-4o-system- card

work page 2024
[30]

2024.Moderation

OpenAI. 2024.Moderation. https://platform.openai.com/docs/guides/moderation

work page 2024
[31]

2025.GPT-4o Search Preview

OpenAI. 2025.GPT-4o Search Preview. https://platform.openai.com/docs/models/ gpt-4o-search-preview

work page 2025
[32]

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red Teaming Language Models with Language Models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computa...

work page doi:10.18653/v1/2022.emnlp-main.225 2022
[33]

Chris Rowlands. 2025. Goodbye Google? People are increasingly switching to the likes of ChatGPT, according to major survey–here’s why.Techradar, https://www. techradar. com/tech/people-are-increasingly-swapping-google-for-the- likesof-chatgpt-according-to-a-major-survey-heres-why(2025)

work page 2025
[34]

Nathalie A Smuha. 2025. Regulation 2024/1689 of the Eur. Parl. & Council of June 13, 2024 (EU Artificial Intelligence Act).International Legal Materials(2025), 1–148. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

work page 2025
[35]

2025.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabilities

Gemini Team. 2025.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabilities. Technical Report. Google DeepMind

work page 2025
[36]

Anuraj J Thirunavukarasu et al. 2023. Large language models in medicine.Nature Medicine29, 8 (2023), 1930–1940

work page 2023
[37]

GOV UK. 2023. Safety and security risks of generative artificial intelligence to 2025 (annex b).GOV. UK, Nov(2023)

work page 2023
[38]

Y Wang et al. 2023. AI for education: Opportunities and challenges.Computers and Education: Artificial Intelligence5 (2023), 100128

work page 2023
[39]

Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020. Generalizing from a few examples: A survey on few-shot learning.ACM computing surveys (csur)53, 3 (2020), 1–34

work page 2020
[40]

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. 2022. Chain-of-thought prompt- ing elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 24824–24837

work page 2022
[41]

Nursel Yalçın and Utku Köse. 2010. What is search engine optimization: SEO? Procedia-Social and Behavioral Sciences9 (2010), 487–493

work page 2010
[42]

Shuo Yang, Shiyu Wu, Jiajun Chen, et al. 2024. Large Language Models in Finance: Applications, Risks, and Opportunities.arXiv preprint arXiv:2402.06196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Jia-Yu Yao, Kun-Peng Ning, Zhen-Hui Liu, Mu-Nan Ning, Yu-Yang Liu, and Li Yuan. 2023. Llm lies: Hallucinations are not bugs, but features as adversarial examples.arXiv preprint arXiv:2310.01469(2023)

work page arXiv 2023
[44]

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. 2024. Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. InThe 41st international ACM SIGIR conference on research & development in information retrieval. 1097–1100

work page 2018
[46]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). CREST-Search: Comprehensive Red-teaming for Evaluating Safety Threats in Large Language Models Powered by Web Search Conference acronym ’XX, June 03–0...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Through the crafted prompts, we can ensure that the generated queries are both diverse and targeted, comprehensively uncovering the risks in the LLMs with web search

<user query 3> Below is the user prompt that specifies the concrete context for generation, including the harmful content category, its description, the selected construction strategy, and a brief explanation of that strategy. Through the crafted prompts, we can ensure that the generated queries are both diverse and targeted, comprehensively uncovering th...

work page

[1] [1]

Bang An, Shiyue Zhang, and Mark Dredze. 2025. Rag llms are not safer: A safety analysis of retrieval-augmented generation for large language models.arXiv preprint arXiv:2504.18041(2025)

work page arXiv 2025

[2] [2]

Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment.arXiv preprint arXiv:2308.09662 (2023)

work page arXiv 2023

[3] [3]

Nicholas Boucher, Luca Pajola, Ilia Shumailov, Ross Anderson, and Mauro Conti

work page

[4] [4]

InProceed- ings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses

Boosting big brother: Attacking search engines with encodings. InProceed- ings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses. 700–713

work page

[5] [5]

Kangjie Chen, Li Muyang, Guanlin Li, Shudong Zhang, Shangwei Guo, and Tian- wei Zhang. [n. d.]. TRUST-VLM: Thorough Red-Teaming for Uncovering Safety Threats in Vision-Language Models. InForty-second International Conference on Machine Learning

work page

[6] [6]

Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 34. 24066–24080

work page 2021

[7] [7]

Google DeepMind. 2024. Gemini: A family of multimodal models. https:// deepmind.google/technologies/gemini/

work page 2024

[8] [8]

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayi- heng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. 2023. How abilities in large language models are affected by supervised fine-tuning data composition.arXiv preprint arXiv:2310.05492(2023)

work page arXiv 2023

[9] [9]

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Abhimanyu Dubey et al . 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Igor Fedorov, Kate Plawiak, Lemeng Wu, Tarek Elgamal, Naveen Suda, Eric Smith, Hongyuan Zhan, Jianfeng Chi, Yuriy Hulovatyy, Kimish Patel, et al. 2024. Llama guard 3-1b-int4: Compact and efficient safeguard for human-ai conversations. arXiv preprint arXiv:2411.17713(2024)

work page arXiv 2024

[12] [12]

Michael Feffer, Anusha Sinha, Wesley H Deng, Zachary C Lipton, and Hoda Heidari. 2024. Red-teaming for generative AI: Silver bullet or security theater?. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 7. 421–437

work page 2024

[13] [13]

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. 2023. Mart: Improving llm safety with multi-round automatic red-teaming.arXiv preprint arXiv:2311.07689(2023)

work page arXiv 2023

[15] [15]

2025.Grounding with Google Search

Google. 2025.Grounding with Google Search. https://ai.google.dev/gemini- api/docs/google-search

work page 2025

[16] [16]

2012.Google and the Culture of Search

Ken Hillis, Michael Petit, and Kylie Jarrett. 2012.Google and the Culture of Search. Routledge

work page 2012

[17] [17]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, et al. 2023. Survey of hallucination in natural language generation.Comput. Surveys(2023)

work page 2023

[19] [19]

Matthew Joslin, Neng Li, Shuang Hao, Minhui Xue, and Haojin Zhu. 2019. Mea- suring and analyzing search engine poisoning of linguistic collisions. In2019 IEEE Symposium on Security and Privacy (SP). IEEE, 1311–1325

work page 2019

[20] [20]

Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh. 2023. Propile: Probing privacy leakage in large language models. Advances in Neural Information Processing Systems36 (2023), 20750–20762

work page 2023

[21] [21]

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. 2023. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705(2023)

work page arXiv 2023

[22] [22]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, et al . 2019. Natural Questions: a Benchmark for Question Answering Research. InTransactions of the Association for Computational Linguistics (TACL), Vol. 7. 453–466

work page 2019

[23] [23]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, et al. 2020. Retrieval- augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 33. 9459–9474

work page 2020

[24] [24]

Guanlin Li, Kangjie Chen, Shudong Zhang, Jie Zhang, and Tianwei Zhang. 2024. ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users. arXiv:2405.19360 [cs.CR] https://arxiv.org/abs/2405.19360

work page arXiv 2024

[25] [25]

Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng- Xin Yong, Suhas Kotha, et al . 2024. A safe harbor for ai evaluation and red teaming.arXiv preprint arXiv:2403.04893(2024)

work page arXiv 2024

[26] [26]

Zeren Luo, Zifan Peng, Yule Liu, Zhen Sun, Mingchen Li, Jingyi Zheng, and Xinlei He. 2025. Unsafe LLM-Based Search: Quantitative Analysis and Mitigation of Safety Risks in AI Web Search.arXiv preprint arXiv:2502.04951(2025)

work page arXiv 2025

[27] [27]

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, et al . 2021. WebGPT: Browser- assisted question-answering with human feedback. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), Vol. 34. 5662–5674

work page 2021

[29] [29]

OpenAI. 2024. GPT-4o System Card. https://openai.com/research/gpt-4o-system- card

work page 2024

[30] [30]

2024.Moderation

OpenAI. 2024.Moderation. https://platform.openai.com/docs/guides/moderation

work page 2024

[31] [31]

2025.GPT-4o Search Preview

OpenAI. 2025.GPT-4o Search Preview. https://platform.openai.com/docs/models/ gpt-4o-search-preview

work page 2025

[32] [32]

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red Teaming Language Models with Language Models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computa...

work page doi:10.18653/v1/2022.emnlp-main.225 2022

[33] [33]

Chris Rowlands. 2025. Goodbye Google? People are increasingly switching to the likes of ChatGPT, according to major survey–here’s why.Techradar, https://www. techradar. com/tech/people-are-increasingly-swapping-google-for-the- likesof-chatgpt-according-to-a-major-survey-heres-why(2025)

work page 2025

[34] [34]

Nathalie A Smuha. 2025. Regulation 2024/1689 of the Eur. Parl. & Council of June 13, 2024 (EU Artificial Intelligence Act).International Legal Materials(2025), 1–148. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al

work page 2025

[35] [35]

2025.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabilities

Gemini Team. 2025.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabilities. Technical Report. Google DeepMind

work page 2025

[36] [36]

Anuraj J Thirunavukarasu et al. 2023. Large language models in medicine.Nature Medicine29, 8 (2023), 1930–1940

work page 2023

[37] [37]

GOV UK. 2023. Safety and security risks of generative artificial intelligence to 2025 (annex b).GOV. UK, Nov(2023)

work page 2023

[38] [38]

Y Wang et al. 2023. AI for education: Opportunities and challenges.Computers and Education: Artificial Intelligence5 (2023), 100128

work page 2023

[39] [39]

Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020. Generalizing from a few examples: A survey on few-shot learning.ACM computing surveys (csur)53, 3 (2020), 1–34

work page 2020

[40] [40]

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. 2022. Chain-of-thought prompt- ing elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 24824–24837

work page 2022

[41] [41]

Nursel Yalçın and Utku Köse. 2010. What is search engine optimization: SEO? Procedia-Social and Behavioral Sciences9 (2010), 487–493

work page 2010

[42] [42]

Shuo Yang, Shiyu Wu, Jiajun Chen, et al. 2024. Large Language Models in Finance: Applications, Risks, and Opportunities.arXiv preprint arXiv:2402.06196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

Jia-Yu Yao, Kun-Peng Ning, Zhen-Hui Liu, Mu-Nan Ning, Yu-Yang Liu, and Li Yuan. 2023. Llm lies: Hallucinations are not bugs, but features as adversarial examples.arXiv preprint arXiv:2310.01469(2023)

work page arXiv 2023

[44] [44]

Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. 2024. Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. InThe 41st international ACM SIGIR conference on research & development in information retrieval. 1097–1100

work page 2018

[46] [46]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). CREST-Search: Comprehensive Red-teaming for Evaluating Safety Threats in Large Language Models Powered by Web Search Conference acronym ’XX, June 03–0...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Through the crafted prompts, we can ensure that the generated queries are both diverse and targeted, comprehensively uncovering the risks in the LLMs with web search

<user query 3> Below is the user prompt that specifies the concrete context for generation, including the harmful content category, its description, the selected construction strategy, and a brief explanation of that strategy. Through the crafted prompts, we can ensure that the generated queries are both diverse and targeted, comprehensively uncovering th...

work page