When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models
Pith reviewed 2026-05-18 09:26 UTC · model grok-4.3
The pith
Web-augmented LLMs can be tricked into citing harmful content via queries that look harmless.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that three novel attack strategies, paired with iterative in-context refinement, can produce search queries that stay effective against black-box web-augmented LLMs, bypass their safety filters, and cause the models to cite harmful or low-credibility web content.
What carries the argument
CREST-Search framework built on three attack strategies that turn benign search queries into vectors for unsafe web citations, plus a WebSearch-Harm dataset used to fine-tune a red-teaming model.
If this is right
- Safety design for web-augmented LLMs must cover the full search-and-citation workflow rather than generation alone.
- A dedicated harmful search dataset improves the quality of queries that surface vulnerabilities.
- Current filters built for standalone models leave measurable gaps when live web results are involved.
- Systematic red-teaming can map out which parts of the retrieval pipeline are easiest to exploit.
Where Pith is reading between the lines
- Future safety benchmarks for LLMs should include search-augmented test cases by default.
- The same query-generation idea could be tested on other external tools such as code interpreters or database queries.
- Companies might add separate credibility scoring on retrieved pages before they are shown to the user.
Load-bearing premise
The three attack strategies will keep working against real deployed web-augmented systems even though the researchers have no direct access to the retrieval or safety components.
What would settle it
Take the queries produced by CREST-Search and submit them to commercial web-augmented LLMs; record whether the models return or cite harmful content that their normal safety filters are supposed to block.
Figures
read the original abstract
Large Language Models (LLMs) have been augmented with web search to overcome the limitations of the static knowledge boundary by accessing up-to-date information from the open Internet. While this integration enhances model capability, it also introduces a distinct safety threat surface: the retrieval and citation process has the potential risk of exposing users to harmful or low-credibility web content. Existing red-teaming methods are largely designed for standalone LLMs as they primarily focus on unsafe generation, ignoring risks emerging from the complex search workflow. To address this gap, we propose CREST-Search, a pioneering red-teaming framework for LLMs with web search. The cornerstone of CREST-Search is three novel attack strategies that generate seemingly benign search queries yet induce unsafe citations. It also employs an iterative in-context refinement mechanism to strengthen adversarial effectiveness under black-box constraints. In addition, we construct a search-specific harmful dataset, WebSearch-Harm, which enables fine-tuning a specialized red-teaming model to improve query quality. Our experiments demonstrate that CREST-Search can effectively bypass safety filters and systematically expose vulnerabilities in web search-based LLM systems, underscoring the necessity of the development of robust search models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CREST-Search, a red-teaming framework for web-augmented LLMs. It proposes three novel attack strategies that craft seemingly benign search queries to induce unsafe citations, an iterative in-context refinement mechanism to improve effectiveness under black-box constraints, and the WebSearch-Harm dataset to fine-tune a specialized red-teaming model. Experiments are presented to demonstrate that the framework bypasses safety filters and exposes vulnerabilities in web search-based LLM systems.
Significance. If the empirical results are robust, the work is significant because it addresses a previously underexplored attack surface arising from the retrieval and citation workflow in web-augmented LLMs, rather than focusing solely on direct generation. The black-box setting, new attack strategies, iterative refinement, and dedicated dataset provide concrete tools that could help developers of production search-augmented systems improve safety; the emphasis on falsifiable red-teaming outcomes is a strength.
major comments (2)
- [§5] §5 (Experimental Setup): The central claim that CREST-Search 'effectively bypasses safety filters and systematically expose vulnerabilities in web search-based LLM systems' rests on black-box experiments, yet the manuscript does not clarify whether evaluations used live commercial APIs with proprietary retrieval and post-retrieval filtering or open-source proxies/simulations. This distinction is load-bearing; success rates observed on proxies may not transfer to production pipelines where ranking and safety mechanisms are unknown and potentially stronger.
- [§4.2 and Table 2] §4.2 and Table 2: The three attack strategies and iterative refinement are presented as novel, but the paper provides limited ablation showing their individual contributions versus a simple baseline of direct harmful queries or existing red-teaming methods adapted to search. Without these controls, it is difficult to establish that the reported bypass rates are attributable to the proposed components rather than the underlying LLM's weaknesses.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., bypass rate or comparison to baseline) to support the claim of effectiveness.
- [§3.3] Notation for the iterative refinement loop (e.g., how many iterations and the exact in-context prompt template) should be formalized in a figure or algorithm box for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We appreciate the recognition of the work's potential significance in highlighting an underexplored attack surface in web-augmented LLMs. We respond to each major comment below and outline the revisions we will make to address the concerns.
read point-by-point responses
-
Referee: [§5] §5 (Experimental Setup): The central claim that CREST-Search 'effectively bypasses safety filters and systematically expose vulnerabilities in web search-based LLM systems' rests on black-box experiments, yet the manuscript does not clarify whether evaluations used live commercial APIs with proprietary retrieval and post-retrieval filtering or open-source proxies/simulations. This distinction is load-bearing; success rates observed on proxies may not transfer to production pipelines where ranking and safety mechanisms are unknown and potentially stronger.
Authors: We agree that explicitly distinguishing between live commercial APIs and open-source proxies is necessary for evaluating the robustness and transferability of our results. The current §5 description does not provide sufficient detail on this point. We will revise the experimental setup section to clearly specify the exact APIs, retrieval systems, and any filtering mechanisms employed in each set of experiments, along with a discussion of how these choices relate to production environments. revision: yes
-
Referee: [§4.2 and Table 2] §4.2 and Table 2: The three attack strategies and iterative refinement are presented as novel, but the paper provides limited ablation showing their individual contributions versus a simple baseline of direct harmful queries or existing red-teaming methods adapted to search. Without these controls, it is difficult to establish that the reported bypass rates are attributable to the proposed components rather than the underlying LLM's weaknesses.
Authors: We acknowledge that stronger ablations would better isolate the contributions of the proposed attack strategies and iterative refinement. While direct harmful queries are often filtered prior to retrieval (making them less relevant as a baseline for search-specific attacks) and existing red-teaming approaches target generation rather than query crafting, we agree that additional controls would strengthen the claims. We will add new ablation experiments comparing against adapted baselines and expand Table 2 to report the individual and combined effects of each component. revision: yes
Circularity Check
No circularity: empirical red-teaming framework with independent experimental validation
full rationale
The paper presents CREST-Search as an empirical red-teaming approach consisting of three attack strategies, iterative refinement, and the WebSearch-Harm dataset for fine-tuning. No equations, derivations, or load-bearing self-citations are present that would reduce any claimed result to a fitted parameter or prior input by construction. The central claims rest on experimental outcomes in a stated black-box setting, which are externally testable against deployed systems and do not rely on self-referential definitions or renamings of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Web search integration in LLMs creates a distinct safety threat surface from retrieval and citation of harmful content that existing red-teaming methods do not address.
invented entities (2)
-
CREST-Search framework
no independent evidence
-
WebSearch-Harm dataset
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three novel attack strategies that generate seemingly benign search queries yet induce unsafe citations... iterative in-context refinement mechanism
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Risk detection rate... citation risk... combined risk
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
NodeSynth generates evidence-anchored synthetic queries that trigger up to five times higher failure rates in mainstream LLMs than human-authored benchmarks.
-
NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
NodeSynth creates evidence-based synthetic queries via a taxonomy generator to evaluate LLMs, revealing up to 5x higher failure rates than human benchmarks and gaps in guard models.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Nicholas Boucher, Luca Pajola, Ilia Shumailov, Ross Anderson, and Mauro Conti
-
[4]
InProceed- ings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses
Boosting big brother: Attacking search engines with encodings. InProceed- ings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses. 700–713
-
[5]
Kangjie Chen, Li Muyang, Guanlin Li, Shudong Zhang, Shangwei Guo, and Tian- wei Zhang. [n. d.]. TRUST-VLM: Thorough Red-Teaming for Uncovering Safety Threats in Vision-Language Models. InForty-second International Conference on Machine Learning
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 34. 24066–24080
work page 2021
-
[7]
Google DeepMind. 2024. Gemini: A family of multimodal models. https:// deepmind.google/technologies/gemini/
work page 2024
- [8]
-
[9]
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Abhimanyu Dubey et al . 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [11]
-
[12]
Michael Feffer, Anusha Sinha, Wesley H Deng, Zachary C Lipton, and Hoda Heidari. 2024. Red-teaming for generative AI: Silver bullet or security theater?. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 7. 421–437
work page 2024
-
[13]
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Joh...
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [14]
-
[15]
2025.Grounding with Google Search
Google. 2025.Grounding with Google Search. https://ai.google.dev/gemini- api/docs/google-search
work page 2025
-
[16]
2012.Google and the Culture of Search
Ken Hillis, Michael Petit, and Kylie Jarrett. 2012.Google and the Culture of Search. Routledge
work page 2012
-
[17]
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, et al. 2023. Survey of hallucination in natural language generation.Comput. Surveys(2023)
work page 2023
-
[19]
Matthew Joslin, Neng Li, Shuang Hao, Minhui Xue, and Haojin Zhu. 2019. Mea- suring and analyzing search engine poisoning of linguistic collisions. In2019 IEEE Symposium on Security and Privacy (SP). IEEE, 1311–1325
work page 2019
-
[20]
Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh. 2023. Propile: Probing privacy leakage in large language models. Advances in Neural Information Processing Systems36 (2023), 20750–20762
work page 2023
- [21]
-
[22]
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, et al . 2019. Natural Questions: a Benchmark for Question Answering Research. InTransactions of the Association for Computational Linguistics (TACL), Vol. 7. 453–466
work page 2019
-
[23]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, et al. 2020. Retrieval- augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 33. 9459–9474
work page 2020
- [24]
- [25]
- [26]
-
[27]
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, et al . 2021. WebGPT: Browser- assisted question-answering with human feedback. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), Vol. 34. 5662–5674
work page 2021
-
[29]
OpenAI. 2024. GPT-4o System Card. https://openai.com/research/gpt-4o-system- card
work page 2024
-
[30]
OpenAI. 2024.Moderation. https://platform.openai.com/docs/guides/moderation
work page 2024
-
[31]
OpenAI. 2025.GPT-4o Search Preview. https://platform.openai.com/docs/models/ gpt-4o-search-preview
work page 2025
-
[32]
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red Teaming Language Models with Language Models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computa...
-
[33]
Chris Rowlands. 2025. Goodbye Google? People are increasingly switching to the likes of ChatGPT, according to major survey–here’s why.Techradar, https://www. techradar. com/tech/people-are-increasingly-swapping-google-for-the- likesof-chatgpt-according-to-a-major-survey-heres-why(2025)
work page 2025
-
[34]
Nathalie A Smuha. 2025. Regulation 2024/1689 of the Eur. Parl. & Council of June 13, 2024 (EU Artificial Intelligence Act).International Legal Materials(2025), 1–148. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al
work page 2025
-
[35]
Gemini Team. 2025.Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next-Generation Agentic Capabilities. Technical Report. Google DeepMind
work page 2025
-
[36]
Anuraj J Thirunavukarasu et al. 2023. Large language models in medicine.Nature Medicine29, 8 (2023), 1930–1940
work page 2023
-
[37]
GOV UK. 2023. Safety and security risks of generative artificial intelligence to 2025 (annex b).GOV. UK, Nov(2023)
work page 2023
-
[38]
Y Wang et al. 2023. AI for education: Opportunities and challenges.Computers and Education: Artificial Intelligence5 (2023), 100128
work page 2023
-
[39]
Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020. Generalizing from a few examples: A survey on few-shot learning.ACM computing surveys (csur)53, 3 (2020), 1–34
work page 2020
-
[40]
Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. 2022. Chain-of-thought prompt- ing elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 24824–24837
work page 2022
-
[41]
Nursel Yalçın and Utku Köse. 2010. What is search engine optimization: SEO? Procedia-Social and Behavioral Sciences9 (2010), 487–493
work page 2010
-
[42]
Shuo Yang, Shiyu Wu, Jiajun Chen, et al. 2024. Large Language Models in Finance: Applications, Risks, and Opportunities.arXiv preprint arXiv:2402.06196(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [43]
-
[44]
Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. 2024. Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407.04295(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. InThe 41st international ACM SIGIR conference on research & development in information retrieval. 1097–1100
work page 2018
-
[46]
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). CREST-Search: Comprehensive Red-teaming for Evaluating Safety Threats in Large Language Models Powered by Web Search Conference acronym ’XX, June 03–0...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
<user query 3> Below is the user prompt that specifies the concrete context for generation, including the harmful content category, its description, the selected construction strategy, and a brief explanation of that strategy. Through the crafted prompts, we can ensure that the generated queries are both diverse and targeted, comprehensively uncovering th...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.