hub Mixed citations

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si · 2024 · cs.CL · arXiv 2406.06608

Mixed citation behavior. Most common role is background (60%).

44 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 44 citing papers arXiv PDF

abstract

Generative Artificial Intelligence (GenAI) systems are increasingly being deployed across diverse industries and research domains. Developers and end-users interact with these systems through the use of prompting and prompt engineering. Although prompt engineering is a widely adopted and extensively researched area, it suffers from conflicting terminology and a fragmented ontological understanding of what constitutes an effective prompt due to its relatively recent emergence. We establish a structured understanding of prompt engineering by assembling a taxonomy of prompting techniques and analyzing their applications. We present a detailed vocabulary of 33 vocabulary terms, a taxonomy of 58 LLM prompting techniques, and 40 techniques for other modalities. Additionally, we provide best practices and guidelines for prompt engineering, including advice for prompting state-of-the-art (SOTA) LLMs such as ChatGPT. We further present a meta-analysis of the entire literature on natural language prefix-prompting. As a culmination of these efforts, this paper presents the most comprehensive survey on prompt engineering to date.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 3 support 1 use method 1

representative citing papers

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

cs.CR · 2026-06-16 · unverdicted · novelty 7.0

Paraphrasing retrieved content is the most effective of five tested prompting defenses against domain-camouflaged injection attacks, cutting success rates 55-84% across three models while financial domains retain the highest residual risk.

Self-Harness: Harnesses That Improve Themselves

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

AtelierEval is the first unified benchmark that quantifies prompting proficiency of humans and MLLMs across 360 tasks using a cognitive taxonomy, with AtelierJudge providing scalable evaluation that correlates 0.79 with experts and shows mimicry outperforming planning.

TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.

Can Vision Language Models Judge Action Quality? An Empirical Evaluation

cs.CV · 2026-04-09 · conditional · novelty 7.0

Vision-language models perform only marginally above random on action quality assessment and retain systematic biases even after targeted prompting and contrastive reformulation.

The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies

cs.CL · 2025-09-22 · conditional · novelty 7.0

A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

cs.CR · 2025-09-03 · unverdicted · novelty 7.0

PromptCOS is a content-only watermarking method for LLM system prompts that embeds detectable cyclic signals via auxiliary tokens while preserving fidelity and resisting removal attacks.

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

cs.CV · 2025-01-07 · unverdicted · novelty 7.0

PromptGuard optimizes a universal safety soft prompt (and category-specific variants) in T2I embedding space to moderate NSFW inputs, achieving average unsafe ratios of 5.84-6.18% while being 3.8x faster than prior defenses.

Automated Design of Agentic Systems

cs.AI · 2024-08-15 · conditional · novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

cs.CL · 2026-06-17 · unverdicted · novelty 6.0

RECOM dataset shows automatic metrics for open-ended Reddit QA exhibit a validity-discrimination tradeoff, with cosine similarity strong on validity but weak on model ranking, and BERTScore showing the reverse pattern after length control.

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Persona prompting trades expertise depth for reduced clarity in LLM answers and works best on advisory questions in medicine and psychology.

Analogies between Transformer Layers and Power Method

cs.LG · 2026-05-25 · unverdicted · novelty 6.0

Transformer layers are analogous to power method steps, tilting tokens toward the principal eigenvector of the output-value weight product, with stronger analytical and empirical alignment in shared-weight models and a proposed steering method.

Intent Signal Theory: A Computational Framework for Intent-State Control in Human-AI Interaction

cs.HC · 2026-05-24 · unverdicted · novelty 6.0

Intent Signal Theory formalizes four distinct intent-related objects in human-AI interaction, introduces a theorem on irreversible private intent loss, and reports supporting patterns from studies across LLMs, languages, and tasks.

Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting

cs.IR · 2026-05-18 · unverdicted · novelty 6.0

Introduces TableGrid Navigation (TGN) and Progressive Inference Prompting (PIP) as training-free structured prompting frameworks that improve LLM performance on table question answering over baselines on TableBench and achieve SOTA on FeTaQa.

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.

Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.

Alignment has a Fantasia Problem

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.

From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

cs.CR · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

Arbiter-K is a governance-first architecture that turns probabilistic agent reasoning into discrete instructions with runtime taint propagation to block unsafe actions, reporting 76-95% interception rates and a 92.79% gain over baseline policies on two test systems.

LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments

cs.SE · 2026-04-12 · unverdicted · novelty 6.0

LLMs improve with detailed code descriptions but remain insufficient to replace human annotators for security-specific qualitative coding.

LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

cs.CL · 2025-10-14 · unverdicted · novelty 6.0

Prompt Duel Optimizer uses dueling bandits and LLM-as-judge pairwise feedback with Double Thompson Sampling and top-performer mutation to find stronger prompts than label-free baselines on BBH and MS MARCO under limited comparison budgets.

Evidence-Supported Credit Risk Report Generation Using News-Centric Financial Knowledge Graphs

cs.CL · 2026-07-01 · unverdicted · novelty 5.0

FinKG-News constructs news-centric financial knowledge graphs to support in-context learning for credit risk report generation across three dimensions, claiming 19-34% quality gains and fewer hallucinations than baselines.

A Taxonomy of Single-Turn Textual Prompt Patterns

cs.SE · 2026-06-29 · unverdicted · novelty 5.0

A taxonomy that consolidates prompt patterns from prior surveys into 30 unique canonical forms organized by two dimensions.

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

cs.AI · 2026-05-26 · unverdicted · novelty 5.0

A 432-run experiment across capability tiers refutes the assumption of a monotone inverse relationship between LLM capability and optimal harness complexity, showing model-type-specific patterns instead.

citing papers explorer

Showing 44 of 44 citing papers.

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks cs.CR · 2026-06-16 · unverdicted · none · ref 11 · internal anchor
Paraphrasing retrieved content is the most effective of five tested prompting defenses against domain-camouflaged injection attacks, cutting success rates 55-84% across three models while financial domains retain the highest residual risk.
Self-Harness: Harnesses That Improve Themselves cs.CL · 2026-06-08 · unverdicted · none · ref 21 · internal anchor
Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.
AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters cs.AI · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
AtelierEval is the first unified benchmark that quantifies prompting proficiency of humans and MLLMs across 360 tasks using a cognitive taxonomy, with AtelierJudge providing scalable evaluation that correlates 0.79 with experts and shows mimicry outperforming planning.
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data cs.AI · 2026-04-30 · unverdicted · none · ref 38 · internal anchor
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
Can Vision Language Models Judge Action Quality? An Empirical Evaluation cs.CV · 2026-04-09 · conditional · none · ref 27 · internal anchor
Vision-language models perform only marginally above random on action quality assessment and retain systematic biases even after targeted prompting and contrastive reformulation.
The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies cs.CL · 2025-09-22 · conditional · none · ref 32 · internal anchor
A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.
PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs cs.CR · 2025-09-03 · unverdicted · none · ref 40 · internal anchor
PromptCOS is a content-only watermarking method for LLM system prompts that embeds detectable cyclic signals via auxiliary tokens while preserving fidelity and resisting removal attacks.
PRIMETIME : Limits of LLMs in Temporal Primitives cs.NE · 2025-04-22 · unverdicted · none · ref 81 · internal anchor
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models cs.CV · 2025-01-07 · unverdicted · none · ref 29 · internal anchor
PromptGuard optimizes a universal safety soft prompt (and category-specific variants) in T2I embedding space to moderate NSFW inputs, achieving average unsafe ratios of 5.84-6.18% while being 3.8x faster than prior defenses.
Automated Design of Agentic Systems cs.AI · 2024-08-15 · conditional · none · ref 206 · internal anchor
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering cs.CL · 2026-06-17 · unverdicted · none · ref 43 · internal anchor
RECOM dataset shows automatic metrics for open-ended Reddit QA exhibit a validity-discrimination tradeoff, with cosine similarity strong on validity but weak on model ranking, and BERTScore showing the reverse pattern after length control.
When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs cs.AI · 2026-05-28 · unverdicted · none · ref 5 · internal anchor
Persona prompting trades expertise depth for reduced clarity in LLM answers and works best on advisory questions in medicine and psychology.
Analogies between Transformer Layers and Power Method cs.LG · 2026-05-25 · unverdicted · none · ref 29 · internal anchor
Transformer layers are analogous to power method steps, tilting tokens toward the principal eigenvector of the output-value weight product, with stronger analytical and empirical alignment in shared-weight models and a proposed steering method.
Intent Signal Theory: A Computational Framework for Intent-State Control in Human-AI Interaction cs.HC · 2026-05-24 · unverdicted · none · ref 4 · internal anchor
Intent Signal Theory formalizes four distinct intent-related objects in human-AI interaction, introduces a theorem on irreversible private intent loss, and reports supporting patterns from studies across LLMs, languages, and tasks.
Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting cs.IR · 2026-05-18 · unverdicted · none · ref 18 · internal anchor
Introduces TableGrid Navigation (TGN) and Progressive Inference Prompting (PIP) as training-free structured prompting frameworks that improve LLM performance on table question answering over baselines on TableBench and achieve SOTA on FeTaQa.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents cs.CL · 2026-05-14 · unverdicted · none · ref 19 · internal anchor
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits cs.LG · 2026-05-14 · unverdicted · none · ref 19 · internal anchor
Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.
Alignment has a Fantasia Problem cs.AI · 2026-04-23 · unverdicted · none · ref 51 · internal anchor
AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.
From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers cs.CR · 2026-04-20 · unverdicted · none · ref 18 · 2 links · internal anchor
Arbiter-K is a governance-first architecture that turns probabilistic agent reasoning into discrete instructions with runtime taint propagation to block unsafe actions, reporting 76-95% interception rates and a 92.79% gain over baseline policies on two test systems.
LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments cs.SE · 2026-04-12 · unverdicted · none · ref 49 · internal anchor
LLMs improve with detailed code descriptions but remain insufficient to replace human annotators for security-specific qualitative coding.
LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization cs.CL · 2025-10-14 · unverdicted · none · ref 2 · internal anchor
Prompt Duel Optimizer uses dueling bandits and LLM-as-judge pairwise feedback with Double Thompson Sampling and top-performer mutation to find stronger prompts than label-free baselines on BBH and MS MARCO under limited comparison budgets.
Evidence-Supported Credit Risk Report Generation Using News-Centric Financial Knowledge Graphs cs.CL · 2026-07-01 · unverdicted · none · ref 19 · internal anchor
FinKG-News constructs news-centric financial knowledge graphs to support in-context learning for credit risk report generation across three dimensions, claiming 19-34% quality gains and fewer hallucinations than baselines.
A Taxonomy of Single-Turn Textual Prompt Patterns cs.SE · 2026-06-29 · unverdicted · none · ref 5 · internal anchor
A taxonomy that consolidates prompt patterns from prior surveys into 30 unique canonical forms organized by two dimensions.
It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers cs.AI · 2026-05-26 · unverdicted · none · ref 6 · internal anchor
A 432-run experiment across capability tiers refutes the assumption of a monotone inverse relationship between LLM capability and optimal harness complexity, showing model-type-specific patterns instead.
User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models cs.SE · 2026-05-12 · conditional · none · ref 17 · internal anchor
LLMs can detect usability content in user reviews with F-scores comparable to humans, though performance depends strongly on prompt design.
LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation cs.AI · 2026-05-11 · unverdicted · none · ref 27 · internal anchor
LLARS is a new integrated platform that combines collaborative prompt authoring, cost-controlled batch generation, and hybrid evaluation to help domain experts and developers jointly build and assess LLM systems.
U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning cs.AI · 2026-05-04 · unverdicted · none · ref 102 · internal anchor
U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.
Looking Into the Past: Eye Movements Characterize Elements of Autobiographical Recall in Interviews with Holocaust Survivors cs.MM · 2026-04-23 · unverdicted · none · ref 24 · internal anchor
Eye movements during Holocaust survivor interviews vary by episodic, semantic, affective and temporal memory dimensions, with pre-onset gaze sufficient to predict sentence temporal context.
OOPrompt: Reifying Intents into Structured Artifacts for Modular and Iterative Prompting cs.HC · 2026-04-21 · unverdicted · none · ref 39 · internal anchor
OOPrompt reifies user intents into structured manipulable artifacts to enable modular and iterative prompting in LLM-based interactive systems.
Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis cs.AI · 2026-04-12 · unverdicted · none · ref 20 · internal anchor
Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.
Confidence Without Competence in AI-Assisted Knowledge Work cs.HC · 2026-04-10 · unverdicted · none · ref 61 · internal anchor
Standard LLM chats produce high perceived understanding but low objective learning in students, while future-self explanations best align confidence with actual gains and guided hints maximize learning with moderate workload.
The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure cs.CL · 2026-04-03 · accept · none · ref 3 · internal anchor
PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.
Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation cs.CL · 2026-03-28 · unverdicted · none · ref 15 · internal anchor
SDSR places human metadata at file primacy and combines it with prompt routing rules to reach 100% primary category accuracy on a 119-category benchmark, far above the 65% no-guidance baseline.
Teaching Astronomy with Large Language Models physics.ed-ph · 2025-06-07 · unverdicted · none · ref 51 · internal anchor
Structured integration of LLMs in astronomy education, including a domain-specific tutor and documentation requirements, leads to improved AI literacy and reduced student reliance on AI over the semester.
Comparing BERT Sentence-Pair Classification and Few-Shot LLM Prompting for Detecting Threat and Solution Framing in German Climate News cs.CL · 2026-06-25 · unverdicted · none · ref 32 · internal anchor
Fine-tuned BERT sentence-pair classifiers reach F1 0.83 while few-shot LLM prompting reaches F1 0.78 on threat and solution framing detection in 440 manually coded German climate news articles.
Characterizing Students' LLM Usage Behaviors and Their Association with Learning in Critical Thinking Tasks cs.HC · 2026-05-06 · unverdicted · none · ref 27 · 2 links · internal anchor
Refined bottom-up categorization of LLM usage types in critical thinking homework, labeled by student initiative, shows associations with midterm performance across two course offerings.
Hint-Writing with Deferred AI Assistance: Fostering Critical Engagement in Data Science Education cs.HC · 2026-04-21 · unverdicted · none · ref 53 · internal anchor
In a randomized experiment with 97 graduate students, deferred AI assistance produced the highest-quality hints and helped students spot more code mistakes than independent writing or immediate AI help.
ClinQueryAgent: A Conversational Agent for Population Health Management cs.IR · 2026-04-13 · unverdicted · none · ref 248 · internal anchor
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 staff across 15 NHS practices covering 148,319 patients.
Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study cs.SE · 2026-05-08 · unverdicted · none · ref 11 · internal anchor
Multi-shot prompting raises agreement with humans for Claude Haiku but not DeepSeek-Chat or Gemini 2.5 Flash, with models showing different stability and a consistent bias toward over-labeling negative feedback.
CLaC at SemEval-2026 Task 6: Response Clarity Detection in Political Discourse cs.CL · 2026-05-04 · unverdicted · none · ref 25 · internal anchor
An LLM ensemble reached 80 macro-F1 on 3-class clarity detection and 59 on 9-class evasion detection, with partial layer unfreezing and multilingual ensembles improving encoder results while enriched context helped only LLMs.
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation cs.IR · 2026-04-21 · unverdicted · none · ref 40 · internal anchor
MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.
The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences cs.CL · 2025-09-14 · unverdicted · none · ref 107 · 2 links · internal anchor
The paper reduces a broad set of prompt engineering techniques to six core approaches and applies them to life sciences use cases while addressing common LLM pitfalls.
LLMs in Qualitative Research: Opportunities, Limitations, and Practical Considerations cs.HC · 2026-05-15 · unverdicted · none · ref 60 · internal anchor
The paper outlines opportunities, limitations, and practical parameters for integrating LLMs into qualitative research while aligning with epistemological commitments like reflexivity and interpretive judgment.
MetaGraph: A Large-Scale Meta-Analysis of GenAI in Financial NLP (2022-2025) cs.CL · 2025-09-11 · unreviewed · ref 45 · internal anchor

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer