Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.
hub Mixed citations
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
Mixed citation behavior. Most common role is background (60%).
abstract
Generative Artificial Intelligence (GenAI) systems are increasingly being deployed across diverse industries and research domains. Developers and end-users interact with these systems through the use of prompting and prompt engineering. Although prompt engineering is a widely adopted and extensively researched area, it suffers from conflicting terminology and a fragmented ontological understanding of what constitutes an effective prompt due to its relatively recent emergence. We establish a structured understanding of prompt engineering by assembling a taxonomy of prompting techniques and analyzing their applications. We present a detailed vocabulary of 33 vocabulary terms, a taxonomy of 58 LLM prompting techniques, and 40 techniques for other modalities. Additionally, we provide best practices and guidelines for prompt engineering, including advice for prompting state-of-the-art (SOTA) LLMs such as ChatGPT. We further present a meta-analysis of the entire literature on natural language prefix-prompting. As a culmination of these efforts, this paper presents the most comprehensive survey on prompt engineering to date.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
AtelierEval is the first unified benchmark that quantifies prompting proficiency of humans and MLLMs across 360 tasks using a cognitive taxonomy, with AtelierJudge providing scalable evaluation that correlates 0.79 with experts and shows mimicry outperforming planning.
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
Vision-language models perform only marginally above random on action quality assessment and retain systematic biases even after targeted prompting and contrastive reformulation.
A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.
PromptCOS is a content-only watermarking method for LLM system prompts that embeds detectable cyclic signals via auxiliary tokens while preserving fidelity and resisting removal attacks.
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
PromptGuard optimizes a universal safety soft prompt (and category-specific variants) in T2I embedding space to moderate NSFW inputs, achieving average unsafe ratios of 5.84-6.18% while being 3.8x faster than prior defenses.
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
Persona prompting trades expertise depth for reduced clarity in LLM answers and works best on advisory questions in medicine and psychology.
Transformer layers are analogous to power method steps, tilting tokens toward the principal eigenvector of the output-value weight product, with stronger analytical and empirical alignment in shared-weight models and a proposed steering method.
Intent Signal Theory formalizes four distinct intent-related objects in human-AI interaction, introduces a theorem on irreversible private intent loss, and reports supporting patterns from studies across LLMs, languages, and tasks.
Introduces TableGrid Navigation (TGN) and Progressive Inference Prompting (PIP) as training-free structured prompting frameworks that improve LLM performance on table question answering over baselines on TableBench and achieve SOTA on FeTaQa.
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.
AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.
Arbiter-K is a governance-first architecture that turns probabilistic agent reasoning into discrete instructions with runtime taint propagation to block unsafe actions, reporting 76-95% interception rates and a 92.79% gain over baseline policies on two test systems.
LLMs improve with detailed code descriptions but remain insufficient to replace human annotators for security-specific qualitative coding.
Prompt Duel Optimizer uses dueling bandits and LLM-as-judge pairwise feedback with Double Thompson Sampling and top-performer mutation to find stronger prompts than label-free baselines on BBH and MS MARCO under limited comparison budgets.
FinKG-News constructs news-centric financial knowledge graphs to support in-context learning for credit risk report generation across three dimensions, claiming 19-34% quality gains and fewer hallucinations than baselines.
A taxonomy that consolidates prompt patterns from prior surveys into 30 unique canonical forms organized by two dimensions.
A 432-run experiment across capability tiers refutes the assumption of a monotone inverse relationship between LLM capability and optimal harness complexity, showing model-type-specific patterns instead.
LLMs can detect usability content in user reviews with F-scores comparable to humans, though performance depends strongly on prompt design.
LLARS is a new integrated platform that combines collaborative prompt authoring, cost-controlled batch generation, and hybrid evaluation to help domain experts and developers jointly build and assess LLM systems.
citing papers explorer
No citing papers match the current filters.