Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
hub Canonical reference
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.
hub tools
citation-role summary
citation-polarity summary
roles
background 6polarities
background 6representative citing papers
LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.
Global Bradley-Terry rankings of LLMs are misleading due to structured heterogeneity in user preferences, and small (λ, ν)-portfolios recover coherent subpopulations that cover over 96% of votes with just five rankings.
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.
Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.
Multi-LLM committees amplify small input perturbations into divergent deliberation trajectories and decisions under deterministic conditions.
PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.
Fine-tuning LLMs on the SubPOP dataset of 3,362 questions and 70K pairs reduces the gap between LLM predictions and human survey responses by up to 46% and generalizes to unseen surveys and subpopulations.
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
MLLMs achieve only 42% accuracy on a new audio-visual task requiring second-order spatial ToM under perceptual limits, while a proposed sensory-bounded CoT outperforms egocentric and allocentric baselines.
citing papers explorer
-
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
-
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
-
The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models
SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.
-
Activation Steering with a Feedback Controller
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
-
Towards Context-Invariant Safety Alignment for Large Language Models
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
-
Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits
Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.
-
Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML
Global Bradley-Terry rankings of LLMs are misleading due to structured heterogeneity in user preferences, and small (λ, ν)-portfolios recover coherent subpopulations that cover over 96% of votes with just five rankings.
-
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
-
Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs
LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.
-
What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
-
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
-
Causal Evidence that Language Models use Confidence to Drive Behavior
Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.
-
Collective AI can amplify tiny perturbations into divergent decisions
Multi-LLM committees amplify small input perturbations into divergent deliberation trajectories and decisions under deterministic conditions.
-
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.
-
Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions
Fine-tuning LLMs on the SubPOP dataset of 3,362 questions and 70K pairs reduces the gap between LLM predictions and human survey responses by up to 46% and generalizes to unseen surveys and subpopulations.
-
Lessons from the Trenches on Reproducible Evaluation of Language Models
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
-
Benchmarking Local Language Models for Social Robots using Edge Devices
Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.
-
The Cartesian Cut in Agentic AI
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
-
The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure
PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.
-
Position: AI Evaluations Should be Grounded on a Theory of Capability
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
-
Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks
MLLMs achieve only 42% accuracy on a new audio-visual task requiring second-order spatial ToM under perceptual limits, while a proposed sensory-bounded CoT outperforms egocentric and allocentric baselines.
- CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging