The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.
Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
iPOE generates and optimizes annotation guidelines from explanations to produce interpretable prompts, reporting up to 39% gains over baselines on four datasets with LLM explanations substituting for human ones.
citing papers explorer
-
Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG
The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.
-
iPOE: Interpretable Prompt Optimization via Explanations
iPOE generates and optimizes annotation guidelines from explanations to produce interpretable prompts, reporting up to 39% gains over baselines on four datasets with LLM explanations substituting for human ones.