Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
super hub Canonical reference
A General Language Assistant as a Laboratory for Alignment
Canonical reference. 82% of citing Pith papers cite this work as background.
abstract
Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. We find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. In contrast, binary discrimination typically performs and scales very similarly to imitation learning. Finally we study a `preference model pre-training' stage of training, with the goal of improving sample efficiency when finetuning on human preferences.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. As an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. We find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. Next we investigate scaling trends for several training objectives relevant to alignment,
authors
co-cited works
representative citing papers
XSTest is a benchmark for detecting exaggerated safety refusals in large language models on clearly safe prompts.
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.
A new benchmark reveals that language models including GPT-3 are truthful on only 58% of questions designed to elicit popular misconceptions, far below human performance of 94%, with larger models performing worse.
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.
Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
The paper presents OverEager-Gen, a 500-scenario benchmark showing that removing consent declarations from prompts increases overeager actions by 11.9-17.2 percentage points across models, with agent framework choice dominating base-model effects.
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summarization.
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
EuropeMedQA is presented as the first comprehensive multilingual and multimodal medical examination dataset drawn from official regulatory exams in four European countries.
SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
H-Elo is an FHE-based protocol that enables private rating-based matchmaking while achieving accuracy comparable to plaintext implementations.
The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.
Varying decoding strategies such as temperature and sampling methods jailbreaks safety alignments in open-source LLMs, raising misalignment from 0% to over 95% at 30x lower cost than prior attacks.
citing papers explorer
-
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
-
Benchmark Data Contamination of Large Language Models: A Survey
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.