pith. sign in

Tradeoffs between alignment and helpfulness in language models

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

fields

cs.LG 3

years

2026 2 2024 1

representative citing papers

Selective Safety Steering via Value-Filtered Decoding

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Value-filtered decoding steers LLM outputs for safety at decoding time using a value criterion with an explicit bound on false interventions controlled by one threshold hyperparameter.

citing papers explorer

Showing 3 of 3 citing papers.