GGRO monitors token entropy to trigger gradient-guided token injection from reward models, improving LLM alignment on safety, helpfulness, and reasoning tasks at inference time.
Cascade reward sampling for efficient decoding-time alignment
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
MI-EPO maximizes joint conditional mutual information among responses, feedback, and preference vectors, using probabilistic routing to improve alignment and controllability in multi-objective LLM optimization.
citing papers explorer
-
Gradient-Guided Reward Optimization for Inference-time Alignment
GGRO monitors token entropy to trigger gradient-guided token injection from reward models, improving LLM alignment on safety, helpfulness, and reasoning tasks at inference time.
-
Multi-Objective Exploration and Preference Optimization via Mutual Information
MI-EPO maximizes joint conditional mutual information among responses, feedback, and preference vectors, using probabilistic routing to improve alignment and controllability in multi-objective LLM optimization.