Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
representative citing papers
citing papers explorer
-
Mitigating Cognitive Bias in RLHF by Altering Rationality
Dynamically adjusting beta via LLM-as-judge downweights biased comparisons to learn more rational reward models from flawed human preferences.
- Lessons from the Trenches on Reproducible Evaluation of Language Models