VAC replaces scalar rewards with natural language feedback in an alternating training loop between a feedback model and a policy model, yielding better personalized QA on the LaMP-QA benchmark.
arXiv:2410.09923 [cs.IR] https://arxiv.org/abs/ 2410.09923
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
POEM constructs dynamic partial-order sequences from multi-task ranking scores to enhance real-time sequential recommendation, reporting 0.2% watch-time lifts when deployed on Kuaishou.
citing papers explorer
-
Learning from Natural Language Feedback for Personalized Question Answering
VAC replaces scalar rewards with natural language feedback in an alternating training loop between a feedback model and a policy model, yielding better personalized QA on the LaMP-QA benchmark.