pith. machine review for the scientific record. sign in

arxiv: 2310.15288 · v3 · submitted 2023-10-23 · 💻 cs.AI · cs.LG

Recognition: unknown

Active teacher selection for reward learning

Justin Svegliato, Kyle Wray, Rachel Freedman, Stuart Russell

Authors on Pith no claims yet
classification 💻 cs.AI cs.LG
keywords teacherlearningframeworkfeedbackrewardselectionsystemsactive
0
0 comments X
read the original abstract

Reward learning techniques enable machine learning systems to learn objectives from human feedback. A core limitation of these systems is their assumption that all feedback comes from a single human teacher, despite gathering feedback from large and heterogeneous populations. We propose the Hidden Utility Bandit (HUB) framework to model differences in teacher rationality, expertise, and costliness, formalizing the problem of learning from multiple teachers. We develop a variety of solution algorithms and apply them to two real-world domains: paper recommendation systems and COVID-19 vaccine testing. We find that Active Teacher Selection (ATS) algorithms outperform baselines by actively selecting when and which teacher to query. Our key contributions are 1) the HUB framework: a novel mathematical framework for modeling the teacher selection problem, 2) ATS: an active-learning based algorithmic approach that demonstrates the utility of modeling teacher heterogeneity, and 3) proof-of-concept application of the HUB framework and ATS approaches to model and solve multiple real-world problems with complex trade-offs between reward learning and optimization.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reinforcement Learning from Human Feedback: A Statistical Perspective

    stat.ML 2026-04 accept novelty 2.0

    A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.