pith. sign in

Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking,

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

citation-role summary

background 1 method 1

citation-polarity summary

years

2026 12 2025 3

clear filters

representative citing papers

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

HARVE removes the component of the reward-head vector aligned with a multi-directional hacking subspace from residual streams using a small set of contrastive examples, improving robustness on RewardHackBench across eight models without fine-tuning while preserving general capability.

How Far Are Video Models from True Multimodal Reasoning?

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

FUSE: Ensembling Verifiers with Zero Labeled Data

stat.ML · 2026-04-20 · unverdicted · novelty 6.0

FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.