pith. machine review for the scientific record. sign in

hub

Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

hub tools

years

2026 15 2025 1

clear filters

representative citing papers

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

H\"older Policy Optimisation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.

Characterizing Model-Native Skills

cs.AI · 2026-04-19 · conditional · novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • Reward Hacking in Rubric-Based Reinforcement Learning cs.AI · 2026-05-12 · unverdicted · none · ref 26

    Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

  • Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 9

    AutoREM augments LLMs with a structured memory of failed reformulation trajectories to improve accuracy and efficiency on robust optimization tasks without parameter updates or expert knowledge.

  • The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling cs.AI · 2026-05-04 · unverdicted · none · ref 9 · 2 links

    APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.

  • Characterizing Model-Native Skills cs.AI · 2026-04-19 · conditional · none · ref 75

    Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.