Alleviating attention hacking in discriminative reward modeling through interaction distillation

Zang, J · 2025 · arXiv 2508.02618

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

FORCEBENCH shows model judges often violate expected ordering on evidence-calibrated vs force-raised claim pairs, with standard support prompting yielding 47.2% MVR and explicit warrant prompting reducing it to 24.5%.

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

cs.SE · 2026-05-30 · unverdicted · novelty 6.0

About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.

MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment

cs.LG · 2026-06-27 · unverdicted · novelty 5.0

MOSAIC combines frozen-LLM semantic embeddings with hierarchical consistency objectives to report up to 3.4% AUC gains on knowledge-tracing benchmarks including a new MOOC dataset.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG cs.AI · 2026-05-27 · unverdicted · none · ref 5
FORCEBENCH shows model judges often violate expected ordering on evidence-calibrated vs force-raised claim pairs, with standard support prompting yielding 47.2% MVR and explicit warrant prompting reducing it to 24.5%.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models cs.AI · 2026-05-09 · unverdicted · none · ref 123
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.

Alleviating attention hacking in discriminative reward modeling through interaction distillation

fields

years

verdicts

representative citing papers

citing papers explorer