Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J

URLhttps://openreview · 2025 · arXiv 2503.01067

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

cs.SE · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

Themis introduces the largest open code preference dataset with over 350k pairs and trains multilingual reward models from 600M to 32B parameters that support flexible multi-criteria scoring, with experiments showing scaling trends and cross-lingual transfer.

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.

Generalization in LLM Problem Solving: The Case of the Shortest Path

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.

citing papers explorer

Showing 4 of 4 citing papers.

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring cs.SE · 2026-05-01 · unverdicted · none · ref 16 · 2 links
Themis introduces the largest open code preference dataset with over 350k pairs and trains multilingual reward models from 600M to 32B parameters that support flexible multi-criteria scoring, with experiments showing scaling trends and cross-lingual transfer.
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation cs.CL · 2026-04-21 · unverdicted · none · ref 18
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
Generalization in LLM Problem Solving: The Case of the Shortest Path cs.AI · 2026-04-16 · unverdicted · none · ref 45
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping cs.LG · 2026-04-13 · unverdicted · none · ref 6
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.

Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, and J

fields

years

verdicts

representative citing papers

citing papers explorer