arXiv preprint arXiv:2310.12036 , year=

· 2023 · arXiv 2310.12036

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

Response Time Enhances Alignment with Heterogeneous Preferences

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

cs.LG · 2026-05-04 · conditional · novelty 6.0

Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.

Process Reinforcement through Implicit Rewards

cs.LG · 2025-02-03 · conditional · novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.

YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning

cs.CL · 2026-05-12 · unverdicted · novelty 5.0

YFPO augments standard preference optimization with neuron-level activation margins from math-related features to improve LLM reasoning on math tasks.

citing papers explorer

Showing 6 of 6 citing papers.

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models cs.LG · 2026-05-12 · unverdicted · none · ref 5 · 2 links
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 136
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
Response Time Enhances Alignment with Heterogeneous Preferences cs.LG · 2026-05-07 · unverdicted · none · ref 4
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models cs.LG · 2026-05-04 · conditional · none · ref 1
Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.
Process Reinforcement through Implicit Rewards cs.LG · 2025-02-03 · conditional · none · ref 5
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning cs.CL · 2026-05-12 · unverdicted · none · ref 22
YFPO augments standard preference optimization with neuron-level activation margins from math-related features to improve LLM reasoning on math tasks.

arXiv preprint arXiv:2310.12036 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer