hub

Advances in Neural Information Processing Systems , volume=

Simpo: Simple preference optimization with a reference-free reward , author=

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

browse 10 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

cs.AI · 2026-05-20 · conditional · novelty 7.0

DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.

Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

cs.LG · 2026-05-11 · conditional · novelty 7.0

DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

Convex Optimization for Alignment and Preference Learning on a Single GPU

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.

Threshold-Guided Optimization for Visual Generative Models

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.

Multilingual Safety Alignment via Self-Distillation

cs.LG · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

AlignCultura: Towards Culturally Aligned Large Language Models?

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.

citing papers explorer

Showing 10 of 10 citing papers.

Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models cs.CV · 2026-05-20 · unverdicted · none · ref 7
Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment cs.AI · 2026-05-20 · conditional · none · ref 2
DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling cs.LG · 2026-05-11 · conditional · none · ref 17
DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective cs.LG · 2026-05-08 · unverdicted · none · ref 19
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 52
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost cs.AI · 2026-05-07 · conditional · none · ref 80
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Convex Optimization for Alignment and Preference Learning on a Single GPU cs.LG · 2026-05-22 · unverdicted · none · ref 17
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.
Threshold-Guided Optimization for Visual Generative Models cs.LG · 2026-05-06 · unverdicted · none · ref 26
A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
Multilingual Safety Alignment via Self-Distillation cs.LG · 2026-05-03 · unverdicted · none · ref 21 · 2 links
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
AlignCultura: Towards Culturally Aligned Large Language Models? cs.CL · 2026-04-21 · unverdicted · none · ref 75
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.

Advances in Neural Information Processing Systems , volume=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer