PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
hub
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 11170–11189
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
HopRank is a self-supervised LLM-tuning method that turns node classification into link prediction via hierarchical hop-based preference sampling, matching supervised GNN performance with zero labeled data on text-attributed graphs.
DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.
RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
Anomaly Preference Optimization reformulates anomalous image synthesis as preference learning with implicit alignment from real anomalies and a time-aware capacity allocation module for diffusion models to balance diversity and fidelity.
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
PhyMix unifies a new multi-aspect physics evaluator with implicit policy optimization and explicit test-time correction to produce single-image 3D indoor scenes that are both visually faithful and physically plausible.
YFPO augments standard preference optimization with neuron-level activation margins from math-related features to improve LLM reasoning on math tasks.
A unified Pair-GRPO framework extends GRPO with soft and hard pairwise preference variants, proving gradient equivalence under Taylor expansion and delivering improved stability and performance in RLHF.
PoliLegalLM, trained with continued pretraining, progressive SFT, and preference RL on a legal corpus, outperforms similar-scale models on LawBench, LexEval, and a real-world PoliLegal dataset while staying competitive with much larger models.
LlamaFactory provides a unified no-code framework for efficient fine-tuning of 100+ LLMs via an integrated web UI and has been released on GitHub.
citing papers explorer
-
KTO: Model Alignment as Prospect Theoretic Optimization
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.