ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Canonical reference
Title resolution pending
Canonical reference. 100% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
OLSF-TRS is a generalized sequential decision framework using structured combinatorial optimization and multi-agent reinforcement learning for order-tote-robot coordination in tote-handling robotic systems, with near-optimal performance on small scales and 8-30%+ improvements over heuristics onlarge
Prepending stochastic sequences from Lorem Ipsum vocabulary to prompts during GRPO resampling broadens reasoning exploration and outperforms standard resampling on hard tasks for 1.7B-7B models.
LIMEN discovers effective RL interfaces by using LLMs to evolve observation and reward programs together from raw state, guided by policy training success, outperforming single-component optimization.
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
Discrete Tilt Matching recasts dLLM fine-tuning as state-level matching of tilted local unmasking posteriors, producing a stable weighted cross-entropy loss that improves Sudoku and Countdown performance when applied to LLaDA-8B-Instruct.
JAXenstein ports the Wolfenstein 3D engine to JAX to create a fast, scalable benchmark for first-person visual RL that is several times quicker than existing vision-based alternatives.
GoLongRL releases a 23K-sample open long-context RL dataset spanning 9 tasks and introduces TMN-Reweight to improve multitask optimization, achieving performance comparable to much larger models under GRPO.
GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.
A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.
GAGPO computes step-aligned temporal advantages from grouped rollout samples without a learned critic, enabling stable policy optimization in multi-turn agent environments.
ODRPO decomposes discrete rewards into ordinal binary indicators to create robust, variance-aware advantage estimators for noisy RLAIF in LLM alignment.
FPILOT optimizes pre-trained RL trading policies at inference time using forecasted price trajectories to improve portfolio allocations and risk-adjusted returns on the DJ30 benchmark.
Risk-sensitive preference games using convex risk measures produce policies that are robust across data strata and match or exceed standard Nash learning performance without added cost.
LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.
COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.
MARLaaS enables concurrent RL fine-tuning across up to 32 tasks using LoRA adapters and a disaggregated asynchronous architecture, matching single-task performance while improving accelerator utilization by 4.3x and cutting end-to-end time by 85%.
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
POETS uses compute-efficient LLM policy ensembles to implicitly perform KL-regularized Thompson sampling, delivering O(sqrt(T gamma_T)) regret bounds and state-of-the-art sample efficiency in scientific discovery tasks such as protein search and quantum circuit design.
BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.
Vanishing L2 regularization yields provable convergence for softmax MAB policies and improves empirical performance.
PUPPET jointly optimizes LLM outputs for high detectability and task performance via RL rewards from a detector and a task evaluator, outperforming watermarking on tasks while matching detectability.
SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
citing papers explorer
-
Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration
Prepending stochastic sequences from Lorem Ipsum vocabulary to prompts during GRPO resampling broadens reasoning exploration and outperforms standard resampling on hard tasks for 1.7B-7B models.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
- Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models
- Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning