Title resolution pending

UltraFeedback: Boosting Language Models with High-quality Feedback , author= · 2023

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

browse 6 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

ODRPO decomposes discrete rewards into ordinal binary indicators to create robust, variance-aware advantage estimators for noisy RLAIF in LLM alignment.

Response Time Enhances Alignment with Heterogeneous Preferences

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

Hybrid Policy Distillation for LLMs

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve stability, efficiency, and performance.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

InternLM2 Technical Report

cs.CL · 2024-03-26 · unverdicted · novelty 5.0

InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

cs.LG · 2024-02-18 · unverdicted · novelty 5.0

POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

citing papers explorer

Showing 6 of 6 citing papers.

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 20 · 2 links
ODRPO decomposes discrete rewards into ordinal binary indicators to create robust, variance-aware advantage estimators for noisy RLAIF in LLM alignment.
Response Time Enhances Alignment with Heterogeneous Preferences cs.LG · 2026-05-07 · unverdicted · none · ref 73
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Hybrid Policy Distillation for LLMs cs.CL · 2026-04-22 · unverdicted · none · ref 53
Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve stability, efficiency, and performance.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 189
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
InternLM2 Technical Report cs.CL · 2024-03-26 · unverdicted · none · ref 136
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning cs.LG · 2024-02-18 · unverdicted · none · ref 120
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer