hub

Controlled decoding from language models

Controlled decoding from language models , author= · 2023 · arXiv 2310.17022

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

cs.AI · 2025-05-25 · unverdicted · novelty 7.0

UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.

TRAM: Test-Time Risk Adaptation with Mixture of Agents

cs.LG · 2024-08-16 · unverdicted · novelty 7.0

TRAM is a test-time mixture method that scores and composes risk-neutral source policies using reward and occupancy-based risk to achieve new reward-risk tradeoffs without parameter updates.

Towards Diverse Scientific Hypothesis Search with Large Language Models

cs.LG · 2026-06-09 · unverdicted · novelty 6.0

A parallel-tempering evolutionary framework for LLM hypothesis search improves both quality and diversity of candidates in molecular, equation, and algorithm discovery under fixed validation budgets.

Spectral Souping: A Unified Framework for Online Preference Alignment

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Spectral Souping learns offline specialized policies for fine-grained preferences and merges them online using a discovered universal spectral representation for efficient LLM alignment.

Selective Safety Steering via Value-Filtered Decoding

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Value-filtered decoding steers LLM outputs for safety at decoding time using a value criterion with an explicit bound on false interventions controlled by one threshold hyperparameter.

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

cs.CL · 2026-05-11 · conditional · novelty 6.0

DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.

Generalization in LLM Problem Solving: The Case of the Shortest Path

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

Synergistic Benefits of Joint Molecule Generation and Property Prediction

cs.LG · 2025-04-23 · unverdicted · novelty 5.0

Hyformer jointly models molecule generation and property prediction via alternating attention and joint pre-training, showing synergistic gains in conditional sampling, OOD prediction, and a drug design case for antimicrobial peptides.

Multi-Objective Exploration and Preference Optimization via Mutual Information

cs.CL · 2026-07-01 · unverdicted · novelty 4.0

MI-EPO maximizes joint conditional mutual information among responses, feedback, and preference vectors, using probabilistic routing to improve alignment and controllability in multi-objective LLM optimization.

Teach a Reward Model to Correct Itself: Reward Guided Adversarial Failure Discovery for Robust Reward Modeling

cs.CL · 2025-07-08

citing papers explorer

Showing 13 of 13 citing papers.

Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion cs.LG · 2026-05-22 · unverdicted · none · ref 51
CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 152 · 2 links
Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs cs.AI · 2025-05-25 · unverdicted · none · ref 26
UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.
TRAM: Test-Time Risk Adaptation with Mixture of Agents cs.LG · 2024-08-16 · unverdicted · none · ref 27
TRAM is a test-time mixture method that scores and composes risk-neutral source policies using reward and occupancy-based risk to achieve new reward-risk tradeoffs without parameter updates.
Towards Diverse Scientific Hypothesis Search with Large Language Models cs.LG · 2026-06-09 · unverdicted · none · ref 44
A parallel-tempering evolutionary framework for LLM hypothesis search improves both quality and diversity of candidates in molecular, equation, and algorithm discovery under fixed validation budgets.
Spectral Souping: A Unified Framework for Online Preference Alignment cs.LG · 2026-05-19 · unverdicted · none · ref 18
Spectral Souping learns offline specialized policies for fine-grained preferences and merges them online using a discovered universal spectral representation for efficient LLM alignment.
Selective Safety Steering via Value-Filtered Decoding cs.LG · 2026-05-14 · unverdicted · none · ref 14
Value-filtered decoding steers LLM outputs for safety at decoding time using a value criterion with an explicit bound on false interventions controlled by one threshold hyperparameter.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement cs.CL · 2026-05-11 · conditional · none · ref 27
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
Generalization in LLM Problem Solving: The Case of the Shortest Path cs.AI · 2026-04-16 · unverdicted · none · ref 34
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 221
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
Synergistic Benefits of Joint Molecule Generation and Property Prediction cs.LG · 2025-04-23 · unverdicted · none · ref 50
Hyformer jointly models molecule generation and property prediction via alternating attention and joint pre-training, showing synergistic gains in conditional sampling, OOD prediction, and a drug design case for antimicrobial peptides.
Multi-Objective Exploration and Preference Optimization via Mutual Information cs.CL · 2026-07-01 · unverdicted · none · ref 34
MI-EPO maximizes joint conditional mutual information among responses, feedback, and preference vectors, using probabilistic routing to improve alignment and controllability in multi-objective LLM optimization.
Teach a Reward Model to Correct Itself: Reward Guided Adversarial Failure Discovery for Robust Reward Modeling cs.CL · 2025-07-08 · unreviewed · ref 19

Controlled decoding from language models

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer