Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible, efficient rlhf framework , author=

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

Unified Data Selection for LLM Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.

TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

TeamTR is a trust-region framework for multi-agent LLM fine-tuning that resamples trajectories after each update to convert quadratic compounding occupancy shift into linear scaling and yields per-update improvement lower bounds.

DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation

cs.LG · 2026-05-01 · unverdicted · novelty 4.0

Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

cs.AI · 2026-05-01 · unverdicted · novelty 4.0 · 2 refs

AEM adaptively modulates response-level entropy in agentic RL to improve credit assignment and exploration-exploitation balance, yielding gains on ALFWorld, WebShop, and SWE-bench.

citing papers explorer

Showing 8 of 8 citing papers.

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL cs.LG · 2026-05-17 · unverdicted · none · ref 42
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective cs.LG · 2026-05-08 · unverdicted · none · ref 36
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 28 · 2 links
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
Unified Data Selection for LLM Reasoning cs.CL · 2026-05-21 · unverdicted · none · ref 9
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination cs.LG · 2026-05-01 · unverdicted · none · ref 3
TeamTR is a trust-region framework for multi-agent LLM fine-tuning that resamples trajectories after each update to convert quadratic compounding occupancy shift into linear scaling and yields per-update improvement lower bounds.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents cs.CV · 2026-04-21 · unverdicted · none · ref 40
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
Exploring Pass-Rate Reward in Reinforcement Learning for Code Generation cs.LG · 2026-05-01 · unverdicted · none · ref 59
Pass-rate rewards in critic-free RL for code generation fail to outperform binary rewards because partial-pass solutions induce conflicting gradient directions that do not consistently favor full correctness.
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning cs.AI · 2026-05-01 · unverdicted · none · ref 8 · 2 links
AEM adaptively modulates response-level entropy in agentic RL to improve credit assignment and exploration-exploitation balance, yielding gains on ALFWorld, WebShop, and SWE-bench.

Proceedings of the Twentieth European Conference on Computer Systems , pages=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer