InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

Qiaosheng Zhang; Tiancheng Han; Wenqi Shao; Wuzhou Yu; Yong Li

arxiv: 2606.03329 · v1 · pith:2MZV54T7new · submitted 2026-06-02 · 💻 cs.AI

InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

Tiancheng Han , Yong Li , Wuzhou Yu , Qiaosheng Zhang , Wenqi Shao This is my paper

Pith reviewed 2026-06-28 10:28 UTC · model grok-4.3

classification 💻 cs.AI

keywords long-context tasksmemory agentsreinforcement learninginformation gainanswer-conditioned rewardchunk-wise agentsGRPO

0 comments

The pith

InfoMem improves long-context memory agents by using a reward that measures how much the final memory increases the likelihood of the ground-truth answer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InfoMem as a reward mechanism for training chunk-wise memory agents that handle long contexts by sequentially processing document chunks and updating a compact memory. It evaluates the final memory's utility through the increase in the model's per-token log-likelihood of the correct answer, rather than sparse answer rewards or lexical overlap measures. The signal is restricted to successful trajectories and normalized before use in the reward. Experiments show this yields better performance than comparable RL baselines under the same GRPO framework and training budget. Separate analyses establish that effective rewards for memory updates should be answer-conditioned, normalized, and limited to successes.

Core claim

InfoMem quantifies final-memory utility as the increase in per-token log-likelihood of the ground-truth answer caused by the memory. The method applies this measure exclusively to successful trajectories and normalizes the resulting value before reward composition. Under identical GRPO training conditions and budget, agents trained with InfoMem outperform memory-agent RL baselines on long-context tasks.

What carries the argument

Answer-conditioned information gain, the measured increase in the model's per-token log-likelihood of the ground-truth answer attributable to the final memory.

If this is right

Chunk-wise agents learn to preserve answer-relevant information more effectively across long documents.
RL training for memory updates improves when the reward is conditioned on the answer rather than the query.
Restricting the reward signal to successful trajectories and normalizing it before composition stabilizes optimization.
Comparable performance gains are possible without changes to the underlying GRPO framework or added training compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward structure could be tested in other sequential state-update settings where final outcome quality must guide intermediate decisions.
Extending the approach beyond chunk-wise agents to alternative memory architectures might reveal broader applicability.
Evaluating the method on additional long-context benchmarks outside the current experiments could identify task-specific patterns.

Load-bearing premise

The increase in per-token log-likelihood of the ground-truth answer due to the final memory is a reliable and superior signal for supervising memory updates.

What would settle it

Reproducing the experiments and observing no performance gain for InfoMem over the compared memory-agent RL baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.03329 by Qiaosheng Zhang, Tiancheng Han, Wenqi Shao, Wuzhou Yu, Yong Li.

**Figure 2.** Figure 2: Example of synthetic hallucinated evidence. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of information-gain supervision side. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Fraction of rollouts whose final memory re [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Training curves for the main comparison runs [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Validation curves for the main comparison [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: ReMemR1 training dynamics under its callback-retrieval chunk-wise framework. Both curves are [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address this issue by sequentially reading document chunks, updating a compact memory, and generating the final answer from the accumulated memory. However, existing RL-based chunk-wise agents either rely on sparse final-answer rewards or use lexical intermediate rewards for memory and retrieval actions. These signals supervise task success or local overlap, but do not directly evaluate whether the final memory supports the ground-truth answer. We propose InfoMem, a reward mechanism for training chunk-wise memory agents that evaluates final-memory utility using answer-conditioned information. InfoMem measures how much the final memory increases the model's per-token log-likelihood of the ground-truth answer. To stabilize RL optimization, InfoMem applies this signal only to successful trajectories and normalizes it before reward composition. Under the same GRPO framework and training budget, InfoMem improves long-context memory-agent performance over comparable memory-agent RL baselines. Analyses show that effective final-memory rewards should operate on successful trajectories, be normalized before reward composition, and be conditioned on the answer rather than the query. Our code is available at https://github.com/GenSouKa1/InfoMem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InfoMem adds a clean answer-conditioned info-gain reward for memory updates that is restricted to successful trajectories and normalized, with reported gains over lexical and sparse baselines under fixed GRPO budget.

read the letter

The new piece is the reward that scores a memory by the increase it produces in the model's per-token log-likelihood of the ground-truth answer, applied only on successful trajectories and normalized before mixing with other terms. That formulation is not in the cited prior work and gives a more direct signal than lexical overlap or final-answer sparsity.

The paper does the obvious next step well: it isolates the effect of conditioning on the answer rather than the query, restricting to successes, and normalizing, then shows each choice matters in their ablations. The GRPO setup and fixed training budget make the comparison straightforward.

The main limitation is that the abstract gives no numbers, confidence intervals, or exact baseline scores, so the size of the improvement and its robustness are still unclear from what is visible. The core assumption—that likelihood lift on the answer is a reliably better training signal than existing options—holds up in their controlled tests but could be sensitive to model scale or task distribution; the paper does not claim otherwise.

This is useful reading for anyone training chunk-wise or memory-augmented agents on long documents. The design is coherent and the experiments appear properly scoped, so it deserves a serious referee even if the gains turn out modest on closer inspection.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes InfoMem, a reward mechanism for training chunk-wise memory agents on long-context tasks via RL. InfoMem defines the reward as the increase in the model's per-token log-likelihood of the ground-truth answer attributable to the final memory (answer-conditioned information gain). The signal is restricted to successful trajectories, normalized before composition, and used within the GRPO framework. The paper claims performance gains over lexical-overlap and sparse final-answer reward baselines under matched training budgets, and provides analyses showing that effective rewards should operate on successful trajectories, be normalized, and be conditioned on the answer rather than the query. Code is released.

Significance. If the reported gains hold under the stated controls, the work supplies a more direct utility signal for memory updates than existing sparse or lexical alternatives, which could improve training of memory agents. The accompanying analyses on reward properties (successful trajectories, normalization, answer conditioning) offer reusable design guidance. Releasing code supports reproducibility and is a clear strength.

minor comments (3)

Abstract: the performance claim would be strengthened by including at least one key quantitative result (e.g., accuracy delta or win rate) alongside the qualitative statement of improvement.
Method section: the precise definition of the per-token log-likelihood difference (including how the 'attributable to final memory' term is isolated) should be given as an explicit equation for clarity and reproducibility.
Experiments: confirm that all baselines use identical GRPO hyperparameters, context lengths, and success criteria so that the 'same training budget' comparison is unambiguous.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of InfoMem, including recognition of the performance gains, the analyses on reward properties (successful trajectories, normalization, and answer conditioning), and the value of releasing code. The recommendation for minor revision is noted, and we will incorporate any minor adjustments in the revised version.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines InfoMem as an external reward signal computed from the increase in the model's per-token log-likelihood of a fixed ground-truth answer when the final memory is provided. This quantity is independent of the RL training loop itself, is restricted to successful trajectories, and is normalized before composition with other rewards. No derivation step reduces a claimed prediction or result to a fitted parameter, self-citation, or input by construction; the central performance claim is an empirical comparison under fixed GRPO budget against lexical and sparse-reward baselines. The design is self-contained against external benchmarks and contains no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim relies on the domain assumption that the proposed reward is effective, with no free parameters explicitly mentioned in the abstract but the method itself is the invention.

axioms (1)

domain assumption GRPO is an appropriate RL algorithm for training these agents
The paper uses it as the training framework without questioning its validity.

invented entities (1)

InfoMem reward signal no independent evidence
purpose: To provide a utility measure for final memory based on answer likelihood increase
Newly introduced concept without independent validation outside the paper.

pith-pipeline@v0.9.1-grok · 5750 in / 1197 out tokens · 33927 ms · 2026-06-28T10:28:33.024783+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 7 canonical work pages

[1]

2025 , eprint=

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent , author=. 2025 , eprint=

2025
[2]

2026 , eprint=

When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework , author=. 2026 , eprint=

2026
[3]

2026 , eprint=

Recursive Language Models , author=. 2026 , eprint=

2026
[4]

L ong RAG : A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

Zhao, Qingfei and Wang, Ruobing and Cen, Yukuo and Zha, Daren and Tan, Shicheng and Dong, Yuxiao and Tang, Jie. L ong RAG : A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1259

work page doi:10.18653/v1/2024.emnlp-main.1259 2024
[5]

2025 , eprint=

Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning , author=. 2025 , eprint=

2025
[6]

SQuAD: 100, 000+ Questions for Machine Comprehension of Text , booktitle =

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQ u AD : 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1264

work page doi:10.18653/v1/d16-1264 2016
[7]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[8]

2025 , eprint=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

2025
[9]

InFindings of the As- sociation for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)

Chen, Jianlyu and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng. M 3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.137

work page doi:10.18653/v1/2024.findings-acl.137 2024
[10]

2024 , month =

Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu , title =. 2024 , month =

2024
[11]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[12]

2026 , howpublished =

Gemini 3 Flash Preview , author =. 2026 , howpublished =

2026
[13]

RULER: What’s the Real Context Size of Your Long-Context Language Models? , author=
[14]

H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1259

work page doi:10.18653/v1/d18-1259 2018
[15]

2025 , howpublished =

OpenAI MRCR: Long Context Multiple Needle in a Haystack Benchmark , author =. 2025 , howpublished =

2025
[16]

2026 , eprint=

CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning , author=. 2026 , eprint=

2026
[17]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author=
[18]

2026 , url =

Kimi K2.6: Advancing Open-Source Coding , author =. 2026 , url =

2026
[19]

2025 , eprint=

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management , author=. 2025 , eprint=

2025
[20]

2024 , eprint=

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , author=. 2024 , eprint=

2024
[21]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=
[22]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

2024
[23]

2025 , eprint=

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , author=. 2025 , eprint=

2025
[24]

International Conference on Machine Learning , pages=

A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024
[25]

Transformer- XL : Attentive Language Models beyond a Fixed-Length Context

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan. Transformer- XL : Attentive Language Models beyond a Fixed-Length Context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1285

work page doi:10.18653/v1/p19-1285 2019
[26]

Block-Recurrent Transformers , url =

Hutchins, DeLesley and Schlag, Imanol and Wu, Yuhuai and Dyer, Ethan and Neyshabur, Behnam , booktitle =. Block-Recurrent Transformers , url =
[27]

ERNIE-Doc: A retrospective long-document modeling transformer , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
[28]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

2025
[29]

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards , author=
[30]

2026 , eprint=

Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning , author=. 2026 , eprint=

2026
[31]

2026 , eprint=

LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards , author=. 2026 , eprint=

2026
[32]

2026 , eprint=

Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents , author=. 2026 , eprint=

2026
[33]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=
[34]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[35]

2026 , eprint=

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization , author=. 2026 , eprint=

2026
[36]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica , editor =. Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =. 2023 , url =. doi:10.1145/3600006.3613165 , timestamp =

work page doi:10.1145/3600006.3613165 2023

[1] [1]

2025 , eprint=

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent , author=. 2025 , eprint=

2025

[2] [2]

2026 , eprint=

When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework , author=. 2026 , eprint=

2026

[3] [3]

2026 , eprint=

Recursive Language Models , author=. 2026 , eprint=

2026

[4] [4]

L ong RAG : A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

Zhao, Qingfei and Wang, Ruobing and Cen, Yukuo and Zha, Daren and Tan, Shicheng and Dong, Yuxiao and Tang, Jie. L ong RAG : A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1259

work page doi:10.18653/v1/2024.emnlp-main.1259 2024

[5] [5]

2025 , eprint=

Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning , author=. 2025 , eprint=

2025

[6] [6]

SQuAD: 100, 000+ Questions for Machine Comprehension of Text , booktitle =

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQ u AD : 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1264

work page doi:10.18653/v1/d16-1264 2016

[7] [7]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[8] [8]

2025 , eprint=

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. 2025 , eprint=

2025

[9] [9]

InFindings of the As- sociation for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)

Chen, Jianlyu and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng. M 3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.137

work page doi:10.18653/v1/2024.findings-acl.137 2024

[10] [10]

2024 , month =

Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu , title =. 2024 , month =

2024

[11] [11]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[12] [12]

2026 , howpublished =

Gemini 3 Flash Preview , author =. 2026 , howpublished =

2026

[13] [13]

RULER: What’s the Real Context Size of Your Long-Context Language Models? , author=

[14] [14]

H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1259

work page doi:10.18653/v1/d18-1259 2018

[15] [15]

2025 , howpublished =

OpenAI MRCR: Long Context Multiple Needle in a Haystack Benchmark , author =. 2025 , howpublished =

2025

[16] [16]

2026 , eprint=

CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning , author=. 2026 , eprint=

2026

[17] [17]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author=

[18] [18]

2026 , url =

Kimi K2.6: Advancing Open-Source Coding , author =. 2026 , url =

2026

[19] [19]

2025 , eprint=

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management , author=. 2025 , eprint=

2025

[20] [20]

2024 , eprint=

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , author=. 2024 , eprint=

2024

[21] [21]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=

[22] [22]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

2024

[23] [23]

2025 , eprint=

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , author=. 2025 , eprint=

2025

[24] [24]

International Conference on Machine Learning , pages=

A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024

[25] [25]

Transformer- XL : Attentive Language Models beyond a Fixed-Length Context

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan. Transformer- XL : Attentive Language Models beyond a Fixed-Length Context. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1285

work page doi:10.18653/v1/p19-1285 2019

[26] [26]

Block-Recurrent Transformers , url =

Hutchins, DeLesley and Schlag, Imanol and Wu, Yuhuai and Dyer, Ethan and Neyshabur, Behnam , booktitle =. Block-Recurrent Transformers , url =

[27] [27]

ERNIE-Doc: A retrospective long-document modeling transformer , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

[28] [28]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

2025

[29] [29]

LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards , author=

[30] [30]

2026 , eprint=

Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning , author=. 2026 , eprint=

2026

[31] [31]

2026 , eprint=

LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards , author=. 2026 , eprint=

2026

[32] [32]

2026 , eprint=

Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents , author=. 2026 , eprint=

2026

[33] [33]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

[34] [34]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602

[35] [35]

2026 , eprint=

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization , author=. 2026 , eprint=

2026

[36] [36]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph Gonzalez and Hao Zhang and Ion Stoica , editor =. Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =. 2023 , url =. doi:10.1145/3600006.3613165 , timestamp =

work page doi:10.1145/3600006.3613165 2023