Composing Policy Gradients and Prompt Optimization for Language Model Programs

Chen Qian, Christopher Potts, Dan Klein, Dilara Soylu, Isaac Miller, Kaiqiang Song, Karel D'Oosterlinck, Lakshya A Agrawal, Liheng Lai, Matei Zaharia, Meng Jiang, Noah Ziems, Omar Khattab

Authors on Pith no claims yet

classification 💻 cs.CL

keywords grpooptimizationpromptprogramsmulti-moduledspyimprovelanguage

0 comments

read the original abstract

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how practitioners can best leverage online RL algorithms like GRPO to improve these systems. We begin to address this challenge by investigating whether it is possible to effectively instantiate GRPO for arbitrary multi-prompt programs and whether it can work robustly as an off-the-shelf optimizer for LM programs using the same abstractions and constraints typically involved for prompt optimization. Our main variant of multi-module GRPO constructs groups from module-level invocations, and we also consider trajectory-level grouping as another natural instantiation. We find for the first time that GRPO (and its multi-module counterpart) empirically composes well with automatic prompt optimization, and together they improve accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM - with 5% gains against prompt optimization on its own. We open-source multi-module GRPO in the DSPy library at https://dspy.ai .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.