Recognition: unknown
Composing Policy Gradients and Prompt Optimization for Language Model Programs
read the original abstract
Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how practitioners can best leverage online RL algorithms like GRPO to improve these systems. We begin to address this challenge by investigating whether it is possible to effectively instantiate GRPO for arbitrary multi-prompt programs and whether it can work robustly as an off-the-shelf optimizer for LM programs using the same abstractions and constraints typically involved for prompt optimization. Our main variant of multi-module GRPO constructs groups from module-level invocations, and we also consider trajectory-level grouping as another natural instantiation. We find for the first time that GRPO (and its multi-module counterpart) empirically composes well with automatic prompt optimization, and together they improve accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM - with 5% gains against prompt optimization on its own. We open-source multi-module GRPO in the DSPy library at https://dspy.ai .
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.