arxiv: 2605.06474 · v2 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· stat.ML

Recognition: no theorem link

Q-MMR: Off-Policy Evaluation via Recursive Reweighting and Moment Matching

Xiang Li , Nan Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords off-policy evaluationmoment matchingreweightingfinite-sample boundsoffline reinforcement learninggeneral function approximationrealizability

0 comments

The pith

Q-MMR learns data-point weights inductively so that reweighted returns match the target policy value under only Q-function realizability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Q-MMR as a new method for off-policy evaluation in finite-horizon MDPs. It assigns a scalar weight to each observed data point so that the weighted sum of rewards approximates the expected return of a target policy. The weights are constructed recursively from the end of the horizon by solving a moment-matching problem against a class of value-function discriminators. This construction delivers a data-dependent finite-sample error bound whose size does not grow with the statistical complexity of the function class, provided only that the target Q-function is realizable. The same framework recovers importance sampling and linear fitted Q-evaluation as special cases and clarifies what coverage must hold for reliable off-policy estimates.

Core claim

Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of Q^π, with a dimension-free bound.

What carries the argument

The top-down recursive reweighting procedure that solves a moment-matching objective against the value-function discriminator class at each time step to produce data-point weights.

If this is right

The error bound holds for any function class whenever the target Q-function lies in that class.
Importance sampling and linear fitted Q-evaluation arise as special cases of the same weighting procedure.
Coverage requirements in offline RL can be characterized directly through the realizability condition rather than through explicit density ratios.
The method applies to general function approximation without incurring statistical penalties that scale with class complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dimension-free guarantee may allow practitioners to use richer function approximators in offline settings without worsening sample complexity.
The recursive structure could be adapted to infinite-horizon or continuous-time problems by replacing the finite-horizon induction with a fixed-point equation.
If the moment-matching step can be solved efficiently, the approach offers a practical alternative to explicit importance-weight estimation in high-dimensional state spaces.

Load-bearing premise

The weights produced by moment matching at each step make the reweighted return equal the target policy value even when the discriminator class is arbitrary.

What would settle it

An MDP and function class in which Q^π is realizable yet the Q-MMR estimate deviates from the true value by more than the stated dimension-free bound on a finite dataset.

read the original abstract

We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^\pi$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Q-MMR's recursive top-down moment matching gives a distinct framing for OPE weights with explicit ties to IS and linear FQE, plus a bold dimension-free claim under only realizability, but the induction for general classes needs checking.

read the letter

The main contribution here is a top-down recursive procedure that learns scalar weights for each trajectory by matching moments against a value-function discriminator class at each time step. This produces a reweighted estimate of the target policy return, and the paper draws clean connections to importance sampling and linear fitted Q-evaluation while also discussing what coverage really means in this setting. The recursive construction itself is a fresh way to organize the problem and unifies some existing ideas without forcing everything through a single fitted model. The coverage analysis adds a useful angle on a core offline RL concept. The dimension-free finite-sample guarantee under only Q^π realizability is the part that stands out most, since it would matter for expressive function approximators where usual complexity terms blow up. The soft spot is exactly the inductive step the stress-test note flags. For the bound to stay independent of any measure of the discriminator class's richness, the moment-matching optimization at each step has to ensure the reweighted conditional expectation of the next value function is controlled solely by the realizability error. If the class is allowed to be arbitrary, it is not obvious that every solution to the matching problem will deliver that control; some classes could admit weights that match moments but still leave a gap. The abstract does not spell out the precise optimization or the error propagation, so the full derivation needs to show there are no hidden dependence terms. This work is aimed at theorists working on OPE and offline RL who want new tools for deriving bounds or thinking about coverage. It has enough distinct framing and analysis to be worth referee time, even if the central claim requires careful verification of the induction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Q-MMR, a recursive reweighting method for off-policy evaluation in finite-horizon MDPs. Scalar weights are learned top-down by solving a moment-matching problem at each time step against a value-function discriminator class F. Under the sole assumption that the target Q^π is realizable in F, the authors claim a data-dependent finite-sample bound on the reweighted return whose error term is dimension-free (independent of any measure of statistical complexity of F). Connections to importance sampling and linear fitted Q-evaluation are derived, together with new observations on the role of coverage.

Significance. If the dimension-free guarantee can be established rigorously for general (possibly infinite) discriminator classes, the result would be a meaningful theoretical advance. Most existing finite-sample bounds for off-policy evaluation or offline RL incur explicit dependence on covering numbers, Rademacher complexity, or other measures of F; a bound free of such terms would simplify analysis and potentially broaden the set of function classes for which reliable guarantees exist. The recursive construction and its explicit links to classical estimators are also of independent interest.

major comments (3)

[§4.2] §4.2 (inductive step of the main theorem): the argument that moment matching against F controls the reweighted conditional expectation of the next-step value function by the realizability error alone must be made fully explicit. For arbitrary (non-convex, unbounded) F the optimization problem may admit solutions whose reweighting error is not controlled solely by realizability, which would introduce hidden dependence on the complexity of F and undermine the dimension-free claim.
[§3.2] Definition of the discriminator class F (likely §3.2): the precise regularity conditions required on F (convexity, boundedness, closure under the Bellman operator, etc.) are not stated clearly enough to verify that the inductive step closes without complexity terms. The abstract asserts generality, yet the proof appears to rely on additional structure that must be listed.
[§4.3] Equation (main bound, §4.3): the data-dependent finite-sample guarantee is stated without an explicit remainder term that could arise from the moment-matching optimization; please derive or bound any such term to confirm it does not re-introduce dependence on the statistical complexity of F.

minor comments (3)

[§3.1] Notation: the recursive definition of the weights w_t should be written as an explicit optimization problem with the exact moment-matching loss, rather than described only in prose.
[Figure 1] Figure 1 (recursive construction diagram): add explicit time-step labels and indicate which value functions are being matched at each layer to improve readability.
[§5] References: the discussion of coverage would benefit from citing recent works on concentrability coefficients in offline RL (e.g., the 2022–2024 literature on single-policy concentrability).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. The comments raise important points about the explicitness of the proof and the assumptions on the discriminator class F. We address each major comment below and will incorporate clarifications and expansions in the revised version to strengthen the presentation.

read point-by-point responses

Referee: [§4.2] §4.2 (inductive step of the main theorem): the argument that moment matching against F controls the reweighted conditional expectation of the next-step value function by the realizability error alone must be made fully explicit. For arbitrary (non-convex, unbounded) F the optimization problem may admit solutions whose reweighting error is not controlled solely by realizability, which would introduce hidden dependence on the complexity of F and undermine the dimension-free claim.

Authors: We appreciate the referee's suggestion to make the inductive step more explicit. In the revised manuscript, we will expand §4.2 with a detailed derivation showing that the moment-matching condition directly implies that the difference between the reweighted conditional expectation and the true one is bounded solely by the realizability error of Q^π in F. This follows because the weights are chosen to match all moments in F, and since Q^π is realizable, the error for the value function is controlled without additional terms. Regarding arbitrary F, we acknowledge that for non-convex or unbounded classes the optimization landscape may be complex; however, the manuscript implicitly assumes F allows for a solution where the matching controls the error (e.g., via convex relaxation or specific solvers). We will explicitly state the required conditions on F to ensure this, and note that the dimension-free property holds as the proof does not rely on uniform convergence or complexity measures of F but on the per-instance matching. revision: yes
Referee: [§3.2] Definition of the discriminator class F (likely §3.2): the precise regularity conditions required on F (convexity, boundedness, closure under the Bellman operator, etc.) are not stated clearly enough to verify that the inductive step closes without complexity terms. The abstract asserts generality, yet the proof appears to rely on additional structure that must be listed.

Authors: We agree that the conditions on F need to be stated more precisely. In the revision, we will update §3.2 to clearly specify that F is a convex and bounded class of functions with Q^π ∈ F (realizability), and that no closure under the Bellman operator is required since the recursive reweighting handles the value propagation. The inductive step closes without complexity terms because the moment matching enforces equality for functions in F at each step, and the bound propagates the realizability error recursively without invoking statistical complexity. This maintains the generality claimed in the abstract under these mild regularity conditions. revision: yes
Referee: [§4.3] Equation (main bound, §4.3): the data-dependent finite-sample guarantee is stated without an explicit remainder term that could arise from the moment-matching optimization; please derive or bound any such term to confirm it does not re-introduce dependence on the statistical complexity of F.

Authors: The current presentation in §4.3 assumes exact solution of the moment-matching optimization, leading to no remainder term from it. We will revise the main bound to explicitly include a potential optimization error term ε_opt and derive that the total error is bounded by the data-dependent term plus ε_opt times a bound on the value range. We show that ε_opt can be controlled independently of the complexity of F by the convexity of the optimization problem or by the specific moment-matching formulation, thus not re-introducing dependence on covering numbers or Rademacher complexity. This will confirm the dimension-free nature of the overall bound. revision: partial

Circularity Check

0 steps flagged

Moment-matching objective yields dimension-free bound under realizability without reducing target to fitted input

full rationale

The derivation proceeds via an explicit top-down inductive construction in which weights are obtained by solving a moment-matching problem against the discriminator class F at each step; the finite-sample guarantee is then proved directly from this construction plus the realizability assumption on Q^π. No equation equates the target policy value to a quantity defined by the same moment-matching objective, nor does any load-bearing step rely on a self-citation whose content is itself unverified. Connections to importance sampling and linear FQE are presented as corollaries rather than as the sole justification for the main result. The bound's dimension-free character follows from the inductive control of reweighting error by realizability alone, which is an independent mathematical claim rather than a renaming or self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central guarantee rests on the domain assumption of Q^π realizability and the validity of the inductive moment-matching construction in finite-horizon MDPs; no free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Realizability of Q^π (the true action-value function of the target policy lies in the chosen function class)
The finite-sample guarantee is stated to hold under only this condition.

pith-pipeline@v0.9.0 · 5441 in / 1234 out tokens · 37401 ms · 2026-05-11T01:53:53.793649+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

A unifying view of coverage in linear off-policy evaluation.arXiv preprint arXiv:2601.19030,

Philip Amortila, Audrey Huang, Akshay Krishnamurthy, and Nan Jiang. A unifying view of coverage in linear off-policy evaluation.arXiv preprint arXiv:2601.19030,

work page arXiv
[2]

On oracle-efficient pac rl with rich observations.Advances in Neural Information Processing Systems, 2018:1422–1432,

Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. On oracle-efficient pac rl with rich observations.Advances in Neural Information Processing Systems, 2018:1422–1432,

work page 2018
[3]

An analysis of random design linear regression

Daniel Hsu, Sham M Kakade, and Tong Zhang. An analysis of random design linear regression. arXiv preprint arXiv:1106.2363, 6,

work page arXiv
[4]

Offline reinforcement learning in large state spaces: Algorithms and guarantees.arXiv preprint arXiv:2510.04088,

Nan Jiang and Tengyang Xie. Offline reinforcement learning in large state spaces: Algorithms and guarantees.arXiv preprint arXiv:2510.04088,

work page arXiv
[5]

Model selection for off-policy evaluation: New algorithms and experimental protocol.arXiv preprint arXiv:2502.08021,

Pai Liu, Lingfeng Zhao, Shivangi Agarwal, Jinghan Liu, Audrey Huang, Philip Amortila, and Nan Jiang. Model selection for off-policy evaluation: New algorithms and experimental protocol.arXiv preprint arXiv:2502.08021,

work page arXiv
[6]

and MEHROTRA, S

Hamed Rahimian and Sanjay Mehrotra. Distributionally robust optimization: A review.arXiv preprint arXiv:1908.05659,

work page arXiv 1908
[7]

Analysis of kernel mean matching under covariate shift.arXiv preprint arXiv:1206.4650,

Yaoliang Yu and Csaba Szepesvári. Analysis of kernel mean matching under covariate shift.arXiv preprint arXiv:1206.4650,

work page arXiv
[8]

13 A Linear Regression under Covariate Shift The form of our main theoretical guarantee in Section 3 can feel somewhat unconventional in the OPE literature due to its data dependence. As we will see, however, in the special case of H= 1 with a linear function class, our algorithm and analyses recover those oflinear regression under covariate shift[Yu and ...

work page 2012
[9]

for the connection between these two problems. Fixed-design analysis of linear regressionConsider a natural algorithm that (1) performs linear regression from the training data to estimatebθ, and (2) outputs the estimate as 1 m Pm i=1(x(i) test)⊤bθ= x⊤ testbθ, where xtest := 1 m Pm i=1 x(i) test. A standard analysis shows that the error can be bounded as:...

work page 2012
[10]

in the linear setting. By Proposition 3(1), in the linear setting with invertible empirical covariance matrices bΣh, Q-MMR with exact minimization and the least-2nd-moment rule produces exactly the same estimator as linear FQE without ridge regularization. We now compare their data-independent bound to ours, and will comment on the data-dependent bounds b...

work page 2020
[11]

This “squared” version coverage requires data to cover all feature directions under π, instead of just the mean direction Edπ h[ϕ] [Jiang and Xie, 2025]

depend on a more stringent notion of coverage, such as cond(Σ−1/2 h Σπ hΣ−1/2 h ), where Σh is the feature covariance under dD h and Σπ h is that under dπ h. This “squared” version coverage requires data to cover all feature directions under π, instead of just the mean direction Edπ h[ϕ] [Jiang and Xie, 2025]. In contrast, the higher-order term in our bou...

work page 2025
[12]

[2023], Amortila et al

considers the linear-MDP setup which is even stronger than Bellman completeness, but their proofs and results can be adapted to much weaker settings (e.g., only realizability), especially given the later insights of Perdomo et al. [2023], Amortila et al. [2026]. In particular, Amortila et al

work page 2023
[13]

In their bound, the coverage term is supf∈F |E(s,a)∼dπ h [fh(s, a)]|/ q ED[ 1 H PH h=1 fh(s, a)2]

to our setup of disjoint state spaces, which admits (nominally) time-homogeneous dynamics and feature map that match their setup. In their bound, the coverage term is supf∈F |E(s,a)∼dπ h [fh(s, a)]|/ q ED[ 1 H PH h=1 fh(s, a)2]. When {Sh}h are disjoint and feature/linear coefficients are essentially independent across levels, the supf is always achieved b...

work page 2020
[14]

state transitions; indeed, they show that this is the form of the asymptotically minimax-optimal bound

is that they replace the range (Vmax or Vh) with (the square-root of) the variance of reward and value function V π w.r.t. state transitions; indeed, they show that this is the form of the asymptotically minimax-optimal bound. This improvement is very easy to implement in our analyses, as we can simply replace the Hoeffding’s inequality in Eq.(10) with a ...

work page 2018
[15]

Since ∥u∥2,dD t ≤1 , generalized leverage gives |u(s, a)| ≤ √κt

Now we are going to bound Zi. Since ∥u∥2,dD t ≤1 , generalized leverage gives |u(s, a)| ≤ √κt. Also,∥f∥ 2,dD h ≤1implies∥f∥ ∞ ≤ √κh. Thus by Definition 5, we have ∥g(h,f) t ∥∞ ≤ρ t:h √κh,∥g (h,f) t+1 ∥∞ ≤ρ t+1:h √κh. This gives an upper bound for the range of Zi, i.e., |Zi| ≤(ρ t:h +ρ t+1:h)√κtκh. On the other hand, for the variance control, we can use a ...

work page 2020
[16]

fixed-design

to see that the second term is indeed a higher-order term, since both the average Bellman error and the tracking error exhibit the rate of n−1/2 (hence an overall rate of n−1). Combining (I) and (II) proves the corollary. D Further discussion on MIS D.1 Issues with non-parametric weights in MIS MIS methods often take a minimax form similar to our Q-MMR, a...

work page 2020