On Equivalence Between Decentralized Policy-Profile Mixtures and Behavioral Coordination Policies in Multi-Agent Systems

Nouman Khan; Vijay G. Subramanian

arxiv: 2504.12635 · v2 · pith:NIXFQXTWnew · submitted 2025-04-17 · 🧮 math.OC

On Equivalence Between Decentralized Policy-Profile Mixtures and Behavioral Coordination Policies in Multi-Agent Systems

Nouman Khan , Vijay G. Subramanian This is my paper

Pith reviewed 2026-05-22 20:18 UTC · model grok-4.3

classification 🧮 math.OC

keywords decentralized policiesoccupation measuresmulti-agent systemsbehavioral coordinationpolicy mixturesLagrangian dualitypartially observed systemsrandomization

0 comments

The pith

In partially observed multi-agent systems, independently randomized decentralized policy-profiles induce the same occupation measures as decentralized behavioral policy-profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that for systems with Borel hidden states, countable observations, and finite actions, independently randomized decentralized policies—pure or behavioral—generate identical occupation measures on joint histories and actions to decentralized behavioral policy-profiles. It also shows that jointly randomized behavioral and pure decentralized policy-profiles produce the same measures. When observations are finite, mixtures of decentralized policy-profiles are equivalent in occupation measures to common information based behavioral coordination policies. These equivalences generalize earlier results on pure policies and support further work on duality and learning in constrained team problems.

Core claim

Independently randomized decentralized policy-profiles, whether behavioral or pure, induce the same occupation measures on joint-history and joint-action pairs as decentralized behavioral policy-profiles. Jointly randomized behavioral and pure decentralized policy-profiles induce the same occupation measures. Restricting to finite observations, joint mixtures of decentralized policy-profiles and common information based behavioral coordination policies induce the same occupation measures.

What carries the argument

Occupation measures on joint histories and actions, which equate the distributions induced by different classes of policy mixtures and randomization.

If this is right

Lagrangian duality results for constrained decentralized team problems can be strengthened using the equivalences.
The minimum number of randomizations required in an optimal behavioral coordination policy can be characterized.
Learning algorithms can target mixtures of decentralized policy-profiles to approximate optimal solutions.
The known equivalence between pure decentralized policy-profiles and pure coordination policies extends to randomized and behavioral cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Computation of optimal policies might be simplified by restricting search to mixtures of pure policies when observations are finite.
The results could extend to approximate equivalence under weaker measurability conditions if continuity arguments are added.
Similar measure equivalences might hold in infinite-horizon discounted settings if the occupation measures are replaced by discounted versions.

Load-bearing premise

The hidden state is Borel, observations are countable (finite for the coordination part), and actions are finite so that the induced measures on joint histories and actions can be equated.

What would settle it

Construct a counterexample with uncountable observations where the occupation measure from an independently randomized pure decentralized policy-profile differs from that of a decentralized behavioral policy-profile.

read the original abstract

Constrained decentralized team problem formulations are good models for many cooperative multi-agent systems. Constraints necessitate randomization when solving for optimal solutions -- past results show that joint randomization in the team is in general necessary for (strong) Lagrangian duality to hold -- , but a better understanding of randomization still remains. For a partially observed multi-agent system with a Borel hidden state, countable observations, and finite actions, we prove the following: \textit{i}) independently randomized decentralized policy-profiles -- whether behavioral or pure -- induce the same occupation measures (on joint-history and joint-action pairs) as decentralized behavioral policy-profiles; and \textit{ii}) jointly randomized behavioral and pure decentralized policy-profiles induce the same occupation measures. Restricting to finite observations, we also prove that joint mixtures of decentralized policy-profiles (both pure and behavioral) and common information based behavioral coordination policies (also mixtures of them) induce the same occupation measures. This generalizes past work that shows equivalence between pure decentralized policy-profiles and pure coordination policies. These results can be used to develop further results on Lagrangian duality, the minimum number of randomizations needed in an optimal behavioral coordination policy, and learning based schemes that can find approximately optimal solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper extends prior equivalence results on pure decentralized policies to randomized behavioral profiles and mixtures with coordination policies, under standard Borel and cardinality conditions.

read the letter

The main point is that under a Borel hidden state, countable observations, and finite actions, independently randomized decentralized policy profiles (pure or behavioral) induce the same occupation measures as decentralized behavioral profiles, and jointly randomized versions match as well. When observations are finite, mixtures of those profiles with common-information behavioral coordination policies also match. This generalizes earlier pure-policy equivalences and is positioned to help with Lagrangian duality and learning in constrained team problems.

Referee Report

1 major / 2 minor

Summary. The paper proves equivalence of occupation measures on joint histories and actions for a partially observed multi-agent system with Borel hidden state, countable observations, and finite actions. Specifically, (i) independently randomized decentralized policy profiles (behavioral or pure) induce the same measures as decentralized behavioral policy profiles; (ii) jointly randomized behavioral and pure decentralized profiles induce identical measures; and, when observations are finite, (iii) joint mixtures of decentralized profiles with common-information behavioral coordination policies (and their mixtures) also coincide. The results generalize prior equivalences for pure policies and rest on existence of regular conditional distributions and measurable selections.

Significance. If the equivalences hold, they clarify when and how randomization can be localized or shared without changing achievable occupation measures, directly supporting further work on Lagrangian duality for constrained team problems, bounds on the number of randomizations required in optimal coordination policies, and learning algorithms that optimize over simpler policy classes. The explicit invocation of standard Borel/countable/finite conditions to guarantee disintegration is a methodological strength.

major comments (1)

[§4, Theorem 2] §4, Theorem 2: the proof that joint mixtures of decentralized profiles and coordination policies induce identical measures relies on the finite-observation assumption to construct a common-information sigma-algebra; it is unclear whether the same construction extends verbatim when observations are only countable, which would affect the scope of the claimed generalization.

minor comments (2)

[§2] Notation for occupation measures (e.g., μ_π) is introduced in §2 but used without explicit reminder in the statements of Theorems 1 and 3; adding a one-sentence cross-reference would improve readability.
[Introduction and Theorem 3] The abstract claims the results apply to 'countable observations' while the coordination-policy equivalence requires 'finite observations'; this distinction should be stated explicitly in the introduction and in the theorem statements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading, positive summary, and recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: [§4, Theorem 2] §4, Theorem 2: the proof that joint mixtures of decentralized profiles and coordination policies induce identical measures relies on the finite-observation assumption to construct a common-information sigma-algebra; it is unclear whether the same construction extends verbatim when observations are only countable, which would affect the scope of the claimed generalization.

Authors: We agree that the proof of the equivalence in Theorem 2 (joint mixtures of decentralized profiles and common-information behavioral coordination policies) relies on the finite-observation assumption to ensure the common-information sigma-algebra is well-defined and to apply the required measurable selection arguments. The manuscript already scopes this result explicitly to the finite-observation case (see abstract and the statement of Theorem 2), while the earlier equivalences (i) and (ii) hold for countable observations. We do not claim, and the construction does not extend verbatim to, the countable-observation setting for the coordination-policy mixtures; the common information would then be generated by a countable product of observation spaces, which introduces technical obstacles to the disintegration and selection steps used in the proof. No change to the manuscript is required, as the stated scope matches the proof. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from system model

full rationale

The paper proves measure-theoretic equivalences of occupation measures induced by independent vs. joint mixtures of decentralized policy profiles (pure and behavioral) and common-information coordination policies, under Borel hidden states, countable/finite observations, and finite actions. These equivalences are derived directly via disintegration of measures on joint histories, existence of regular conditional distributions, and measurable selection arguments applied to the underlying partially observed stochastic process. No step reduces a claimed result to a fitted parameter, a self-referential definition, or a load-bearing self-citation whose validity depends on the present work; the generalization of prior pure-policy results is an extension rather than a foundational premise. The central claims therefore remain independent of the inputs they organize.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is a pure equivalence proof in stochastic control; it invokes standard measure-theoretic background but introduces no fitted parameters or new postulated entities.

axioms (2)

domain assumption Hidden state space is Borel; observations countable; actions finite
Invoked to guarantee that occupation measures on joint histories and actions are well-defined and comparable across policy classes.
standard math Standard results from stochastic control and measure theory on occupation measures
Used without proof to equate measures induced by different randomization schemes.

pith-pipeline@v0.9.0 · 5747 in / 1331 out tokens · 54389 ms · 2026-05-22T20:18:10.553847+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove the equivalence between joint mixtures of decentralized policy-profiles (both pure and behavioral) and common-information based behavioral coordination policies

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

On the structure of real-time sourc e coders,

H. S. Witsenhausen, “On the structure of real-time sourc e coders,” Bell System Technical Journal , vol. 58, no. 6, pp. 1437–1451, 1979

work page 1979
[2]

Optimal causal coding - decoding problems,

J. C. W alrand and P. Varaiya, “Optimal causal coding - decoding problems,” IEEE Trans. Inf. Theory , vol. 29, no. 6, pp. 814–819, 1983

work page 1983
[3]

A standard form for sequential stoc has- tic control,

H. S. Witsenhausen, “A standard form for sequential stoc has- tic control,” Mathematical systems theory, vol. 7, no. 1, pp. 5– 11, 1973

work page 1973
[4]

Optimal contr ol strategies in delayed sharing information structures,

A. Nayyar, A. Mahajan, and D. Teneketzis, “Optimal contr ol strategies in delayed sharing information structures,” IEEE Transactions on Automatic Control , vol. 56, no. 7, pp. 1606– 1620, 2011

work page 2011
[5]

Decentralize d stochastic control with partial history sharing: A common i n- formation approach,

A. Nayyar, A. Mahajan, and D. Teneketzis, “Decentralize d stochastic control with partial history sharing: A common i n- formation approach,” IEEE Transactions on Automatic Con- trol, vol. 58, no. 7, pp. 1644–1658, 2013

work page 2013
[6]

A strong duality result for c o- operative decentralized constrained POMDPs,

N. Khan and V. Subramanian, “A strong duality result for c o- operative decentralized constrained POMDPs,” in 2023 62nd IEEE Conference on Decision and Control (CDC) , pp. 5410– 5415, 2023

work page 2023
[7]

Kumar and P

P. Kumar and P. Varaiya, Stochastic Systems: Estimation, Identiﬁcation, and Adaptive Control . Classics in Applied Mathematics, SIAM, Society for Industrial and Applied Math - ematics, 2015

work page 2015
[8]

Altman, Constrained Markov Decision Processes

E. Altman, Constrained Markov Decision Processes . Chap- man and Hall, 1999

work page 1999
[9]

A convex analytic approach to Markov decis ion processes,

V. S. Borkar, “A convex analytic approach to Markov decis ion processes,” Probability Theory and Related Fields , vol. 78, pp. 583–602, Aug 1988

work page 1988
[10]

Common randomness and distributed control: A counterexample,

V. Anantharam and V. Borkar, “Common randomness and distributed control: A counterexample,” Systems & control letters, vol. 56, no. 7-8, pp. 568–572, 2007

work page 2007
[11]

Dynamic games among teams with delayed intra-team information sharing,

D. Tang, H. Tavafoghi, V. Subramanian, A. Nayyar, and D. Teneketzis, “Dynamic games among teams with delayed intra-team information sharing,” Dynamic Games and Appli- cations, vol. 13, pp. 353–411, Mar 2023

work page 2023
[12]

On the hardness of constrained cooperative multi-agent reinforcement learn ing

Z. Chen, Y. Zhou, and H. Huang, “On the hardness of constrained cooperative multi-agent reinforcement learn ing. ” Poster presented at the International Conference on Learni ng Representations (ICLR), 2024

work page 2024
[13]

H. W. Kuhn, 11. Extensive Games and the Problem of Infor- mation, pp. 193–216. Princeton: Princeton University Press, 1953

work page 1953
[14]

Information and strategies in dynamic ga mes,

P. Bernhard, “Information and strategies in dynamic ga mes,” Siam Journal on Control and Optimization , vol. 30, pp. 212– 228, 1992

work page 1992
[15]

R. J. Aumann, Mixed and behavior strategies in inﬁnite exten- sive games . Princeton University Princeton, 1961

work page 1961
[16]

Decentraliz ed stochastic control with partial history sharing: A common i n- formation approach,

A. Nayyar, A. Mahajan, and D. Teneketzis, “Decentraliz ed stochastic control with partial history sharing: A common i n- formation approach,” IEEE Trans. Automatic Control, vol. 58, no. 7, pp. 1644–1658, 2013

work page 2013
[17]

Nayyar, A

A. Nayyar, A. Mahajan, and D. Teneketzis, The Common- Information Approach to Decentralized Stochastic Control , pp. 123–156. Cham: Springer International Publishing, 201 4

work page

[1] [1]

On the structure of real-time sourc e coders,

H. S. Witsenhausen, “On the structure of real-time sourc e coders,” Bell System Technical Journal , vol. 58, no. 6, pp. 1437–1451, 1979

work page 1979

[2] [2]

Optimal causal coding - decoding problems,

J. C. W alrand and P. Varaiya, “Optimal causal coding - decoding problems,” IEEE Trans. Inf. Theory , vol. 29, no. 6, pp. 814–819, 1983

work page 1983

[3] [3]

A standard form for sequential stoc has- tic control,

H. S. Witsenhausen, “A standard form for sequential stoc has- tic control,” Mathematical systems theory, vol. 7, no. 1, pp. 5– 11, 1973

work page 1973

[4] [4]

Optimal contr ol strategies in delayed sharing information structures,

A. Nayyar, A. Mahajan, and D. Teneketzis, “Optimal contr ol strategies in delayed sharing information structures,” IEEE Transactions on Automatic Control , vol. 56, no. 7, pp. 1606– 1620, 2011

work page 2011

[5] [5]

Decentralize d stochastic control with partial history sharing: A common i n- formation approach,

A. Nayyar, A. Mahajan, and D. Teneketzis, “Decentralize d stochastic control with partial history sharing: A common i n- formation approach,” IEEE Transactions on Automatic Con- trol, vol. 58, no. 7, pp. 1644–1658, 2013

work page 2013

[6] [6]

A strong duality result for c o- operative decentralized constrained POMDPs,

N. Khan and V. Subramanian, “A strong duality result for c o- operative decentralized constrained POMDPs,” in 2023 62nd IEEE Conference on Decision and Control (CDC) , pp. 5410– 5415, 2023

work page 2023

[7] [7]

Kumar and P

P. Kumar and P. Varaiya, Stochastic Systems: Estimation, Identiﬁcation, and Adaptive Control . Classics in Applied Mathematics, SIAM, Society for Industrial and Applied Math - ematics, 2015

work page 2015

[8] [8]

Altman, Constrained Markov Decision Processes

E. Altman, Constrained Markov Decision Processes . Chap- man and Hall, 1999

work page 1999

[9] [9]

A convex analytic approach to Markov decis ion processes,

V. S. Borkar, “A convex analytic approach to Markov decis ion processes,” Probability Theory and Related Fields , vol. 78, pp. 583–602, Aug 1988

work page 1988

[10] [10]

Common randomness and distributed control: A counterexample,

V. Anantharam and V. Borkar, “Common randomness and distributed control: A counterexample,” Systems & control letters, vol. 56, no. 7-8, pp. 568–572, 2007

work page 2007

[11] [11]

Dynamic games among teams with delayed intra-team information sharing,

D. Tang, H. Tavafoghi, V. Subramanian, A. Nayyar, and D. Teneketzis, “Dynamic games among teams with delayed intra-team information sharing,” Dynamic Games and Appli- cations, vol. 13, pp. 353–411, Mar 2023

work page 2023

[12] [12]

On the hardness of constrained cooperative multi-agent reinforcement learn ing

Z. Chen, Y. Zhou, and H. Huang, “On the hardness of constrained cooperative multi-agent reinforcement learn ing. ” Poster presented at the International Conference on Learni ng Representations (ICLR), 2024

work page 2024

[13] [13]

H. W. Kuhn, 11. Extensive Games and the Problem of Infor- mation, pp. 193–216. Princeton: Princeton University Press, 1953

work page 1953

[14] [14]

Information and strategies in dynamic ga mes,

P. Bernhard, “Information and strategies in dynamic ga mes,” Siam Journal on Control and Optimization , vol. 30, pp. 212– 228, 1992

work page 1992

[15] [15]

R. J. Aumann, Mixed and behavior strategies in inﬁnite exten- sive games . Princeton University Princeton, 1961

work page 1961

[16] [16]

Decentraliz ed stochastic control with partial history sharing: A common i n- formation approach,

A. Nayyar, A. Mahajan, and D. Teneketzis, “Decentraliz ed stochastic control with partial history sharing: A common i n- formation approach,” IEEE Trans. Automatic Control, vol. 58, no. 7, pp. 1644–1658, 2013

work page 2013

[17] [17]

Nayyar, A

A. Nayyar, A. Mahajan, and D. Teneketzis, The Common- Information Approach to Decentralized Stochastic Control , pp. 123–156. Cham: Springer International Publishing, 201 4

work page