arxiv: 2604.22982 · v1 · submitted 2026-04-24 · 💰 econ.EM

Recognition: unknown

Stacked Triple Differences

Meng Hsuan Hsieh

Pith reviewed 2026-05-08 08:53 UTC · model grok-4.3

classification 💰 econ.EM

keywords stacked triple differencestriple differencesstaggered adoptiondifference-in-differencesfixed effects regressioncausal inferenceevent study

0 comments

The pith

A linear regression on stacked four-cell triple-difference data identifies a cell-size-weighted average of treatment effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops stacked triple differences to handle staggered treatment adoption without the forbidden comparisons that arise in standard three-way fixed effects implementations. Each stack consists of treated and clean comparison cohorts, each crossed with treatment-eligible and ineligible units, over an event window; these self-contained stacks are appended into one dataset. A fully saturated fixed-effects regression on the stacked data recovers, at each post-treatment event time, a strictly positive weighted average of the stack-specific conditional average treatment effects, where the weights are the relative cell sizes. A sympathetic reader cares because the approach keeps estimation transparent and regression-based while allowing explicit control over how effects are aggregated across groups and relying only on pairwise parallel trends within stacks rather than global assumptions.

Core claim

By constructing self-contained stacks each containing four cells over an event window—treated and clean comparison cohorts crossed with treatment-eligible and ineligible units—and appending these stacks, a linear regression with fully saturated fixed effects applied to the pooled dataset identifies, at each post-treatment event time, a strictly positive cell-size-weighted average of the stack-level conditional average treatment effects, with stack weights proportional to stack-level cell sizes.

What carries the argument

The stacked dataset of self-contained four-cell groups (treated and clean comparison cohorts crossed with eligible and ineligible units) and the linear regression with fully saturated fixed effects applied to it.

If this is right

Alternative weighting schemes applied to the same stacked data recover distinct, transparent causal estimands with clear interpretations.
The estimator complements GMM and imputation frameworks by trading efficiency for regression transparency and pairwise rather than global parallel trends.
Empirical illustrations show that stacked DDD produces substantially different quantitative conclusions than conventional three-way fixed effects procedures.
Researchers obtain direct control over aggregation weights when combining effects across stacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Varying how clean comparison cohorts are defined when forming stacks could serve as a robustness check for sensitivity to the choice of counterfactual groups.
The method may extend naturally to other multi-dimensional difference designs that face similar staggered-adoption complications.
Explicit control over cell-size weights allows targeting averages that match specific policy populations rather than defaulting to the implicit weighting of the regression.

Load-bearing premise

Clean comparison cohorts satisfy pairwise parallel trends with treated cohorts within each stack, and treatment-ineligible units provide valid counterfactuals without anticipation effects or spillovers.

What would settle it

Monte Carlo simulations in which stack-specific treatment effects are known and cell sizes differ across stacks, followed by checking whether the regression coefficient exactly equals the cell-size-weighted average of those known effects.

read the original abstract

Triple differences (DDD) is a workhorse quasi-experimental design in applied economics. But, under staggered adoption, its conventional three-way fixed-effects (3WFE) implementation inherits the forbidden-comparison and interpretation issues now well understood in the difference-in-differences literature. To resolve these issues, I introduce stacked DDD. I extend the stacked difference-in-differences approach to the DDD setting by creating self-contained stacks, each consisting of four cells over an event window: treated and clean comparison cohorts, each with treatment-eligible and treatment-ineligible units. Appending these stacks yields a unified dataset for estimating treatment effects without making forbidden comparisons. I prove that, at each post-treatment event-time, a linear regression with fully saturated fixed-effects applied to the stacked dataset identifies a strictly positive, cell-size-weighted average of stack-level conditional average treatment effects, with stack weights proportional to stack-level cell sizes. Building on this characterization, I outline alternative weighting schemes that recover distinct, transparent causal estimands with clear interpretations. Stacked DDD complements recent GMM and imputation-based frameworks by trading efficiency for regression-based transparency, pairwise (rather than global) parallel trends, and direct control over aggregation weights. I provide two empirical illustrations where stacked DDD yields substantially different quantitative conclusions compared to existing procedures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stacked DDD gives a regression-based way to handle staggered triple differences by recovering a cell-size weighted average of stack-level effects.

read the letter

The main thing to know is that this paper builds self-contained four-cell stacks—treated and clean comparison cohorts, each split into eligible and ineligible units—then appends them so a saturated fixed-effects regression identifies a strictly positive, cell-size weighted average of the stack-level CATEs at each event time. The proof works by substituting the within-stack pairwise parallel trends assumption into the OLS normal equations, which avoids the forbidden comparisons that plague standard three-way fixed effects DDD under staggered adoption. That characterization and the option to reweight for other transparent estimands look new relative to the stacked DiD and DDD literature it cites. The two empirical illustrations are a plus because they show the method can produce different numbers from conventional approaches in real data. The assumptions are the usual ones: pairwise parallel trends within each stack and no anticipation or spillovers for the ineligible units. Those are necessary and sufficient for the result, and the paper does not overclaim. One minor limitation is that the efficiency cost versus GMM or imputation methods is acknowledged but not quantified in the examples, so users will need to judge that trade-off themselves. Finding clean comparison cohorts for every treated group could also be constraining in some applications. This is for applied econometricians who run triple differences with staggered timing and want a transparent regression alternative. Readers working on policy evaluation with multiple periods and ineligible units will get practical value from the weight control and clear interpretation. It deserves a serious referee because the identification argument is direct and the contribution addresses a documented gap without internal contradictions.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces stacked triple differences (DDD) for settings with staggered treatment adoption. It constructs self-contained stacks, each consisting of a treated cohort and a clean comparison cohort, with each cohort split into treatment-eligible and treatment-ineligible units over an event window. Appending these stacks creates a unified dataset on which a linear regression with fully saturated fixed effects is shown to identify, at each post-treatment event time, a strictly positive cell-size-weighted average of the stack-level conditional average treatment effects (with weights proportional to stack-level cell sizes). The paper also outlines alternative weighting schemes to recover other transparent causal estimands and provides two empirical illustrations comparing results to conventional procedures.

Significance. If the identification result holds under the stated assumptions, this provides a regression-based, transparent alternative to GMM and imputation-based DDD estimators. The approach avoids forbidden comparisons by design, relies on pairwise (rather than global) parallel trends within stacks, and gives the researcher direct control over aggregation weights. The explicit characterization as a positive weighted average of CATEs enhances interpretability and complements recent work on staggered designs.

major comments (2)

[Identification section] Identification section: The central claim is that the saturated fixed-effects regression identifies the cell-size-weighted average of stack-level CATEs by substituting the pairwise parallel-trends assumption into the OLS normal equations. The manuscript should expand the derivation with the explicit normal equations and verify that no negative weights or cross-stack forbidden comparisons arise, as this step is load-bearing for the identification result.
[Assumptions section] Assumptions and robustness: The result relies on clean comparison cohorts satisfying pairwise parallel trends with treated cohorts within each stack and on treatment-ineligible units providing valid counterfactuals without anticipation or spillovers. The paper should discuss whether violations of the no-spillover assumption across stacks would invalidate the no-forbidden-comparison property, since this directly affects the scope of the main theorem.

minor comments (3)

[Abstract and Introduction] The abstract and introduction could more explicitly preview the number of stacks and event-time horizons used in the two empirical illustrations.
[Empirical illustrations] A side-by-side table of point estimates, standard errors, and implied weights for stacked DDD versus conventional 3WFE and other methods would improve the clarity of the empirical comparisons.
[Setup section] Notation for 'clean comparison cohorts' and the event-window cells should be formalized with equations in the setup section to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which have helped us clarify and strengthen the identification and robustness arguments in the paper. We address each major comment below.

read point-by-point responses

Referee: [Identification section] Identification section: The central claim is that the saturated fixed-effects regression identifies the cell-size-weighted average of stack-level CATEs by substituting the pairwise parallel-trends assumption into the OLS normal equations. The manuscript should expand the derivation with the explicit normal equations and verify that no negative weights or cross-stack forbidden comparisons arise, as this step is load-bearing for the identification result.

Authors: We agree that a more explicit derivation will improve transparency. In the revised manuscript, we will expand the identification section to include the full set of OLS normal equations for the saturated fixed-effects regression on the stacked data. We will then substitute the pairwise parallel-trends assumption directly into these equations and derive the cell-size-weighted average of the stack-level CATEs. We will also add a dedicated verification step showing that all weights are strictly positive (proportional to cell sizes) and that the stacked design precludes cross-stack forbidden comparisons by construction, since each stack relies exclusively on its own clean comparison cohort. revision: yes
Referee: [Assumptions section] Assumptions and robustness: The result relies on clean comparison cohorts satisfying pairwise parallel trends with treated cohorts within each stack and on treatment-ineligible units providing valid counterfactuals without anticipation or spillovers. The paper should discuss whether violations of the no-spillover assumption across stacks would invalidate the no-forbidden-comparison property, since this directly affects the scope of the main theorem.

Authors: We appreciate this point on the scope of the no-forbidden-comparison property. The main identification result assumes no spillovers across stacks (in addition to the within-stack assumptions). In the revised version, we will add a new subsection in the assumptions and robustness discussion that explicitly addresses potential cross-stack spillovers. We will note that while the no-forbidden-comparison property is preserved by design within each stack, cross-stack violations could indirectly affect counterfactual validity for treatment-ineligible units; we will contrast this with global approaches and clarify the conditions under which the main theorem continues to hold. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central result is an identification theorem obtained by substituting the stated pairwise parallel-trends assumption (within each self-constructed stack of treated and clean comparison cohorts) directly into the normal equations of a fully saturated fixed-effects OLS regression on the appended dataset. This algebraic substitution produces the claimed cell-size-weighted average of stack-level CATEs as the coefficient on the treatment indicator at each post-treatment event time. The stack construction, the four-cell structure, and the weighting (proportional to cell sizes) are defined explicitly in the manuscript prior to the proof and do not presuppose the target estimand. Prior stacked DiD literature is referenced only for methodological context; the DDD extension, the self-contained stack definition, and the specific weighted-average characterization are derived independently. No step reduces the claimed result to a fitted parameter by construction, a self-referential definition, or a load-bearing self-citation whose content is itself unverified. The result remains falsifiable under the maintained assumptions and does not rename an existing empirical pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The identification result rests on standard quasi-experimental assumptions for staggered designs; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Pairwise parallel trends hold within each stack between treated and clean comparison cohorts for eligible and ineligible units
Required for the regression on stacked data to identify the weighted average of conditional treatment effects at each event time.
domain assumption No anticipation effects and clean comparisons (no spillovers or contamination across stacks)
Ensures each stack remains self-contained and avoids the forbidden comparisons the method aims to eliminate.

pith-pipeline@v0.9.0 · 5514 in / 1405 out tokens · 35147 ms · 2026-05-08T08:53:35.417505+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 1 canonical work pages

[1]

W., and Wooldridge, J

Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. M. (2023). When Should You Adjust Standard Errors for Clustering?Quarterly Journal of Economics, 138(1):1–35. (Cited on page 3.) Borusyak, K., Jaravel, X., and Spiess, J. (2024). Revisiting Event-Study Designs: Robust and Efficient Estima- tion.Review of Economic Studies, 91(6):3253–3285. (Cited on ...

2023
[2]

(Cited on page 5.) Sant’Anna, P. H. C. and Zhao, J. (2020). Doubly robust difference-in-differences estimators.Journal of Econo- metrics, 219(1):101–122. (Cited on page 3.) Shastry, G. K. and Tortorice, D. L. (2025). Effective health aid: Evidence from gavi’s vaccine program.Amer- ican Economic Journal: Economic Policy, 17(1):540–74. (Cited on pages 4, 31...

work page arXiv 2020
[3]

Computing the FWL residual within each stack. By Step 1, the OLS estimator forτ e reduces to a bivariate FWL regression (B.12)bτ e = P g∈Gtrg(e) P i∈Sg eRe(i, g+e, g) g∆Y i,g+e,g P g∈Gtrg(e) P i∈Sg eRe(i, g+e, g) 2 , where the time summation collapses entirely tot=g+ebecause eRe(i, t, g) = 0fort̸=g+e. I now derive the FWL residual eRe and the resulting we...

2021
[4]

APPENDIXE. THREE-WAYFIXED-EFFECTS INEVENT-STUDYDESIGNS The conventional approach to DDD estimation is the three-way fixed effects (3WFE) regression (E.1)Y i,t =α i +γ t +δ Si,t +θD i,t +ϵ i,t , whereα i are unit fixed effects,γ t are time fixed effects,δ Si,t are group-by-time fixed effects, andD i,t = 1{t≥S i}Qi is the treatment indicator. Despite its si...

2021
[5]

The group-time mean Dg,t equals the fraction of group-g units that are eligible, Dg,t =n g,1/ng,· fort≥g(since all eligible group-gunits are treated) and Dg,t = 0for t < g

The unit-level time mean is Di,· = (T−g+ 1)/T. The group-time mean Dg,t equals the fraction of group-g units that are eligible, Dg,t =n g,1/ng,· fort≥g(since all eligible group-gunits are treated) and Dg,t = 0for t < g. Thus Dg,· = ((T−g+ 1)/T)(n g,1/ng,·). STACKED TRIPLE DIFFERENCES 23 Now consider a second cohortg ′ > gand a not-yet-treated eligible uni...

2020
[6]

The stacked estimator avoids negative weights because each stack produces a singleATT(g, t)estimate, and the aggregation weightsω g are chosen by the researcher to be non-negative

The stacked estimator avoids both pathologies. The stacked estimator avoids negative weights because each stack produces a singleATT(g, t)estimate, and the aggregation weightsω g are chosen by the researcher to be non-negative. It avoids forbidden comparisons because each stack restricts the comparison group to units withS i =g c > g+K, ensuring no compar...

2023
[7]

Property (i), own-period weights sum to one

I now prove each property by summing (E.12) overg. Property (i), own-period weights sum to one. Fixℓ=eand sum (E.12) overg∈ G trg X g∈Gtrg ωe g,e =e ⊤ e  X t E[ ¨Ri,t ¨R⊤ i,t]   −1 E ¨Ri,g+e X g∈Gtrg Rg,e(i, g+e) . SinceP g Rg,e(i, t) =R e(i, t)for all(i, t)(summing over cohorts recovers the aggregate event-time indicator), this becomes the coefficien...

2021
[8]

The specifi- cation (E.6) recovers interpretable causal parameters only under the joint restrictions of DDD-PCT, treatment effect homogeneity across cohorts, and no anticipation

Thereforeµ e = ATTe.■ Theorem E.5 clarifies the conditions under which the 3WFE event-study regression is valid. The specifi- cation (E.6) recovers interpretable causal parameters only under the joint restrictions of DDD-PCT, treatment effect homogeneity across cohorts, and no anticipation. In practice, these conditions are rarely satisfied simul- taneous...

2021
[9]

The event- time indicatorR 0(i,2015)lights up for cohort 2015 but not for cohort 2013; conversely,R 2(i,2015)lights up for cohort 2013 but not

2015
[10]

forbidden comparisons

After removing unit and time fixed effects, these indicators retain residual correlation because the relative-time composition of the sample changes across time. The group-by-time fixed effectsδ Si,t absorb some of this variation but do not eliminate it, since the within-group composition between eligible and ineligible units is not collinear with the gro...

2021
[11]

forbidden comparison

The group mean for groupgis DSi=g,· = 1 ngT X j:Sj=g TX t=1 Dj,t =p Q=1|S=g T−g+ 1 T . Substituting into (E.19) for a unit withS i =g,Q i = 1: eDi,t =1{t≥g} − T−g+ 1 T −p Q=1|S=g 1{t≥g}+p Q=1|S=g T−g+ 1 T = (1−p Q=1|S=g) 1{t≥g} − T−g+ 1 T .(E.20) The factor(1−p Q=1|S=g)is the proportion of ineligible units in groupg; it scales down the residual treatment ...

2021