pith. sign in

arxiv: 2604.15631 · v1 · submitted 2026-04-17 · 💻 cs.CV

Causal Bootstrapped Alignment for Unsupervised Video-Based Visible-Infrared Person Re-Identification

Pith reviewed 2026-05-10 08:26 UTC · model grok-4.3

classification 💻 cs.CV
keywords unsupervised learningvideo person re-identificationvisible-infrared re-idcausal interventioncross-modality alignmentpseudo-labelingmodality bias
0
0 comments X

The pith

Causal Bootstrapped Alignment learns reliable identity features from unlabeled video tracklets for visible-infrared re-identification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that extending image-based unsupervised visible-infrared re-identification methods to video settings fails because pretrained encoders produce weak identity discrimination and strong modality bias. This leads to confused intra-modality clusters and mismatched granularity between visible and infrared data, which in turn makes pseudo-labels unreliable for cross-modality alignment. CBA counters these problems by first applying sequence-level causal interventions that use temporal identity consistency and cross-modality identity consistency to remove spurious modality and motion correlations. It then applies a prototype-guided refinement step that reorganizes under-clustered infrared features around reliable visible prototypes with uncertainty-aware supervision. The result is cleaner representations that support effective unsupervised clustering and alignment on video benchmarks.

Core claim

The Causal Bootstrapped Alignment framework exploits inherent video priors through two stages. Causal Intervention Warm-up performs sequence-level causal interventions that leverage temporal identity consistency and cross-modality identity consistency to suppress modality- and motion-induced spurious correlations while preserving identity-relevant semantics. Prototype-Guided Uncertainty Refinement then follows a coarse-to-fine strategy that resolves cross-modality granularity mismatch by reorganizing under-clustered infrared representations under the guidance of reliable visible prototypes with uncertainty-aware supervision. Together these steps yield cleaner representations that improve the

What carries the argument

Causal Bootstrapped Alignment framework, whose core mechanisms are sequence-level causal interventions that clean representations using video consistency priors and prototype-guided uncertainty refinement that aligns modalities despite initial clustering imbalance.

If this is right

  • Unsupervised video-based visible-infrared re-identification becomes feasible without cross-modality annotations.
  • Pseudo-label reliability rises because intra-modality identity confusion and granularity imbalance are reduced.
  • Cross-modality alignment succeeds even when initial clusters are imbalanced between visible and infrared modalities.
  • All-day surveillance systems can scale to new camera networks without incurring expensive labeling costs.
  • The same sequence-level priors can be reused for other video multi-modal unsupervised tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same causal-intervention step could be tested on single-modality video re-identification to isolate how much the cross-modality consistency term contributes.
  • If the visible prototypes prove more reliable than infrared ones, the framework may naturally favor one modality as the anchor in future extensions.
  • Applying the uncertainty-aware supervision to other granularity-mismatch problems, such as day-night object detection, would test whether the refinement stage generalizes beyond person re-identification.

Load-bearing premise

That temporal identity consistency and cross-modality identity consistency in unlabeled video tracklets are strong enough to guide causal interventions that reliably separate identity semantics from modality and motion biases.

What would settle it

Running CBA on the HITSZ-VCM benchmark and finding no meaningful gain in re-identification accuracy or reduction in visible-infrared clustering imbalance compared with directly extended image-based unsupervised baselines would show the interventions failed to remove the spurious correlations.

Figures

Figures reproduced from arXiv: 2604.15631 by Changjiang Kuang, Jiaxu Leng, Mingpi Tan, Shuang Li, Xinbo Gao, Yu Yuan.

Figure 1
Figure 1. Figure 1: Challenges of USL-VVI-ReID with generic pre-trained encoders. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Structural causal model (SCM) for representation purification. The [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: First, identity representations become poorly separated, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall architecture of the proposed CBA framework for USL-VVI-ReID. The framework comprises three stages: [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the Modality-Perturbation Bootstrapping (MPB). [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the PGUR. PGUR addresses cross-modality granularity [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distributions of intra- and inter-class distances on HITSZ-VCM. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE visualization of cross-modality embeddings on HITSZ-VCM. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative cross-modality retrieval results on HITSZ-VCM. Top-6 matches (Rank@1–Rank@6) are shown for Infrared-to-Visible (top) and Visible [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

VVI-ReID is a critical technique for all-day surveillance, where temporal information provides additional cues beyond static images. However, existing approaches rely heavily on fully supervised learning with expensive cross-modality annotations, limiting scalability. To address this issue, we investigate Unsupervised Learning for VVI-ReID (USL-VVI-ReID), which learns identity-discriminative representations directly from unlabeled video tracklets. Directly extending image-based USL-VI-ReID methods to this setting with generic pretrained encoders leads to suboptimal performance. Such encoders suffer from weak identity discrimination and strong modality bias, resulting in severe intra-modality identity confusion and pronounced clustering granularity imbalance between visible and infrared modalities. These issues jointly degrade pseudo-label reliability and hinder effective cross-modality alignment. To address these challenges, we propose a Causal Bootstrapped Alignment (CBA) framework that explicitly exploits inherent video priors. First, we introduce Causal Intervention Warm-up (CIW), which performs sequence-level causal interventions by leveraging temporal identity consistency and cross-modality identity consistency to suppress modality- and motion-induced spurious correlations while preserving identity-relevant semantics, yielding cleaner representations for unsupervised clustering. Second, we propose Prototype-Guided Uncertainty Refinement (PGUR), which employs a coarse-to-fine alignment strategy to resolve cross-modality granularity mismatch, reorganizing under-clustered infrared representations under the guidance of reliable visible prototypes with uncertainty-aware supervision. Extensive experiments on the HITSZ-VCM and BUPTCampus benchmarks demonstrate that CBA significantly outperforms existing USL-VI-ReID methods when extended to the USL-VVI-ReID setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are quantified. The approach implicitly assumes video tracklets contain usable temporal and cross-modality identity consistency.

axioms (1)
  • domain assumption Temporal identity consistency and cross-modality identity consistency hold sufficiently in unlabeled video tracklets to enable causal interventions that remove spurious correlations.
    Invoked to justify the Causal Intervention Warm-up step.

pith-pipeline@v0.9.0 · 5604 in / 1275 out tokens · 83197 ms · 2026-05-10T08:26:15.870984+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    Learning modal-invariant and temporal-memory for video-based visible-infrared person re-identification,

    X. Lin, J. Li, Z. Ma, H. Li, S. Li, K. Xu, G. Lu, and D. Zhang, “Learning modal-invariant and temporal-memory for video-based visible-infrared person re-identification,” inCVPR, 2022, pp. 20 973–20 982. SUBMIT TO IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 13

  2. [2]

    Video-level language-driven video-based visible-infrared person re-identification,

    S. Li, J. Leng, C. Kuang, M. Tan, and X. Gao, “Video-level language-driven video-based visible-infrared person re-identification,” IEEE Trans. Inf. Forensics Security, 2025

  3. [3]

    Intermediary-guided bidi- rectional spatial-temporal aggregation network for video-based visible- infrared person re-identification,

    H. Li, M. Liu, Z. Hu, F. Nie, and Z. Yu, “Intermediary-guided bidi- rectional spatial-temporal aggregation network for video-based visible- infrared person re-identification,”IEEE Trans. Circuits Syst. Video Technol., 2023

  4. [4]

    Video-based visible- infrared person re-identification with auxiliary samples,

    Y . Du, C. Lei, Z. Zhao, Y . Dong, and F. Su, “Video-based visible- infrared person re-identification with auxiliary samples,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 1313–1325, 2023

  5. [5]

    Dinov2-driven gait representation learning for video-based visible–infrared person re- identification,

    Y . Yang, S. Li, J. Ye, N. Dong, F. Li, and H. Li, “Dinov2-driven gait representation learning for video-based visible–infrared person re- identification,” inACM MM, 2025, pp. 8283–8292

  6. [6]

    Dual-space video person re-identification,

    J. Leng, C. Kuang, S. Li, J. Gan, H. Chen, and X. Gao, “Dual-space video person re-identification,”Int. J. Comput. Vis., vol. 133, no. 6, pp. 3667–3688, 2025

  7. [7]

    Logical relation inference and multiview information interaction for domain adaptation person re-identification,

    S. Li, F. Li, J. Li, H. Li, B. Zhang, D. Tao, and X. Gao, “Logical relation inference and multiview information interaction for domain adaptation person re-identification,”IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 10, pp. 14 770–14 782, 2023

  8. [8]

    Hierarchical prompt learning for image- and text-based person re-identification,

    L. Zhou, S. Li, N. Dong, Y . Tai, Y . Zhang, and H. Li, “Hierarchical prompt learning for image- and text-based person re-identification,” in AAAI, 2026, accepted

  9. [9]

    Body part-level domain alignment for domain-adaptive person re-identification with transformer framework,

    Y . Wang, G. Qi, S. Li, Y . Chai, and H. Li, “Body part-level domain alignment for domain-adaptive person re-identification with transformer framework,”IEEE Trans. Inf. Forensics Security, vol. 17, pp. 3321–3334, 2022

  10. [10]

    Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re- identification,

    B. Yang, M. Ye, J. Chen, and Z. Wu, “Augmented dual-contrastive aggregation learning for unsupervised visible-infrared person re- identification,” inACM MM, 2022, pp. 2843–2851

  11. [11]

    Unsupervised visible-infrared person re-identification via progressive graph matching and alternate learning,

    Z. Wu and M. Ye, “Unsupervised visible-infrared person re-identification via progressive graph matching and alternate learning,” inCVPR, 2023, pp. 9548–9558

  12. [12]

    Towards grand unified representation learning for unsupervised visible-infrared person re-identification,

    B. Yang, J. Chen, and M. Ye, “Towards grand unified representation learning for unsupervised visible-infrared person re-identification,” in ICCV, 2023, pp. 11 069–11 079

  13. [13]

    Unsupervised visible-infrared person reid by collaborative learning with neighbor- guided label refinement,

    D. Cheng, X. Huang, N. Wang, L. He, Z. Li, and X. Gao, “Unsupervised visible-infrared person reid by collaborative learning with neighbor- guided label refinement,” inACM MM, 2023, pp. 7085–7093

  14. [14]

    Relieving universal label noise for unsupervised visible-infrared person re-identification by inferring from neighbors,

    X. Teng, L. Lan, D. Chen, K. Xu, and N. Yin, “Relieving universal label noise for unsupervised visible-infrared person re-identification by inferring from neighbors,” inAAAI, vol. 39, no. 7, 2025, pp. 7356–7364

  15. [15]

    Robust pseudo-label learning with neighbor relation for unsupervised visible- infrared person re-identification,

    X. Yin, J. Shi, Y . Zhang, Y . Lu, Z. Zhang, Y . Xie, and Y . Qu, “Robust pseudo-label learning with neighbor relation for unsupervised visible- infrared person re-identification,” inACM MM, 2024, pp. 2242–2251

  16. [16]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021, pp. 8748–8763

  17. [17]

    Clip-driven fine-grained text- image person re-identification,

    S. Yan, N. Dong, L. Zhang, and J. Tang, “Clip-driven fine-grained text- image person re-identification,”IEEE Trans. Image Process., vol. 32, pp. 6032–6046, 2023

  18. [18]

    Clip-driven semantic discovery network for visible-infrared person re-identification,

    X. Yu, N. Dong, L. Zhu, H. Peng, and D. Tao, “Clip-driven semantic discovery network for visible-infrared person re-identification,”IEEE Trans. Multimedia, 2025

  19. [19]

    Diverse semantics- guided feature alignment and decoupling for visible-infrared person re- identification,

    N. Dong, S. Yan, L. Zhang, and J. Tang, “Diverse semantics- guided feature alignment and decoupling for visible-infrared person re- identification,”IEEE Trans. Inf. Forensics Security, vol. 20, pp. 12 245– 12 259, 2025

  20. [20]

    Boosting temporal sentence grounding via causal inference,

    K. Tang, L. He, J. Dang, and X. Gao, “Boosting temporal sentence grounding via causal inference,” inACM MM, 2025, pp. 8701–8710

  21. [21]

    Causality-inspired invariant representation learning for text-based person retrieval,

    Y . Liu, G. Qin, H. Chen, Z. Cheng, and X. Yang, “Causality-inspired invariant representation learning for text-based person retrieval,” in AAAI, vol. 38, no. 12, 2024, pp. 14 052–14 060

  22. [22]

    Causality inspired representation learning for domain generalization,

    F. Lv, J. Liang, S. Li, B. Zang, C. H. Liu, Z. Wang, and D. Liu, “Causality inspired representation learning for domain generalization,” inCVPR, 2022, pp. 8046–8056

  23. [23]

    Deep learning for person re-identification: A survey and outlook,

    M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. C. H. Hoi, “Deep learning for person re-identification: A survey and outlook,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 2872–2893, 2021

  24. [24]

    Dynamic dual-attentive aggregation learning for visible-infrared person re- identification,

    M. Ye, J. Shen, D. J. Crandall, L. Shao, and J. Luo, “Dynamic dual-attentive aggregation learning for visible-infrared person re- identification,” inECCV, 2020, pp. 229–247

  25. [25]

    Discover cross-modality nuances for visible-infrared person re- identification,

    Q. Wu, P. Dai, J. Chen, C.-W. Lin, Y . Wu, F. Huang, B. Zhong, and R. Ji, “Discover cross-modality nuances for visible-infrared person re- identification,” inCVPR, 2021, pp. 4330–4339

  26. [26]

    Shape-centered repre- sentation learning for visible-infrared person re-identification,

    S. Li, J. Leng, J. Gan, M. Mo, and X. Gao, “Shape-centered repre- sentation learning for visible-infrared person re-identification,”Pattern Recognition, p. 111756, 2025

  27. [27]

    Cross-modality spatial-temporal transformer for video-based visible-infrared person re-identification,

    Y . Feng, F. Chen, J. Yu, Y . Ji, F. Wu, T. Liu, S. Liu, X.-Y . Jing, and J. Luo, “Cross-modality spatial-temporal transformer for video-based visible-infrared person re-identification,”IEEE Trans. Multimedia, 2024

  28. [28]

    Video-based visible- infrared person re-identification via style disturbance defense and dual interaction,

    C. Zhou, J. Li, H. Li, G. Lu, Y . Xu, and M. Zhang, “Video-based visible- infrared person re-identification via style disturbance defense and dual interaction,” inACM MM, 2023, pp. 46–55

  29. [29]

    Hierarchical disturbance and group inference for video-based visible-infrared person re-identification,

    C. Zhou, Y . Zhou, T. Ren, H. Li, J. Li, and G. Lu, “Hierarchical disturbance and group inference for video-based visible-infrared person re-identification,”Information Fusion, vol. 117, p. 102882, 2025

  30. [30]

    X-reid: Multi-granularity informa- tion interaction for video-based visible-infrared person re-identification,

    C. Yu, X. Liu, P. Zhang, and H. Lu, “X-reid: Multi-granularity informa- tion interaction for video-based visible-infrared person re-identification,” inAAAI, 2025

  31. [31]

    Stepwise metric promotion for unsuper- vised video person re-identification,

    Z. Liu, D. Wang, and H. Lu, “Stepwise metric promotion for unsuper- vised video person re-identification,” inICCV, 2017, pp. 2429–2438

  32. [32]

    Robust anchor embedding for unsupervised video person re-identification in the wild,

    M. Ye, X. Lan, and P. C. Yuen, “Robust anchor embedding for unsupervised video person re-identification in the wild,” inECCV, 2018, pp. 170–186

  33. [33]

    Dynamic graph co- matching for unsupervised video-based person re-identification,

    M. Ye, J. Li, A. J. Ma, L. Zheng, and P. C. Yuen, “Dynamic graph co- matching for unsupervised video-based person re-identification,”IEEE Trans. Image Process., vol. 28, no. 6, pp. 2976–2990, 2019

  34. [34]

    Exploiting global camera network constraints for unsupervised video person re-identification,

    X. Wang, R. Panda, M. Liu, Y . Wang, and A. K. Roy-Chowdhury, “Exploiting global camera network constraints for unsupervised video person re-identification,”IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 10, pp. 4020–4030, 2020

  35. [35]

    Unsupervised person re-identification by deep learning tracklet association,

    M. Li, X. Zhu, and S. Gong, “Unsupervised person re-identification by deep learning tracklet association,” inECCV, 2018, pp. 737–753

  36. [36]

    Unsupervised tracklet person re- identification,

    M. Li, X. Zhu, and S. Gong, “Unsupervised tracklet person re- identification,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 7, pp. 1770–1782, 2019

  37. [37]

    Person re-identification by unsupervised video matching,

    X. Ma, X. Zhu, S. Gong, X. Xie, J. Hu, K.-M. Lam, and Y . Zhong, “Person re-identification by unsupervised video matching,”Pattern Recognition, vol. 65, pp. 197–210, 2017

  38. [38]

    Sampling and re- weighting: Towards diverse frame-aware unsupervised video person re- identification,

    P. Xie, X. Xu, Z. Wang, and T. Yamasaki, “Sampling and re- weighting: Towards diverse frame-aware unsupervised video person re- identification,”IEEE Trans. Multimedia, vol. 24, pp. 4250–4261, 2022

  39. [39]

    Anchor association learning for unsupervised video person re-identification,

    S. Zeng, X. Wang, M. Liu, Q. Liu, and Y . Wang, “Anchor association learning for unsupervised video person re-identification,”IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 1, pp. 1013–1024, 2022

  40. [40]

    Unsupervised person re- identification: Clustering and fine-tuning,

    H. Fan, L. Zheng, C. Yan, and Y . Yang, “Unsupervised person re- identification: Clustering and fine-tuning,”ACM Trans. Multimedia Comput. Commun. Appl., vol. 14, no. 4, pp. 1–18, 2018

  41. [41]

    Robust duality learning for unsupervised visible-infrared person re-identification,

    Y . Li, Y . Sun, Y . Qin, D. Peng, X. Peng, and P. Hu, “Robust duality learning for unsupervised visible-infrared person re-identification,”IEEE Trans. Inf. Forensics Security, 2025

  42. [42]

    Multi-memory matching for unsupervised visible-infrared person re- identification,

    J. Shi, X. Yin, Y . Chen, Y . Zhang, Z. Zhang, Y . Xie, and Y . Qu, “Multi-memory matching for unsupervised visible-infrared person re- identification,” inECCV. Springer, 2024, pp. 456–474

  43. [43]

    Csanet: Cross-modality self-paced association network for unsupervised visible- infrared person re-identification,

    R. Xi, Z. Fu, N. Huang, X. Zhao, Q. Zhang, and J. Han, “Csanet: Cross-modality self-paced association network for unsupervised visible- infrared person re-identification,”IEEE Trans. Inf. Forensics Security, 2025

  44. [44]

    Dual-level matching with outlier filtering for unsupervised visible-infrared person re-identification,

    M. Ye, Z. Wu, and B. Du, “Dual-level matching with outlier filtering for unsupervised visible-infrared person re-identification,”IEEE Trans. Pattern Anal. Mach. Intell., 2025

  45. [45]

    Shallow-deep collaborative learning for unsupervised visible-infrared person re-identification,

    B. Yang, J. Chen, and M. Ye, “Shallow-deep collaborative learning for unsupervised visible-infrared person re-identification,” inCVPR, 2024, pp. 16 870–16 879

  46. [46]

    Unveiling the power of clip in unsupervised visible-infrared person re-identification,

    Z. Chen, Z. Zhang, X. Tan, Y . Qu, and Y . Xie, “Unveiling the power of clip in unsupervised visible-infrared person re-identification,” inACM MM, 2023, pp. 3667–3675

  47. [47]

    Peters, D

    J. Peters, D. Janzing, and B. Sch ¨olkopf,Elements of causal inference: foundations and learning algorithms. The MIT press, 2017

  48. [48]

    On causal and anticausal learning,

    B. Sch ¨olkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij, “On causal and anticausal learning,”ICML, 2012

  49. [49]

    Arbitrary style transfer in real-time with adaptive instance normalization,

    X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” inICCV, 2017, pp. 1501–1510

  50. [50]

    Channel augmented joint learning for visible-infrared recognition,

    M. Ye, W. Ruan, B. Du, and M. Z. Shou, “Channel augmented joint learning for visible-infrared recognition,” inICCV, 2021, pp. 13 567– 13 576

  51. [51]

    Fa-net: A feature alignment network for video-based visible-infrared person re- identification,

    X. Yang, W. Dong, X. Wang, D. Cheng, and N. Wang, “Fa-net: A feature alignment network for video-based visible-infrared person re- identification,”IEEE Trans. Image Process., 2025