Enhancing the Discriminative Feature Learning for Visible-Thermal Cross-Modality Person Re-Identification
Pith reviewed 2026-05-24 18:09 UTC · model grok-4.3
The pith
Skip connections for mid-level features plus a dual-modality triplet loss enhance discriminative learning in visible-thermal person re-identification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A two-stream CNN equipped with skip connections that incorporate mid-level features and trained with a dual-modality triplet loss reduces both cross-modality discrepancy and intra-modality variations, yielding person features that are more discriminative and robust for visible-thermal re-identification.
What carries the argument
The EDFL method that adds skip connections for mid-level feature incorporation and a dual-modality triplet loss inside a two-stream CNN.
If this is right
- Mid-level features passed via skip connections add robustness that high-level features alone do not provide across modalities.
- The dual-modality triplet loss simultaneously shrinks distances between visible and thermal images of the same person and expands distances between different people within each modality.
- A two-stream architecture produces shared features usable by both visible and thermal inputs.
- The combined changes produce measurable accuracy lifts on existing visible-thermal benchmarks.
Where Pith is reading between the lines
- The emphasis on mid-level features suggests that identity cues useful across modalities sit at intermediate depths rather than only at the deepest layers.
- The same pair of modifications could be tested on other cross-modal re-identification pairs such as RGB-infrared or visible-depth without changing the overall training recipe.
- If the dual triplet loss proves decisive, future work could explore weighting the cross-modality and intra-modality terms separately on different datasets.
Load-bearing premise
These two lightweight changes alone will close the modality gap and variation problems without needing deeper redesigns or more training data.
What would settle it
Running the same network on a new visible-thermal dataset where accuracy gains shrink to within a few percent of the best prior method would show the enhancements are not generally sufficient.
Figures
read the original abstract
Existing person re-identification has achieved great progress in the visible domain, capturing all the person images with visible cameras. However, in a 24-hour intelligent surveillance system, the visible cameras may be noneffective at night. In this situation, thermal cameras are the best supplemental components, which capture images without depending on visible light. Therefore, in this paper, we investigate the visible-thermal cross-modality person re-identification (VT Re-ID) problem. In VT Re-ID, there are two knotty problems should be well handled, cross-modality discrepancy and intra-modality variations. To address these two issues, we propose focusing on enhancing the discriminative feature learning (EDFL) with two extreme simple means from two core aspects, (1) skip-connection for mid-level features incorporation to improve the person features with more discriminability and robustness, and (2) dual-modality triplet loss to guide the training procedures by simultaneously considering the cross-modality discrepancy and intra-modality variations. Additionally, the two-stream CNN structure is adopted to learn the multi-modality sharable person features. The experimental results on two datasets show that our proposed EDFL approach distinctly outperforms state-of-the-art methods by large margins, demonstrating the effectiveness of our EDFL to enhance the discriminative feature learning for VT Re-ID.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an EDFL method for visible-thermal cross-modality person re-identification that adopts a two-stream CNN backbone, adds skip-connections to incorporate mid-level features for improved discriminability, and introduces a dual-modality triplet loss to jointly address cross-modality discrepancy and intra-modality variations. The central claim, stated in the abstract and §4, is that these two simple enhancements enable the method to distinctly outperform state-of-the-art approaches by large margins on two datasets.
Significance. If the reported gains are reproducible and attributable to the proposed components rather than the backbone or training protocol, the work would be significant for 24-hour surveillance applications, as it suggests that lightweight architectural and loss modifications can mitigate modality gaps without complex models or extra data.
major comments (1)
- [§4] §4 (Experiments) and associated tables: no ablation studies are presented that isolate the contribution of the mid-level skip-connection or the dual-modality triplet loss (e.g., full EDFL vs. two-stream baseline with standard triplet loss). Without these controls, the claim that the large margins over SOTA are caused by the two proposed enhancements cannot be verified and remains load-bearing for the paper's central assertion.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive feedback. We address the major comment below and will incorporate the suggested changes in the revised manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and associated tables: no ablation studies are presented that isolate the contribution of the mid-level skip-connection or the dual-modality triplet loss (e.g., full EDFL vs. two-stream baseline with standard triplet loss). Without these controls, the claim that the large margins over SOTA are caused by the two proposed enhancements cannot be verified and remains load-bearing for the paper's central assertion.
Authors: We agree that the manuscript would benefit from explicit ablation studies to isolate the contributions of the mid-level skip-connections and the dual-modality triplet loss. The current experiments focus on overall performance against state-of-the-art methods but do not include direct controls such as a two-stream baseline with standard triplet loss or variants without skip-connections. In the revision, we will add these ablation experiments to the tables in §4, which will allow verification that the reported gains are attributable to the proposed components rather than the backbone or training protocol alone. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper proposes an empirical method (two-stream CNN + mid-level skip connections + dual-modality triplet loss) for VT Re-ID and supports its claims solely via reported performance on two external datasets. No equations, derivations, or predictions are present that reduce to inputs by construction; no self-citations are invoked as load-bearing uniqueness theorems; and the approach does not rename known results or smuggle ansatzes. The derivation chain is therefore self-contained empirical evaluation rather than tautological.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Collective deep quantization for efficient cross-modal retrieval,
Y . Cao, M. Long, J. Wang, and S. Liu, “Collective deep quantization for efficient cross-modal retrieval,” in AAAI, 2017
work page 2017
-
[2]
Multi-level factorisation net for person re-identification,
X. Chang, T. M. Hospedales, and T. Xiang, “Multi-level factorisation net for person re-identification,” in CVPR, 2018
work page 2018
-
[3]
Person re-identification by camera correlation aware feature augmentation,
Y .-C. Chen, X. Zhu, W.-S. Zheng, and J.-H. Lai, “Person re-identification by camera correlation aware feature augmentation,” IEEE TPAMI , vol. 40, no. 2, pp. 392–408, 2018
work page 2018
-
[4]
Towards cycle-consistent models for text and image retrieval,
M. Cornia, L. Baraldi, H. R. Tavakoli, and R. Cucchiara, “Towards cycle-consistent models for text and image retrieval,” in ECCV, 2018, pp. 687–691
work page 2018
-
[5]
Cross-modality person re-identification with generative adversarial training
P. Dai, R. Ji, H. Wang, Q. Wu, and Y . Huang, “Cross-modality person re-identification with generative adversarial training.” in IJCAI, 2018, pp. 677–683
work page 2018
-
[6]
Mutual component convolutional neural networks for heterogeneous face recognition,
Z. Deng, X. Peng, Z. Li, and Y . Qiao, “Mutual component convolutional neural networks for heterogeneous face recognition,” IEEE TIP , 2019
work page 2019
-
[7]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778
work page 2016
-
[8]
Learning invariant deep represen- tation for nir-vis face recognition,
R. He, X. Wu, Z. Sun, and T. Tan, “Learning invariant deep represen- tation for nir-vis face recognition,” in AAAI, 2017
work page 2017
-
[9]
In Defense of the Triplet Loss for Person Re-Identification
A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[10]
A systematic evaluation and benchmark for person re- identification: Features, metrics, and datasets,
S. Karanam, M. Gou, Z. Wu, A. Rates-Borras, O. Camps, and R. J. Radke, “A systematic evaluation and benchmark for person re- identification: Features, metrics, and datasets,” IEEE TPAMI, 2018
work page 2018
-
[11]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[12]
Imagenet classification with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NeurIPS, 2012, pp. 1097– 1105
work page 2012
-
[13]
Harmonious attention network for person re-identification,
W. Li, X. Zhu, and S. Gong, “Harmonious attention network for person re-identification,” in CVPR, 2018, pp. 2285–2294
work page 2018
-
[14]
H. Liu and J. Cheng, “Gallery based k-reciprocal-like re-ranking for heavy cross-camera discrepancy in person re-identification,” Neurocom- puting, vol. 333, pp. 64–75, 2019
work page 2019
-
[15]
Hydraplus-net: Attentive deep features for pedestrian analysis,
X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang, “Hydraplus-net: Attentive deep features for pedestrian analysis,” in ICCV, 2017, pp. 350–359
work page 2017
-
[16]
D. Nguyen, H. Hong, K. Kim, and K. Park, “Person recognition system based on a combination of body images from visible light and thermal cameras,” Sensors, vol. 17, no. 3, p. 605, 2017
work page 2017
-
[17]
C. Reale, H. Lee, and H. Kwon, “Deep heterogeneous face recognition networks based on cross-modal distillation and an equitable distance metric,” in ICCV Workshops, 2017, pp. 32–38
work page 2017
-
[18]
Facenet: A unified embed- ding for face recognition and clustering,
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed- ding for face recognition and clustering,” in CVPR, 2015, pp. 815–823
work page 2015
-
[19]
Grad-cam: Visual explanations from deep networks via gradient-based localization,
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in ICCV, 2017, pp. 618–626
work page 2017
-
[20]
Pose-driven deep convolutional model for person re-identification,
C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-driven deep convolutional model for person re-identification,” in ICCV, 2017, pp. 3980–3989
work page 2017
-
[21]
Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),
Y . Sun, L. Zheng, Y . Yang, Q. Tian, and S. Wang, “Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline),” in ECCV, 2018, pp. 501–518
work page 2018
-
[22]
Inception-v4, inception-resnet and the impact of residual connections on learning
C. Szegedy, S. Ioffe, V . Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning.” in AAAI, vol. 4, 2017, p. 12
work page 2017
-
[23]
Mancs: A multi-task attentional network with curriculum sampling for person re- identification,
C. Wang, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Mancs: A multi-task attentional network with curriculum sampling for person re- identification,” in ECCV, 2018, pp. 384–400
work page 2018
-
[24]
Learning discriminative features with multiple granularities for person re-identification,
G. Wang, Y . Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple granularities for person re-identification,” ACM MM, 2018
work page 2018
-
[25]
Learning two-branch neural networks for image-text matching tasks,
L. Wang, Y . Li, J. Huang, and S. Lazebnik, “Learning two-branch neural networks for image-text matching tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 41, no. 2, pp. 394–407, 2019
work page 2019
-
[26]
Learn- ing to reduce dual-level discrepancy for infrared-visible person re- identification,
Z. Wang, Z. Wang, Y . Zheng, Y .-Y . Chuang, and S. Satoh, “Learn- ing to reduce dual-level discrepancy for infrared-visible person re- identification,” in CVPR, 2019, pp. 618–626
work page 2019
-
[27]
Rgb-infrared cross-modality person re-identification,
A. Wu, W.-S. Zheng, H.-X. Yu, S. Gong, and J. Lai, “Rgb-infrared cross-modality person re-identification,” in ICCV, 2017, pp. 5380–5389
work page 2017
-
[28]
Coupled deep learning for heterogeneous face recognition,
X. Wu, L. Song, R. He, and T. Tan, “Coupled deep learning for heterogeneous face recognition,” in AAAI, 2018
work page 2018
-
[29]
Hierarchical discriminative learning for visible thermal person re-identification,
M. Ye, X. Lan, J. Li, and P. C. Yuen, “Hierarchical discriminative learning for visible thermal person re-identification,” in AAAI, 2018
work page 2018
-
[30]
Visible thermal person re-identification via dual-constrained top-ranking
M. Ye, Z. Wang, X. Lan, and P. C. Yuen, “Visible thermal person re-identification via dual-constrained top-ranking.” in IJCAI, 2018, pp. 1092–1099
work page 2018
-
[31]
Visualizing and understanding convolu- tional networks,
M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu- tional networks,” in ECCV, 2014, pp. 818–833
work page 2014
-
[32]
Person Re-identification: Past, Present and Future
L. Zheng, Y . Yang, and A. G. Hauptmann, “Person re-identification: Past, present and future,” arXiv preprint arXiv:1610.02984 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.