Interaction-and-Aggregation Network for Person Re-identification

Bingpeng Ma; Hong Chang; Ruibing Hou; Shiguang Shan; Xilin Chen; Xinqian Gu

arxiv: 1907.08435 · v1 · pith:WODK2SPMnew · submitted 2019-07-19 · 💻 cs.CV

Interaction-and-Aggregation Network for Person Re-identification

Ruibing Hou , Bingpeng Ma , Hong Chang , Xinqian Gu , Shiguang Shan , Xilin Chen This is my paper

Pith reviewed 2026-05-24 19:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords person re-identificationconvolutional neural networkfeature representationspatial interactionchannel interactiondeep learningpose variation

0 comments

The pith

The Interaction-and-Aggregation network enhances CNN feature representations for person re-identification by adaptively modeling spatial and channel interdependencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes adding Interaction-and-Aggregation blocks to convolutional neural networks to better handle variations in person pose and scale during re-identification. The Spatial IA module models interdependencies between spatial features and aggregates those from the same body parts, allowing adaptive receptive fields unlike fixed CNN regions. The Channel IA module selectively aggregates channel features to highlight small-scale cues. These blocks can be inserted at any depth in CNNs, leading to better embeddings validated on three benchmark datasets.

Core claim

The paper claims that the Interaction-and-Aggregation (IA) network structure, built from Spatial IA (SIA) and Channel IA (CIA) modules, enhances the feature representation capability of CNNs for person re-identification by modeling interdependencies and aggregating correlated features adaptively according to input pose and scale, outperforming state-of-the-art methods on benchmark datasets.

What carries the argument

The Interaction-and-Aggregation (IA) block consisting of Spatial IA (SIA) module for spatial feature interdependencies and Channel IA (CIA) module for channel feature aggregation.

If this is right

Standard CNNs gain the ability to adapt receptive fields based on person pose and scale instead of using fixed regions.
Small-scale visual cues are enhanced through selective channel feature aggregation.
IA blocks can be integrated into existing CNN architectures at multiple depths to improve reID performance.
Feature embeddings become more robust, leading to higher accuracy on person re-identification benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar modules might improve performance in other computer vision tasks involving variable object poses and scales.
The approach could reduce the need for complex data augmentation strategies in reID training.
Inserting these blocks might have computational trade-offs that depend on network depth.

Load-bearing premise

That the SIA module can adaptively determine receptive fields according to input person pose and scale and that inserting IA blocks at any depth produces measurable gains on standard reID benchmarks without dataset-specific adjustments.

What would settle it

Running the IA network on the three benchmark datasets and finding it does not outperform state-of-the-art methods would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 1907.08435 by Bingpeng Ma, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen, Xinqian Gu.

**Figure 4.** Figure 4: Visualization of the receptive fields in SIA with single [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 3.** Figure 3: The multi-context interaction operation of SIA. For clarity, we omit the channel dimensions of the input feature map and the softmax layer. The number of context levels is 3 in this figure. 3. Interaction-and-Aggregation Network In this section, we first introduce SIA and CIA modules, respectively. Then, IA block, which integrates SIA and CIA modules, is illustrated, followed by IANet for person reID. Fin… view at source ↗

**Figure 5.** Figure 5: The architecture of Channel Interaction-and [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: (a) The structure of IA block, which is sequentially con [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Parameter analysis for location relation map. (a) top [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization results of SIA and CIA on Market-1501. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Person re-identification (reID) benefits greatly from deep convolutional neural networks (CNNs) which learn robust feature embeddings. However, CNNs are inherently limited in modeling the large variations in person pose and scale due to their fixed geometric structures. In this paper, we propose a novel network structure, Interaction-and-Aggregation (IA), to enhance the feature representation capability of CNNs. Firstly, Spatial IA (SIA) module is introduced. It models the interdependencies between spatial features and then aggregates the correlated features corresponding to the same body parts. Unlike CNNs which extract features from fixed rectangle regions, SIA can adaptively determine the receptive fields according to the input person pose and scale. Secondly, we introduce Channel IA (CIA) module which selectively aggregates channel features to enhance the feature representation, especially for smallscale visual cues. Further, IA network can be constructed by inserting IA blocks into CNNs at any depth. We validate the effectiveness of our model for person reID by demonstrating its superiority over state-of-the-art methods on three benchmark datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an Interaction-and-Aggregation (IA) network for person re-identification. It introduces a Spatial IA (SIA) module that models interdependencies between spatial features and aggregates correlated features from the same body parts, enabling adaptive receptive fields based on input pose and scale (unlike fixed CNN grids), and a Channel IA (CIA) module that selectively aggregates channel features to enhance representation of small-scale cues. IA blocks can be inserted into CNNs at arbitrary depths, and the resulting model is claimed to outperform state-of-the-art methods on three benchmark datasets.

Significance. If the SIA aggregation mechanism is shown to produce receptive fields that genuinely vary with pose and scale geometry (rather than acting as a generic capacity boost), the approach would address a recognized limitation of CNNs in reID and offer a flexible way to enhance feature robustness. The arbitrary-depth insertion property could increase practical utility across architectures.

major comments (2)

[Abstract] Abstract, paragraph 2: the claim that SIA 'can adaptively determine the receptive fields according to the input person pose and scale' is load-bearing for the central novelty argument, yet the provided description supplies no derivation, conditioning variable, or constraint ensuring that the interdependency weights respond to pose/scale geometry rather than learning a static or capacity-driven pattern. If the module reduces to a non-local or attention block whose effective field is independent of input geometry, benchmark gains cannot be attributed to the stated mechanism.
[Method (SIA)] Method section (SIA module): the explicit formulation of how spatial interdependencies are computed and aggregated (e.g., the weight matrix or aggregation operator) must be shown to enforce dynamic response to pose/scale; without this, the superiority claim over standard CNNs rests on an unverified assumption.

minor comments (1)

[Abstract] The abstract states validation on 'three benchmark datasets' but does not name them; this should be added for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below, clarifying the input-dependent formulation of the SIA module while acknowledging where additional exposition would strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract, paragraph 2: the claim that SIA 'can adaptively determine the receptive fields according to the input person pose and scale' is load-bearing for the central novelty argument, yet the provided description supplies no derivation, conditioning variable, or constraint ensuring that the interdependency weights respond to pose/scale geometry rather than learning a static or capacity-driven pattern. If the module reduces to a non-local or attention block whose effective field is independent of input geometry, benchmark gains cannot be attributed to the stated mechanism.

Authors: The SIA module derives its spatial interdependency weights from a learned function applied directly to the input feature map; the resulting correlation matrix therefore varies with the specific activations that encode pose and scale. This input conditioning distinguishes the mechanism from a static pattern. We will revise the abstract and add a short clarifying sentence in Section 3.2 that explicitly identifies the input feature tensor as the conditioning variable. revision: partial
Referee: [Method (SIA)] Method section (SIA module): the explicit formulation of how spatial interdependencies are computed and aggregated (e.g., the weight matrix or aggregation operator) must be shown to enforce dynamic response to pose/scale; without this, the superiority claim over standard CNNs rests on an unverified assumption.

Authors: Equation (2) in Section 3.2 defines the weight matrix as a softmax-normalized similarity computed between feature vectors extracted from the current input tensor; the subsequent aggregation (Equation (3)) therefore selects body-part features according to input-specific correlations. Because the similarity computation is performed anew for every forward pass, the effective receptive field changes with pose and scale geometry. We will insert a brief paragraph contrasting this behavior with fixed CNN grids and, if space permits, add a qualitative visualization of the learned weights on sample poses. revision: partial

Circularity Check

0 steps flagged

No circularity: architecture proposal with empirical validation only

full rationale

The paper introduces a new CNN augmentation (IA blocks containing SIA and CIA modules) whose claimed benefits are design assertions about adaptive receptive fields and channel aggregation, followed by benchmark comparisons. No equations, fitted parameters, or derivations are presented that could reduce a result to its own inputs by construction. The adaptivity statement is a descriptive claim about module behavior rather than a mathematical prediction derived from prior fitted quantities or self-citations. Self-contained empirical evaluation on standard reID datasets supplies the support; no load-bearing step collapses into a tautology or renamed input.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review is abstract-only, so the ledger records only the explicit domain assumptions stated in the provided text. The paper introduces two new modules whose internal mechanics are not detailed.

axioms (1)

domain assumption CNNs are inherently limited in modeling large variations in person pose and scale due to their fixed geometric structures.
Stated in the first sentence of the abstract as the core motivation for the new modules.

invented entities (2)

Spatial IA (SIA) module no independent evidence
purpose: Models interdependencies between spatial features and aggregates correlated features corresponding to the same body parts.
New module introduced to adapt receptive fields to pose and scale.
Channel IA (CIA) module no independent evidence
purpose: Selectively aggregates channel features to enhance representation especially for small-scale visual cues.
New module introduced to improve feature representation.

pith-pipeline@v0.9.0 · 5727 in / 1377 out tokens · 23169 ms · 2026-05-24T19:31:27.034344+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 8 internal anchors

[1]

S. Bai, X. Bai, and Q. Tian. Scalable person re-identiﬁcation on supervised smoothed manifold. In CVPR, pages 2530– 2539, 2017

work page 2017
[2]

Bak and P

S. Bak and P. Carr. One-shot metric learning for person re- identiﬁcation. In CVPR, pages 2990–2999, 2017

work page 2017
[3]

R. M. Bolle, J. H. Connell, S. Pankanti, N. K. Ratha, and A. W. Senior. The relation between the roc curve and the cmc. In AUTOID, pages 15–20, 2005

work page 2005
[4]

Chang, T

X. Chang, T. M. Hospedales, and T. Xiang. Multi-level fac- torisation net for person re-identiﬁcation. In CVPR, pages 2109–2118, 2018

work page 2018
[5]

D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang. Group consistent similarity learning via deep crf for person re- identiﬁcation. In CVPR, pages 8649–8658, 2018

work page 2018
[6]

Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning

Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR, pages 5659–5667, 2017

work page 2017
[7]

Y . Chen, X. Zhu, and S. Gong. Person re-identiﬁcation by deep learning multi-scale representations. In ICCV, pages 2590–2600, 2017

work page 2017
[8]

Cheng, Y

D. Cheng, Y . Gong, S. Zhou, J. Wang, and N. Zheng. Per- son re-identiﬁcation by multi-channel parts-based cnn with improved triplet loss function. In CVPR, pages 1335 – 1344, 2016

work page 2016
[9]

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. In ICCV, pages 764– 773, 2017

work page 2017
[10]

Y . Du, C. Yuan, B. Li, L. Zhao, Y . Li, and W. Hu. Interaction- aware spatio-temporal pyramid attention networks for action classiﬁcation. In ECCV, pages 373–389, 2018

work page 2018
[11]

Gens and P

R. Gens and P. M. Domingos. Deep symmetry networks. In NIPS, pages 2537–2545, 2014

work page 2014
[12]

Guo and N

Y . Guo and N. M. Cheung. Efﬁcient and deep person re- identiﬁcation using multi-level similarity. In CVPR, pages 2335–2344, 2018

work page 2018
[13]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770 – 778, 2016

work page 2016
[14]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[15]

J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net- works. arXiv preprint arXiv:1709.01507, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Huang, D

H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang. Ad- versarially occluded samples for person re-identiﬁcation. In CVPR, pages 5098–5107, 2018

work page 2018
[17]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, pages 2017–2025, 2015

work page 2017
[19]

Jeon and J

Y . Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classiﬁcation. In CVPR, pages 4201–4209, 2017

work page 2017
[20]

M. M. Kalayeh, E. Basaran, M. Gkmen, M. E. Kamasak, and M. Shah. Human semantic parsing for person re- identiﬁcation. In CVPR, pages 1062–1071, 2018

work page 2018
[21]

Locally Scale-Invariant Convolutional Neural Networks

A. Kanazawa, A. Sharma, and D. Jacobs. Locally scale- invariant convolutional neural networks. arXiv preprint arXiv:1412.5104, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[22]

Karpathy, G

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classiﬁcation with convo- lutional neural networks. In CVPR, pages 1725–1732, 2014

work page 2014
[23]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

D. Li, X. Chen, Z. Zhang, and K. Huang. Learning deep context-aware features over body and latent parts for person re-identiﬁcation. In CVPR, pages 384–393, 2017

work page 2017
[25]

W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep ﬁlter pairing neural network for person re-identiﬁcation. InCVPR, pages 152–159, 2014

work page 2014
[26]

W. Li, X. Zhu, and S. Gong. Harmonious attention network for person re-identiﬁcation. In CVPR, pages 2285 – 2294, 2018

work page 2018
[27]

J. Liu, Z. J. Zha, Q. Tian, D. Liu, T. Yao, Q. Ling, and T. Mei. Multi-scale triplet cnn for person re-identiﬁcation. In ACM, pages 192–196, 2016

work page 2016
[28]

X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang. Hydraplus-net: Attentive deep features for pedestrian analysis. In ICCV, pages 350–359, 2017

work page 2017
[29]

Z. Liu, D. Wang, and H. Lu. Stepwise metric promotion for unsupervised video person re-identiﬁcation. In ICCV, pages 2429–2438, 2017

work page 2017
[30]

D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, pages 1150–1157, 1999

work page 1999
[31]

Newell, K

A. Newell, K. Yang, and J. Deng. Stacked hourglass net- works for human pose estimation. In ECCV, pages 483 – 499, 2016

work page 2016
[32]

Paisitkriangkrai, C

S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Learn- ing to rank in person re-identiﬁcation with metric ensembles. In CVPR, pages 1846–1855, 2015

work page 2015
[33]

X. Qian, Y . Fu, W. Wang, T. Xiang, Y . Wu, Y . G. Jiang, and X. Xue. Pose-normalized image generation for person re- identiﬁcation. In ECCV, pages 650–667, 2018

work page 2018
[34]

Rublee, V

E. Rublee, V . Rabaud, K. Konolige, and G. Bradski. Orb: an efﬁcient alternative to sift or surf. In ICCV, pages 2564– 2571, 2011

work page 2011
[35]

M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen. A pose-sensitive embedding for person re-identiﬁcation with expanded cross neighborhood re-ranking. In CVPR, pages 420–429, 2018

work page 2018
[36]

Y . Shen, T. Xiao, H. Li, S. Yi, and X. Wang. End-to-end deep kronecker-product matching for person re-identiﬁcation. In CVPR, pages 6886–6895, 2018

work page 2018
[37]

C. Song, Y . Huang, W. Ouyang, and L. Wang. Mask-guided contrastive attention model for person reidentiﬁcation. In CVPR, pages 1179–1188, 2018

work page 2018
[38]

C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian. Pose- driven deep convolutional model for person re-identiﬁcation. arXiv preprint arXiv:1709.08325, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Y . Sun, L. Zheng, W. Deng, and S. Wang. Svdnet for pedes- trian retrieval. In ICCV, pages 3800–3808, 2017

work page 2017
[40]

Y . Sun, L. Zheng, Y . Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with reﬁned part pooling (and a strong convolutional baseline). In ECCV, pages 480–496, 2018

work page 2018
[41]

Szegedy, W

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Go- ing deeper with convolutions. In CVPR, pages 1–9, 2015

work page 2015
[42]

M. Tian, S. Yi, H. Li, S. Li, X. Zhang, J. Shi, J. Yan, and X. Wang. Eliminating background-bias for robust person re- identiﬁcation. In CVPR, pages 5794–5803, 2018

work page 2018
[43]

R. R. Varior, M. Haloi, and G. Wang. Gated siamese convo- lutional neural network architecture for human reidentiﬁca- tion. In ECCV, pages 791–808, 2016

work page 2016
[44]

C. Wang, Q. Zhang, C. Huang, W. Liu, and X. Wang. Mancs: A multi-task attentional network with curriculum sampling for person re-identiﬁcation. In ECCV, pages 365 – 381, 2018

work page 2018
[45]

Residual Attention Network for Image Classification

Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classiﬁcation. arXiv preprint arXiv:1704.06904, 2017, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, pages 7794–7803, 2018

work page 2018
[47]

Y . Wang, Z. Chen, F. Wu, and G. Wang. Person re- identiﬁcation with cascaded pairwise convolutions. In CVPR, pages 1470–1478, 2018

work page 2018
[48]

Y . Wang, L. Wang, Y . You, X. Zou, V . Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger. Resource aware person re-identiﬁcation across multiple resolutions. In CVPR, pages 8042–8051, 2018

work page 2018
[49]

L. Wei, S. Zhang, W. Gao, and Q. Tian. Person trasfer gan to bridge domain gap for person re-identiﬁcation. In CVPR, pages 79–88, 2018

work page 2018
[50]

L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian. Glad: global- local-alignment descriptor for pedestrian retrieval. In ACM, pages 420–428, 2017

work page 2017
[51]

Cbam: Convolutional block attention module

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In ECCV, pages 3–19, 2018

work page 2018
[52]

D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation and rotation equivariance. arXiv preprint arXiv:1612.04642, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[53]

T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deep fea- ture representations with domain guided dropout for person re-identiﬁcation. In CVPR, pages 1249–1258, 2016

work page 2016
[54]

J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Quyang. Attention- aware compositional network for person re-identiﬁcation. In CVPR, pages 2119–2128, 2018

work page 2018
[55]

Yanbei, Z

C. Yanbei, Z. Xiatian, and G. Shaogang. Person reidentiﬁca- tion by deep learning multi-scale representations. In ICCV, pages 2590–2600, 2017

work page 2017
[56]

H. X. Yu, A. Wu, and W. S. Zhen. Cross-view asymmetric metric learning for unsupervised person re-identiﬁcation. In ICCV, pages 994–1002, 2017

work page 2017
[57]

R. Yu, Z. Dou, S. Bai, Z. Zhang, Y . Xu, and X. Bai. Hard- aware point-to-set deep metric for person re-identiﬁcation. In ECCV, pages 188–204, 2018

work page 2018
[58]

Zhang, J

S. Zhang, J. Yang, and B. Schiele. Occluded pedestrian de- tection through guided attention in cnns. In CVPR, pages 6995 – 7003, 2018

work page 2018
[59]

H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: Person re-identiﬁcation with hu- man body region guided feature decomposition and fusion. In CVPR, pages 1077–1085, 2017

work page 2017
[60]

L. Zhao, X. Li, J. Wang, and Y . Zhuang. Deeply-learned part-aligned representations for person re-identiﬁcation. In ICCV, pages 3239 – 3248, 2017

work page 2017
[61]

Pose Invariant Embedding for Deep Person Re-identification

L. Zheng, Y . Huang, H. Lu, and Y . Yang. Pose invariant embedding for deep person re-identiﬁcation. arXiv preprint arXiv:1701.07732, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[63]

Zheng, L

L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identiﬁcation: A benchmark. In ICCV, pages 1116–1124, 2015

work page 2015
[64]

Zheng, L

Z. Zheng, L. Zheng, and Y . Yang. Unlabeled samples gener- ated by gan improve the person re-identiﬁcation baseline in vitro. In ICCV, pages 3754–3762, 2017

work page 2017

[1] [1]

S. Bai, X. Bai, and Q. Tian. Scalable person re-identiﬁcation on supervised smoothed manifold. In CVPR, pages 2530– 2539, 2017

work page 2017

[2] [2]

Bak and P

S. Bak and P. Carr. One-shot metric learning for person re- identiﬁcation. In CVPR, pages 2990–2999, 2017

work page 2017

[3] [3]

R. M. Bolle, J. H. Connell, S. Pankanti, N. K. Ratha, and A. W. Senior. The relation between the roc curve and the cmc. In AUTOID, pages 15–20, 2005

work page 2005

[4] [4]

Chang, T

X. Chang, T. M. Hospedales, and T. Xiang. Multi-level fac- torisation net for person re-identiﬁcation. In CVPR, pages 2109–2118, 2018

work page 2018

[5] [5]

D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang. Group consistent similarity learning via deep crf for person re- identiﬁcation. In CVPR, pages 8649–8658, 2018

work page 2018

[6] [6]

Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning

Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR, pages 5659–5667, 2017

work page 2017

[7] [7]

Y . Chen, X. Zhu, and S. Gong. Person re-identiﬁcation by deep learning multi-scale representations. In ICCV, pages 2590–2600, 2017

work page 2017

[8] [8]

Cheng, Y

D. Cheng, Y . Gong, S. Zhou, J. Wang, and N. Zheng. Per- son re-identiﬁcation by multi-channel parts-based cnn with improved triplet loss function. In CVPR, pages 1335 – 1344, 2016

work page 2016

[9] [9]

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. In ICCV, pages 764– 773, 2017

work page 2017

[10] [10]

Y . Du, C. Yuan, B. Li, L. Zhao, Y . Li, and W. Hu. Interaction- aware spatio-temporal pyramid attention networks for action classiﬁcation. In ECCV, pages 373–389, 2018

work page 2018

[11] [11]

Gens and P

R. Gens and P. M. Domingos. Deep symmetry networks. In NIPS, pages 2537–2545, 2014

work page 2014

[12] [12]

Guo and N

Y . Guo and N. M. Cheung. Efﬁcient and deep person re- identiﬁcation using multi-level similarity. In CVPR, pages 2335–2344, 2018

work page 2018

[13] [13]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770 – 778, 2016

work page 2016

[14] [14]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997

[15] [15]

J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net- works. arXiv preprint arXiv:1709.01507, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Huang, D

H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang. Ad- versarially occluded samples for person re-identiﬁcation. In CVPR, pages 5098–5107, 2018

work page 2018

[17] [17]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[18] [18]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, pages 2017–2025, 2015

work page 2017

[19] [19]

Jeon and J

Y . Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classiﬁcation. In CVPR, pages 4201–4209, 2017

work page 2017

[20] [20]

M. M. Kalayeh, E. Basaran, M. Gkmen, M. E. Kamasak, and M. Shah. Human semantic parsing for person re- identiﬁcation. In CVPR, pages 1062–1071, 2018

work page 2018

[21] [21]

Locally Scale-Invariant Convolutional Neural Networks

A. Kanazawa, A. Sharma, and D. Jacobs. Locally scale- invariant convolutional neural networks. arXiv preprint arXiv:1412.5104, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[22] [22]

Karpathy, G

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classiﬁcation with convo- lutional neural networks. In CVPR, pages 1725–1732, 2014

work page 2014

[23] [23]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[24] [24]

D. Li, X. Chen, Z. Zhang, and K. Huang. Learning deep context-aware features over body and latent parts for person re-identiﬁcation. In CVPR, pages 384–393, 2017

work page 2017

[25] [25]

W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep ﬁlter pairing neural network for person re-identiﬁcation. InCVPR, pages 152–159, 2014

work page 2014

[26] [26]

W. Li, X. Zhu, and S. Gong. Harmonious attention network for person re-identiﬁcation. In CVPR, pages 2285 – 2294, 2018

work page 2018

[27] [27]

J. Liu, Z. J. Zha, Q. Tian, D. Liu, T. Yao, Q. Ling, and T. Mei. Multi-scale triplet cnn for person re-identiﬁcation. In ACM, pages 192–196, 2016

work page 2016

[28] [28]

X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, S. Yi, J. Yan, and X. Wang. Hydraplus-net: Attentive deep features for pedestrian analysis. In ICCV, pages 350–359, 2017

work page 2017

[29] [29]

Z. Liu, D. Wang, and H. Lu. Stepwise metric promotion for unsupervised video person re-identiﬁcation. In ICCV, pages 2429–2438, 2017

work page 2017

[30] [30]

D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, pages 1150–1157, 1999

work page 1999

[31] [31]

Newell, K

A. Newell, K. Yang, and J. Deng. Stacked hourglass net- works for human pose estimation. In ECCV, pages 483 – 499, 2016

work page 2016

[32] [32]

Paisitkriangkrai, C

S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Learn- ing to rank in person re-identiﬁcation with metric ensembles. In CVPR, pages 1846–1855, 2015

work page 2015

[33] [33]

X. Qian, Y . Fu, W. Wang, T. Xiang, Y . Wu, Y . G. Jiang, and X. Xue. Pose-normalized image generation for person re- identiﬁcation. In ECCV, pages 650–667, 2018

work page 2018

[34] [34]

Rublee, V

E. Rublee, V . Rabaud, K. Konolige, and G. Bradski. Orb: an efﬁcient alternative to sift or surf. In ICCV, pages 2564– 2571, 2011

work page 2011

[35] [35]

M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen. A pose-sensitive embedding for person re-identiﬁcation with expanded cross neighborhood re-ranking. In CVPR, pages 420–429, 2018

work page 2018

[36] [36]

Y . Shen, T. Xiao, H. Li, S. Yi, and X. Wang. End-to-end deep kronecker-product matching for person re-identiﬁcation. In CVPR, pages 6886–6895, 2018

work page 2018

[37] [37]

C. Song, Y . Huang, W. Ouyang, and L. Wang. Mask-guided contrastive attention model for person reidentiﬁcation. In CVPR, pages 1179–1188, 2018

work page 2018

[38] [38]

C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian. Pose- driven deep convolutional model for person re-identiﬁcation. arXiv preprint arXiv:1709.08325, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[39] [39]

Y . Sun, L. Zheng, W. Deng, and S. Wang. Svdnet for pedes- trian retrieval. In ICCV, pages 3800–3808, 2017

work page 2017

[40] [40]

Y . Sun, L. Zheng, Y . Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with reﬁned part pooling (and a strong convolutional baseline). In ECCV, pages 480–496, 2018

work page 2018

[41] [41]

Szegedy, W

C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Go- ing deeper with convolutions. In CVPR, pages 1–9, 2015

work page 2015

[42] [42]

M. Tian, S. Yi, H. Li, S. Li, X. Zhang, J. Shi, J. Yan, and X. Wang. Eliminating background-bias for robust person re- identiﬁcation. In CVPR, pages 5794–5803, 2018

work page 2018

[43] [43]

R. R. Varior, M. Haloi, and G. Wang. Gated siamese convo- lutional neural network architecture for human reidentiﬁca- tion. In ECCV, pages 791–808, 2016

work page 2016

[44] [44]

C. Wang, Q. Zhang, C. Huang, W. Liu, and X. Wang. Mancs: A multi-task attentional network with curriculum sampling for person re-identiﬁcation. In ECCV, pages 365 – 381, 2018

work page 2018

[45] [45]

Residual Attention Network for Image Classification

Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classiﬁcation. arXiv preprint arXiv:1704.06904, 2017, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[46] [46]

X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, pages 7794–7803, 2018

work page 2018

[47] [47]

Y . Wang, Z. Chen, F. Wu, and G. Wang. Person re- identiﬁcation with cascaded pairwise convolutions. In CVPR, pages 1470–1478, 2018

work page 2018

[48] [48]

Y . Wang, L. Wang, Y . You, X. Zou, V . Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger. Resource aware person re-identiﬁcation across multiple resolutions. In CVPR, pages 8042–8051, 2018

work page 2018

[49] [49]

L. Wei, S. Zhang, W. Gao, and Q. Tian. Person trasfer gan to bridge domain gap for person re-identiﬁcation. In CVPR, pages 79–88, 2018

work page 2018

[50] [50]

L. Wei, S. Zhang, H. Yao, W. Gao, and Q. Tian. Glad: global- local-alignment descriptor for pedestrian retrieval. In ACM, pages 420–428, 2017

work page 2017

[51] [51]

Cbam: Convolutional block attention module

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In ECCV, pages 3–19, 2018

work page 2018

[52] [52]

D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation and rotation equivariance. arXiv preprint arXiv:1612.04642, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[53] [53]

T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deep fea- ture representations with domain guided dropout for person re-identiﬁcation. In CVPR, pages 1249–1258, 2016

work page 2016

[54] [54]

J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Quyang. Attention- aware compositional network for person re-identiﬁcation. In CVPR, pages 2119–2128, 2018

work page 2018

[55] [55]

Yanbei, Z

C. Yanbei, Z. Xiatian, and G. Shaogang. Person reidentiﬁca- tion by deep learning multi-scale representations. In ICCV, pages 2590–2600, 2017

work page 2017

[56] [56]

H. X. Yu, A. Wu, and W. S. Zhen. Cross-view asymmetric metric learning for unsupervised person re-identiﬁcation. In ICCV, pages 994–1002, 2017

work page 2017

[57] [57]

R. Yu, Z. Dou, S. Bai, Z. Zhang, Y . Xu, and X. Bai. Hard- aware point-to-set deep metric for person re-identiﬁcation. In ECCV, pages 188–204, 2018

work page 2018

[58] [58]

Zhang, J

S. Zhang, J. Yang, and B. Schiele. Occluded pedestrian de- tection through guided attention in cnns. In CVPR, pages 6995 – 7003, 2018

work page 2018

[59] [59]

H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang. Spindle net: Person re-identiﬁcation with hu- man body region guided feature decomposition and fusion. In CVPR, pages 1077–1085, 2017

work page 2017

[60] [60]

L. Zhao, X. Li, J. Wang, and Y . Zhuang. Deeply-learned part-aligned representations for person re-identiﬁcation. In ICCV, pages 3239 – 3248, 2017

work page 2017

[61] [61]

Pose Invariant Embedding for Deep Person Re-identification

L. Zheng, Y . Huang, H. Lu, and Y . Yang. Pose invariant embedding for deep person re-identiﬁcation. arXiv preprint arXiv:1701.07732, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[62] [63]

Zheng, L

L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identiﬁcation: A benchmark. In ICCV, pages 1116–1124, 2015

work page 2015

[63] [64]

Zheng, L

Z. Zheng, L. Zheng, and Y . Yang. Unlabeled samples gener- ated by gan improve the person re-identiﬁcation baseline in vitro. In ICCV, pages 3754–3762, 2017

work page 2017