pith. sign in

arxiv: 2606.02022 · v1 · pith:BR3PYYTJnew · submitted 2026-06-01 · 💻 cs.CV · cs.AI· cs.LG

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

Pith reviewed 2026-06-28 15:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords multi-view object associationranking metricsassignment problemaverage precisionFPR-95Sinkhorn normalizationevaluation metrics
0
0 comments X

The pith

Pairwise ranking metrics like AP and FPR-95 do not match the assignment objective in multi-view object association.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that evaluation metrics based on pairwise ranking, such as average precision and false positive rate at 95 percent recall, can give imperfect scores even when the one-to-one assignment of objects across views is already correct. It also shows that the reverse is possible: an optimal ranking can produce an incorrect assignment. Using Sinkhorn normalization on similarity scores as a post-processing step improves the ranking metrics without altering the assignment accuracy. This highlights that optimizing for ranking metrics may not optimize the actual task performance.

Core claim

AP and FPR-95 can be imperfect even when the assignment is already correct, and Sinkhorn-based normalization can make them perfect, while optimal pairwise ranking can still lead to incorrect assignments.

What carries the argument

Sinkhorn-based normalization applied to similarity matrices as a controlled post-processing step to isolate the effect on ranking metrics versus assignment metrics.

If this is right

  • Models trained to maximize AP or FPR-95 may not achieve the best possible assignments.
  • Assignment metrics such as ACC and IPAA provide a more direct measure of task success.
  • Simple post-processing can boost reported ranking scores independently of model improvements.
  • Evaluation protocols should include both ranking and assignment metrics to avoid misleading conclusions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar metric mismatches could exist in other bipartite matching problems in computer vision.
  • Future work might explore training methods that directly optimize assignment objectives instead of ranking proxies.
  • Practitioners should verify if their models suffer from this mismatch by testing Sinkhorn post-processing on their outputs.

Load-bearing premise

The mismatch between ranking and assignment that appears in theory and controlled experiments will also occur with the similarity matrices generated by actual trained models on real data.

What would settle it

A trained model where applying Sinkhorn normalization improves AP and FPR-95 but does not change ACC or IPAA would support the claim; the absence of such improvement or a case where ranking is perfect but assignment is wrong would challenge it.

Figures

Figures reproduced from arXiv: 2606.02022 by Aleksandr Chukhrov, Karina Kvanchiani, Matvei Shelukhan, Timur Mamedov.

Figure 1
Figure 1. Figure 1: Pairwise ranking metrics and assignment-level metrics evaluate different aspects [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our Sinkhorn-based post-processing stress test. The raw affinity ma [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean metric change across all evaluated methods on WILDTRACK under the [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of pairwise scores before and after the Sinkhorn-based post [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Temperature sensitivity for Self-MVA on WILDTRACK under the test-to-test [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims a fundamental mismatch between pairwise ranking metrics (AP, FPR-95) commonly used to evaluate multi-view object association models and the underlying constrained one-to-one assignment objective. It provides theoretical constructions demonstrating that AP/FPR-95 can remain imperfect for already-correct assignments (with Sinkhorn normalization able to perfect the metrics without changing the assignment) and that optimal pairwise ranking can still produce incorrect assignments. Empirically, a Sinkhorn-based post-processing stress test is used to show that optimizing a few parameters can significantly improve ranking metrics without corresponding gains in assignment-level metrics such as ACC and IPAA.

Significance. If the mismatch generalizes beyond constructed matrices to the structured similarity matrices produced by real multi-view embedding models, the result would indicate that current evaluation practices are misaligned with the task objective and could be driving suboptimal model development in multi-camera perception. The controlled post-processing experiment is a methodological strength for isolating metric effects.

major comments (2)
  1. [Empirical validation / stress test] The theoretical counter-examples rely on constructed similarity matrices; no measurement or argument is provided showing that the reported mismatch structures (or the effect of Sinkhorn on them) arise in the non-uniform similarity matrices generated by actual feature embeddings from multi-view detectors (see the empirical validation section and the stress-test description).
  2. [Empirical validation] The Sinkhorn post-processing is presented as isolating metric mismatch, but the experiment does not include controls or analysis confirming that the observed AP/FPR-95 gains occur without confounding changes to the underlying pairwise similarities or model outputs (see the description of the post-processing parameters and results on ACC/IPAA).
minor comments (1)
  1. Notation for the assignment problem and the ranking metrics could be introduced more explicitly with a small running example to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major comment below, clarifying the role of the empirical stress test on real data while acknowledging where additional details could strengthen the presentation.

read point-by-point responses
  1. Referee: [Empirical validation / stress test] The theoretical counter-examples rely on constructed similarity matrices; no measurement or argument is provided showing that the reported mismatch structures (or the effect of Sinkhorn on them) arise in the non-uniform similarity matrices generated by actual feature embeddings from multi-view detectors (see the empirical validation section and the stress-test description).

    Authors: The theoretical constructions use synthetic matrices solely to prove the mathematical possibility of the mismatch. The empirical validation section applies the identical Sinkhorn post-processing directly to the non-uniform similarity matrices produced by trained multi-view embedding models on standard real-world datasets. The observed outcome—that a few optimized parameters produce large gains in AP and FPR-95 while ACC and IPAA remain unchanged—constitutes direct evidence that the mismatch structures are exploitable in the matrices arising from actual detectors. We are prepared to add a short quantitative comparison of matrix statistics (e.g., row-sum variation, off-diagonal concentration) before and after normalization in a revised version. revision: partial

  2. Referee: [Empirical validation] The Sinkhorn post-processing is presented as isolating metric mismatch, but the experiment does not include controls or analysis confirming that the observed AP/FPR-95 gains occur without confounding changes to the underlying pairwise similarities or model outputs (see the description of the post-processing parameters and results on ACC/IPAA).

    Authors: The procedure modifies only the output similarity matrix through Sinkhorn normalization; the underlying feature embeddings and model weights are untouched. Isolation is achieved by holding the model fixed and reporting that assignment-level metrics (ACC, IPAA) are invariant while ranking metrics improve. Because the theoretical section proves that Sinkhorn can perfect ranking scores without altering the optimal assignment, the empirical stability of ACC/IPAA confirms that no confounding change to the assignment occurs. The optimized parameters are strictly those of the normalization routine. If desired, we can include an auxiliary experiment comparing Sinkhorn to random score perturbations of matched magnitude. revision: partial

Circularity Check

0 steps flagged

No circularity; theoretical constructions and post-processing experiment are independent

full rationale

The paper's core argument rests on explicit construction of similarity matrices that separate ranking metrics (AP/FPR-95) from assignment correctness, plus a controlled Sinkhorn post-processing experiment that improves the former without the latter. No step reduces to a self-citation chain, a fitted parameter renamed as a prediction, or a quantity defined in terms of the target result. The derivation is self-contained against the stated assumptions and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5685 in / 1002 out tokens · 26705 ms · 2026-06-28T15:40:19.552202+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    A structured and methodological review on multi-view human activity recognition for ambient assisted living.Journal of Imaging, 11(6):182, 2025

    Fahmid Al Farid, Ahsanul Bari, Abu Saleh Musa Miah, Sarina Mansor, Jia Uddin, and S Prabha Kumaresan. A structured and methodological review on multi-view human activity recognition for ambient assisted living.Journal of Imaging, 11(6):182, 2025

  2. [2]

    Action recognition via multi-view perception feature tracking for human–robot interaction.Robotics, 14(4):53, 2025

    Chaitanya Bandi and Ulrike Thomas. Action recognition via multi-view perception feature tracking for human–robot interaction.Robotics, 14(4):53, 2025

  3. [3]

    Messytable: Instance association in multiple camera views

    Zhongang Cai, Junzhe Zhang, Daxuan Ren, Cunjun Yu, Haiyu Zhao, Shuai Yi, Chai Kiat Yeo, and Chen Change Loy. Messytable: Instance association in multiple camera views. InEuropean Conference on Computer Vision, pages 1–16. Springer, 2020

  4. [4]

    Wildtrack: A multi-camera hd dataset for dense unscripted pedestrian detection

    Tatjana Chavdarova, Pierre Baqué, Stéphane Bouquet, Andrii Maksai, Cijo Jose, Timur Bagautdinov, Louis Lettry, Pascal Fua, Luc Van Gool, and François Fleuret. Wildtrack: A multi-camera hd dataset for dense unscripted pedestrian detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5030–5039, 2018

  5. [5]

    Learning from syn- chronization: Self-supervised uncalibrated multi-view person association in challeng- ing scenes

    Keqi Chen, Vinkle Srivastav, Didier Mutter, and Nicolas Padoy. Learning from syn- chronization: Self-supervised uncalibrated multi-view person association in challeng- ing scenes. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 24419–24428, 2025

  6. [6]

    Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking

    Cheng-Che Cheng, Min-Xuan Qiu, Chen-Kuo Chiang, and Shang-Hong Lai. Rest: A reconfigurable spatial-temporal graph model for multi-camera multi-object tracking. InProceedings of the IEEE/CVF international conference on computer vision, pages 10051–10060, 2023

  7. [7]

    Soldiers tracking.https://www.epfl.ch/labs/cvlab/ data/soldiers-tracking/

    Leonardo Citraro. Soldiers tracking.https://www.epfl.ch/labs/cvlab/ data/soldiers-tracking/

  8. [8]

    Sinkhorn distances: Lightspeed computation of optimal transport.Ad- vances in neural information processing systems, 26, 2013

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Ad- vances in neural information processing systems, 26, 2013

  9. [9]

    Self-supervised multi-view multi-human association and tracking

    Yiyang Gan, Ruize Han, Liqiang Yin, Wei Feng, and Song Wang. Self-supervised multi-view multi-human association and tracking. InProceedings of the 29th ACM international conference on multimedia, pages 282–290, 2021

  10. [10]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  11. [11]

    Multi-player tracking for multi-view sports videos with improved k-shortest path algo- rithm.Applied Sciences, 10(3):864, 2020

    Qiaokang Liang, Wanneng Wu, Yukun Yang, Ruiheng Zhang, Yu Peng, and Min Xu. Multi-player tracking for multi-view sports videos with improved k-shortest path algo- rithm.Applied Sciences, 10(3):864, 2020

  12. [12]

    Graph neu- ral networks for cross-camera data association.IEEE Transactions on Circuits and Systems for Video Technology, 33(2):589–601, 2022

    Elena Luna, Juan C SanMiguel, José M Martínez, and Pablo Carballeira. Graph neu- ral networks for cross-camera data association.IEEE Transactions on Circuits and Systems for Video Technology, 33(2):589–601, 2022. 16SHELUKHAN, MAMEDOV , CHUKHROV , KV ANCHIANI: RANKING VS. ASSIGNMENT

  13. [13]

    Dynamix: Generalizable person re-identification via dynamic relabeling and mixed data sampling.Neurocom- puting, page 132446, 2025

    Timur Mamedov, Anton Konushin, and Vadim Konushin. Dynamix: Generalizable person re-identification via dynamic relabeling and mixed data sampling.Neurocom- puting, page 132446, 2025

  14. [14]

    Retext: Text boosts generalization in image-based person re-identification.arXiv:2602.05785, 2026

    Timur Mamedov, Karina Kvanchiani, Anton Konushin, and Vadim Konushin. Re- text: Text boosts generalization in image-based person re-identification.arXiv preprint arXiv:2602.05785, 2026

  15. [15]

    Algorithms for the assignment and transportation problems.Society for Industrial and Applied Mathematics, 15:196–210, 1962

    James Munkres. Algorithms for the assignment and transportation problems.Society for Industrial and Applied Mathematics, 15:196–210, 1962

  16. [16]

    Lmgp: Lifted multicut meets geometry projections for multi-camera multi-object tracking

    Duy MH Nguyen, Roberto Henschel, Bodo Rosenhahn, Daniel Sonntag, and Paul Swo- boda. Lmgp: Lifted multicut meets geometry projections for multi-camera multi-object tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8866–8875, 2022

  17. [17]

    Hubness reduction with dual bank sinkhorn normalization for cross-modal retrieval

    Zhengxin Pan, Haishuai Wang, Fangyu Wu, Peng Zhang, and Jiajun Bu. Hubness reduction with dual bank sinkhorn normalization for cross-modal retrieval. InPro- ceedings of the 33rd ACM International Conference on Multimedia, pages 6153–6162, 2025

  18. [18]

    Mvdet: multi-view multi-class object detection without ground plane assumption.Pattern Analysis and Applications, 26(3): 1059–1070, 2023

    Sola Park, Seungjin Yang, and Hyuk-Jae Lee. Mvdet: multi-view multi-class object detection without ground plane assumption.Pattern Analysis and Applications, 26(3): 1059–1070, 2023

  19. [19]

    Superglue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938– 4947, 2020

  20. [20]

    Vit-p3de*: Vision transformer based multi-camera instance association with pseudo 3d position embeddings

    Minseok Seo, Hyuk-Jae Lee, and Xuan Truong Nguyen. Vit-p3de*: Vision transformer based multi-camera instance association with pseudo 3d position embeddings. InIJ- CAI, pages 1340–1350, 2023

  21. [21]

    The self-optimal-transport feature transform.arXiv preprint arXiv:2204.03065, 3, 2022

    Daniel Shalam and Simon Korman. The self-optimal-transport feature transform.arXiv preprint arXiv:2204.03065, 3, 2022

  22. [22]

    arXiv preprint arXiv:1808.08180 (2018)

    Vinkle Srivastav, Thibaut Issenhuth, Abdolrahim Kadkhodamohammadi, Michel de Mathelin, Afshin Gangi, and Nicolas Padoy. Mvor: A multi-view rgb-d operating room dataset for 2d and 3d human pose estimation.arXiv preprint arXiv:1808.08180, 2018

  23. [23]

    Optimal transport for label-efficient visible-infrared per- son re-identification

    Jiangming Wang, Zhizhong Zhang, Mingang Chen, Yi Zhang, Cong Wang, Bin Sheng, Yanyun Qu, and Yuan Xie. Optimal transport for label-efficient visible-infrared per- son re-identification. InEuropean Conference on Computer Vision, pages 93–109. Springer, 2022

  24. [24]

    Mutual information guided optimal transport for unsupervised visible- infrared person re-identification.arXiv preprint arXiv:2407.12758, 2024

    Zhizhong Zhang, Jiangming Wang, Xin Tan, Yanyun Qu, Junping Wang, Yong Xie, and Yuan Xie. Mutual information guided optimal transport for unsupervised visible- infrared person re-identification.arXiv preprint arXiv:2407.12758, 2024. SHELUKHAN, MAMEDOV , CHUKHROV , KV ANCHIANI: RANKING VS. ASSIGNMENT17

  25. [25]

    Learning general- isable omni-scale representations for person re-identification.IEEE transactions on pattern analysis and machine intelligence, 44(9):5056–5069, 2021

    Kaiyang Zhou, Yongxin Yang, Andrea Cavallaro, and Tao Xiang. Learning general- isable omni-scale representations for person re-identification.IEEE transactions on pattern analysis and machine intelligence, 44(9):5056–5069, 2021. 18SHELUKHAN, MAMEDOV , CHUKHROV , KV ANCHIANI: RANKING VS. ASSIGNMENT A Pseudocode For clarity, we provide the pseudocode for ou...