pith. sign in

arxiv: 2606.03568 · v1 · pith:TJVJGQS5new · submitted 2026-06-02 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Learned Non-Maximum Suppression for 3D Object Detection

Pith reviewed 2026-06-28 10:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords learned nms3d object detectionlidartransformer attentionmessage passingnuscenespost-processingnon-maximum suppression
0
0 comments X

The pith

Two learned modules replace heuristic NMS and raise mAP plus NDS on nuScenes 3D detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces D2D-Rescore and GossipNet3D as replacements for traditional non-maximum suppression in LiDAR 3D object detection. These modules use detection-to-detection relations via transformer attention and localized bird's-eye-view message passing to filter overlapping proposals. A metric-aware matching process keeps training aligned with the nuScenes evaluation protocol. The learned filters lift mean average precision, nuScenes detection score, and true-positive quality over CircleNMS, with the biggest gains on small and infrequent classes and only minor added cost. The work shows that post-processing can be learned to improve detector output without altering the base network.

Core claim

D2D-Rescore employs transformer-based detection-to-detection attention while GossipNet3D adapts localized message passing to three dimensions; both modules, trained with metric-aware matching, outperform CircleNMS on mAP, NDS, and true-positive metrics, especially for rare object classes, while adding negligible computation.

What carries the argument

Transformer attention across detections and localized message passing in bird's-eye view for rescor ing and suppressing 3D proposals.

If this is right

  • Detection scores rise without any change to the underlying 3D detector architecture.
  • Small and infrequent object classes receive the largest accuracy gains.
  • Added runtime cost stays minimal compared with standard CircleNMS.
  • Training remains aligned with final evaluation through the shared matching rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same filtering approach could be attached to other 3D detectors that output dense proposals.
  • Adapting the matching strategy might allow the modules to improve results on benchmarks that use different overlap criteria.
  • Reduced false positives from better suppression could directly lower collision risk in downstream planning modules.

Load-bearing premise

The metric-aware matching strategy keeps training and validation behavior consistent without introducing bias toward the specific benchmark protocol.

What would settle it

An experiment on a different dataset or with a mismatched evaluation metric in which the learned modules produce equal or lower scores than CircleNMS would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.03568 by Stefan Sch\"utte, Timo Osterburg, Torsten Bertram.

Figure 1
Figure 1. Figure 1: Overview of the post-processing pipelines: classical filtering (I) uses score thresholding and non-maximum [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Precision-recall curves of D2D-Rescore (blue), [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among detections. D2D-Rescore employs transformer-based detection-to-detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird's-eye view. A metric-aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection-level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at https://github.com/rst-tu-dortmund/learned-3d-nms .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes two learned post-processing modules to replace heuristic CircleNMS in LiDAR-based 3D object detection: D2D-Rescore, which uses transformer-based detection-to-detection attention, and GossipNet3D, which adapts 2D GossipNet via localized message passing in bird's-eye view. A metric-aware matching strategy aligned with the nuScenes evaluation protocol is used during training. The manuscript reports that both modules improve mAP, NDS, and true-positive quality metrics over CircleNMS, with larger gains on small and infrequent classes and only minimal added compute; code is released.

Significance. If the reported gains prove robust, the work supplies a general, detector-agnostic route to better suppression that does not require retraining the base network. The public code release is a clear strength that aids reproducibility.

major comments (1)
  1. [Abstract and §4] Abstract and §4 (experiments): the central empirical claim of consistent mAP/NDS/TP-quality gains rests on training with a metric-aware matcher that deliberately mirrors the nuScenes evaluation protocol. No ablation is described that retrains the same modules with a protocol-agnostic matcher (e.g., standard 3D IoU). Without this control, it remains possible that part of the reported improvement is an artifact of metric alignment rather than a genuine advance in learned suppression; this directly affects the interpretation of the headline results, especially the gains on small/infrequent classes.
minor comments (1)
  1. [Abstract] The abstract states that both modules add “minimal computational overhead” but supplies no concrete latency or FLOPs numbers; a table or sentence in §4.3 would make the overhead claim verifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the role of the metric-aware matcher. We address this point directly below and commit to strengthening the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (experiments): the central empirical claim of consistent mAP/NDS/TP-quality gains rests on training with a metric-aware matcher that deliberately mirrors the nuScenes evaluation protocol. No ablation is described that retrains the same modules with a protocol-agnostic matcher (e.g., standard 3D IoU). Without this control, it remains possible that part of the reported improvement is an artifact of metric alignment rather than a genuine advance in learned suppression; this directly affects the interpretation of the headline results, especially the gains on small/infrequent classes.

    Authors: We agree that an explicit control experiment would strengthen the interpretation of the results. The metric-aware matcher was introduced to ensure training and evaluation operate under identical assignment criteria, which is a deliberate design choice to avoid train-test mismatch in the suppression objective. Nevertheless, the referee is correct that this leaves open the possibility that some gains arise from the alignment itself rather than from the learned D2D-Rescore or GossipNet3D modules. In the revised manuscript we will add an ablation that retrains both modules using a standard 3D IoU matcher (with the same hyperparameters otherwise) and report the resulting mAP, NDS, and TP metrics on nuScenes. This will allow readers to isolate the contribution of the learned suppression from the effect of metric alignment, particularly for the small and rare classes highlighted in the paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains rest on independent training and evaluation

full rationale

The paper reports empirical mAP/NDS/TP-quality improvements from two learned post-processing modules (D2D-Rescore, GossipNet3D) versus CircleNMS on nuScenes. The metric-aware matching is presented as a training design choice aligned with the benchmark protocol to ensure consistent behavior; it does not appear in any equation that reduces a reported gain to a quantity fitted inside the same experiment. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The central claims are therefore self-contained against external benchmarks and receive the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training details, or model specifications; therefore no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5727 in / 976 out tokens · 20572 ms · 2026-06-28T10:52:15.989760+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 2 canonical work pages

  1. [1]

    Center-based 3d object detection and tracking,

    T. Yin, X. Zhou, and P. Krähenbühl, “Center-based 3d object detection and tracking,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021

  2. [2]

    Pointpillars: Fast encoders for object detection from point clouds,

    A. H. Lang et al., “Pointpillars: Fast encoders for object detection from point clouds,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019

  3. [3]

    Dsvt: Dynamic sparse voxel trans- former with rotated sets,

    H. Wang et al., “Dsvt: Dynamic sparse voxel trans- former with rotated sets,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  4. [4]

    Soft-nms–improving object detection with one line of code,

    N. Bodla et al., “Soft-nms–improving object detection with one line of code,” inProceedings of the IEEE international conference on computer vision, 2017

  5. [5]

    End-to-end object detection with transformers,

    N. Carion et al., “End-to-end object detection with transformers,” inEuropean conference on computer vision, Springer, 2020

  6. [6]

    Detr3d: 3d object detection from multi- view images via 3d-to-2d queries,

    Y . Wang et al., “Detr3d: 3d object detection from multi- view images via 3d-to-2d queries,” inConference on robot learning, PMLR, 2022

  7. [7]

    Li3detr: A lidar based 3d detection transformer,

    G. K. Erabati and H. Araujo, “Li3detr: A lidar based 3d detection transformer,” inProceedings of the IEEE/CVF Winter conference on applications of computer vision, 2023

  8. [8]

    Learning non- maximum suppression,

    J. Hosang, R. Benenson, and B. Schiele, “Learning non- maximum suppression,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017

  9. [9]

    Relation networks for object detection,

    H. Hu et al., “Relation networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018

  10. [10]

    End-to-end single shot detector using graph-based learnable duplicate removal,

    S. Ding et al., “End-to-end single shot detector using graph-based learnable duplicate removal,” inDAGM German Conference on Pattern Recognition, Springer, 2022

  11. [11]

    Internimage: Exploring large-scale vision foundation models with deformable convolu- tions,

    W. Wang et al., “Internimage: Exploring large-scale vision foundation models with deformable convolu- tions,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023

  12. [12]

    Rtmdet: An empirical study of designing real- time object detectors

    C. Lyu et al., “Rtmdet: An empirical study of de- signing real-time object detectors,”arXiv preprint arXiv:2212.07784, 2022

  13. [13]

    Neural attention-driven non- maximum suppression for person detection,

    C. Symeonidis et al., “Neural attention-driven non- maximum suppression for person detection,”IEEE transactions on image processing, vol. 32, 2023

  14. [14]

    Petrv2: A unified framework for 3d perception from multi-camera images,

    Y . Liu et al., “Petrv2: A unified framework for 3d perception from multi-camera images,” inProceedings of the IEEE/CVF international conference on computer vision, 2023

  15. [15]

    Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspec- tive supervision,

    C. Yang et al., “Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspec- tive supervision,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023

  16. [16]

    Nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar et al., “Nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020

  17. [17]

    Visibility guided nms: Efficient boosting of amodal object detection in crowded traffic scenes,

    N. Gählert et al., “Visibility guided nms: Efficient boosting of amodal object detection in crowded traffic scenes,”arXiv preprint arXiv:2006.08547, 2020

  18. [18]

    Fourier features let networks learn high frequency functions in low dimensional domains,

    M. Tancik et al., “Fourier features let networks learn high frequency functions in low dimensional domains,” Advances in neural information processing systems, vol. 33, 2020