arxiv: 2604.22507 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

Railway Artificial Intelligence Learning Benchmark (RAIL-BENCH): A Benchmark Suite for Perception in the Railway Domain

Annika B\"atz , Pavel Klasek , Seo-Young Ham , Philipp Neumaier , Martin K\"oppel , Martin Lauer

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords railwayperception benchmarkrail track detectionLineAPobject detectionvisual odometrycomputer vision for trains

0 comments

The pith

The railway domain now has its first dedicated perception benchmark suite called RAIL-BENCH, featuring five challenges and a new LineAP metric for track detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces RAIL-BENCH to address the lack of public benchmarks for camera-based perception in railways, which is essential for automated train operation. It includes five challenges tailored to railway environments: rail track detection, object detection, vegetation segmentation, multi-object tracking, and monocular visual odometry. The suite provides curated datasets from real-world scenarios, standardized evaluation protocols, and public leaderboards. For track detection, it adds LineAP, a metric that measures the geometric accuracy of predicted polylines without depending on how instances are grouped. If successful, this would enable reproducible comparisons and drive progress in reliable railway perception systems.

Core claim

RAIL-BENCH is the first perception benchmark suite for the railway domain, consisting of five challenges each with curated training and test datasets from diverse real-world scenarios, along with evaluation metrics and public scoreboards. The rail track detection challenge introduces LineAP, a segment-based average precision metric that evaluates the geometric accuracy of polyline predictions independently of instance-level grouping, overcoming limitations of prior line detection metrics.

What carries the argument

RAIL-BENCH benchmark suite and the LineAP metric, which evaluates polyline predictions for rail tracks based on segment-level geometric accuracy rather than instance matching.

If this is right

Standardized datasets and metrics will allow direct comparison of different perception algorithms for railways.
LineAP will provide a more reliable way to assess track detection performance by focusing on geometric fidelity.
The five challenges cover key perception tasks needed for automated train operations.
Public scoreboards will promote transparency and reproducibility in railway AI research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adoption of RAIL-BENCH could highlight the need for domain-specific adaptations in vision models trained on general datasets.
Success on these benchmarks might correlate with better performance in long-distance, high-speed railway scenarios compared to urban road benchmarks.
Extending the benchmark to include additional railway-specific conditions like tunnels or extreme weather could further improve its utility.

Load-bearing premise

The curated datasets from diverse real-world scenarios are representative of the perception challenges that deployed railway systems will actually face.

What would settle it

A perception model that scores highly on all RAIL-BENCH challenges but performs poorly when tested on additional real railway footage not included in the benchmark datasets.

Figures

Figures reproduced from arXiv: 2604.22507 by Annika B\"atz, Martin K\"oppel, Martin Lauer, Pavel Klasek, Philipp Neumaier, Seo-Young Ham.

**Figure 1.** Figure 1: Class distribution in the RAIL-BENCH Object+Rail dataset. All view at source ↗

**Figure 2.** Figure 2: Annotated example images from the RAIL-BENCH Object+Rail view at source ↗

**Figure 3.** Figure 3: Example images from the RAIL-BENCH Vegetation (left) and view at source ↗

**Figure 4.** Figure 4: Illustration of the proposed LineAP matching. Ground-truth view at source ↗

**Figure 5.** Figure 5: Predicted lines from YOLinO (top) and PINet (middle, bottom). In each column LineAP@0.5 is compared with one other metric: TuSimple view at source ↗

read the original abstract

Automated train operation on existing railway infrastructure requires robust camera-based perception, yet the railway domain lacks public benchmark suites with standardized evaluation protocols that would enable reproducible comparison of approaches. We present RAIL-BENCH, the first perception benchmark suite for the railway domain. It comprises five challenges - rail track detection, object detection, vegetation segmentation, multi-object tracking, and monocular visual odometry - each tailored to the specific characteristics of railway environments. RAIL-BENCH provides curated training and test datasets drawn from diverse real-world scenarios, evaluation metrics, and public scoreboards (https://www.mrt.kit.edu/railbench). For the rail track detection challenge we introduce LineAP, a novel segment-based average precision metric that evaluates the geometric accuracy of polyline predictions independently of instance-level grouping, addressing key limitations of existing line detection metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAIL-BENCH supplies the first public railway perception benchmark suite and a practical fix for line-detection scoring, but its usefulness rests on unshown dataset coverage.

read the letter

The paper's core contribution is releasing RAIL-BENCH as the first standardized suite for railway camera perception. It defines five tasks—track detection, object detection, vegetation segmentation, multi-object tracking, and monocular odometry—plus curated datasets, metrics, and public scoreboards. For track detection it adds LineAP, a segment-based average precision that scores geometric accuracy of polylines without depending on instance grouping. That directly tackles a known weakness in prior line metrics and is a concrete improvement worth noting.

Referee Report

2 major / 2 minor

Summary. The manuscript presents RAIL-BENCH as the first public benchmark suite for camera-based perception tasks in the railway domain. It defines five challenges (rail track detection, object detection, vegetation segmentation, multi-object tracking, and monocular visual odometry), supplies curated training/test datasets from diverse real-world scenarios, introduces the LineAP metric for evaluating polyline predictions in track detection, and provides public scoreboards at https://www.mrt.kit.edu/railbench.

Significance. If the datasets prove representative of operational railway conditions and LineAP is shown to be a valid improvement, the benchmark would fill a documented gap in standardized, reproducible evaluation for safety-critical railway perception, analogous to the role of KITTI or BDD100K in autonomous driving. The public leaderboards and focus on railway-specific characteristics (e.g., track geometry) are concrete strengths that could accelerate community progress.

major comments (2)

[Section 3] Dataset description (Section 3): the claim that the curated datasets are drawn from 'diverse real-world scenarios' and will support transfer to deployed systems is not backed by quantitative coverage statistics (scene counts per lighting/weather/curvature/sensor-degradation regime, label distributions, or comparison to operational railway data). This directly affects the central claim that results on RAIL-BENCH will advance robust perception for automated train operation.
[Section 4.1] Rail track detection challenge and LineAP definition (Section 4.1): the manuscript introduces LineAP as a segment-based average precision metric that evaluates geometric accuracy independently of instance grouping, yet provides no validation (correlation with existing metrics, error analysis on sample predictions, or ablation showing it resolves the stated limitations of prior line-detection metrics). Without such evidence the novelty and utility of the new metric remain unestablished.

minor comments (2)

[Abstract] The abstract lists the five challenges but the full manuscript should ensure that the challenge ordering and metric definitions are presented with consistent notation and cross-references to the public scoreboard.
[Figures] Figure captions for dataset examples and metric illustrations could be expanded to include explicit definitions of the polyline segments used by LineAP.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of RAIL-BENCH's potential impact and for the constructive feedback. We address the two major comments below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Section 3] Dataset description (Section 3): the claim that the curated datasets are drawn from 'diverse real-world scenarios' and will support transfer to deployed systems is not backed by quantitative coverage statistics (scene counts per lighting/weather/curvature/sensor-degradation regime, label distributions, or comparison to operational railway data). This directly affects the central claim that results on RAIL-BENCH will advance robust perception for automated train operation.

Authors: We agree that including quantitative coverage statistics would provide stronger support for the diversity claims. In the revised manuscript, we will expand Section 3 with a dedicated subsection or table detailing scene counts across different lighting conditions, weather regimes, track curvatures, and sensor degradation levels. We will also include label distributions and, where possible, comparisons to operational railway data. This addition will better justify the transferability to deployed systems. revision: yes
Referee: [Section 4.1] Rail track detection challenge and LineAP definition (Section 4.1): the manuscript introduces LineAP as a segment-based average precision metric that evaluates geometric accuracy independently of instance grouping, yet provides no validation (correlation with existing metrics, error analysis on sample predictions, or ablation showing it resolves the stated limitations of prior line-detection metrics). Without such evidence the novelty and utility of the new metric remain unestablished.

Authors: We recognize that empirical validation is important to establish the novelty and utility of LineAP. In the revised version, we will add a validation analysis in Section 4.1, including: correlation coefficients between LineAP and standard line detection metrics on the provided datasets, qualitative error analysis with sample predictions illustrating the advantages, and an ablation study comparing LineAP to prior metrics to show how it resolves issues like dependence on instance grouping. These additions will substantiate the metric's improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark release with independent metric definition

full rationale

The paper introduces RAIL-BENCH as a new benchmark suite comprising five challenges, curated datasets from real-world scenarios, and a novel LineAP metric for rail track detection. No derivations, equations, fitted parameters, or predictions are present that reduce to inputs by construction. The central claims rest on the provision of public datasets, evaluation protocols, and scoreboards rather than any self-referential or self-citational chain. LineAP is explicitly positioned as addressing limitations of existing metrics, with no evidence of smuggling ansatzes or renaming known results via self-citation. The representativeness of datasets is an unvalidated assumption but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central contribution rests on the assumption that the chosen real-world scenarios and the new LineAP metric capture the relevant difficulties of railway perception; no free parameters or invented physical entities are introduced.

invented entities (1)

LineAP metric no independent evidence
purpose: Segment-based average precision for evaluating geometric accuracy of polyline predictions for rail tracks independently of instance grouping
Presented as novel in the abstract to address limitations of existing line detection metrics; no external validation or independent evidence provided in the abstract

pith-pipeline@v0.9.0 · 5464 in / 1135 out tokens · 39065 ms · 2026-05-08T12:34:05.345594+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 6 canonical work pages

[1]

Sensor system for development of perception systems for ATO,

R. Tagiew, D. Leinhos, H. von der Haar, C. Klotz, D. Sprute, J. Ziehn, A. Schmelter, S. Witte, and P. Klasek, “Sensor system for development of perception systems for ATO,”Discover Artif. Intell., vol. 3, no. 22, 2023

2023
[2]

Are we ready for autonomous driving? the KITTI vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the KITTI vision benchmark suite,” in2012 IEEE Conf. on Comput. Vis. and Pattern Recognit., 2012, pp. 3354–3361

2012
[3]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” inProc. of the IEEE/CVF Conf...

2020
[4]

nuScenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” inProc. of the IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit., 2020, pp. 11 621–11 631

2020
[5]

RailSem19: A dataset for semantic rail scene understanding,

O. Zendel, M. Murschitz, M. Zeilinger, D. Steininger, S. Abbasi, and C. Beleznai, “RailSem19: A dataset for semantic rail scene understanding,” in2019 IEEE/CVF Conf. on Comput. Vis. and Pattern Recognit. Workshops, 2019, pp. 1221–1229

2019
[6]

RailSet: A unique dataset for railway anomaly detection,

A. Zouaoui, A. Mahtani, M. A. Hadded, S. Ambellouis, J. Boonaert, and H. Wannous, “RailSet: A unique dataset for railway anomaly detection,” in2022 IEEE 5th Int. Conf. on Image Process. Appl. and Syst. (IPAS), 2022, pp. 1–6

2022
[7]

Rail detection: An efficient row-based network and a new benchmark,

X. Li and X. Peng, “Rail detection: An efficient row-based network and a new benchmark,” inProc. of the 30th ACM Int. Conf. on Multimedia, 2022, pp. 6455–6463

2022
[8]

Labels4Rails: A railway image annotation tool and associated reference dataset,

T. Hiebert, F. Hofstetter, C. Thomas, S. Mushtaq, E. Kaan, and B. Parameswaran, “Labels4Rails: A railway image annotation tool and associated reference dataset,”Data, vol. 10, no. 12, 2025

2025
[9]

FRSign: A large-scale traffic light dataset for autonomous trains,

J. Harb, N. R ´eb´ena, R. Chosidow, G. Roblin, R. Potarusov, and H. Hajri, “FRSign: A large-scale traffic light dataset for autonomous trains,” 2020, arXiv:2002.05665

work page arXiv 2020
[10]

GERALD: A novel dataset for the detection of german mainline railway signals,

P. Leibner, F. Hampel, and C. Schindler, “GERALD: A novel dataset for the detection of german mainline railway signals,”Proc. of the Inst. of Mech. Engineers, Part F: J. of Rail and Rapid Transit, vol. 237, no. 10, pp. 1332–1342, 2023

2023
[11]

RailGoerl24: G¨orlitz rail test center CV dataset 2024,

R. Tagiew, I. Wunderlich, M. Sastuba, K. G ¨oller, and S. Seitz, “RailGoerl24: G¨orlitz rail test center CV dataset 2024,” in2025 IEEE Eng. Reliable Auton. Syst. (ERAS), 2025, pp. 1–4

2024
[12]

Road and railway smart mobility: A high- definition ground truth hybrid dataset,

R. Khemmar, A. Mauri, C. Dulompont, J. Gajula, V . Vauchey, M. Had- dad, and R. Boutteau, “Road and railway smart mobility: A high- definition ground truth hybrid dataset,”Sensors, vol. 22, no. 10, p. 3922, 2022

2022
[13]

MRSI: A multimodal proximity remote sensing data set for environment perception in rail transit,

Y . Chen, N. Zhu, Q. Wu, C. Wu, W. Niu, and Y . Wang, “MRSI: A multimodal proximity remote sensing data set for environment perception in rail transit,”Int. J. of Intell. Syst., vol. 37, no. 9, pp. 5530–5556, 2022

2022
[14]

OSDaR23: Open sensor data for rail 2023,

R. Tagiew, M. K ¨oppel, K. Schwalbe, P. Denzler, P. Neumaier, T. Klockau, M. Boekhoff, P. Klasek, and R. Tilly, “OSDaR23: Open sensor data for rail 2023,” in2023 8th Int. Conf. on Robot. and Automat. Eng. (ICRAE), 2023, pp. 270–276

2023
[15]

OSDaR26: Open sensor data for rail 2026,

P. Klasek, S.-Y . Ham, K. Schwalbe, J. Schloßhauer, H. Wakkaf, H. Grad, P. Neumaier, Z. Ilknur- ¨Oz, T. Cronauer, and M. K ¨oppel, “OSDaR26: Open sensor data for rail 2026,” submitted to: IEEE/RSJ Int. Conf. on Intell. Robots and Syst., 2026

2026
[16]

Are we there yet? chal- lenging SeqSLAM on a 3000 km journey across all four seasons,

N. Sunderhauf, P. Neubert, and P. Protzel, “Are we there yet? chal- lenging SeqSLAM on a 3000 km journey across all four seasons,” in Proc. ICRA Workshop on long-term autonomy, 2013

2013
[17]

CV AT: Computer vision annotation tool,

CV AT.ai, “CV AT: Computer vision annotation tool,” software, https: //github.com/cvat-ai/cvat, 2026

2026
[18]

CoRR abs/2105.15203(2021),https://arxiv.org/abs/2105.15203

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” 2021, arXiv:2105.15203

work page arXiv 2021
[19]

The Cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The Cityscapes dataset for semantic urban scene understanding,” in2016 IEEE Conf. on Comput. Vis. and Pattern Recognit., 2016, pp. 3213–3223

2016
[20]

The Pascal visual object classes chal- lenge: A retrospective,

M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal visual object classes chal- lenge: A retrospective,”Int. J. of Comput. Vis., vol. 111, no. 1, pp. 98–136, 2015

2015
[21]

HOTA: A higher order metric for evaluating multi-object tracking,

J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taix ´e, and B. Leibe, “HOTA: A higher order metric for evaluating multi-object tracking,”Int. J. of Comput. Vis., vol. 129, no. 2, pp. 548–578, 2021

2021
[22]

Least-squares estimation of transformation parameters between two point patterns,

S. Umeyama, “Least-squares estimation of transformation parameters between two point patterns,”IEEE Trans. on Pattern Anal. and Mach. Intell., vol. 13, no. 4, pp. 376–380, 1991

1991
[23]

A benchmark for the evaluation of RGB-D SLAM systems,

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” in2012 IEEE/RSJ Int. Conf. on Intell. Robots and Syst., 2012, pp. 573–580

2012
[24]

Coco detection challenge (bounding box),

CodaLab Competitions, “Coco detection challenge (bounding box),” https://competitions.codalab.org/competitions/20794
[25]

Hdmapnet: An online HD map construction and evaluation framework

Q. Li, Y . Wang, Y . Wang, and H. Zhao, “HDMapNet: An online HD map construction and evaluation framework,” 2022, arXiv:2107.06307

work page arXiv 2022
[26]

A comparative analysis of object detection metrics with a companion open-source toolkit,

R. Padilla, W. L. Passos, T. L. B. Dias, S. L. Netto, and E. A. B. da Silva, “A comparative analysis of object detection metrics with a companion open-source toolkit,”Electronics, vol. 10, no. 3, p. 279, 2021

2021
[27]

End-to-end lane marker detection via row-wise classification,

S. Yoo, H. Lee, H. Myeong, S. Yun, H. Park, J. Cho, and D. H. Kim, “End-to-end lane marker detection via row-wise classification,” 2020, arXiv:2005.08630

work page arXiv 2020
[28]

Spa- tial as deep: Spatial cnn for traffic scene understanding,

X. Pan, X. Zhan, J. Shi, P. Luo, X. Wang, and X. Tang, “Spa- tial as deep: Spatial cnn for traffic scene understanding,” 2026, arXiv:1712.06080

work page arXiv 2026
[29]

Ultralytics yolo11,

G. Jocher and J. Qiu, “Ultralytics yolo11,” 2024, version 11.0.0. [Online]. Available: https://github.com/ultralytics/ultralytics

2024
[30]

Yolo-world: Real-time open-vocabulary object detection,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo-world: Real-time open-vocabulary object detection,” 2024, arXiv:2401.17270

work page arXiv 2024
[31]

Key points estimation and point instance segmentation approach for lane detection,

Y . Ko, Y . Lee, S. Azam, F. Munir, M. Jeon, and W. Pedrycz, “Key points estimation and point instance segmentation approach for lane detection,”IEEE Trans. on Intell. Transp. Syst., vol. 23, no. 7, pp. 8949–8958, 2022

2022
[32]

YOLinO: Generic single shot polyline detection in real time,

A. Meyer, P. Skudlik, J.-H. Pauls, and C. Stiller, “YOLinO: Generic single shot polyline detection in real time,” inProc. of the IEEE/CVF Int. Conf. on Comput. Vis. Workshops, 2021, pp. 2916–2925

2021