arxiv: 2605.11674 · v1 · submitted 2026-05-12 · 💻 cs.RO

Recognition: no theorem link

A Proprioceptive-Only Benchmark for Quadruped State Estimation: ATE, RPE, and Runtime Trade-offs Between Filters and Smoothers

Ylenia Nistic\`o , Jo\~ao Carlos Virgolino Soares , Joan Sol\`a , Claudio Semini

Authors on Pith no claims yet

Pith reviewed 2026-05-13 00:58 UTC · model grok-4.3

classification 💻 cs.RO

keywords quadruped state estimationproprioceptive sensorsbenchmarkabsolute trajectory errorrelative pose errorinvariant extended Kalman filterinvariant smoother

0 comments

The pith

IEKF and invariant smoother achieve lower long-term trajectory error than MUSE on proprioceptive quadruped data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares three proprioceptive-only state estimators for quadruped robots on one sequence from the GrandTour dataset. It measures long-term accuracy with absolute trajectory error, short-term accuracy with relative pose error, and the time required for each update on identical hardware. A reader would care because these choices directly affect how well a legged robot can maintain its pose estimate during locomotion without external references like cameras or GPS. The results indicate that relative pose errors stay comparable across the methods while two of them reduce absolute error and all three differ in computational cost. This gives concrete numbers for selecting an estimator that matches a given robot's accuracy needs and onboard processing limits.

Core claim

On the CYN-1 sequence, the relative pose errors remain broadly similar across MUSE, the invariant extended Kalman filter, and the invariant smoother. The invariant extended Kalman filter and invariant smoother produce lower absolute trajectory error than MUSE. Computation times per update differ among the three, making the accuracy versus latency trade-offs visible when all methods run on the same fixed hardware and software stack.

What carries the argument

Side-by-side reporting of absolute trajectory error, relative pose error, and per-update runtime for MUSE, IEKF, and IS on the same proprioceptive dataset sequence.

If this is right

Applications that prioritize low long-term drift without external corrections can favor IEKF or the invariant smoother over MUSE.
When only short-horizon accuracy matters, any of the three methods may suffice since their relative pose errors are similar.
The measured runtimes allow direct selection of an estimator whose speed fits a robot's real-time control loop constraints.
Releasing the full evaluation code makes it possible for others to rerun the comparison on new hardware or additional sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pattern of similar short-term but differing long-term errors suggests that drift accumulation is the main differentiator, which could be tested by adding controlled disturbances to the dataset.
These results could inform whether hybrid filters that borrow drift-correction ideas from the smoother would narrow the absolute error gap without increasing runtime much.
Extending the benchmark to include sequences with faster gaits or uneven terrain would show whether the observed trade-offs hold under more dynamic conditions.

Load-bearing premise

The single chosen sequence and the evaluation protocol produce a fair comparison of the three estimators without biases from data selection or metric implementation.

What would settle it

Running the identical code on the identical CYN-1 sequence and hardware but obtaining a lower or equal absolute trajectory error for MUSE than for IEKF and IS.

Figures

Figures reproduced from arXiv: 2605.11674 by Claudio Semini, Joan Sol\`a, Jo\~ao Carlos Virgolino Soares, Ylenia Nistic\`o.

**Figure 1.** Figure 1: Comparison of the ground truth (GT) trajectory, position and orientation vs. the estimates obtained with MUSE, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: (a) Comparison of ground-truth (GT) and estimated velocities from MUSE, IEKF, and IS. (b) Zoomed-in view of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Computation time per iteration for MUSE, IEKF, and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Convergence behavior with large initial orientation [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Compact supplementary workflow: dataset processing, estimator execution, and metric evaluation. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

We compare three state-of-the-art proprioceptive state estimators for quadruped robots: MUSE [1], the Invariant Extended Kalman Filter (IEKF) [2], and the Invariant Smoother (IS) [3], on the CYN-1 sequence of the GrandTour Dataset [4]. Our goal is to give practitioners clear guidance on accuracy and computation time: we report long-term accuracy (Absolute Trajectory Error, ATE), short-term accuracy (translational and rotational Relative Pose Error, RPE), and per-update computation time on a fixed hardware/software stack. On this dataset, RPEs are broadly similar across methods, while IEKF and IS achieve a lower ATE than MUSE. Runtime results highlight the accuracy-latency trade-offs across the three approaches. In the discussion, we outline the evaluation choices used to ensure a fair comparison and analyze factors that influence short-horizon metrics. Overall, this study provides a concise snapshot of accuracy and cost, helping readers choose an estimator that fits their application constraints, with all evaluation code and documentation released open-source at https://github.com/iit-DLSLab/state_estimation_benchmark for full reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clean benchmark giving new ATE, RPE, and runtime numbers for MUSE, IEKF, and IS on one quadruped sequence, with open code and explicit fairness steps.

read the letter

This paper runs three existing proprioceptive estimators—MUSE, the invariant EKF, and the invariant smoother—on the CYN-1 sequence and reports ATE, RPE, and per-update times on fixed hardware. The headline result is that RPE stays similar across the methods while IEKF and IS show lower ATE than MUSE, with runtime numbers that make the accuracy-latency trade-offs concrete for practitioners. The code and scripts are released, which lets anyone reproduce the exact figures. The authors also spell out how they kept preprocessing, initial conditions, and covariance settings consistent, and they discuss what influences the short-horizon metrics. That level of transparency is better than most comparison papers deliver. The obvious limit is the single sequence. Nothing new is derived or invented, and there are no tests on other robots, terrains, or sensor suites, so the numbers are best treated as a snapshot rather than a general ranking. No load-bearing assumptions or circular claims appear in the evaluation. This is useful reading for anyone who has to pick or tune a proprioceptive estimator for a quadruped and wants verifiable numbers plus runnable code. It is worth sending to peer review because the execution is careful and the reproducibility package is solid, even though the scope stays narrow.

Referee Report

0 major / 2 minor

Summary. The paper compares three proprioceptive state estimators for quadruped robots—MUSE, the Invariant Extended Kalman Filter (IEKF), and the Invariant Smoother (IS)—on the CYN-1 sequence of the GrandTour Dataset. It reports Absolute Trajectory Error (ATE) for long-term accuracy, translational and rotational Relative Pose Error (RPE) for short-term accuracy, and per-update runtime on fixed hardware. The central claims are that RPE values are broadly similar across the three methods while IEKF and IS achieve lower ATE than MUSE, with runtime results illustrating accuracy-latency trade-offs; all evaluation code is released open-source.

Significance. If the results hold, the work supplies practitioners with actionable guidance on selecting among existing proprioceptive estimators under accuracy and latency constraints. The use of standard metrics (ATE, RPE), explicit discussion of fairness measures (consistent preprocessing, identical initial conditions, shared covariance tuning), and full release of scripts and documentation constitute a reproducible empirical contribution that is valuable in a field where such benchmarks are often incomplete or non-reproducible.

minor comments (2)

[Discussion] Discussion section: the paragraph addressing evaluation choices for fairness is helpful, but a short table summarizing the exact preprocessing steps, initial covariance values, and tuning parameters applied identically to MUSE, IEKF, and IS would make the fairness claim immediately verifiable without inspecting the repository.
[Results] Results: the reported ATE and RPE numbers would benefit from explicit statement of the number of runs or seeds used (if any) and whether the single CYN-1 sequence was the only one processed; this does not affect the central comparison but improves clarity for readers wishing to replicate on additional sequences.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our benchmark study, the accurate summary of our claims, and the recommendation for minor revision. The significance section correctly identifies the value of our reproducible comparison using standard metrics and open-source code. Since the report lists no specific major comments, we have no individual points to address point-by-point.

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark only

full rationale

The manuscript is a direct empirical comparison of three pre-existing estimators (MUSE, IEKF, IS) on public CYN-1 data. It computes standard ATE/RPE/runtime metrics under shared preprocessing and tuning, with released code. No derivations, fitted parameters renamed as predictions, self-definitional equations, or load-bearing self-citations appear. All claims rest on observable outputs from external dataset sequences rather than internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark study; the central claim rests on no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.0 · 5532 in / 1052 out tokens · 33757 ms · 2026-05-13T00:58:03.158032+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

MUSE: A real-time multi-sensor state estimator for quadruped robots,

Y . Nistic `o, J. C. V . Soares, L. Amatucci, G. Fink, and C. Sem- ini, “MUSE: A real-time multi-sensor state estimator for quadruped robots,”IEEE Robotics and Automation Letters, vol. 10, no. 5, pp. 4620–4627, 2025, DOI: 10.1109/LRA.2025.3553047

work page doi:10.1109/lra.2025.3553047 2025
[2]

Contact- aided invariant extended kalman filtering for robot state estimation,

R. Hartley, M. Ghaffari, R. M. Eustice, and J. W. Grizzle, “Contact- aided invariant extended kalman filtering for robot state estimation,” The International Journal of Robotics Research, vol. 39, no. 4, pp. 402–430, 2020, DOI: 10.1177/0278364919894385

work page doi:10.1177/0278364919894385 2020
[3]

Invariant smoother for legged robot state estimation with dynamic contact event information,

Z. Yoon, J.-H. Kim, and H.-W. Park, “Invariant smoother for legged robot state estimation with dynamic contact event information,”IEEE Transactions on Robotics, vol. 40, pp. 193–212, 2024, DOI: 10.1109/ TRO.2023.3328202

work page arXiv 2024
[4]

Grandtour: A legged robotics dataset in the wild for multi-modal perception and state estimation,

J. Frey, T. Tuna, F. Fu, K. Patterson, T. Xu, M. Fallon, C. Ca- dena, and M. Hutter, “Grandtour: A legged robotics dataset in the wild for multi-modal perception and state estimation,”arXiv preprint arXiv:2602.18164, 2026

work page arXiv 2026
[5]

State estimation for legged robots: consistent fusion of leg kinematics and IMU,

M. Bloesch, M. Hutter, M. A. Hoepflinger, S. Leutenegger, C. Gehring, C. D. Remy, and R. Siegwart, “State estimation for legged robots: consistent fusion of leg kinematics and IMU,”Robotics, vol. 17, pp. 17–24, 2013, DOI: 10.15607/RSS.2012.VIII.003

work page doi:10.15607/rss.2012.viii.003 2013
[6]

The two-state implicit filter recursive estimation for mobile robots,

M. Bloesch, M. Burri, H. Sommer, R. Siegwart, and M. Hutter, “The two-state implicit filter recursive estimation for mobile robots,”IEEE Robot. Autom. Lett., vol. 3, no. 1, pp. 573–580, 2018, DOI: 10.1109/ LRA.2017.2776340

work page arXiv 2018
[7]

Proprioceptive sensor fusion for quadruped robot state estimation,

G. Fink and C. Semini, “Proprioceptive sensor fusion for quadruped robot state estimation,” in2020 IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2020, pp. 10 914–10 920, DOI: 10.1109/IROS45743. 2020.9341521

work page doi:10.1109/iros45743 2020
[8]

Proprioceptive state estimation of legged robots with kinematic chain modeling,

V . Agrawal, S. Bertrand, R. Griffin, and F. Dellaert, “Proprioceptive state estimation of legged robots with kinematic chain modeling,” in 2022 IEEE-RAS 21st International Conference on Humanoid Robots (Humanoids), 2022, pp. 178–185, DOI: 10.1109/Humanoids53995. 2022.10000099

work page doi:10.1109/humanoids53995 2022
[9]

Proprioceptive state estimation for quadruped robots using invariant Kalman filtering and scale-variant robust cost functions,

H. M. S. Santana, J. C. V . Soares, Y . Nistic `o, M. A. Meggiolaro, and C. Semini, “Proprioceptive state estimation for quadruped robots using invariant Kalman filtering and scale-variant robust cost functions,” in 2024 IEEE-RAS Int. Conf. Humanoid Robots, 2024, pp. 213–220, DOI: 10.1109/Humanoids58906.2024.10769911

work page doi:10.1109/humanoids58906.2024.10769911 2024
[10]

A benchmark for the evaluation of RGB-D SLAM systems,

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 573–580, DOI: 10.1109/IROS.2012.6385773

work page doi:10.1109/iros.2012.6385773 2012
[11]

Globally exponentially stable attitude and gyro bias estimation with application to GNSS/INS integration,

H. F. Grip, T. I. Fossen, T. A. Johansen, and A. Saberi, “Globally exponentially stable attitude and gyro bias estimation with application to GNSS/INS integration,”Automatica, vol. 51, pp. 158–166, 2015, DOI: 10.1016/j.automatica.2014.10.076

work page doi:10.1016/j.automatica.2014.10.076 2015
[12]

The eXogenous Kalman filter (XKF),

T. A. Johansen and T. I. Fossen, “The eXogenous Kalman filter (XKF),”International Journal of Control, vol. 90, no. 2, pp. 161–167, 2017, DOI: 10.1080/00207179.2016.1172390

work page doi:10.1080/00207179.2016.1172390 2017
[13]

evo: Python package for the evaluation of odometry and slam

M. Grupp, “evo: Python package for the evaluation of odometry and slam.” https://github.com/MichaelGrupp/evo, 2017

work page 2017