pith. machine review for the scientific record. sign in

arxiv: 2605.09677 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: no theorem link

VFM-SDM: A vision foundation model-based framework for training-free, marker-free, and calibration-free structural displacement measurement

Berend Jan van der Zwaag, Hao Cheng, Ozlem Durmaz Incel, Qingyu Xian, Rolands Kromanis

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:05 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision foundation modelsstructural displacement measurementtraining-freemarker-freecalibration-freetriangulationpoint trackingstructural health monitoring
0
0 comments X

The pith

Vision foundation models enable training-free, marker-free structural displacement measurement from video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VFM-SDM, a framework that uses pre-trained vision foundation models to estimate camera parameters and track points in structural videos, then reconstructs multi-directional displacements through triangulation. Structural geometry constraints are added to keep results physically realistic. This removes the usual requirements for task-specific training, marker installation, or manual camera calibration. The approach is tested on real video from an in-service pedestrian bridge, showing close agreement with expected displacement patterns. If the method holds, it supports faster, lower-cost non-contact monitoring for bridges and similar infrastructure without on-site setup.

Core claim

The VFM-SDM framework integrates VFM-inferred camera parameter estimation and point tracking to reconstruct multi-directional structural displacements via triangulation without task-specific training or on-site preparation. Structural geometry constraints suppress physically implausible deviations and improve estimation consistency. On a multi-modal field dataset from a pedestrian bridge, it reports NRMSE_range of 0.11/0.12, correlation coefficients of 0.86/0.88, and RPPAE of 0.01/0.02 for vertical and lateral displacements.

What carries the argument

VFM-inferred camera parameter estimation and point tracking combined with triangulation, regularized by structural geometry constraints.

If this is right

  • Allows immediate deployment of non-contact displacement monitoring on existing structures without preparation steps.
  • Produces consistent vertical and lateral displacement time series that match observed structural behavior.
  • Provides a reproducible evaluation protocol using multi-modal field data for future comparisons.
  • Supports scaling toward automated monitoring in digital twin and data-centric construction settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same VFM components could be applied to other civil infrastructure like buildings or towers if the geometry constraints transfer.
  • Processing speed improvements might enable near-real-time monitoring on standard hardware.
  • Integration with existing structural analysis software could allow direct input of measured displacements into finite element models.

Load-bearing premise

Pre-trained vision foundation models can accurately estimate camera parameters and track points in real-world structural videos without fine-tuning or extra preparation.

What would settle it

Simultaneous reference sensor data from the same bridge showing displacement estimates with NRMSE_range above 0.3 or correlation below 0.7 under ordinary lighting and motion conditions.

Figures

Figures reproduced from arXiv: 2605.09677 by Berend Jan van der Zwaag, Hao Cheng, Ozlem Durmaz Incel, Qingyu Xian, Rolands Kromanis.

Figure 1
Figure 1. Figure 1: Pipeline comparison of vision-based methods capable of measuring normal [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture overview of VGGT. Segment Anything (SAM) [54], Track Anything (TAM) Yang et al. [20] and Visual Geometry Grounded transformer (VGGT) [33] which have demon￾strated remarkable generalization across domains and data modalities. VGGT is a transformer-based 3D perception vision foundation model that unifies multiple geometric estimation tasks within a single feed-forward framework. Pretrained on la… view at source ↗
Figure 3
Figure 3. Figure 3: Overall pipeline of VFM-SDM for structural displacement measurement. The [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Complete computational pipeline of VFM-SDM. Capital letters A–D correspond [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Geometric illustration of triangulation. The 3D point [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the data acquisition setup. From left to right: bridge overview, [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the three stereo sequences in the proposed dataset. Each row shows [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The first ROIs at the 1/4-span location of the bridge in Video-1 (left) and [PITH_FULL_IMAGE:figures/full_fig_p042_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: VGGT-based tracking results of the 1/4-span point in the stereo video pair [PITH_FULL_IMAGE:figures/full_fig_p043_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: VGGT-based tracking results of the 1/4-span point in the second video (Data [PITH_FULL_IMAGE:figures/full_fig_p044_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: 3D displacement measurement at the 1/4-span location (Data-2) in the struc [PITH_FULL_IMAGE:figures/full_fig_p045_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison between accelerometer-derived displacement measurements and [PITH_FULL_IMAGE:figures/full_fig_p046_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Correlation between accelerometer-derived displacement measurements and [PITH_FULL_IMAGE:figures/full_fig_p047_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of camera poses obtained from calibration and estimated by VGGT [PITH_FULL_IMAGE:figures/full_fig_p051_14.png] view at source ↗
read the original abstract

Reliable displacement measurement is fundamental for structural health monitoring and digital engineering workflows, as it provides direct structural response information. Vision-based measurement has emerged as a promising approach for low-cost, non-contact displacement monitoring. However, its deployment often remains constrained by task-specific model training or on-site preparation, such as marker installation or manual camera calibration. This study presents a Vision Foundation Model-based framework for Structural Displacement Measurement (VFM-SDM) that integrates VFM-inferred camera parameter estimation and point tracking to reconstruct multi-directional structural displacements via triangulation without task-specific training or on-site preparation, enabling efficient non-contact deployment in real-world applications. Structural geometry constraints are incorporated to suppress physically implausible deviations and improve estimation consistency. A multi-modal field dataset collected from an in-service pedestrian bridge is introduced alongside a unified benchmarking protocol to support reproducible evaluation. Representative results show low amplitude errors (NRMSE$_{\text{range}}$: 0.11/0.12), strong temporal agreement (correlation coefficient: 0.86/0.88), and small peak-to-peak amplitude errors (RPPAE: 0.01/0.02) for vertical and lateral displacements, indicating robust performance under real-world conditions. The proposed framework advances automated, scalable displacement monitoring and lays the groundwork for VFM-enabled structural response measurements in digital twin and data-centric construction workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents VFM-SDM, a framework that uses vision foundation models to estimate camera parameters and track points in videos of structures, enabling multi-directional displacement measurement through triangulation. This is done without task-specific training, markers, or calibration, with structural geometry constraints to improve consistency. A new dataset from an in-service pedestrian bridge is introduced, and results show NRMSE_range of 0.11/0.12, correlations of 0.86/0.88, and low peak-to-peak errors.

Significance. If validated, the approach has high significance for structural health monitoring by providing a scalable, preparation-free method for displacement measurement using off-the-shelf models. The new field dataset and benchmarking protocol are valuable contributions that can facilitate reproducible research in applying VFMs to civil engineering tasks. It supports the shift towards data-centric and digital twin applications in construction.

major comments (2)
  1. [Methodology] The central claim depends on VFM zero-shot performance for camera intrinsics/extrinsics and point tracking being sufficiently accurate for triangulation to mm-scale displacements on low-texture bridge surfaces. However, no ablation is presented quantifying the raw VFM errors versus the final constrained estimates (see framework description and results), making it difficult to assess how much the geometry constraints suppress deviations versus how much systematic bias remains in the VFM outputs.
  2. [Experiments and Results] Results section, reported metrics (NRMSE_range 0.11/0.12, correlation 0.86/0.88): while the numbers are promising, the evaluation on a single bridge dataset lacks sensitivity analysis to VFM choice, illumination variation, or repetitive elements, and provides no baseline comparisons (e.g., calibrated traditional methods), which is load-bearing for the claim of reliable training-free performance.
minor comments (2)
  1. [Abstract] Abstract: the two values in NRMSE_range (0.11/0.12) and correlation (0.86/0.88) are not explicitly mapped to vertical versus lateral displacements; this should be clarified for immediate readability.
  2. [Introduction] Notation for RPPAE and other metrics could be defined more explicitly on first use in the main text to aid readers unfamiliar with structural monitoring conventions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We have carefully reviewed the major concerns and provide detailed point-by-point responses below. Where appropriate, we will incorporate revisions to strengthen the presentation of the methodology and experimental validation.

read point-by-point responses
  1. Referee: [Methodology] The central claim depends on VFM zero-shot performance for camera intrinsics/extrinsics and point tracking being sufficiently accurate for triangulation to mm-scale displacements on low-texture bridge surfaces. However, no ablation is presented quantifying the raw VFM errors versus the final constrained estimates (see framework description and results), making it difficult to assess how much the geometry constraints suppress deviations versus how much systematic bias remains in the VFM outputs.

    Authors: We agree that quantifying the contribution of the geometry constraints is important for validating the framework. In the revised manuscript, we will add a dedicated ablation subsection that directly compares raw VFM-derived displacements (prior to constraint application) against the final geometry-constrained estimates. This will report per-axis errors, NRMSE, and correlation metrics on the bridge dataset to illustrate the degree of deviation suppression and any residual systematic biases. revision: yes

  2. Referee: [Experiments and Results] Results section, reported metrics (NRMSE_range 0.11/0.12, correlation 0.86/0.88): while the numbers are promising, the evaluation on a single bridge dataset lacks sensitivity analysis to VFM choice, illumination variation, or repetitive elements, and provides no baseline comparisons (e.g., calibrated traditional methods), which is load-bearing for the claim of reliable training-free performance.

    Authors: We acknowledge the value of additional robustness checks. The revised version will include sensitivity analysis across multiple VFM variants for both camera estimation and point tracking, along with qualitative discussion of performance under varying illumination and repetitive texture conditions observed in the dataset. For baselines, we will add quantitative comparisons against a calibrated traditional vision-based method applied to the same video sequences (where ground-truth calibration data is available), while noting that our approach avoids such preparation. Full exhaustive sensitivity across all environmental factors is limited by the single in-service dataset; we will explicitly discuss this as a limitation and outline directions for future multi-site validation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external pre-trained models and new data

full rationale

The paper presents VFM-SDM as an integration of off-the-shelf vision foundation models for camera parameter estimation and point tracking, followed by triangulation and structural geometry constraints on a newly collected multi-modal bridge dataset. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described framework. The central claims rest on external pre-trained VFMs and a reproducible benchmarking protocol rather than reducing to the paper's own inputs by construction. This is the expected honest non-finding for a method that delegates core vision tasks to independently trained models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities. The approach uses standard techniques like triangulation and pre-trained models, with structural geometry constraints as domain knowledge.

pith-pipeline@v0.9.0 · 5571 in / 1302 out tokens · 71023 ms · 2026-05-12T04:05:32.175710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 1 internal anchor

  1. [1]

    Abu Dabous, S

    S. Abu Dabous, S. Feroz, Condition monitoring of bridges with non- contact testing technologies, Automation in Construction 116 (2020) 103224. doi:10.1016/j.autcon.2020.103224

  2. [2]

    N. S. Gulgec, M. Takáč, S. N. Pakzad, Convolutional neural network approach for robust structural damage detection and localization, Jour- nal of Computing in Civil Engineering 33 (2019). doi:10.1061/(asce) cp.1943-5487.0000820

  3. [3]

    Ereiz, I

    S. Ereiz, I. Duvnjak, J. Fernando Jiménez-Alonso, Review of finite element model updating methods for structural applications, Structures 41 (2022) 684–723. doi:10.1016/j.istruc.2022.05.041

  4. [4]

    Nicoletti, D

    V. Nicoletti, D. Arezzo, S. Carbonari, F. Gara, Dynamic monitoring of buildings as a diagnostic tool during construction phases, Journal of Building Engineering 46 (2022) 103764. doi:10.1016/j.jobe.2021. 103764

  5. [5]

    doi:10.2172/249299

    S.W.Doebling, C.R.Farrar, M.B.Prime, D.W.Shevitz, DamageIden- tification and Health Monitoring of Structural and Mechanical Systems from Changes in Their Vibration Characteristics: A Literature Review, Technical Report LA-13070-MS, Los Alamos National Laboratory, 1996. doi:10.2172/249299

  6. [6]

    Moreno-Gomez, C

    A. Moreno-Gomez, C. A. Perez-Ramirez, A. Dominguez-Gonzalez, M. Valtierra-Rodriguez, O. Chavez-Alegria, J. P. Amezquita-Sanchez, Sensors used in structural health monitoring, Archives of Compu- tational Methods in Engineering 25 (2018) 901–918. doi:10.1007/ s11831-017-9217-4

  7. [7]

    P. Garg, F. Moreu, A. Ozdagli, M. R. Taha, D. Mascareñas, Noncontact dynamic displacement measurement of structures using a moving laser doppler vibrometer, Journal of Bridge Engineering 24 (2019). doi:10. 1061/(asce)be.1943-5592.0001472

  8. [8]

    Xiong, Z

    Y. Xiong, Z. Peng, G. Xing, W. Zhang, G. Meng, Accurate and ro- bust displacement measurement for fmcw radar vibration monitoring, IEEE Sensors Journal 18 (2018) 1131–1139. doi:10.1109/jsen.2017. 2778294. 55

  9. [10]

    Huang, B

    M. Huang, B. Zhang, W. Lou, A. Kareem, A deep learning augmented vision-based method for measuring dynamic displacements of structures in harsh environments, Journal of Wind Engineering and Industrial Aerodynamics 217 (2021) 104758. doi:10.1016/j.jweia.2021.104758

  10. [11]

    doi:10.1111/mice.12645

    Y.Weng, J.Shan, Z.Lu, X.Lu, B.F.Spencer, Homography-basedstruc- tural displacement measurement for large structures using unmanned aerial vehicles, Computer-Aided Civil and Infrastructure Engineering 36 (2021) 1114–1128. doi:10.1111/mice.12645

  11. [12]

    Bolognini, G

    M. Bolognini, G. Izzo, D. Marchisotti, L. Fagiano, M. P. Limongelli, E. Zappa, Vision-based modal analysis of built environment structures with multiple drones, Automation in Construction 143 (2022) 104550. doi:10.1016/j.autcon.2022.104550

  12. [13]

    Q. Li, Y. Shao, L. Li, J. Li, H. Hao, Advancements in 3d displace- ment measurement for civil structures: A monocular vision approach with moving cameras, Measurement 242 (2025) 116060. doi:10.1016/ j.measurement.2024.116060

  13. [14]

    Zhang, P

    S. Zhang, P. Ni, J. Wen, Q. Han, X. Du, K. Xu, Automated vision-based multi-plane bridge displacement monitoring, Automation in Construc- tion 166 (2024) 105619. doi:10.1016/j.autcon.2024.105619

  14. [15]

    Zhang, A flexible new technique for camera calibration, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 1330–1334

    Z. Zhang, A flexible new technique for camera calibration, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000) 1330–1334. doi:10.1109/34.888718

  15. [16]

    J. Wang, C. Rupprecht, D. Novotny, Posediffusion: Solving pose es- timation via diffusion-aided bundle adjustment, in: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, 2023, p. 9739–9749. doi:10.1109/iccv51070.2023.00896

  16. [17]

    C. Sun, D. Gu, Y. Zhang, X. Lu, Vision-based displacement measure- ment enhanced by super-resolution using generative adversarial net- 56 works, Structural Control and Health Monitoring 29 (2022). doi:10. 1002/stc.3048

  17. [18]

    Marchisotti, E

    D. Marchisotti, E. Zappa, Feasibility of drone-based modal analysis using tof-grayscale and tracking cameras, IEEE Transactions on Instru- mentation and Measurement 72 (2023) 1–10. doi:10.1109/tim.2023. 3281628

  18. [19]

    Zhang, Z

    C. Zhang, Z. Lu, X. Li, Y. Zhang, X. Guo, A two-stage correction method for uav movement-induced errors in non-target computer vision- based displacement measurement, Mechanical Systems and Signal Pro- cessing 224 (2025) 112131. doi:10.1016/j.ymssp.2024.112131

  19. [20]

    Track anything: Segment anything meets videos.arXiv preprint arXiv:2304.11968, 2023

    J.Yang, M.Gao, Z.Li, S.Gao, F.Wang, F.Zheng, Trackanything: Seg- ment anything meets videos, arXiv preprint arXiv:2304.11968 (2023). doi:10.48550/arXiv.2304.11968

  20. [21]

    Sadler and Jiaman Wu and Wei

    C. Doersch, Y. Yang, M. Vecerik, D. Gokay, A. Gupta, Y. Aytar, J. Car- reira, A. Zisserman, Tapir: Tracking any point with per-frame initial- ization and temporal refinement, in: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, 2023, p. 10027–10038. doi:10.1109/iccv51070.2023.00923

  21. [22]

    J. Zhao, F. Hu, Y. Xu, W. Zuo, J. Zhong, H. Li, Structure-posenet for identification of dense dynamic displacement and three-dimensional poses of structures using a monocular camera, Computer-Aided Civil and Infrastructure Engineering 37 (2022) 704–725. doi:10.1111/mice. 12761

  22. [23]

    Y. Shao, L. Li, J. Li, Q. Li, S. An, H. Hao, Out-of-plane full-field vibra- tion displacement measurement with monocular computer vision, Au- tomation in Construction 165 (2024) 105507. doi:10.1016/j.autcon. 2024.105507

  23. [24]

    Y. Ruan, T. Huang, C. Yuan, G. Zong, Q. Kong, A lightweight binocular vision-supported framework for 3d structural dynamic response moni- toring, Computer-Aided Civil and Infrastructure Engineering 40 (2025) 4364–4377. doi:10.1111/mice.13452

  24. [26]

    Y. Shao, L. Li, J. Li, Q. Li, S. An, H. Hao, Dimmc: A 3d vision ap- proach for structural displacement measurement using a moving camera, Engineering Structures 338 (2025) 120566. doi:10.1016/j.engstruct. 2025.120566

  25. [27]

    J. Jiao, J. Guo, K. Fujita, I. Takewaki, Displacement measurement and nonlinear structural system identification: A vision-based approach with camera motion correction using planar structures, Structural Control and Health Monitoring 28 (2021). doi:10.1002/stc.2761

  26. [28]

    L. Xing, W. Dai, Y. Zhang, Improving displacement measurement accu- racy by compensating for camera motion and thermal effect on camera sensor, Mechanical Systems and Signal Processing 167 (2022) 108525. doi:10.1016/j.ymssp.2021.108525

  27. [29]

    Panigati, A

    T. Panigati, A. Abbozzo, M. A. Pace, E. Temur, F. Cigan, R. Kroma- nis, Dynamic identification of bridges using multiple synchronized cam- eras and computer vision, Infrastructures 10 (2025) 37. doi:10.3390/ infrastructures10020037

  28. [30]

    Y. Shao, L. Li, J. Li, S. An, H. Hao, Target-free 3d tiny structural vibration measurement based on deep learning and motion magnifica- tion, Journal of Sound and Vibration 538 (2022) 117244. doi:10.1016/ j.jsv.2022.117244

  29. [31]

    X. Pan, T. Yang, 3d vision-based out-of-plane displacement quantifica- tion for steel plate structures using structure-from-motion, deep learn- ing, and point-cloud processing, Computer-Aided Civil and Infrastruc- ture Engineering 38 (2023) 547–561. doi:10.1111/mice.12906

  30. [33]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, D. Novotny, Vggt: Visual geometry grounded transformer, in: 2025 IEEE/CVF Con- 58 ference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2025, p. 5294–5306. doi:10.1109/cvpr52734.2025.00499

  31. [34]

    A. Lin, J. Y. Zhang, D. Ramanan, S. Tulsiani, Relpose++: Re- covering 6d poses from sparse-view observations, in: 2024 Inter- national Conference on 3D Vision (3DV), IEEE, 2024, p. 106–115. doi:10.1109/3dv62453.2024.00126

  32. [35]

    Q. Xian, W. Jiao, H. Cheng, B. J. van der Zwaag, Y. Huang, T-graph: Enhancing sparse-view camera pose estimation by pairwise translation graph, ISPRS Journal of Photogrammetry and Remote Sensing 230 (2025) 109–125. doi:10.1016/j.isprsjprs.2025.08.031

  33. [36]

    Hartley, A

    R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2003. doi:10.1017/CBO9780511811685

  34. [37]

    X. Ji, Z. Miao, R. Kromanis, Vision-based measurements of deforma- tions and cracks for rc structure tests, Engineering Structures 212 (2020) 110508

  35. [38]

    Kromanis, P

    R. Kromanis, P. Kripakaran, A multiple camera position approach for accurate displacement measurement using computer vision, Journal of Civil Structural Health Monitoring 11 (2021) 661–678. doi:10.1007/ s13349-021-00473-0

  36. [39]

    X. Pan, T. Yang, Y. Xiao, H. Yao, H. Adeli, Vision-based real-time structural vibration measurement through deep-learning-based detec- tion and tracking methods, Engineering Structures 281 (2023) 115676. doi:10.1016/j.engstruct.2023.115676

  37. [40]

    Jeong, H

    J. Jeong, H. Jo, Real-time generic target tracking for structural displace- ment monitoring under environmental uncertainties via deep learning, Structural Control and Health Monitoring 29 (2021). doi:10.1002/stc. 2902

  38. [41]

    Q. He, S. Wang, Improving 2d displacement accuracy in bridge vibration measurement with color space fusion and super resolution, Advanced Engineering Informatics 65 (2025) 103248. 59

  39. [42]

    Wan, T.-L

    H.-P. Wan, T.-L. Fang, Y.-K. Zhu, C. Wang, N.-B. Wang, An improved sift-based method for non-contact bridge displacement measurement, Advanced Engineering Informatics 71 (2026) 104145

  40. [43]

    L. Luo, M. Q. Feng, Z. Y. Wu, Robust vision sensor for multi-point displacement monitoring of bridges in the field, Engineering Structures 163 (2018) 255–266. doi:10.1016/j.engstruct.2018.02.014

  41. [44]

    Y. Xu, J. Zhang, J. Brownjohn, An accurate and distraction-free vision-based structural displacement measurement method integrating siamese network based tracker and correlation-based template match- ing, Measurement 179 (2021) 109506. doi:10.1016/j.measurement. 2021.109506

  42. [46]

    Z. Ma, J. Choi, P. Liu, H. Sohn, Structural displacement estimation by fusing vision camera and accelerometer using hybrid computer vi- sion algorithm and adaptive multi-rate kalman filter, Automation in Construction 140 (2022) 104338. doi:10.1016/j.autcon.2022.104338

  43. [47]

    M. Wang, W. Kei Ao, J. Bownjohn, F. Xu, Completely non-contact modal testing of full-scale bridge in challenging conditions using vision sensing systems, Engineering Structures 272 (2022) 114994. doi:10. 1016/j.engstruct.2022.114994

  44. [48]

    D. Cui, W. Wang, Y. He, Y. Zhang, X. Zhang, Y. Zhao, Y. Zhang, Re- search on wide-area displacement monitoring based on rotating platform and binocular vision, Measurement 253 (2025) 117463. doi:10.1016/j. measurement.2025.117463

  45. [49]

    C. Xie, B. Huang, Z. Wu, Y. Hu, K. Liang, J. Chen, A. Garg, G. Mei, A new economical approach for measurement of 3d structural displace- ment and motion trajectory: Utilizing binocular vision and subpixel en- hancement with square feature recognition, Structures 77 (2025) 109178. doi:10.1016/j.istruc.2025.109178. 60

  46. [50]

    Y. Shao, L. Li, J. Li, S. An, H. Hao, Computer vision based target- free 3d vibration displacement measurement of structures, Engineering Structures246(2021)113040.doi:10.1016/j.engstruct.2021.113040

  47. [51]

    J. Wu, Z. Ma, Y. Xue, J. Qin, D. You, G. Sun, Displacement monitor- ing and modal parameter identification of cable net structure based on feature optical flow and binocular stereo vision, Structures 76 (2025) 108914. doi:10.1016/j.istruc.2025.108914

  48. [52]

    Narazaki, F

    Y. Narazaki, F. Gomez, V. Hoskere, M. D. Smith, B. F. Spencer, Efficient development of vision-based dense three-dimensional dis- placement measurement algorithms using physics-based graphics mod- els, Structural Health Monitoring 20 (2020) 1841–1863. doi:10.1177/ 1475921720939522

  49. [53]

    K. Lin, L. Wang, Z. Liu, End-to-end human pose and mesh reconstruc- tion with transformers, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2021, p. 1954–1963. doi:10.1109/cvpr46437.2021.00199

  50. [54]

    Sadler and Jiaman Wu and Wei

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, R. Girshick, Segment anything, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4015–4026. doi:10. 1109/ICCV51070.2023.00371

  51. [55]

    A benchmark for the evaluation of RGB-D SLAM systems,

    J. Sturm, N. Engelhard, F. Endres, W. Burgard, D. Cremers, A bench- mark for the evaluation of rgb-d slam systems, in: 2012 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems, IEEE, 2012, p. 573–580. doi:10.1109/iros.2012.6385773

  52. [56]

    J. L. Schonberger, J.-M. Frahm, Structure-from-motion revisited, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2016, p. 4104–4113. doi:10.1109/cvpr.2016.445

  53. [57]

    Triggs, P

    B. Triggs, P. F. McLauchlan, R. I. Hartley, A. W. Fitzgibbon, Bundle Adjustment — A Modern Synthesis, Springer Berlin Heidelberg, 2000, p. 298–372. doi:10.1007/3-540-44480-7\_21

  54. [58]

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, H. Zhao, Depth anything: Unleashing the power of large-scale unlabeled data, in: 2024 IEEE/CVF 61 Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2024, p. 10371–10381. doi:10.1109/cvpr52733.2024.00987

  55. [59]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., Dinov2: Learning robust visual features without supervision, arXiv preprint arXiv:2304.07193 (2023). doi:10.48550/arXiv.2304.07193

  56. [60]

    Ranftl, A

    R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense prediction, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, 2021, p. 12159–12168. doi:10.1109/iccv48922. 2021.01196

  57. [61]

    Ponsi, E

    F. Ponsi, E. Buoli, G. E. Varzaneh, E. Bassoli, B. Briseghella, L. Vin- cenzi, Vision-based approach for the static and dynamic monitoring of bridges, Procedia Structural Integrity 62 (2024) 946–954. doi:10.1016/ j.prostr.2024.09.127

  58. [62]

    R. J. Hyndman, A. B. Koehler, Another look at measures of fore- cast accuracy, International Journal of Forecasting 22 (2006) 679–688. doi:10.1016/j.ijforecast.2006.03.001

  59. [63]

    J. L. Rodgers, W. A. Nicewander, Thirteen ways to look at the correla- tion coefficient, The American Statistician 42 (1988) 59. doi:10.2307/ 2685263

  60. [64]

    C.-J. Kat, P. S. Els, Validation metric based on relative error, Math- ematical and Computer Modelling of Dynamical Systems 18 (2012) 487–520. doi:10.1080/13873954.2012.663392

  61. [65]

    Doersch, P

    C. Doersch, P. Luc, Y. Yang, D. Gokay, S. Koppula, A. Gupta, J. Hey- ward, I. Rocco, R. Goroshin, J. Carreira, A. Zisserman, BootsTAP: Bootstrapped Training for Tracking-Any-Point, Springer Nature Singa- pore, 2024, p. 483–500. doi:10.1007/978-981-96-0901-7\_28

  62. [66]

    DeTone, T

    D. DeTone, T. Malisiewicz, A. Rabinovich, Superpoint: Self-supervised interest point detection and description, in: 2018 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, 2018, p. 337–33712. doi:10.1109/cvprw.2018.00060. 62

  63. [67]

    nuScenes: A multimodal dataset for autonomous driving,

    P.-E. Sarlin, D. DeTone, T. Malisiewicz, A. Rabinovich, Superglue: Learning feature matching with graph neural networks, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2020, p. 4937–4946. doi:10.1109/cvpr42600.2020. 00499. Appendix A. Additional smartphone-based vibration sequence This appendix describes an additiona...