arxiv: 2605.05854 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

AirQualityBench: A Realistic Evaluation Benchmark for Global Air Quality Forecasting

Huilin Zhao, Xing Xu, Xu Wang, Yang Wang, Yudong Zhang, Zhengyang Zhou

Pith reviewed 2026-05-08 11:12 UTC · model grok-4.3

classification 💻 cs.AI

keywords air quality forecastingbenchmark datasetspatio-temporal modelingmissing data handlingglobal monitoringevaluation benchmarkmulti-pollutant prediction

0 comments

The pith

AirQualityBench shows that models excelling on preprocessed air quality datasets do not reliably perform on global fragmented monitoring data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AirQualityBench, a benchmark using real hourly observations from 3720 global stations across 2021 to 2025 for six pollutants, keeping the original missing data patterns intact. It evaluates forecasting models by reporting errors only on valid future observations after converting back to physical units, without filling in gaps artificially. This approach reveals that good results on cleaned regional datasets often do not carry over to realistic conditions with uneven coverage and varying scales. A sympathetic reader would care because it highlights the gap between lab-like testing and practical deployment for air quality predictions that affect public health.

Core claim

The central discovery is that representative spatio-temporal models, when evaluated under a unified protocol on this global benchmark with preserved missingness and physical-scale errors, exhibit performance that does not transfer from sanitized datasets, establishing AirQualityBench as a realistic testbed for mask-aware and physically interpretable forecasting.

What carries the argument

The provider-native observation masks that expose missingness as part of the forecasting problem rather than imputing dense tensors.

If this is right

Forecasting models need to incorporate mechanisms for handling structured missing data in global settings.
Evaluation must use inverse transformations to physical concentration scales for meaningful comparisons.
Scalable models that work with uneven station coverage become a priority for real-world use.
Multi-pollutant forecasting requires addressing heterogeneous scales without normalization assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar benchmarks could be developed for other environmental forecasting tasks like weather or traffic to test real-world robustness.
Model developers might prioritize architectures that explicitly model observation masks to improve transfer.
Extending the benchmark to include deployment cost metrics could guide practical model selection.

Load-bearing premise

The 3720 selected stations and the 2021-2025 time period with their provider-native masks sufficiently represent the dominant real-world conditions in global air quality monitoring networks.

What would settle it

Finding a model that maintains high performance on both traditional sanitized datasets and this benchmark, or discovering that the selected stations miss key patterns in global monitoring, would challenge the claim that sanitized performance does not transfer.

Figures

Figures reproduced from arXiv: 2605.05854 by Huilin Zhao, Xing Xu, Xu Wang, Yang Wang, Yudong Zhang, Zhengyang Zhou.

**Figure 1.** Figure 1: Overview of AirQualityBench. The benchmark combines a global network of 3,720 monitoring stations, synchronized observations of six pollutants, authentic missingness patterns, and physical-scale evaluation, providing a realistic testbed for large-scale spatio-temporal air quality forecasting. dating their scalability in large-scale networks. Second, these benchmarks frequently employ manual interventions t… view at source ↗

**Figure 2.** Figure 2: Pollutant-specific missingness in AirQualityBench. Station-level missingness distributions show heterogeneous coverage across pollutants, with substantially sparser observations for gaseous species. Dataset acquisition and filter. The raw data for AirQualityBench is harvested from the OpenAQ[24] platform, which aggregates hourly pollutant concentrations from diverse global monitoring networks. We retain d… view at source ↗

**Figure 3.** Figure 3: Multi-scale temporal dynamics in AirQualityBench. Diurnal and seasonal climatologies show that all six pollutants preserve regular temporal structure with pollutant-dependent rhythms. Spatial and temporal structure under global fragmentation. Despite the planetary-scale coverage of AirQualityBench, pairwise correlations exhibit a systematic decay with Haversine distance rather than collapsing into spatial… view at source ↗

**Figure 4.** Figure 4: Accuracy–efficiency trade-off on AirQualityBench. Each bubble denotes a forecasting model, with position determined by global aggregate MAE and inference latency, and bubble size proportional to parameter count. 6 Future Opportunities and Scope AirQualityBench provides a transparent first release for studying global, mask-aware, physicalscale air-quality forecasting. Beyond leaderboard comparison, it open… view at source ↗

**Figure 5.** Figure 5: Spatial correlation decay with Haversine distance. Left: aggregated relationship between pairwise correlation and inter-station distance, summarized with distance-binned means and 95% confidence intervals. Right: pollutant-specific decay patterns for CO, NO2, O3, PM10, PM2.5, and SO2. Across pollutants, pairwise correlation generally weakens with distance, supporting the use of a standardized spherical k-N… view at source ↗

read the original abstract

Air-quality forecasting models are commonly evaluated on regional, preprocessed, and normalized datasets, where missing observations are removed or artificially completed. Such protocols simplify comparison but hide the conditions that dominate real monitoring networks: uneven global coverage, structured missingness, heterogeneous pollutant scales, and deployment cost. We introduce \textbf{AirQualityBench}, a global multi-pollutant benchmark designed to evaluate forecasting models under these realistic conditions. The benchmark contains hourly observations from 3,720 monitoring stations over 2021--2025, covers six major pollutants, and preserves provider-native observation masks. Rather than imputing a dense data tensor, AirQualityBench exposes missingness as part of the forecasting problem and reports errors on valid future observations after inverse transformation to physical concentration scales. Evaluating representative spatio-temporal models under this unified protocol shows that strong performance on sanitized datasets does not reliably transfer to global, fragmented monitoring streams. AirQualityBench therefore serves as a realistic testbed for scalable, mask-aware, and physically interpretable air-quality forecasting. All benchmark data, code, evaluation scripts, and baseline implementations are available at \href{https://github.com/Star-Learning/AirQualityBench}{GitHub}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AirQualityBench is a useful data release that keeps real missingness and physical scales in global air quality forecasting, but the station selection needs clearer justification to back the non-transfer claim.

read the letter

The main thing to know is that this paper releases AirQualityBench, a global dataset of hourly observations from 3720 stations over 2021-2025 across six pollutants. It preserves the original provider masks instead of imputing or dropping gaps, and it scores forecasts only on valid future points after converting back to physical concentrations. That setup is new relative to the usual regional, sanitized benchmarks, and the evaluation shows that models strong on clean data often degrade here.

Referee Report

2 major / 2 minor

Summary. The paper introduces AirQualityBench, a global multi-pollutant forecasting benchmark built from hourly observations at 3,720 real monitoring stations over 2021–2025. It preserves provider-native missingness masks rather than imputing or discarding data, reports errors after inverse transformation to physical units, and shows that representative spatio-temporal models that perform well on sanitized regional datasets do not transfer reliably to this fragmented global setting.

Significance. If the station selection and masks are representative of dominant real-world monitoring conditions, the benchmark supplies a much-needed testbed that forces models to handle uneven coverage, structured missingness, and heterogeneous scales. The public release of data, code, evaluation scripts, and baselines is a clear strength that lowers the barrier for future work on mask-aware and physically interpretable forecasters.

major comments (2)

[§3] §3 (Dataset Construction): The selection of the 3,720 stations is described only by the final count and time window; no explicit sampling criteria, spatial-density statistics (stations per continent or per 1,000 km²), or missingness histograms are provided, nor is any comparison made to global inventories such as the full OpenAQ or EEA networks. This directly undermines the central claim that the observed performance gap demonstrates failure under “dominant real-world conditions” rather than an artifact of a non-representative subset.
[§5] §5 (Experimental Results): The non-transfer conclusion rests on the assumption that the native masks in the chosen stations embody the typical fragmentation patterns; without quantitative characterization of those patterns (e.g., fraction of stations with >30 % missingness, spatial clustering of gaps), it is impossible to judge whether the reported degradation is general or specific to the selected subset.

minor comments (2)

[Abstract / §1] The abstract and §1 repeatedly use “global” without qualification; a single sentence clarifying the continental coverage of the 3,720 stations would improve precision.
[Table 1] Table 1 (baseline results) would benefit from an additional column reporting the average missingness rate per pollutant or per region to contextualize the error numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments both highlight the need for more quantitative characterization of station selection and missingness patterns to support claims of representativeness. We agree these details strengthen the manuscript and will add them in revision.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The selection of the 3,720 stations is described only by the final count and time window; no explicit sampling criteria, spatial-density statistics (stations per continent or per 1,000 km²), or missingness histograms are provided, nor is any comparison made to global inventories such as the full OpenAQ or EEA networks. This directly undermines the central claim that the observed performance gap demonstrates failure under “dominant real-world conditions” rather than an artifact of a non-representative subset.

Authors: We acknowledge that the current manuscript provides only the final station count and time window without explicit selection criteria or comparative statistics. Stations were drawn from public sources (primarily OpenAQ) by retaining all locations with at least one valid hourly observation in the 2021–2025 window to preserve native global coverage and missingness. To address the concern, the revised §3 will include: (i) explicit sampling criteria (minimum temporal coverage threshold), (ii) spatial-density statistics (stations per continent and per 1,000 km²), (iii) missingness histograms, and (iv) a direct comparison against the full OpenAQ and EEA inventories. These additions will allow readers to evaluate representativeness directly. revision: yes
Referee: [§5] §5 (Experimental Results): The non-transfer conclusion rests on the assumption that the native masks in the chosen stations embody the typical fragmentation patterns; without quantitative characterization of those patterns (e.g., fraction of stations with >30 % missingness, spatial clustering of gaps), it is impossible to judge whether the reported degradation is general or specific to the selected subset.

Authors: We agree that the non-transfer claim would be more robust with explicit quantification of the missingness regime. The revised §5 (and a new supplementary section) will report: the distribution of per-station missingness rates, the exact fraction of stations exceeding 30 % missingness, and metrics of spatial clustering of gaps (e.g., Moran’s I on missingness indicators). These statistics will be computed on the released benchmark data so that readers can assess how typical the observed fragmentation is. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark release and empirical evaluation are self-contained

full rationale

The paper introduces AirQualityBench as a data release and unified evaluation protocol for air-quality forecasting under realistic missingness and global coverage conditions. Its central claim is an empirical observation from running representative models on the released data: performance on sanitized datasets does not transfer. No derivation chain, equations, fitted parameters renamed as predictions, or first-principles results are present. The contribution does not reduce to any self-citation, ansatz, or input by construction; it is an independent dataset and protocol whose validity rests on external representativeness checks rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on empirical data collection rather than mathematical axioms or derivations; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Provider-native observation masks and station data accurately reflect real monitoring network conditions.
Invoked in the description of preserved missingness and global coverage.

pith-pipeline@v0.9.0 · 5514 in / 1130 out tokens · 46545 ms · 2026-05-08T11:12:47.858090+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 7 canonical work pages

[1]

National Academies Press, 2010

National Research Council, Division on Earth, Life Studies, Board on Atmospheric Sciences, and Committee on the Significance of International Transport of Air Pollutants.Global sources of local pollution: An assessment of long-range transport of key air pollutants to and from the United States. National Academies Press, 2010

2010
[2]

Deep air quality forecasting using hybrid deep learning framework.IEEE Transactions on Knowledge and Data Engineering, 33 (6):2412–2424, 2019

Shengdong Du, Tianrui Li, Yan Yang, and Shi-Jinn Horng. Deep air quality forecasting using hybrid deep learning framework.IEEE Transactions on Knowledge and Data Engineering, 33 (6):2412–2424, 2019

2019
[3]

5 and PM10), ozone, nitrogen dioxide, sulfur dioxide and carbon monoxide

World Health Organization et al.WHO global air quality guidelines: particulate matter (PM2. 5 and PM10), ozone, nitrogen dioxide, sulfur dioxide and carbon monoxide. World Health Organization, 2021

2021
[4]

Chujun Chen, Weihua Chen, Linhao Guo, Yongkang Wu, Xianzhong Duan, Xuemei Wang, and Min Shao. A comprehensive review of tropospheric background ozone: definitions, estimation methods, and meta-analysis of its spatiotemporal distribution in china.Atmospheric Chemistry and Physics, 25(21):15145–15169, 2025

2025
[5]

Largest: A benchmark dataset for large-scale traffic forecasting.Advances in Neural Information Processing Systems, 36:75354–75371, 2023

Xu Liu, Yutong Xia, Yuxuan Liang, Junfeng Hu, Yiwei Wang, Lei Bai, Chao Huang, Zhenguang Liu, Bryan Hooi, and Roger Zimmermann. Largest: A benchmark dataset for large-scale traffic forecasting.Advances in Neural Information Processing Systems, 36:75354–75371, 2023

2023
[6]

Shuo Wang, Yanran Li, Jiang Zhang, Qingye Meng, Lingwei Meng, and Fei Gao. Pm2. 5-gnn: A domain knowledge enhanced graph neural network for pm2. 5 forecasting. InProceedings of the 28th international conference on advances in geographic information systems, pages 163–166, 2020

2020
[7]

A new benchmark of graph learning for pm 2.5 forecasting under distribution shift

Yachuan Liu, Jiaqi Ma, Paramveer Dhillon, and Qiaozhu Mei. A new benchmark of graph learning for pm 2.5 forecasting under distribution shift. InACM, page 6, 2021

2021
[8]

representative non-optimal coarse-graining

Shuo Wang, Yun Cheng, Qingye Meng, Olga Saukh, Jiang Zhang, Jingfang Fan, Yuanting Zhang, Xingyuan Yuan, and Lothar Thiele. Pcdcnet: A surrogate model for air quality forecasting with physical-chemical dynamics and constraints.arXiv preprint arXiv:2505.19842, 2025

work page arXiv 2025
[9]

Spatio-temporal graph neural network for inter-city air quality forecasting.International Journal of Environmental Science and Technology, 23(1):63, 2026

José F Vicent, Manuel Curado, and Marc Semper. Spatio-temporal graph neural network for inter-city air quality forecasting.International Journal of Environmental Science and Technology, 23(1):63, 2026

2026
[10]

Zhiyuan Li, Kin-Fai Ho, Harry Fung Lee, and Steve Hung Lam Yim. Development of an integrated model framework for multi-air-pollutant exposure assessments in high-density cities and the implications for epidemiological research.EGUsphere, 2023:1–20, 2023

2023
[11]

Stop using root-mean-square error as a precipitation target!arXiv preprint arXiv:2509.08369, 2025

Kieran MR Hunt. Stop using root-mean-square error as a precipitation target!arXiv preprint arXiv:2509.08369, 2025

work page arXiv 2025
[12]

Diffusion convolutional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926, 2017

Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926, 2017

work page arXiv 2017
[13]

Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting.arXiv preprint arXiv:1709.04875,

Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting.arXiv preprint arXiv:1709.04875, 2017

work page arXiv 2017
[14]

Graph wavenet for deep spatial-temporal graph model- ing.arXiv preprint arXiv:1906.00121,

Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph wavenet for deep spatial-temporal graph modeling.arXiv preprint arXiv:1906.00121, 2019

work page arXiv 1906
[15]

Attention based spatial- temporal graph convolutional networks for traffic flow forecasting

Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. Attention based spatial- temporal graph convolutional networks for traffic flow forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 922–929, 2019

2019
[16]

Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv:2001.02908, 2020

Mingxing Xu, Wenrui Dai, Chunmiao Liu, Xing Gao, Weiyao Lin, Guo-Jun Qi, and Hongkai Xiong. Spatial-temporal transformer networks for traffic flow forecasting.arXiv preprint arXiv:2001.02908, 2020. 10

work page arXiv 2001
[17]

Pdformer: Propagation delay-aware dynamic long-range transformer for traffic flow prediction

Jiawei Jiang, Chengkai Han, Wayne Xin Zhao, and Jingyuan Wang. Pdformer: Propagation delay-aware dynamic long-range transformer for traffic flow prediction. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 4365–4373, 2023

2023
[18]

Adaptive graph convolutional recurrent network for traffic forecasting.Advances in neural information processing systems, 33:17804–17815, 2020

Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. Adaptive graph convolutional recurrent network for traffic forecasting.Advances in neural information processing systems, 33:17804–17815, 2020

2020
[19]

Decoupled dynamic spatial-temporal graph neural network for traffic forecasting,

Zezhi Shao, Zhao Zhang, Wei Wei, Fei Wang, Yongjun Xu, Xin Cao, and Christian S Jensen. Decoupled dynamic spatial-temporal graph neural network for traffic forecasting.arXiv preprint arXiv:2206.09112, 2022

work page arXiv 2022
[20]

Less but more: Linear adaptive graph learning empowering spatiotemporal forecasting

Jiaming Ma, Binwu Wang, Guanjun Wang, Kuo Yang, Zhengyang Zhou, Pengkun Wang, Xu Wang, and Yang Wang. Less but more: Linear adaptive graph learning empowering spatiotemporal forecasting. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[21]

Bist: A lightweight and efficient bi-directional model for spatiotemporal prediction.Proceedings of the VLDB Endowment, 18(6):1663–1676, 2025

Jiaming Ma, Binwu Wang, Pengkun Wang, Zhengyang Zhou, Xu Wang, and Yang Wang. Bist: A lightweight and efficient bi-directional model for spatiotemporal prediction.Proceedings of the VLDB Endowment, 18(6):1663–1676, 2025

2025
[22]

Incident-guided spatiotemporal traffic forecasting

Lixiang Fan, Bohao Li, Tao Zou, Junchen Ye, and Bowen Du. Incident-guided spatiotemporal traffic forecasting. InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 243–254, 2026

2026
[23]

Airformer: Predicting nationwide air quality in china with trans- formers

Yuxuan Liang, Yutong Xia, Songyu Ke, Yiwei Wang, Qingsong Wen, Junbo Zhang, Yu Zheng, and Roger Zimmermann. Airformer: Predicting nationwide air quality in china with trans- formers. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 14329–14337, 2023

2023
[24]

Openaq: a platform to aggregate and freely share global air quality data

Christa A Hasenkopf, JC Flasher, Olaf Veerman, and Helen Langley DeWitt. Openaq: a platform to aggregate and freely share global air quality data. InAGU fall meeting abstracts, volume 2015, pages A31D–0097, 2015. 11 A Metric Formulations We define the masked evaluation metrics used in AirQualityBench. Let O denote the set of valid (observed, non-missing) ...

2015