TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

Chao Chen; Hanyu Guo; Jiedong Yang; Kaikui Liu; Longfei Xu; Xiangxiang Chu

arxiv: 2605.22355 · v1 · pith:HSBE5JWWnew · submitted 2026-05-21 · 💻 cs.CL · cs.AI· cs.LG

TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

Hanyu Guo , Jiedong Yang , Chao Chen , Longfei Xu , Kaikui Liu , Xiangxiang Chu This is my paper

Pith reviewed 2026-05-22 05:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords transit route planningmap-free generationlarge language modelsorigin destination queriespublic transportation datasetdata-driven routingGPS grounding

0 comments

The pith

An LLM trained on historical transit records can generate valid routes and ground GPS points to stations without any maps or routing engines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a dataset of more than 13 million real transit route records from four cities to serve as training material for language models. Experiments demonstrate that a model fine-tuned on this data produces routes that respect line connections and transfers at high rates of structural correctness. The same model also learns to associate arbitrary location coordinates with the correct nearby stations on its own. This shows that the entire task of planning public transit journeys can be handled as a data-driven prediction problem rather than an engineering task built on explicit maps. If the approach holds, route generation becomes a single learned function that takes only origin and destination inputs.

Core claim

The paper establishes that transit route planning reduces to a sequence-generation task that a large language model can master when given a large corpus of past origin-destination records. After training, the model outputs complete, structurally valid itineraries and, without any separate location database, maps raw GPS coordinates to the nearest appropriate stations. The results indicate that the implicit regularities in historical travel data are sufficient to support generalization to new queries.

What carries the argument

The TransitLM dataset of 13 million historical route records, used as continual pre-training data, supplies the implicit network structure that lets the model learn valid sequences and coordinate-to-station mappings directly.

If this is right

Route generation becomes possible with only origin and destination text or coordinates as input.
No separate map database or graph engine is required at inference time.
The same model can handle both textual addresses and raw GPS locations in one forward pass.
New cities or lines can be incorporated simply by adding more historical records to the training corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on other sequential planning tasks such as delivery routing or multi-modal journey planning where explicit graphs are costly to maintain.
Accuracy might improve further if the training data were augmented with synthetic but realistic route variations that preserve line constraints.
Deployment in low-resource settings becomes feasible because the only required runtime component is the trained model weights.

Load-bearing premise

Historical route records already contain enough hidden regularities that a model can learn to connect arbitrary new origin and destination points without ever seeing an explicit map.

What would settle it

Test the trained model on a held-out set of origin-destination pairs that require transfers or use stations never seen in training; if the fraction of routes that violate line connectivity or station existence exceeds a small threshold, the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.22355 by Chao Chen, Hanyu Guo, Jiedong Yang, Kaikui Liu, Longfei Xu, Xiangxiang Chu.

**Figure 2.** Figure 2: Overview of TransitLM. Left: Data sources from Amap comprising route plans, station information, station connectivity, and line information across four cities. Center: TransitBench defines three evaluation tasks (ORG, PRG, DRG) with 10K test samples each, assessed by 10 metrics across five categories. Right: TransitLM addresses the limitations of general LLMs through continual pre-training on three knowled… view at source ↗

**Figure 3.** Figure 3: Geographic distribution of route planning origins across the four cities. Density reflects [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: CPT training loss curves for Qwen3-0.6B, Qwen3-1.7B, and Qwen3-4B over [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Optimal Route Generation example from the 4B-Joint model in Beijing. Given a natural [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Preference-Aware Planning example on the same OD pair with an added “bus first” [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Multi-Route Generation example on the same OD pair. The model produces three alterna [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: GPS-only Optimal Route Generation on the same OD pair with the textual query removed. [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

read the original abstract

Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end-to-end, map-free route generation directly from origin-destination information. The dataset and benchmark are available at https://huggingface.co/datasets/GD-ML/TransitLM, with evaluation code at https://github.com/HotTricker/TransitLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real value is the new large dataset and benchmark for map-free transit routing, though the LLM results are hard to judge without metrics or evaluation details.

read the letter

The main thing to know is that this work releases a dataset of over 13 million historical transit records from four Chinese cities, covering more than 120,000 stations and 13,000 lines, and uses it to train an LLM for generating routes straight from origin-destination GPS without maps or engines. That scale and the explicit map-free framing are new for this corner of AI applied to transportation. Releasing the data on Hugging Face and the evaluation code on GitHub is a practical step that lets others test similar ideas quickly. The claim that the model implicitly grounds arbitrary GPS to stations from route history alone is the interesting angle if it generalizes. On the soft spots, the abstract gives no accuracy numbers, no baseline comparisons, no error analysis, and no specifics on how GPS are handled or how structural validity is checked. That makes the central results difficult to assess. The stress-test concern about test points possibly being too close to training locations is reasonable given the lack of geographic spread details, and it would be easy to check with a simple hold-out by city or distance. If the full paper shows clear out-of-distribution results and reproducible protocols, that would fix most of the issue. This is mainly for researchers in urban computing or data-driven mobility planning who need new sequence data or benchmarks. A reader working on applied LLMs for structured prediction would get something concrete from the release. It deserves peer review because the dataset itself is a usable contribution even if the modeling side needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper introduces TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines. It is positioned as a continual pre-training corpus and benchmark for three evaluation tasks with complementary metrics. The central claim is that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping, enabling end-to-end map-free route generation directly from origin-destination information.

Significance. If substantiated with quantitative evidence, the result would be significant for demonstrating that complex spatial and routing tasks can be learned entirely from historical data without maps or engines. The public release of the dataset and evaluation code is a clear strength supporting reproducibility and further work in data-driven transit planning.

major comments (2)

Abstract: The assertion that the LLM 'produces structurally valid routes at high accuracy' supplies no quantitative metrics, error analysis, baseline comparisons, or details on how structural validity was measured. This leaves the central claim without visible supporting evidence and requires explicit results in the experiments section to be load-bearing.
Experiments section: The claim of implicit GPS-to-station grounding without explicit mapping is central to the map-free contribution. The manuscript provides no details on GPS tokenization, whether station coordinates are leaked into inputs, or the geographic spread of test queries relative to training data. This raises a correctness risk that success may stem from proximity to training locations rather than learned general structure, directly testing the assumption that historical records contain sufficient implicit structure for generalization to new queries.

minor comments (1)

Clarify the exact definitions and metrics for the three evaluation tasks in the benchmark description to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our central claims. We address each major comment below and have prepared revisions to strengthen the manuscript's evidentiary support and technical details.

read point-by-point responses

Referee: Abstract: The assertion that the LLM 'produces structurally valid routes at high accuracy' supplies no quantitative metrics, error analysis, baseline comparisons, or details on how structural validity was measured. This leaves the central claim without visible supporting evidence and requires explicit results in the experiments section to be load-bearing.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript, we have added a sentence to the abstract reporting the primary accuracy figures for structural validity (e.g., route validity rate and station grounding accuracy) along with a brief note on the evaluation metrics. We have also expanded the experiments section (Section 4) to include explicit baseline comparisons, error analysis broken down by route complexity, and a dedicated paragraph detailing how structural validity is operationalized via the three complementary tasks and their metrics. These changes make the central claim directly supported by visible evidence. revision: yes
Referee: Experiments section: The claim of implicit GPS-to-station grounding without explicit mapping is central to the map-free contribution. The manuscript provides no details on GPS tokenization, whether station coordinates are leaked into inputs, or the geographic spread of test queries relative to training data. This raises a correctness risk that success may stem from proximity to training locations rather than learned general structure, directly testing the assumption that historical records contain sufficient implicit structure for generalization to new queries.

Authors: We appreciate the referee highlighting the need for greater transparency on the map-free mechanism. In the revision, we have added a new subsection in the experiments section that describes the GPS tokenization scheme (coordinate discretization into tokens without any station ID or coordinate leakage into the input sequence), confirms that station coordinates are never provided as input features, and reports the geographic distribution of test queries (including a split showing performance on queries from regions with low overlap to training data). We further include results from an out-of-distribution evaluation set to demonstrate generalization beyond proximity to training locations, supporting that the model learns implicit structure from historical records. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results from external dataset and standard training

full rationale

The paper collects a large external dataset of over 13 million real-world transit records from four Chinese cities, releases it as a pre-training corpus and benchmark, and reports empirical results from training an LLM on this data. The central claim of implicit GPS-to-station grounding and valid route generation is presented as an observed experimental outcome on held-out evaluation tasks, not as a mathematical derivation that reduces to its own inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or described methodology. The approach is self-contained against the externally gathered data and follows standard ML dataset/benchmark practices.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that historical route data encodes enough structure to substitute for explicit maps.

axioms (1)

domain assumption Transit route planning patterns can be learned implicitly from historical route data without explicit map or routing engine knowledge.
This premise is required for the map-free claim to hold.

pith-pipeline@v0.9.0 · 5700 in / 1062 out tokens · 53510 ms · 2026-05-22T05:44:05.885346+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 3 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Fast routing in very large public transportation networks using transfer patterns

Hannah Bast, Erik Carlsson, Arno Eigenwillig, Robert Geisberger, Chris Harrelson, Veselin Raychev, and Fabien Viger. Fast routing in very large public transportation networks using transfer patterns. InEuropean Symposium on Algorithms, pages 290–301, 2010

work page 2010
[3]

Route planning in transportation networks

Hannah Bast, Daniel Delling, Andrew Goldberg, Matthias M ¨uller-Hannemann, Thomas Pajor, Peter Sanders, Dorothea Wagner, and Renato F Werneck. Route planning in transportation networks. In Algorithm engineering: Selected results and surveys, pages 19–80. 2016

work page 2016
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020

work page 1901
[5]

TripCraft: A benchmark for spatio-temporally fine grained travel planning

Soumyabrata Chaudhuri, Pranav Purkar, Ritwik Raghav, Shubhojit Mallick, Manish Gupta, Abhik Jana, and Shreya Ghosh. TripCraft: A benchmark for spatio-temporally fine grained travel planning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 17035–17064, 2025

work page 2025
[6]

Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Zheng Pan, Xin Li, and Yong Liu. TravelBench: A real-world benchmark for multi-turn and tool-augmented travel planning.arXiv preprint arXiv:2512.22673, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Unique in the crowd: The privacy bounds of human mobility.Scientific reports, 3(1):1376, 2013

Yves-Alexandre De Montjoye, C´esar A Hidalgo, Michel Verleysen, and Vincent D Blondel. Unique in the crowd: The privacy bounds of human mobility.Scientific reports, 3(1):1376, 2013

work page 2013
[8]

Round-based public transit routing.Transportation Science, 49(3):591–604, 2015

Daniel Delling, Thomas Pajor, and Renato F Werneck. Round-based public transit routing.Transportation Science, 49(3):591–604, 2015

work page 2015
[9]

Fast and exact public transit routing with restricted pareto sets

Daniel Delling, Julian Dibbelt, and Thomas Pajor. Fast and exact public transit routing with restricted pareto sets. InProceedings of the Twenty-First Workshop on Algorithm Engineering and Experiments, pages 54–65, 2019

work page 2019
[10]

Connection scan algorithm.Journal of Experimental Algorithmics, 23:1–56, 2018

Julian Dibbelt, Thomas Pajor, Ben Strasser, and Dorothea Wagner. Connection scan algorithm.Journal of Experimental Algorithmics, 23:1–56, 2018

work page 2018
[11]

A note on two problems in connexion with graphs

Edsger W Dijkstra. A note on two problems in connexion with graphs. InEdsger Wybe Dijkstra: his life, work, and legacy, pages 287–290. 2022

work page 2022
[12]

Bowen Fang, Zixiao Yang, and Xuan Di. TraveLLM: Could you plan my public transit alternatives in face of a network disruption? In2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), pages 4711–4717, 2025

work page 2025
[13]

CityBench: Evaluating the capabilities of large language models for urban tasks

Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, and Yong Li. CityBench: Evaluating the capabilities of large language models for urban tasks. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5413–5424, 2025

work page 2025
[14]

Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, and Xinchao Wang. Can MLLMs guide me home? A benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

work page arXiv 2025
[15]

Don’t stop pretraining: Adapt language models to domains and tasks

Suchin Gururangan, Ana Marasovi ´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, 2020

work page 2020
[16]

OpenStreetMap: User-generated street maps.IEEE Pervasive Computing, 7(4):12–18, 2008

Mordechai Haklay and Patrick Weber. OpenStreetMap: User-generated street maps.IEEE Pervasive Computing, 7(4):12–18, 2008

work page 2008
[17]

A formal basis for the heuristic determination of minimum cost paths.IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968

Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths.IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968

work page 1968
[18]

Training compute- optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, et al. Training compute- optimal large language models. InAdvances in Neural Information Processing Systems, pages 30016– 30030, 2022. 10

work page 2022
[19]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025
[20]

Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks

Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. InInternational Conference on Machine Learning, 2024

work page 2024
[21]

USTBench: Benchmarking and dissecting spatiotemporal reasoning of LLMs as urban agents.arXiv preprint arXiv:2505.17572, 2025

Siqi Lai, Yansong Ning, Zirui Yuan, Zhixi Chen, and Hao Liu. USTBench: Benchmarking and dissecting spatiotemporal reasoning of LLMs as urban agents.arXiv preprint arXiv:2505.17572, 2025

work page arXiv 2025
[22]

GridRoute: A benchmark for LLM-based route planning with cardinal movement in grid environments

Kechen Li, Yaotian Tao, Ximing Wen, Quanwei Sun, Zifei Gong, Chang Xu, Xizhe Zhang, and Tianbo Ji. GridRoute: A benchmark for LLM-based route planning with cardinal movement in grid environments. arXiv preprint arXiv:2505.24306, 2025

work page arXiv 2025
[23]

GeoLM: Empowering language models for geospatially grounded language understanding

Zekun Li, Wenxuan Zhou, Yao-Yi Chiang, and Muhao Chen. GeoLM: Empowering language models for geospatially grounded language understanding. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5227–5240, 2023

work page 2023
[24]

LLM-A*: Large language model enhanced incremental heuristic search on path planning

Silin Meng, Yiwei Wang, Cheng-Fu Yang, Nanyun Peng, and Kai-Wei Chang. LLM-A*: Large language model enhanced incremental heuristic search on path planning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1087–1102, 2024

work page 2024
[25]

Predicting taxi–passenger demand using streaming data.IEEE Transactions on Intelligent Transportation Systems, 14 (3):1393–1402, 2013

Luis Moreira-Matias, Joao Gama, Michel Ferreira, Joao Mendes-Moreira, and Luis Damas. Predicting taxi–passenger demand using streaming data.IEEE Transactions on Intelligent Transportation Systems, 14 (3):1393–1402, 2013

work page 2013
[26]

TP-RAG: Benchmarking retrieval-augmented large language model agents for spatiotemporal-aware travel planning

Hang Ni, Fan Liu, Xinyu Ma, Lixin Su, Shuaiqiang Wang, Dawei Yin, Hui Xiong, and Hao Liu. TP-RAG: Benchmarking retrieval-augmented large language model agents for spatiotemporal-aware travel planning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12403–12429, 2025

work page 2025
[27]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, pages 27730–27744, 2022

work page 2022
[28]

MapTrace: Scalable data generation for route tracing on maps.arXiv preprint arXiv:2512.19609, 2025

Artemis Panagopoulou, Aveek Purohit, Achin Kulshrestha, Soroosh Yazdani, and Mohit Goyal. MapTrace: Scalable data generation for route tracing on maps.arXiv preprint arXiv:2512.19609, 2025

work page arXiv 2025
[29]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

work page 2020
[30]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

work page 2023
[31]

ChinaTravel: An open-ended benchmark for language agents in Chinese travel planning

Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Siyu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, and Yu-Feng Li. ChinaTravel: An open-ended benchmark for language agents in Chinese travel planning. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle, 2025

work page 2025
[32]

TRIP-Bench: A benchmark for long-horizon interactive agents in real-world scenarios.arXiv preprint arXiv:2602.01675, 2026

Yuanzhe Shen, Zisu Huang, Zhengyuan Wang, et al. TRIP-Bench: A benchmark for long-horizon interactive agents in real-world scenarios.arXiv preprint arXiv:2602.01675, 2026

work page arXiv 2026
[33]

MobilityBench: A benchmark for evaluating route- planning agents in real-world mobility scenarios.arXiv preprint arXiv:2602.22638, 2026

Zhiheng Song, Jingshuai Zhang, Chuan Qin, et al. MobilityBench: A benchmark for evaluating route- planning agents in real-world mobility scenarios.arXiv preprint arXiv:2602.22638, 2026

work page arXiv 2026
[34]

On the planning abilities of large language models – a critical investigation

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models – a critical investigation. InAdvances in Neural Information Processing Systems, volume 36, pages 75993–76005, 2023

work page 2023
[35]

TripTailor: A real- world benchmark for personalized travel planning

Kaimin Wang, Yuanzhe Shen, Changze Lv, Xiaoqing Zheng, and Xuan-Jing Huang. TripTailor: A real- world benchmark for personalized travel planning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9705–9723, 2025

work page 2025
[36]

China public transport operation network dataset (CPTOND-2025): National-scale bus-metro vector dataset.Scientific Data, 2026

Liang Wang, He Wei, Yu Guan, Libin Ouyang, DanDan Xu, Xuehua Han, Min Zhang, Meng Chen, Daosheng Sun, Daqing Gong, et al. China public transport operation network dataset (CPTOND-2025): National-scale bus-metro vector dataset.Scientific Data, 2026. 11

work page 2025
[37]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022

work page 2022
[38]

Leveraging the general transit feed specification for efficient transit analysis.Transportation Research Record, 2338(1):11–19, 2013

James Wong. Leveraging the general transit feed specification for efficient transit analysis.Transportation Research Record, 2338(1):11–19, 2013

work page 2013
[39]

TravelPlanner: A benchmark for real-world planning with language agents

Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. TravelPlanner: A benchmark for real-world planning with language agents. InInternational Conference on Machine Learning, pages 54590–54613, 2024

work page 2024
[40]

Can large vision language models read maps like a human?arXiv preprint arXiv:2503.14607, 2025

Shuo Xing, Zezhou Sun, Shuangyu Xie, Kaiyuan Chen, Yanjia Huang, Yuping Wang, Jiachen Li, Dezhen Song, and Zhengzhong Tu. MapBench: Can large vision language models read maps like a human?arXiv preprint arXiv:2503.14607, 2025

work page arXiv 2025
[41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

T-drive: Driving directions based on taxi trajectories

Jing Yuan, Yu Zheng, Chengyang Zhang, Wenlei Xie, Xing Xie, Guangzhong Sun, and Yan Huang. T-drive: Driving directions based on taxi trajectories. InProceedings of the 18th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 99–108, 2010

work page 2010
[43]

falling piece

Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V Le, Ed H Chi, et al. NATURAL PLAN: Benchmarking LLMs on natural language planning.arXiv preprint arXiv:2406.04520, 2024

work page arXiv 2024
[44]

Trajectory data mining: an overview.ACM Transactions on Intelligent Systems and Technology, 6(3):1–41, 2015

Yu Zheng. Trajectory data mining: an overview.ACM Transactions on Intelligent Systems and Technology, 6(3):1–41, 2015

work page 2015
[45]

query":

Yu Zheng, Xing Xie, and Wei-Ying Ma. GeoLife: A collaborative social networking service among user, location and trajectory.IEEE Data Engineering Bulletin, 33(2):32–39, 2010. 12 A Data Visualization Figure 3 visualizes the spatial distribution of route planning origins across the four cities. The heatmaps reveal dense coverage in urban cores with natural ...

work page arXiv 2010

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Fast routing in very large public transportation networks using transfer patterns

Hannah Bast, Erik Carlsson, Arno Eigenwillig, Robert Geisberger, Chris Harrelson, Veselin Raychev, and Fabien Viger. Fast routing in very large public transportation networks using transfer patterns. InEuropean Symposium on Algorithms, pages 290–301, 2010

work page 2010

[3] [3]

Route planning in transportation networks

Hannah Bast, Daniel Delling, Andrew Goldberg, Matthias M ¨uller-Hannemann, Thomas Pajor, Peter Sanders, Dorothea Wagner, and Renato F Werneck. Route planning in transportation networks. In Algorithm engineering: Selected results and surveys, pages 19–80. 2016

work page 2016

[4] [4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020

work page 1901

[5] [5]

TripCraft: A benchmark for spatio-temporally fine grained travel planning

Soumyabrata Chaudhuri, Pranav Purkar, Ritwik Raghav, Shubhojit Mallick, Manish Gupta, Abhik Jana, and Shreya Ghosh. TripCraft: A benchmark for spatio-temporally fine grained travel planning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, pages 17035–17064, 2025

work page 2025

[6] [6]

Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Zheng Pan, Xin Li, and Yong Liu. TravelBench: A real-world benchmark for multi-turn and tool-augmented travel planning.arXiv preprint arXiv:2512.22673, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Unique in the crowd: The privacy bounds of human mobility.Scientific reports, 3(1):1376, 2013

Yves-Alexandre De Montjoye, C´esar A Hidalgo, Michel Verleysen, and Vincent D Blondel. Unique in the crowd: The privacy bounds of human mobility.Scientific reports, 3(1):1376, 2013

work page 2013

[8] [8]

Round-based public transit routing.Transportation Science, 49(3):591–604, 2015

Daniel Delling, Thomas Pajor, and Renato F Werneck. Round-based public transit routing.Transportation Science, 49(3):591–604, 2015

work page 2015

[9] [9]

Fast and exact public transit routing with restricted pareto sets

Daniel Delling, Julian Dibbelt, and Thomas Pajor. Fast and exact public transit routing with restricted pareto sets. InProceedings of the Twenty-First Workshop on Algorithm Engineering and Experiments, pages 54–65, 2019

work page 2019

[10] [10]

Connection scan algorithm.Journal of Experimental Algorithmics, 23:1–56, 2018

Julian Dibbelt, Thomas Pajor, Ben Strasser, and Dorothea Wagner. Connection scan algorithm.Journal of Experimental Algorithmics, 23:1–56, 2018

work page 2018

[11] [11]

A note on two problems in connexion with graphs

Edsger W Dijkstra. A note on two problems in connexion with graphs. InEdsger Wybe Dijkstra: his life, work, and legacy, pages 287–290. 2022

work page 2022

[12] [12]

Bowen Fang, Zixiao Yang, and Xuan Di. TraveLLM: Could you plan my public transit alternatives in face of a network disruption? In2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), pages 4711–4717, 2025

work page 2025

[13] [13]

CityBench: Evaluating the capabilities of large language models for urban tasks

Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, and Yong Li. CityBench: Evaluating the capabilities of large language models for urban tasks. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5413–5424, 2025

work page 2025

[14] [14]

Can mllms guide me home? a benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, and Xinchao Wang. Can MLLMs guide me home? A benchmark study on fine-grained visual reasoning from transit maps.arXiv preprint arXiv:2505.18675, 2025

work page arXiv 2025

[15] [15]

Don’t stop pretraining: Adapt language models to domains and tasks

Suchin Gururangan, Ana Marasovi ´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, 2020

work page 2020

[16] [16]

OpenStreetMap: User-generated street maps.IEEE Pervasive Computing, 7(4):12–18, 2008

Mordechai Haklay and Patrick Weber. OpenStreetMap: User-generated street maps.IEEE Pervasive Computing, 7(4):12–18, 2008

work page 2008

[17] [17]

A formal basis for the heuristic determination of minimum cost paths.IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968

Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths.IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968

work page 1968

[18] [18]

Training compute- optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, et al. Training compute- optimal large language models. InAdvances in Neural Information Processing Systems, pages 30016– 30030, 2022. 10

work page 2022

[19] [19]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025

[20] [20]

Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks

Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. InInternational Conference on Machine Learning, 2024

work page 2024

[21] [21]

USTBench: Benchmarking and dissecting spatiotemporal reasoning of LLMs as urban agents.arXiv preprint arXiv:2505.17572, 2025

Siqi Lai, Yansong Ning, Zirui Yuan, Zhixi Chen, and Hao Liu. USTBench: Benchmarking and dissecting spatiotemporal reasoning of LLMs as urban agents.arXiv preprint arXiv:2505.17572, 2025

work page arXiv 2025

[22] [22]

GridRoute: A benchmark for LLM-based route planning with cardinal movement in grid environments

Kechen Li, Yaotian Tao, Ximing Wen, Quanwei Sun, Zifei Gong, Chang Xu, Xizhe Zhang, and Tianbo Ji. GridRoute: A benchmark for LLM-based route planning with cardinal movement in grid environments. arXiv preprint arXiv:2505.24306, 2025

work page arXiv 2025

[23] [23]

GeoLM: Empowering language models for geospatially grounded language understanding

Zekun Li, Wenxuan Zhou, Yao-Yi Chiang, and Muhao Chen. GeoLM: Empowering language models for geospatially grounded language understanding. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5227–5240, 2023

work page 2023

[24] [24]

LLM-A*: Large language model enhanced incremental heuristic search on path planning

Silin Meng, Yiwei Wang, Cheng-Fu Yang, Nanyun Peng, and Kai-Wei Chang. LLM-A*: Large language model enhanced incremental heuristic search on path planning. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1087–1102, 2024

work page 2024

[25] [25]

Predicting taxi–passenger demand using streaming data.IEEE Transactions on Intelligent Transportation Systems, 14 (3):1393–1402, 2013

Luis Moreira-Matias, Joao Gama, Michel Ferreira, Joao Mendes-Moreira, and Luis Damas. Predicting taxi–passenger demand using streaming data.IEEE Transactions on Intelligent Transportation Systems, 14 (3):1393–1402, 2013

work page 2013

[26] [26]

TP-RAG: Benchmarking retrieval-augmented large language model agents for spatiotemporal-aware travel planning

Hang Ni, Fan Liu, Xinyu Ma, Lixin Su, Shuaiqiang Wang, Dawei Yin, Hui Xiong, and Hao Liu. TP-RAG: Benchmarking retrieval-augmented large language model agents for spatiotemporal-aware travel planning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12403–12429, 2025

work page 2025

[27] [27]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume 35, pages 27730–27744, 2022

work page 2022

[28] [28]

MapTrace: Scalable data generation for route tracing on maps.arXiv preprint arXiv:2512.19609, 2025

Artemis Panagopoulou, Aveek Purohit, Achin Kulshrestha, Soroosh Yazdani, and Mohit Goyal. MapTrace: Scalable data generation for route tracing on maps.arXiv preprint arXiv:2512.19609, 2025

work page arXiv 2025

[29] [29]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

work page 2020

[30] [30]

Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551, 2023

work page 2023

[31] [31]

ChinaTravel: An open-ended benchmark for language agents in Chinese travel planning

Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Siyu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, and Yu-Feng Li. ChinaTravel: An open-ended benchmark for language agents in Chinese travel planning. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle, 2025

work page 2025

[32] [32]

TRIP-Bench: A benchmark for long-horizon interactive agents in real-world scenarios.arXiv preprint arXiv:2602.01675, 2026

Yuanzhe Shen, Zisu Huang, Zhengyuan Wang, et al. TRIP-Bench: A benchmark for long-horizon interactive agents in real-world scenarios.arXiv preprint arXiv:2602.01675, 2026

work page arXiv 2026

[33] [33]

MobilityBench: A benchmark for evaluating route- planning agents in real-world mobility scenarios.arXiv preprint arXiv:2602.22638, 2026

Zhiheng Song, Jingshuai Zhang, Chuan Qin, et al. MobilityBench: A benchmark for evaluating route- planning agents in real-world mobility scenarios.arXiv preprint arXiv:2602.22638, 2026

work page arXiv 2026

[34] [34]

On the planning abilities of large language models – a critical investigation

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models – a critical investigation. InAdvances in Neural Information Processing Systems, volume 36, pages 75993–76005, 2023

work page 2023

[35] [35]

TripTailor: A real- world benchmark for personalized travel planning

Kaimin Wang, Yuanzhe Shen, Changze Lv, Xiaoqing Zheng, and Xuan-Jing Huang. TripTailor: A real- world benchmark for personalized travel planning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 9705–9723, 2025

work page 2025

[36] [36]

China public transport operation network dataset (CPTOND-2025): National-scale bus-metro vector dataset.Scientific Data, 2026

Liang Wang, He Wei, Yu Guan, Libin Ouyang, DanDan Xu, Xuehua Han, Min Zhang, Meng Chen, Daosheng Sun, Daqing Gong, et al. China public transport operation network dataset (CPTOND-2025): National-scale bus-metro vector dataset.Scientific Data, 2026. 11

work page 2025

[37] [37]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2022

work page 2022

[38] [38]

Leveraging the general transit feed specification for efficient transit analysis.Transportation Research Record, 2338(1):11–19, 2013

James Wong. Leveraging the general transit feed specification for efficient transit analysis.Transportation Research Record, 2338(1):11–19, 2013

work page 2013

[39] [39]

TravelPlanner: A benchmark for real-world planning with language agents

Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. TravelPlanner: A benchmark for real-world planning with language agents. InInternational Conference on Machine Learning, pages 54590–54613, 2024

work page 2024

[40] [40]

Can large vision language models read maps like a human?arXiv preprint arXiv:2503.14607, 2025

Shuo Xing, Zezhou Sun, Shuangyu Xie, Kaiyuan Chen, Yanjia Huang, Yuping Wang, Jiachen Li, Dezhen Song, and Zhengzhong Tu. MapBench: Can large vision language models read maps like a human?arXiv preprint arXiv:2503.14607, 2025

work page arXiv 2025

[41] [41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

T-drive: Driving directions based on taxi trajectories

Jing Yuan, Yu Zheng, Chengyang Zhang, Wenlei Xie, Xing Xie, Guangzhong Sun, and Yan Huang. T-drive: Driving directions based on taxi trajectories. InProceedings of the 18th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 99–108, 2010

work page 2010

[43] [43]

falling piece

Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V Le, Ed H Chi, et al. NATURAL PLAN: Benchmarking LLMs on natural language planning.arXiv preprint arXiv:2406.04520, 2024

work page arXiv 2024

[44] [44]

Trajectory data mining: an overview.ACM Transactions on Intelligent Systems and Technology, 6(3):1–41, 2015

Yu Zheng. Trajectory data mining: an overview.ACM Transactions on Intelligent Systems and Technology, 6(3):1–41, 2015

work page 2015

[45] [45]

query":

Yu Zheng, Xing Xie, and Wei-Ying Ma. GeoLife: A collaborative social networking service among user, location and trajectory.IEEE Data Engineering Bulletin, 33(2):32–39, 2010. 12 A Data Visualization Figure 3 visualizes the spatial distribution of route planning origins across the four cities. The heatmaps reveal dense coverage in urban cores with natural ...

work page arXiv 2010