pith. sign in

arxiv: 2606.01745 · v1 · pith:KT3QW47Onew · submitted 2026-06-01 · 💻 cs.SI

Enhancing the Socioeconomic Understanding of Foundation Models with Urban Mobility

Pith reviewed 2026-06-28 12:08 UTC · model grok-4.3

classification 💻 cs.SI
keywords foundation modelsurban mobilitysocioeconomic predictionmultimodal fusionLLM promptinggraph connectorsurban analytics
0
0 comments X

The pith

Incorporating mobility networks improves foundation models' socioeconomic predictions for urban areas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether human mobility patterns can help foundation models understand urban socioeconomic conditions better than static inputs alone. Models currently rely on POI text, satellite imagery, and geospatial descriptions, but these miss how places connect through people's movements. The authors introduce MobFusion, a framework with three integration methods: using mobility as context for zero-shot LLM prompts, as graph connectors between visual and text embeddings, and as structured tokens in multimodal LLMs. Experiments with large-scale anonymized mobility data from three U.S. metropolitan areas show gains on tasks predicting median household income, population density, and crime.

Core claim

Mobility networks can elicit the geospatial capabilities of foundation models by explicitly encoding connectivity among urban entities that static attributes such as POI text and satellite imagery do not capture. MobFusion, instantiated in three complementary designs on anonymized large-scale mobility datasets from three U.S. metropolitan areas, improves urban prediction tasks including median household income, population density, and crime prediction.

What carries the argument

MobFusion, a modular mobility-enhanced foundation model fusion paradigm with three designs: mobility networks as contexts for zero-shot LLM prompting, as graph connectors for fusing geospatial visual embeddings with textual embeddings, and as structured tokens for multimodal LLM reasoning.

If this is right

  • Mobility integration improves accuracy on socioeconomic prediction tasks such as income, density, and crime across multiple cities.
  • Three fusion designs offer complementary ways to combine mobility with existing foundation model inputs.
  • Foundation models acquire better geospatial understanding when mobility patterns are explicitly included.
  • Urban applications that rely on socioeconomic forecasts can use mobility-enhanced models for higher performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Mobility fusion techniques could extend to other spatial prediction domains such as traffic flow or land-use change.
  • Testing on cities outside the U.S. or with different mobility data resolutions would clarify the scope of the gains.
  • The results imply that dynamic network data can serve as a general complement to static geospatial features in multimodal models.

Load-bearing premise

Mobility networks provide connectivity information among urban places that static attributes like POI text and satellite imagery cannot capture.

What would settle it

An experiment on the same urban prediction tasks where adding mobility networks produces no accuracy gain or produces lower accuracy than the static-attribute baselines.

Figures

Figures reproduced from arXiv: 2606.01745 by Alok Prakash, Baoshen Guo, Donghang Li, Heye Huang, Kailai Sun, Shenhao Wang, Zhiqing Hong.

Figure 1
Figure 1. Figure 1: Framework of MobFusion. Intrinsic census block group (CBG) features (Vision, POI text) and relational mobility network are encoded by a set of foundation models Φ and fused via fθ for downstream geospatial prediction. tual and visual knowledge of urban entities, while human mobility provides complementary relational signals about how these entities are connected. Specifically, as shown in [PITH_FULL_IMAGE… view at source ↗
Figure 2
Figure 2. Figure 2: Zero-shot prompt templates: (a) prompt with only intrinsic POI features (sampled POI names and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: CBG-POI mobility network as the connector [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Spatial distribution of median household in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: UMAP visualization of CBG embeddings in Chicago, colored by percentile rank of income, density, and crime (columns). 5.5 Mobility-aware MLLM Performance To evaluate whether mobility graph tokens enhance MLLMs, we compare several input variants of MobFusion-T along three dimensions: the visual feature used to construct the graph token, the visual input, and the textual prompt. For the graph token, CBG nodes… view at source ↗
Figure 7
Figure 7. Figure 7: Spatial distribution of population density per [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: UMAP visualization of CBG embeddings in Boston, colored by percentile rank of income, density, and crime (columns). Rows compare AlphaEarth, POI embedding, and MobFusion-G embedding [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Full POI-only prompt template instantiated on a representative. No mobility information is provided. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Full mobility-aware prompt template (MobFusion-C) on the same Boston CBG as [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
read the original abstract

Foundation models have recently been applied to urban socioeconomic prediction using POI text, satellite imagery, and geospatial descriptions. However, these models mostly rely on static attributes of individual places, while ignoring the mobility patterns that reveal how places are functionally connected. To address this gap, we explore whether mobility networks can elicit the geospatial capabilities of foundation models by explicitly encoding connectivity among urban entities. We propose \textit{MobFusion}, a modular mobility-enhanced foundation model fusion paradigm, and instantiate it through three complementary designs: (i) mobility networks as contexts for zero-shot LLM prompting, (ii) as graph connectors for fusing geospatial visual embeddings with textual embeddings, and (iii) as structured tokens for multimodal LLM reasoning. Using anonymized large-scale mobility datasets from three U.S. metropolitan areas, we find that \textit{MobFusion} improves urban prediction tasks (e.g., median household income, population density, and crime prediction) across three instantiations, demonstrating that incorporating human mobility can effectively improve the socioeconomic understanding of foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MobFusion, a modular paradigm for fusing urban mobility networks with foundation models to enhance their socioeconomic understanding. It instantiates the approach in three complementary ways—mobility networks as contexts for zero-shot LLM prompting, as graph connectors for fusing visual and textual embeddings, and as structured tokens for multimodal LLM reasoning—and reports that these yield performance gains on urban prediction tasks (median household income, population density, crime) using anonymized large-scale mobility data from three U.S. metropolitan areas.

Significance. If the empirical gains prove robust, the work would indicate that mobility-derived functional connectivity supplies geospatial signals absent from static POI text or satellite imagery, thereby extending foundation-model applications in urban socioeconomic modeling. The modular design, which supports multiple fusion strategies, is a constructive contribution that could facilitate further experimentation.

major comments (2)
  1. [§5] §5 (Experimental Evaluation): The reported improvements lack control experiments that preserve input volume, dimensionality, and architecture while destroying mobility structure (e.g., random rewiring of the mobility graph that retains degree sequence). Without such ablations, it remains unclear whether gains arise from the claimed connectivity signal or from the added fusion mechanisms and data volume themselves; this directly bears on the central claim that mobility networks elicit unique geospatial capabilities.
  2. [§4] §4 (MobFusion Instantiations): The descriptions of the three fusion designs do not specify the precise encoding of mobility networks (e.g., how edges are tokenized or how graph connectors are constructed) or include ablation variants that isolate connectivity from other mobility-derived features, making it difficult to attribute performance differences to the functional connectivity asserted in the abstract.
minor comments (2)
  1. The abstract would be strengthened by the inclusion of concrete quantitative deltas, baseline models, and statistical significance measures to support the performance claims.
  2. Notation for the three instantiations could be made more consistent across the text and figures to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested controls and clarifications.

read point-by-point responses
  1. Referee: [§5] §5 (Experimental Evaluation): The reported improvements lack control experiments that preserve input volume, dimensionality, and architecture while destroying mobility structure (e.g., random rewiring of the mobility graph that retains degree sequence). Without such ablations, it remains unclear whether gains arise from the claimed connectivity signal or from the added fusion mechanisms and data volume themselves; this directly bears on the central claim that mobility networks elicit unique geospatial capabilities.

    Authors: We agree that such controls are necessary to isolate the contribution of mobility structure. In the revised manuscript we will add random-rewiring ablations that preserve degree sequence, input volume, and model architecture for all three fusion designs. Results will be reported alongside the original experiments to directly test whether performance gains depend on functional connectivity rather than data volume or fusion mechanics alone. revision: yes

  2. Referee: [§4] §4 (MobFusion Instantiations): The descriptions of the three fusion designs do not specify the precise encoding of mobility networks (e.g., how edges are tokenized or how graph connectors are constructed) or include ablation variants that isolate connectivity from other mobility-derived features, making it difficult to attribute performance differences to the functional connectivity asserted in the abstract.

    Authors: We acknowledge that greater technical detail is required. The revision will expand §4 with explicit specifications of edge tokenization, graph-connector construction, and embedding fusion steps for each of the three designs. We will also add ablation variants that disrupt connectivity (e.g., random edge permutation or feature ablation) while retaining other mobility-derived statistics, allowing clearer attribution of gains to functional connectivity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fusion experiments with measured outcomes

full rationale

The paper proposes the MobFusion paradigm and its three instantiations, then reports empirical performance gains on socioeconomic prediction tasks using real mobility datasets from three metropolitan areas. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content. The central claim rests on experimental results rather than any quantity defined in terms of its own inputs, satisfying the default expectation of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available; no explicit free parameters, mathematical axioms, or externally validated invented entities are described.

invented entities (1)
  • MobFusion no independent evidence
    purpose: Modular mobility-enhanced foundation model fusion paradigm instantiated in three designs
    Introduced as the central technical contribution without independent external evidence cited in the abstract.

pith-pipeline@v0.9.1-grok · 5720 in / 1124 out tokens · 33687 ms · 2026-06-28T12:08:32.711212+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 2 canonical work pages

  1. [1]

    Nature Cities , volume=

    The city as text , author=. Nature Cities , volume=. 2025 , publisher=

  2. [2]

    Science , volume=

    Network diversity and economic development , author=. Science , volume=. 2010 , publisher=

  3. [3]

    Nature communications , volume=

    Uncovering the spatial structure of mobility networks , author=. Nature communications , volume=. 2015 , publisher=

  4. [4]

    Nature Cities , pages=

    Global urban visual perception varies across demographics and personalities , author=. Nature Cities , pages=. 2025 , publisher=

  5. [5]

    Proceedings of the National Academy of Sciences , volume=

    Urban visual intelligence: Uncovering hidden city profiles with street view images , author=. Proceedings of the National Academy of Sciences , volume=. 2023 , publisher=

  6. [6]

    arXiv preprint arXiv:2310.06213 , year=

    Geollm: Extracting geospatial knowledge from large language models , author=. arXiv preprint arXiv:2310.06213 , year=

  7. [7]

    arXiv preprint arXiv:2402.02680 , year=

    Large language models are geographically biased , author=. arXiv preprint arXiv:2402.02680 , year=

  8. [8]

    arXiv preprint arXiv:2507.22291 , year=

    Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data , author=. arXiv preprint arXiv:2507.22291 , year=

  9. [9]

    CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing , author=

  10. [10]

    arXiv preprint arXiv:2510.22282 , year=

    CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning , author=. arXiv preprint arXiv:2510.22282 , year=

  11. [11]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    UrbanVLP: Multi-granularity vision-language pretraining for urban socioeconomic indicator prediction , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  12. [12]

    The Fourteenth International Conference on Learning Representations , year=

    MoRA: Mobility as the Backbone for Geospatial Representation Learning at Scale , author=. The Fourteenth International Conference on Learning Representations , year=

  13. [13]

    arXiv preprint arXiv:2411.07207 , year=

    General geospatial inference with a population dynamics foundation model , author=. arXiv preprint arXiv:2411.07207 , year=

  14. [14]

    Scientific reports , volume=

    Uncovering the socioeconomic facets of human mobility , author=. Scientific reports , volume=. 2021 , publisher=

  15. [15]

    Annual Review of Sociology , volume=

    Urban mobility and activity space , author=. Annual Review of Sociology , volume=. 2020 , publisher=

  16. [16]

    Nature , volume=

    Machine learning and phone data can improve targeting of humanitarian aid , author=. Nature , volume=. 2022 , publisher=

  17. [17]

    Nature communications , volume=

    Mobility patterns are associated with experienced income segregation in large US cities , author=. Nature communications , volume=. 2021 , publisher=

  18. [18]

    Proceedings of the National Academy of Sciences , volume=

    Estimating experienced racial segregation in US cities using large-scale GPS data , author=. Proceedings of the National Academy of Sciences , volume=. 2021 , publisher=

  19. [19]

    Environment and Planning B: Urban Analytics and City Science , volume=

    Revisiting Jane Jacobs: quantifying urban diversity , author=. Environment and Planning B: Urban Analytics and City Science , volume=. 2022 , publisher=

  20. [20]

    Scientific Reports , volume=

    Commuting network effect on urban wealth scaling , author=. Scientific Reports , volume=. 2021 , publisher=

  21. [21]

    Journal of The Royal Society Interface , volume=

    Unravelling daily human mobility motifs , author=. Journal of The Royal Society Interface , volume=. 2013 , publisher=

  22. [22]

    Scientific Reports , volume=

    Uncovering structural diversity in commuting networks: global and local entropy , author=. Scientific Reports , volume=. 2022 , publisher=

  23. [23]

    Nature , volume=

    Mobility network models of COVID-19 explain inequities and inform reopening , author=. Nature , volume=. 2021 , publisher=

  24. [24]

    American Sociological Review , volume=

    Triple disadvantage: neighborhood networks of everyday urban mobility and violence in US cities , author=. American Sociological Review , volume=. 2020 , publisher=

  25. [25]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Heterogeneous region embedding with prompt learning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  26. [26]

    Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence , pages=

    Multi-view joint graph representation learning for urban region embedding , author=. Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence , pages=

  27. [27]

    Proceedings of the 29th ACM International Conference on Information & Knowledge Management , pages=

    Predicting economic growth by region embedding: A multigraph convolutional network approach , author=. Proceedings of the 29th ACM International Conference on Information & Knowledge Management , pages=

  28. [28]

    arXiv preprint arXiv:2510.13774 , year=

    UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations , author=. arXiv preprint arXiv:2510.13774 , year=

  29. [29]

    2025 , eprint=

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2025 , eprint=

  30. [30]

    2025 , publisher =

    Foot. 2025 , publisher =. doi:10.82551/HYH5-PC45 , url =

  31. [31]

    2022 , publisher =

    Global. 2022 , publisher =. doi:10.82551/SMXB-1K04 , url =

  32. [32]

    2023 , howpublished =

    Census. 2023 , howpublished =

  33. [33]

    arXiv preprint arXiv:1711.03654 , year=

    Poverty prediction with public landsat 7 satellite imagery and machine learning , author=. arXiv preprint arXiv:1711.03654 , year=

  34. [34]

    Science , volume=

    Combining satellite imagery and machine learning to predict poverty , author=. Science , volume=. 2016 , publisher=

  35. [35]

    2024 , eprint=

    Let Your Graph Do the Talking: Encoding Structured Data for LLMs , author=. 2024 , eprint=

  36. [36]

    International Conference on Learning Representations (ICLR) , year=

    Talk like a Graph: Encoding Graphs for Large Language Models , author=. International Conference on Learning Representations (ICLR) , year=

  37. [37]

    2024 , eprint=

    RemoteCLIP: A Vision Language Foundation Model for Remote Sensing , author=. 2024 , eprint=

  38. [38]

    2017 , eprint=

    Modeling Relational Data with Graph Convolutional Networks , author=. 2017 , eprint=

  39. [39]

    arXiv preprint arXiv:2002.05709 , year=

    A Simple Framework for Contrastive Learning of Visual Representations , author=. arXiv preprint arXiv:2002.05709 , year=

  40. [40]

    arXiv preprint arXiv:1807.03748 , year=

    Representation learning with contrastive predictive coding , author=. arXiv preprint arXiv:1807.03748 , year=

  41. [41]

    Crime Incident Reports (August 2015 to Date) (Source: New System) , year =

  42. [42]

    Crimes -- 2023 , year =

  43. [43]

    2024 , howpublished =

  44. [44]

    arXiv preprint arXiv:1802.03426 , year=

    Umap: Uniform manifold approximation and projection for dimension reduction , author=. arXiv preprint arXiv:1802.03426 , year=

  45. [45]

    Authorea Preprints , year=

    Language Models Meet Urban Mobility: A Data-Centric Review , author=. Authorea Preprints , year=

  46. [46]

    arXiv preprint arXiv:2303.08774 , year=

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  47. [47]

    arXiv preprint arXiv:2507.06261 , year=

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  48. [48]

    2025 , howpublished=

    Introducing GPT-5 , author=. 2025 , howpublished=

  49. [49]

    Technometrics , volume=

    Ridge regression: Biased estimation for nonorthogonal problems , author=. Technometrics , volume=. 1970 , publisher=

  50. [50]

    arXiv preprint arXiv:2511.21631 , year=

    Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

  51. [51]

    Transportation Research Part D: Transport and Environment , volume=

    Quantifying the nonlinear causal impact of commute time on US remote work , author=. Transportation Research Part D: Transport and Environment , volume=. 2026 , publisher=

  52. [52]

    NeurIPS , year=

    Visual Instruction Tuning , author=. NeurIPS , year=

  53. [53]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

  54. [54]

    Cities , volume=

    Evaluating cities' vitality and identifying ghost cities in China with emerging geographical data , author=. Cities , volume=. 2017 , publisher=