From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

Bolei Zhou; Honglin He; Yukai Ma; Zhizheng Liu

arxiv: 2606.12603 · v1 · pith:D5DEA7PCnew · submitted 2026-06-10 · 💻 cs.RO · cs.AI

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

Honglin He , Zhizheng Liu , Yukai Ma , Bolei Zhou This is my paper

Pith reviewed 2026-06-27 09:28 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords sidewalk navigationflow matchingimitation learninghuman preference learningautonomous navigationmonocular cameralong-horizon taskssocial compliance

0 comments

The pith

FlowPilot pre-trains a sidewalk navigation policy with anchored flow matching on robot fleet data then aligns it via human preference tuning on intervention data to handle long-horizon tasks with only a monocular camera.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets long-horizon sidewalk navigation for micro-mobility uses such as delivery robots, where unpredictable terrain and pedestrians must be handled with a minimal single-camera sensor stack. Standard imitation learning produces policies prone to compounding errors, weak social compliance, and poor handling of complex situations. The authors therefore pre-train with anchored flow matching to capture multimodal action distributions from large fleet data, then apply human-in-the-loop preference learning on limited intervention traces to strengthen counterfactual reasoning. A sympathetic reader would care because the two-stage process turns raw imitation into a more aligned policy without requiring dense human labels or heavy perception hardware.

Core claim

FlowPilot achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. Anchored flow matching serves as the action representation for policy pre-training on large-scale robot fleet data to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. A subsequent human-in-the-loop preference learning scheme tunes the policy on a small amount of human intervention data, strengthening counterfactual reasoning and social compliance on sidewalks.

What carries the argument

Anchored flow matching as the action representation for pre-training, combined with human-in-the-loop preference learning on intervention data for alignment.

If this is right

FlowPilot reaches 42 percent success rate and 66 percent route completion in simulation.
FlowPilot-HP reduces intervention rate by 40.0 percent and near-miss intervention rate by 52.1 percent relative to the base model in real-world tests.
The preference-tuned policy improves real-world robustness and social compliance over imitation-only training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-stage recipe could be applied to other long-horizon robotic tasks that require social awareness, such as indoor service robots.
Small preference datasets may suffice to correct imitation failures in many navigation domains, lowering the cost of human data collection.
The monocular-camera constraint suggests the method might combine with additional low-cost sensors without redesigning the core policy architecture.

Load-bearing premise

The small amount of human intervention data is representative enough to strengthen counterfactual reasoning and social compliance without introducing new biases or distribution shifts that degrade performance on unseen sidewalk scenarios.

What would settle it

A controlled test on a new collection of diverse sidewalk routes where the human-preference-tuned policy shows higher intervention rates or lower success than the base FlowPilot model would falsify the alignment benefit.

Figures

Figures reproduced from arXiv: 2606.12603 by Bolei Zhou, Honglin He, Yukai Ma, Zhizheng Liu.

**Figure 1.** Figure 1: Long-horizon sidewalk navigation for a food delivery robot (the pink bot shown in the upper-left corner). Equipped with only a monocular camera and a GPS, the wheeled robot navigates complex sidewalk terrains, avoid pedestrians and obstacles, and follow social norms to reach its destination safely. We show the first-person views of the robot and the trajectories from FlowPilot. modal and control-sensitive,… view at source ↗

**Figure 2.** Figure 2: Architecture and training pipeline of FlowPilot. Given RGB observations and an optional goal, the encoder first extracts scene tokens, which are processed by self-attention layers and then used as keys and values in gated cross-attention. A set of noisy trajectory queries, together with learnable anchor registers, is denoised to generate multiple trajectory candidates. During pretraining, the model is su… view at source ↗

**Figure 3.** Figure 3: Mitigating attention sink with gated attention. The proportion of attention allocated to the goal / anchor across layers. The baseline exhibits attention-sink behavior, in which a large fraction of attention is concentrated on the goal, whereas gated attention substantially reduces this concentration and encourages better context utilization. Model architecture As illustrated in [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 5.** Figure 5: Effectiveness of pretraining on mixed dataset. Method minMOE ↓ minADE ↓ L2 ↓ mAP ↑ GNM [5] 7.31 0.63 1.32 0.63 ViNT [6] 8.51 0.74 1.54 0.74 NoMaD [8] 13.77 1.34 2.71 0.63 CityWalker [9] 8.94 0.71 1.48 0.77 MIMIC [27] 9.31 0.52 1.43 0.69 S2E [15] 6.77 0.46 1.73 0.81 DiffusionDrive [47] 6.96 0.63 1.71 0.77 FlowPilot-Base 6.63 0.49 1.04 0.87 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison in simulated sidewalk environments. Method SR ↑ RC ↑ CR ↓ OBR ↓ GNM [24] 0.19 0.39 0.68 0.01 ViNT [6] 0.27 0.51 0.60 0.04 NoMaD [8] 0.21 0.45 0.74 0.04 MBRA [11] 0.31 0.50 0.54 0.15 CityWalker [25] 0.28 0.48 0.70 0.02 MIMIC [27] 0.25 0.44 0.62 0.03 S2E [15] 0.35 0.55 0.51 0.13 FlowPilot w/o GCA 0.29 0.59 0.61 0.09 FlowPilot-Base 0.42 0.66 0.43 0.01 [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 8.** Figure 8: Qualitative results in diverse real-world sidewalk environments. We visualize the trajectories predicted by FlowPilot and executed by the robot during real-world deployment. of Gated Cross Attention (GCA); compared with the one without GCA, FlowPilot-Base improves SR from 0.29 to 0.42 and reduces CR from 0.61 to 0.43, highlighting the importance of GCA for robust context understanding for navigation. For c… view at source ↗

**Figure 9.** Figure 9: Kilometer-scale long-horizon sidewalk navigation. We evaluate FlowPilot in a longhorizon, complex real-world scenario, demonstrating its robust progress over long distances [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Long-horizon sidewalk navigation under challenging nighttime illumination. We evaluate FlowPilot on a long-horizon route in night with low-light regions and complex illumination changes, demonstrating its robustness to degraded visual conditions. Normalized Intervention Rate (NIR) [62]. Specifically, IR is the fraction of distance driven under human takeover, i.e., IR = dintervention / dtotal, while NIR m… view at source ↗

**Figure 11.** Figure 11: Robot platforms used in real-world experiments. We use a wheeled robot for the majority of experiments, and deploy a legged robot to assess cross-embodiment generalization. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Sidewalk lane keeping in diverse real-world scenes. FlowPilot maintains a stable position within the sidewalk under varying appearances and layouts, producing smooth, safe trajectories. During deployment, we use a laptop as the operator interface to visualize the live camera stream, predicted trajectory, robot state, and system status. A joystick controller is connected to the deployment laptop and is u… view at source ↗

**Figure 13.** Figure 13: Reactive obstacle avoidance in cluttered sidewalks. FlowPilot selects safe bypass maneuvers around static obstacles and returns to the nominal path after passing them. a kilometer-scale daytime route, where the robot maintains consistent goal progress over an extended distance while performing stable sidewalk lane keeping, smooth turning, and obstacle avoidance [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Pedestrian-aware navigation without interventions. FlowPilot keeps comfortable clearance, yields in crowded situations, and re-centers once the path is clear. B.5 Pedestrian awareness As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Representative failure cases in crowded or constrained scenes. When faced with ambiguous right-of-way or insufficient clearance, FlowPilot may be overly conservative and stop, requiring human intervention to continue. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: Cross-embodiment generalization in simulated environments. We further demonstrate the cross-embodiment generalization ability of FlowPilot in closed-loop simulation, as demonstrated in [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: Overview of the large-scale pretraining dataset. We pretrain FlowPilot on ∼300 hours of diverse real-world navigation data spanning varied lighting, weather, and scene layouts, including frequent interactions with obstacles and pedestrians. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative comparison between FlowPilot predictions and human interventions. We visualize representative real-world frames with the model-predicted path (cyan) and the trajectory executed during human intervention (green). Across diverse sidewalk scenes, the intervention traces highlight where humans deviate from the model to ensure safety and progress. complexity. The data includes frequent interaction… view at source ↗

read the original abstract

Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs anchored flow matching pre-training with small-scale human preference tuning and reports concrete gains on sidewalk navigation, but the abstract supplies almost no experimental details to assess those gains.

read the letter

The main thing to know is that the authors take a standard imitation-learning setup for mapless sidewalk navigation, replace the usual action head with anchored flow matching to better capture multimodal behaviors from fleet data, then run a second stage of human-in-the-loop preference tuning on a small intervention set. They claim this lifts real-world robustness and social compliance, with FlowPilot-HP cutting interruption rate by 40% and near-interruption rate by 52% relative to the base model.

What is actually new is the concrete ordering: flow-matching pre-training on large robot data followed by preference alignment on sidewalk-specific human corrections. The problem itself (long-horizon navigation with only monocular RGB, no map) is well-motivated for micro-mobility, and the pipeline is straightforward enough that a practitioner could try to reproduce the high-level steps.

The soft spots are all in the evidence. The abstract states the simulation numbers (42% success, 66% route completion) and the relative real-world reductions but gives no baselines, no definition of success or interruption, no trial counts, and no description of how the small human-intervention set was collected or how it overlaps with the test distribution. Without those, the 40% and 52% figures are hard to interpret and the stress-test worry about unmeasured distribution shift is reasonable.

The work is aimed at robotics groups already running imitation pipelines on delivery or assistive platforms who want a lightweight way to add social compliance. A reader who needs a ready-to-adapt recipe for camera-only sidewalk policies will get value; someone looking for a general theoretical advance will not.

It deserves a serious referee. The application is narrow but the method is described at a level that can be reviewed and the empirical claims are falsifiable once the missing details are supplied.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FlowPilot, a mapless long-horizon sidewalk navigation policy that relies solely on monocular RGB input. It pre-trains an anchored flow-matching policy on large-scale robot fleet demonstrations to capture multimodal behaviors, then applies human-in-the-loop preference tuning on a small set of human intervention trajectories to improve counterfactual reasoning and social compliance. Simulation results are reported as 42% success rate and 66% route completion; the preference-tuned FlowPilot-HP variant is claimed to reduce real-world intervention rate (IR) by 40.0% and near-intervention rate (NIR) by 52.1% relative to the base model.

Significance. If the reported gains are shown to be robust and not artifacts of uncharacterized data shifts, the two-stage imitation-to-alignment pipeline would constitute a practical advance for lightweight, camera-only navigation in unstructured pedestrian environments. The use of flow matching as an action representation and the explicit human-preference stage are technically interesting; however, the absence of baseline comparisons, statistical reporting, and data-coverage metrics in the provided abstract substantially weakens the ability to judge whether the central claim holds.

major comments (2)

[Abstract] Abstract: the headline numerical claims (42% success, 66% route completion; IR −40.0%, NIR −52.1%) are presented without any baseline algorithms, statistical significance tests, definitions of success/interruption, or data-exclusion criteria. These quantities are load-bearing for the central claim that FlowPilot-HP improves robustness and social compliance.
[Abstract] Abstract / evaluation description: the human-intervention dataset is characterized only as “small” with no quantitative information on its size, diversity, coverage of pedestrian/terrain types, or distributional overlap with the test sidewalks. This directly bears on the assumption that preference tuning strengthens counterfactual reasoning without introducing new biases or failure modes.

minor comments (1)

[Abstract] The abstract refers to “anchored flow matching” and “human-preference flow policies” without a brief parenthetical gloss on the key technical distinction from standard flow matching or behavioral cloning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the abstract accordingly to improve clarity and completeness while preserving the manuscript's core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the headline numerical claims (42% success, 66% route completion; IR −40.0%, NIR −52.1%) are presented without any baseline algorithms, statistical significance tests, definitions of success/interruption, or data-exclusion criteria. These quantities are load-bearing for the central claim that FlowPilot-HP improves robustness and social compliance.

Authors: We agree the abstract would benefit from additional context. In revision we will add concise references to the baseline algorithms evaluated in the full paper (standard behavior cloning and other IL policies), note that metrics are averaged over multiple random seeds with statistical details reported in Section 5, and include brief definitions of success rate, route completion, IR, and NIR along with data-exclusion criteria from the experimental protocol. revision: yes
Referee: [Abstract] Abstract / evaluation description: the human-intervention dataset is characterized only as “small” with no quantitative information on its size, diversity, coverage of pedestrian/terrain types, or distributional overlap with the test sidewalks. This directly bears on the assumption that preference tuning strengthens counterfactual reasoning without introducing new biases or failure modes.

Authors: We acknowledge that 'small' is insufficiently precise in the abstract. We will revise to report the dataset size (number of trajectories and interventions), note coverage across pedestrian densities and terrain types, and reference the distributional analysis already present in Section 4.2 that demonstrates overlap with test environments and supports improved counterfactual reasoning without new failure modes. revision: yes

Circularity Check

0 steps flagged

No circularity: results are measured experimental outcomes

full rationale

The paper reports performance metrics (42% success rate, 66% route completion, IR/NIR reductions) as direct measurements from simulation and real-world experiments on policies trained with anchored flow matching pre-training followed by human-in-the-loop preference tuning. These quantities are not derived by construction from the paper's own equations, fitted parameters, or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes that reduce the central claims to inputs appear in the provided text. The evaluation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5801 in / 1100 out tokens · 16977 ms · 2026-06-27T09:28:17.039287+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 20 canonical work pages · 9 internal anchors

[1]

Engesser, E

V . Engesser, E. Rombaut, L. Vanhaverbeke, and P. Lebeau. Autonomous delivery solutions for last-mile logistics operations: A literature review and research agenda.Sustainability, 15(3): 2774, 2023

2023
[2]

X. Liu, L. Zhang, and T. Zhu. Service robots in my workplace: Effects of employee-service robot co-work experiences on psychological empowerment.Journal of Hospitality Marketing & Management, 34(2):175–203, 2025

2025
[3]

Tuomi, I

A. Tuomi, I. P. Tussyadiah, and J. Stienmetz. Applications and implications of service robots in hospitality.Cornell Hospitality Quarterly, 62(2):232–247, 2021

2021
[4]

Arntz, J

E. Arntz, J. Van Duin, A. Van Binsbergen, L. Tavasszy, and T. Klein. Assessment of readiness of a traffic environment for autonomous delivery robots.Frontiers in Future Transportation, 4:1102302, 2023

2023
[5]

D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine. Gnm: A general navigation model to drive any robot.arXiv preprint:2210.03370, 2022

work page arXiv 2022
[6]

D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine. Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

work page arXiv 2023
[7]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 17853–17862, 2023

2023
[8]

Sridhar, D

A. Sridhar, D. Shah, C. Glossop, and S. Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In2024 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 63–70. IEEE, 2024

2024
[9]

X. Liu, J. Li, Y . Jiang, N. Sujay, Z. Yang, J. Zhang, J. Abanes, J. Zhang, and C. Feng. Citywalker: Learning embodied urban navigation from web-scale videos.arXiv preprint arXiv:2411.17820, 2024

work page arXiv 2024
[10]

B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration.Robotics and autonomous systems, 57(5):469–483, 2009

2009
[11]

Hirose, L

N. Hirose, L. Ignatova, K. Stachowicz, C. Glossop, S. Levine, and D. Shah. Learning to drive anywhere with model-based reannotation.IEEE Robotics and Automation Letters, 11(2): 1242–1249, 2025

2025
[12]

Karnan, A

H. Karnan, A. Nair, X. Xiao, G. Warnell, S. Pirk, A. Toshev, J. Hart, J. Biswas, and P. Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022

2022
[13]

X. Pan, T. Zhang, B. Ichter, A. Faust, J. Tan, and S. Ha. Zero-shot imitation learning from demonstrations for legged robot visual navigation. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 679–685. IEEE, 2020

2020
[14]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[15]

H. He, Y . Ma, W. Wu, and B. Zhou. From seeing to experiencing: Scaling navigation founda- tion models with reinforcement learning.arXiv preprint arXiv:2507.22028, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011. 10

2011
[17]

Celemin, R

C. Celemin, R. P ´erez-Dattari, E. Chisari, G. Franzese, L. de Souza Rosa, R. Prakash, Z. Ajanovi´c, M. Ferraz, A. Valada, J. Kober, et al. Interactive imitation learning in robotics: A survey.Foundations and Trends® in Robotics, 10(1-2):1–197, 2022

2022
[18]

Z. M. Peng, W. Mo, C. Duan, Q. Li, and B. Zhou. Learning from active human involvement through proxy value propagation.Advances in neural information processing systems, 36: 77969–77992, 2023

2023
[19]

H. Cai, Z. Peng, and B. Zhou. Predictive preference learning from human interventions. In Advances in Neural Information Processing Systems, 2025

2025
[20]

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[21]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2305. 18290

2023
[23]

S. Thrun. Probabilistic robotics.Communications of the ACM, 45(3):52–57, 2002

2002
[24]

D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine. Gnm: A general navigation model to drive any robot. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023

2023
[25]

X. Liu, J. Li, Y . Jiang, N. Sujay, Z. Yang, J. Zhang, J. Abanes, J. Zhang, and C. Feng. City- walker: Learning embodied urban navigation from web-scale videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6875–6885, 2025

2025
[26]

Z. Chen, Y . Guo, Z. Chu, M. Luo, Y . Shen, M. Sun, J. Hu, S. Xie, K. Yang, P. Shi, et al. So- cialnav: Training human-inspired foundation model for socially-aware embodied navigation. arXiv preprint arXiv:2511.21135, 2025

work page arXiv 2025
[27]

H. He, Y . Ma, B. Squicciarini, W. Wu, and B. Zhou. Learning sidewalk autopilot from multi- scale imitation with corrective behavior expansion.arXiv preprint arXiv:2603.22527, 2026

work page arXiv 2026
[28]

W. Cai, J. Peng, Y . Yang, Y . Zhang, M. Wei, H. Wang, Y . Chen, T. Wang, and J. Pang. Navdp: Learning sim-to-real navigation diffusion policy with privileged information guidance.arXiv preprint arXiv:2505.08712, 2025

work page arXiv 2025
[29]

arXiv preprint arXiv:2412.04453 (2024)

A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Bıyık, H. Yin, S. Liu, and X. Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

work page arXiv 2024
[30]

M. Wei, C. Wan, J. Peng, X. Yu, Y . Yang, D. Feng, W. Cai, C. Zhu, T. Wang, J. Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision- and-language navigation.arXiv preprint arXiv:2512.08186, 2025

work page arXiv 2025
[31]

A. Bar, G. Zhou, D. Tran, T. Darrell, and Y . LeCun. Navigation world models.arXiv preprint arXiv:2412.03572, 2024

work page arXiv 2024
[32]

Kretzschmar, M

H. Kretzschmar, M. Spies, C. Sprunk, and W. Burgard. Socially compliant mobile robot navi- gation via inverse reinforcement learning.The International Journal of Robotics Research, 35 (11):1289–1307, 2016. 11

2016
[33]

Ho and S

J. Ho and S. Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

2016
[34]

B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Dey, et al. Maximum entropy inverse reinforce- ment learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008

2008
[35]

Seneviratne, J

G. Seneviratne, J. An, S. Ellahy, K. Weerakoon, M. B. Elnoor, J. D. Kannan, A. T. Sunil, and D. Manocha. Halo: Human preference aligned offline reward learning for robot navigation. arXiv preprint arXiv:2508.01539, 2025

work page arXiv 2025
[36]

H. Cai, Z. Peng, and B. Zhou. Robot-gated interactive imitation learning with adaptive inter- vention mechanism. InInternational Conference on Machine Learning, 2025

2025
[37]

Sadigh, A

D. Sadigh, A. D. Dragan, S. S. Sastry, and S. A. Seshia. Active preference-based learning of reward functions. InRobotics: Science and Systems, 2017

2017
[38]

J. Choi, C. Dance, J.-e. Kim, K.-s. Park, J. Han, J. Seo, and M. Kim. Fast adaptation of deep re- inforcement learning-based navigation skills to human preference. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3363–3370. IEEE, 2020

2020
[39]

W. B. Knox, S. Hatgis-Kessell, S. Booth, S. Niekum, P. Stone, and A. Allievi. Models of human preference for learning reward functions.arXiv preprint arXiv:2206.02231, 2022

work page arXiv 2022
[40]

R. Wang, W. Wang, and B.-C. Min. Feedback-efficient active preference learning for socially aware robot navigation. In2022 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 11336–11343. IEEE, 2022

2022
[41]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Y...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierar- chical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[44]

P. K. A. Vasu, J. Gabriel, J. Zhu, O. Tuzel, and A. Ranjan. Fastvit: A fast hybrid vision transformer using structural reparameterization. InProceedings of the IEEE/CVF international conference on computer vision, pages 5785–5795, 2023

2023
[45]

J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

J. Liu, G. Liu, J. Liang, Z. Yuan, X. Liu, M. Zheng, X. Wu, Q. Wang, M. Xia, X. Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

2025
[48]

P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths.IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968. 12

1968
[49]

Zhang, C

A. Zhang, C. Eranki, C. Zhang, J.-H. Park, R. Hong, P. Kalyani, L. Kalyanaraman, A. Gamare, A. Bagad, M. Esteva, et al. Toward robust robot 3-d perception in urban environments: The ut campus object dataset.IEEE Transactions on Robotics, 40:3322–3340, 2024

2024
[50]

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

T. Akhtyamov, M. A. Mdfaa, J. A. R. Benavides, A. Nigmatzyanov, S. Bakulin, G. Devchich, D. Fatykhov, D. R. Salinas, A. Mazurov, K. Zipa, et al. Egowalk: A multimodal dataset for robot navigation in the wild.arXiv preprint arXiv:2505.21282, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Liang, D

J. Liang, D. Das, D. Song, M. N. H. Shuvo, M. Durrani, K. Taranath, I. Penskiy, D. Manocha, and X. Xiao. Gnd: Global navigation dataset with multi-modal perception and multi-category traversability in outdoor campus environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 2383–2390. IEEE, 2025

2025
[52]

Hirose, D

N. Hirose, D. Shah, A. Sridhar, and S. Levine. Sacson: Scalable autonomous control for social navigation.IEEE Robotics and Automation Letters, 9(1):49–56, 2023

2023
[53]

D. M. Nguyen, M. Nazeri, A. Payandeh, A. Datar, and X. Xiao. Toward human-like so- cial robot navigation: A large-scale, multi-modal, social human navigation dataset. In2023 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 7442–
[54]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle sim- ulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706– 28719, 2024

2024
[55]

Physicalai autonomous vehicles dataset.https://huggingface

NVIDIA Corporation. Physicalai autonomous vehicles dataset.https://huggingface. co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles, 2025. Hugging Face dataset. Accessed: 2026-06-10

2025
[56]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022
[57]

Kelly, C

M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

2019
[58]

Menda, K

K. Menda, K. Driggs-Campbell, and M. J. Kochenderfer. Ensembledagger: A bayesian ap- proach to safe imitation learning. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5041–5048. IEEE, 2019

2019
[59]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Munoz, X. Yao, R. Zurbr ¨ugg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi- modal robot learning.arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

W. Wu, H. He, C. Zhang, J. He, S. Z. Zhao, R. Gong, Q. Li, and B. Zhou. Towards autonomous micromobility through scalable urban simulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27553–27563, 2025

2025
[61]

Ettinger, S

S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y . Chai, B. Sapp, C. R. Qi, Y . Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF international conference on computer vision, pages 9710–9719, 2021

2021
[62]

Zhang, H

A. Zhang, H. Sikchi, A. Zhang, and J. Biswas. Creste: Scalable mapless navigation with internet scale priors and counterfactual guidance. InRobotics: Science and Systems (RSS), 2025

2025
[63]

Van Den Berg, S

J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha. Reciprocal n-body collision avoidance. InRobotics research: the 14th international symposium ISRR, pages 3–19. Springer, 2011. 13 Appendix We organize the appendix as follows. We first introduce the demonstration video in Sec. A, followed by additional real-world results in Sec. B. We then provide additi...

2011
[64]

FlowPilot capabilities demonstrationFlowPilot demonstrates robust navigation in complex real-world sidewalk environments. It successfully negotiates narrow passages, cluttered layouts, and broken curbs while maintaining safe and socially compliant behaviors, including effective obstacle avoidance and pedestrian awareness
[65]

Long-horizon sidewalk navigationFlowPilot completes GPS-guided long-horizon navigation with only a few human interventions. It maintains stable sidewalk lane keeping and consistent goal progress over time, while remaining robust to lighting changes and transient disturbances in challenging sidewalk environments
[66]

The comparisons highlight the advan- tages of FlowPilot in trajectory smoothness, navigation stability, and safety

Comparison with SOTA methodsWe present side-by-side real-world evaluations against rep- resentative state-of-the-art methods under the same setting. The comparisons highlight the advan- tages of FlowPilot in trajectory smoothness, navigation stability, and safety
[67]

Cross-embodiment generalityFlowPilot transfers effectively across different robot platforms both without finetuning and with only a few embodiment-specific examples. It preserves reasonable navigation behaviors under changes in platform dynamics and sensing configurations, demonstrating strong generalization and rapid adaptation across embodiments. B Real...

[1] [1]

Engesser, E

V . Engesser, E. Rombaut, L. Vanhaverbeke, and P. Lebeau. Autonomous delivery solutions for last-mile logistics operations: A literature review and research agenda.Sustainability, 15(3): 2774, 2023

2023

[2] [2]

X. Liu, L. Zhang, and T. Zhu. Service robots in my workplace: Effects of employee-service robot co-work experiences on psychological empowerment.Journal of Hospitality Marketing & Management, 34(2):175–203, 2025

2025

[3] [3]

Tuomi, I

A. Tuomi, I. P. Tussyadiah, and J. Stienmetz. Applications and implications of service robots in hospitality.Cornell Hospitality Quarterly, 62(2):232–247, 2021

2021

[4] [4]

Arntz, J

E. Arntz, J. Van Duin, A. Van Binsbergen, L. Tavasszy, and T. Klein. Assessment of readiness of a traffic environment for autonomous delivery robots.Frontiers in Future Transportation, 4:1102302, 2023

2023

[5] [5]

D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine. Gnm: A general navigation model to drive any robot.arXiv preprint:2210.03370, 2022

work page arXiv 2022

[6] [6]

D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine. Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

work page arXiv 2023

[7] [7]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 17853–17862, 2023

2023

[8] [8]

Sridhar, D

A. Sridhar, D. Shah, C. Glossop, and S. Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In2024 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 63–70. IEEE, 2024

2024

[9] [9]

X. Liu, J. Li, Y . Jiang, N. Sujay, Z. Yang, J. Zhang, J. Abanes, J. Zhang, and C. Feng. Citywalker: Learning embodied urban navigation from web-scale videos.arXiv preprint arXiv:2411.17820, 2024

work page arXiv 2024

[10] [10]

B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration.Robotics and autonomous systems, 57(5):469–483, 2009

2009

[11] [11]

Hirose, L

N. Hirose, L. Ignatova, K. Stachowicz, C. Glossop, S. Levine, and D. Shah. Learning to drive anywhere with model-based reannotation.IEEE Robotics and Automation Letters, 11(2): 1242–1249, 2025

2025

[12] [12]

Karnan, A

H. Karnan, A. Nair, X. Xiao, G. Warnell, S. Pirk, A. Toshev, J. Hart, J. Biswas, and P. Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation.IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022

2022

[13] [13]

X. Pan, T. Zhang, B. Ichter, A. Faust, J. Tan, and S. Ha. Zero-shot imitation learning from demonstrations for legged robot visual navigation. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 679–685. IEEE, 2020

2020

[14] [14]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[15] [15]

H. He, Y . Ma, W. Wu, and B. Zhou. From seeing to experiencing: Scaling navigation founda- tion models with reinforcement learning.arXiv preprint arXiv:2507.22028, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011. 10

2011

[17] [17]

Celemin, R

C. Celemin, R. P ´erez-Dattari, E. Chisari, G. Franzese, L. de Souza Rosa, R. Prakash, Z. Ajanovi´c, M. Ferraz, A. Valada, J. Kober, et al. Interactive imitation learning in robotics: A survey.Foundations and Trends® in Robotics, 10(1-2):1–197, 2022

2022

[18] [18]

Z. M. Peng, W. Mo, C. Duan, Q. Li, and B. Zhou. Learning from active human involvement through proxy value propagation.Advances in neural information processing systems, 36: 77969–77992, 2023

2023

[19] [19]

H. Cai, Z. Peng, and B. Zhou. Predictive preference learning from human interventions. In Advances in Neural Information Processing Systems, 2025

2025

[20] [20]

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[21] [21]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2305. 18290

2023

[23] [23]

S. Thrun. Probabilistic robotics.Communications of the ACM, 45(3):52–57, 2002

2002

[24] [24]

D. Shah, A. Sridhar, A. Bhorkar, N. Hirose, and S. Levine. Gnm: A general navigation model to drive any robot. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023

2023

[25] [25]

X. Liu, J. Li, Y . Jiang, N. Sujay, Z. Yang, J. Zhang, J. Abanes, J. Zhang, and C. Feng. City- walker: Learning embodied urban navigation from web-scale videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6875–6885, 2025

2025

[26] [26]

Z. Chen, Y . Guo, Z. Chu, M. Luo, Y . Shen, M. Sun, J. Hu, S. Xie, K. Yang, P. Shi, et al. So- cialnav: Training human-inspired foundation model for socially-aware embodied navigation. arXiv preprint arXiv:2511.21135, 2025

work page arXiv 2025

[27] [27]

H. He, Y . Ma, B. Squicciarini, W. Wu, and B. Zhou. Learning sidewalk autopilot from multi- scale imitation with corrective behavior expansion.arXiv preprint arXiv:2603.22527, 2026

work page arXiv 2026

[28] [28]

W. Cai, J. Peng, Y . Yang, Y . Zhang, M. Wei, H. Wang, Y . Chen, T. Wang, and J. Pang. Navdp: Learning sim-to-real navigation diffusion policy with privileged information guidance.arXiv preprint arXiv:2505.08712, 2025

work page arXiv 2025

[29] [29]

arXiv preprint arXiv:2412.04453 (2024)

A.-C. Cheng, Y . Ji, Z. Yang, Z. Gongye, X. Zou, J. Kautz, E. Bıyık, H. Yin, S. Liu, and X. Wang. Navila: Legged robot vision-language-action model for navigation.arXiv preprint arXiv:2412.04453, 2024

work page arXiv 2024

[30] [30]

M. Wei, C. Wan, J. Peng, X. Yu, Y . Yang, D. Feng, W. Cai, C. Zhu, T. Wang, J. Pang, et al. Ground slow, move fast: A dual-system foundation model for generalizable vision- and-language navigation.arXiv preprint arXiv:2512.08186, 2025

work page arXiv 2025

[31] [31]

A. Bar, G. Zhou, D. Tran, T. Darrell, and Y . LeCun. Navigation world models.arXiv preprint arXiv:2412.03572, 2024

work page arXiv 2024

[32] [32]

Kretzschmar, M

H. Kretzschmar, M. Spies, C. Sprunk, and W. Burgard. Socially compliant mobile robot navi- gation via inverse reinforcement learning.The International Journal of Robotics Research, 35 (11):1289–1307, 2016. 11

2016

[33] [33]

Ho and S

J. Ho and S. Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016

2016

[34] [34]

B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Dey, et al. Maximum entropy inverse reinforce- ment learning. InAaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008

2008

[35] [35]

Seneviratne, J

G. Seneviratne, J. An, S. Ellahy, K. Weerakoon, M. B. Elnoor, J. D. Kannan, A. T. Sunil, and D. Manocha. Halo: Human preference aligned offline reward learning for robot navigation. arXiv preprint arXiv:2508.01539, 2025

work page arXiv 2025

[36] [36]

H. Cai, Z. Peng, and B. Zhou. Robot-gated interactive imitation learning with adaptive inter- vention mechanism. InInternational Conference on Machine Learning, 2025

2025

[37] [37]

Sadigh, A

D. Sadigh, A. D. Dragan, S. S. Sastry, and S. A. Seshia. Active preference-based learning of reward functions. InRobotics: Science and Systems, 2017

2017

[38] [38]

J. Choi, C. Dance, J.-e. Kim, K.-s. Park, J. Han, J. Seo, and M. Kim. Fast adaptation of deep re- inforcement learning-based navigation skills to human preference. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3363–3370. IEEE, 2020

2020

[39] [39]

W. B. Knox, S. Hatgis-Kessell, S. Booth, S. Niekum, P. Stone, and A. Allievi. Models of human preference for learning reward functions.arXiv preprint arXiv:2206.02231, 2022

work page arXiv 2022

[40] [40]

R. Wang, W. Wang, and B.-C. Min. Feedback-efficient active preference learning for socially aware robot navigation. In2022 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 11336–11343. IEEE, 2022

2022

[41] [41]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[42] [42]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Y...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierar- chical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009

[44] [44]

P. K. A. Vasu, J. Gabriel, J. Zhu, O. Tuzel, and A. Ranjan. Fastvit: A fast hybrid vision transformer using structural reparameterization. InProceedings of the IEEE/CVF international conference on computer vision, pages 5785–5795, 2023

2023

[45] [45]

J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

J. Liu, G. Liu, J. Liang, Z. Yuan, X. Liu, M. Zheng, X. Wu, Q. Wang, M. Xia, X. Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

2025

[48] [48]

P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths.IEEE transactions on Systems Science and Cybernetics, 4(2):100–107, 1968. 12

1968

[49] [49]

Zhang, C

A. Zhang, C. Eranki, C. Zhang, J.-H. Park, R. Hong, P. Kalyani, L. Kalyanaraman, A. Gamare, A. Bagad, M. Esteva, et al. Toward robust robot 3-d perception in urban environments: The ut campus object dataset.IEEE Transactions on Robotics, 40:3322–3340, 2024

2024

[50] [50]

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

T. Akhtyamov, M. A. Mdfaa, J. A. R. Benavides, A. Nigmatzyanov, S. Bakulin, G. Devchich, D. Fatykhov, D. R. Salinas, A. Mazurov, K. Zipa, et al. Egowalk: A multimodal dataset for robot navigation in the wild.arXiv preprint arXiv:2505.21282, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Liang, D

J. Liang, D. Das, D. Song, M. N. H. Shuvo, M. Durrani, K. Taranath, I. Penskiy, D. Manocha, and X. Xiao. Gnd: Global navigation dataset with multi-modal perception and multi-category traversability in outdoor campus environments. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 2383–2390. IEEE, 2025

2025

[52] [52]

Hirose, D

N. Hirose, D. Shah, A. Sridhar, and S. Levine. Sacson: Scalable autonomous control for social navigation.IEEE Robotics and Automation Letters, 9(1):49–56, 2023

2023

[53] [53]

D. M. Nguyen, M. Nazeri, A. Payandeh, A. Datar, and X. Xiao. Toward human-like so- cial robot navigation: A large-scale, multi-modal, social human navigation dataset. In2023 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 7442–

[54] [54]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle sim- ulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706– 28719, 2024

2024

[55] [55]

Physicalai autonomous vehicles dataset.https://huggingface

NVIDIA Corporation. Physicalai autonomous vehicles dataset.https://huggingface. co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles, 2025. Hugging Face dataset. Accessed: 2026-06-10

2025

[56] [56]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022

[57] [57]

Kelly, C

M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

2019

[58] [58]

Menda, K

K. Menda, K. Driggs-Campbell, and M. J. Kochenderfer. Ensembledagger: A bayesian ap- proach to safe imitation learning. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5041–5048. IEEE, 2019

2019

[59] [59]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Munoz, X. Yao, R. Zurbr ¨ugg, N. Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi- modal robot learning.arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

W. Wu, H. He, C. Zhang, J. He, S. Z. Zhao, R. Gong, Q. Li, and B. Zhou. Towards autonomous micromobility through scalable urban simulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27553–27563, 2025

2025

[61] [61]

Ettinger, S

S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y . Chai, B. Sapp, C. R. Qi, Y . Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF international conference on computer vision, pages 9710–9719, 2021

2021

[62] [62]

Zhang, H

A. Zhang, H. Sikchi, A. Zhang, and J. Biswas. Creste: Scalable mapless navigation with internet scale priors and counterfactual guidance. InRobotics: Science and Systems (RSS), 2025

2025

[63] [63]

Van Den Berg, S

J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha. Reciprocal n-body collision avoidance. InRobotics research: the 14th international symposium ISRR, pages 3–19. Springer, 2011. 13 Appendix We organize the appendix as follows. We first introduce the demonstration video in Sec. A, followed by additional real-world results in Sec. B. We then provide additi...

2011

[64] [64]

FlowPilot capabilities demonstrationFlowPilot demonstrates robust navigation in complex real-world sidewalk environments. It successfully negotiates narrow passages, cluttered layouts, and broken curbs while maintaining safe and socially compliant behaviors, including effective obstacle avoidance and pedestrian awareness

[65] [65]

Long-horizon sidewalk navigationFlowPilot completes GPS-guided long-horizon navigation with only a few human interventions. It maintains stable sidewalk lane keeping and consistent goal progress over time, while remaining robust to lighting changes and transient disturbances in challenging sidewalk environments

[66] [66]

The comparisons highlight the advan- tages of FlowPilot in trajectory smoothness, navigation stability, and safety

Comparison with SOTA methodsWe present side-by-side real-world evaluations against rep- resentative state-of-the-art methods under the same setting. The comparisons highlight the advan- tages of FlowPilot in trajectory smoothness, navigation stability, and safety

[67] [67]

Cross-embodiment generalityFlowPilot transfers effectively across different robot platforms both without finetuning and with only a few embodiment-specific examples. It preserves reasonable navigation behaviors under changes in platform dynamics and sensing configurations, demonstrating strong generalization and rapid adaptation across embodiments. B Real...