arxiv: 2604.22724 · v1 · submitted 2026-04-24 · 💻 cs.RO · cs.SY· eess.SY

Recognition: unknown

GCImOpt: Learning efficient goal-conditioned policies by imitating optimal trajectories

Jes\'us F. Palaci\'an, Jon Goikoetxea

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:23 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY

keywords goal-conditioned policiesimitation learningtrajectory optimizationrobotic controlneural network policiesdata augmentationoptimal controlrobot arm

0 comments

The pith

Neural network policies trained on optimal trajectories reach goals with high success while running thousands of times faster than solvers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a method to generate large datasets of optimal trajectories using trajectory optimization on a laptop computer. The datasets are augmented by treating intermediate states as goals, multiplying their size by an order of magnitude. Goal-conditioned neural network policies are then trained on the augmented data to steer systems such as cart-poles, quadcopters, and robot arms toward arbitrary goals. The resulting policies deliver high success rates and near-optimal control inputs while remaining compact and executing far faster than the original optimization process.

Core claim

By generating datasets of optimal trajectories through trajectory optimization and augmenting them by designating intermediate states as goals, the approach trains compact goal-conditioned neural network policies. These policies control dynamical systems toward arbitrary goals, delivering high success rates and near-optimal control inputs across tasks including cart-pole stabilization, planar and 3D quadcopter stabilization, and 6-DoF robot arm point reaching. The resulting controllers contain fewer than 80,000 parameters and execute up to more than 6,000 times faster than the trajectory optimization solver used to create the training data.

What carries the argument

Efficient trajectory optimization for creating optimal demonstration datasets combined with data augmentation that designates intermediate states as additional goals for imitation learning.

Load-bearing premise

The trajectories produced by the optimization method are both optimal and sufficiently varied to allow the augmented imitation learning to produce policies that work for any goal in the state space.

What would settle it

Evaluating the trained policies on a broad set of randomly sampled goals outside the distribution of the generated trajectories and observing a sharp drop in success rates or large deviations from optimal control profiles would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.22724 by Jes\'us F. Palaci\'an, Jon Goikoetxea.

**Figure 1.** Figure 1: Schematic summary of GCIMOPT, a simple method to train and evaluate near-optimal goal-conditioned policies. A policy πθ, parameterized by a neural network, is trained by supervised learning on a large dataset of trajectories generated by trajectory optimization. The policy is then evaluated in simulation to measure its success rate and efficiency. system state and the desired goal. Generalization is a cent… view at source ↗

read the original abstract

Imitation learning is a well-established approach for machine-learning-based control. However, its applicability depends on having access to demonstrations, which are often expensive to collect and/or suboptimal for solving the task. In this work, we present GCImOpt, an approach to learn efficient goal-conditioned policies by training on datasets generated by trajectory optimization. Our approach for dataset generation is computationally efficient, can generate thousands of optimal trajectories in minutes on a laptop computer, and produces high-quality demonstrations. Further, by means of a data augmentation scheme that treats intermediate states as goals, we are able to increase the training dataset size by an order of magnitude. Using our generated datasets, we train goal-conditioned neural network policies that can control the system towards arbitrary goals. To demonstrate the generality of our approach, we generate datasets and then train policies for various control tasks, namely cart-pole stabilization, planar and three-dimensional quadcopter stabilization, and point reaching using a 6-DoF robot arm. We show that our trained policies can achieve high success rates and near-optimal control profiles, all while being small (less than 80,000 neural network parameters) and fast enough (up to more than 6,000 times faster than a trajectory optimization solver) that they could be deployed onboard resource-constrained controllers. We provide videos, code, datasets and pre-trained policies under a free software license; see our project website https://jongoiko.github.io/gcimopt/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GCImOpt gives a practical recipe for fast dataset generation from trajectory optimization to train compact goal-conditioned controllers, though the abstract skimps on the numbers needed to fully assess the claims.

read the letter

The main thing to know is that this paper describes a pipeline for generating imitation learning datasets from trajectory optimization, augmented by treating intermediate states as new goals, to train small goal-conditioned policies that are fast enough for onboard use in robotics. What the paper does well is demonstrate this on four different control tasks—cart-pole, 2D and 3D quadcopters, and a 6-DoF arm—while emphasizing computational efficiency in dataset creation and the resulting policy size and speed. They report being able to produce thousands of trajectories quickly on a laptop and achieve the claimed performance with networks under 80k parameters. Providing code, datasets, and pre-trained policies under a free license adds real value for anyone who wants to build on it or verify the results. The approach is not entirely novel in its components, drawing from optimal control and imitation learning, but the specific combination for goal-conditioned policies with this augmentation scheme is a solid practical extension. The main soft spot is the limited evidence in the abstract for the performance claims. There are no quantitative success rates, baseline comparisons, or analysis of optimality gaps or goal space coverage. The assumption that the generated trajectories are diverse and optimal enough to support reliable generalization to arbitrary goals needs more support, such as metrics on trajectory diversity or held-out goal testing. If the full paper includes these and they check out, it would address the concern; as presented, the central empirical claim is only partially backed. This work would appeal to readers in robotics and control who are looking for deployable learned policies rather than theoretical advances. It is worth sending for peer review, as the method is straightforward to implement and the results, if confirmed with proper metrics, could be useful for practical applications.

Referee Report

2 major / 2 minor

Summary. The paper introduces GCImOpt, a method to learn goal-conditioned neural network policies by imitating datasets of trajectories generated via trajectory optimization. Datasets are produced efficiently for cart-pole stabilization, planar/3D quadcopter stabilization, and 6-DoF robot arm point reaching; a data-augmentation step treats every intermediate state along each trajectory as a new goal, increasing dataset size by an order of magnitude. The resulting policies (under 80k parameters) are claimed to achieve high success rates and near-optimal control profiles for arbitrary goals while running up to 6000 times faster than the underlying trajectory optimizer, enabling onboard deployment on resource-constrained hardware. Code, datasets, and pre-trained models are released openly.

Significance. If the empirical claims hold under rigorous validation, the work provides a practical pipeline for converting offline trajectory optimization into lightweight, real-time goal-conditioned controllers. The combination of efficient dataset generation, order-of-magnitude augmentation, and demonstrated applicability across four distinct dynamical systems is potentially useful for robotics control. The open release of code, datasets, and pre-trained policies is a clear strength that supports reproducibility and follow-on work.

major comments (2)

[Experiments / Abstract] Experiments section (and abstract): the central claim that the trained policies achieve 'high success rates and near-optimal control profiles' for arbitrary unseen goals is not supported by quantitative metrics, baseline comparisons (e.g., against standard RL or other imitation methods), optimality-gap measurements, or held-out goal evaluation in the provided text. Without these, the generalization assertion remains only partially substantiated.
[Dataset generation and augmentation] Dataset generation and augmentation (Section 3): the data-augmentation scheme that treats intermediate states as goals is load-bearing for the 'arbitrary goals' claim, yet no analysis of goal-space coverage, trajectory diversity, or verification that the optimizer returns globally (or near-globally) optimal trajectories (e.g., multi-start comparisons or cost lower bounds) is reported. If the generated trajectories are locally optimal or leave large regions of the joint state-goal space undersampled, the near-optimal performance for unseen goals does not follow.

minor comments (2)

[Implementation details] Ensure that all experimental details (network architectures, training hyperparameters, success-rate definitions, and timing measurements) are fully specified so that the reported speed-ups and parameter counts can be reproduced from the released code.
[Figures] Figure captions and axis labels should explicitly state the units and definitions of 'success rate' and 'control profile' metrics to avoid ambiguity when comparing across the four tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to provide stronger quantitative support and analysis where possible.

read point-by-point responses

Referee: [Experiments / Abstract] Experiments section (and abstract): the central claim that the trained policies achieve 'high success rates and near-optimal control profiles' for arbitrary unseen goals is not supported by quantitative metrics, baseline comparisons (e.g., against standard RL or other imitation methods), optimality-gap measurements, or held-out goal evaluation in the provided text. Without these, the generalization assertion remains only partially substantiated.

Authors: We agree that the current presentation would benefit from more explicit quantitative metrics and comparisons. In the revised manuscript, we will expand the experiments section to include success rates with statistics over multiple random seeds on held-out goal sets, direct comparisons to standard imitation learning baselines (e.g., behavioral cloning) and RL methods where applicable, and optimality-gap measurements (e.g., relative cost to the trajectory optimizer). This will more rigorously substantiate the claims of high success rates and near-optimal performance for arbitrary goals. revision: yes
Referee: [Dataset generation and augmentation] Dataset generation and augmentation (Section 3): the data-augmentation scheme that treats intermediate states as goals is load-bearing for the 'arbitrary goals' claim, yet no analysis of goal-space coverage, trajectory diversity, or verification that the optimizer returns globally (or near-globally) optimal trajectories (e.g., multi-start comparisons or cost lower bounds) is reported. If the generated trajectories are locally optimal or leave large regions of the joint state-goal space undersampled, the near-optimal performance for unseen goals does not follow.

Authors: We acknowledge that additional analysis of the dataset would strengthen the paper. In the revision, we will add visualizations and quantitative measures of goal-space coverage and trajectory diversity (e.g., histograms of sampled states and goals). We will also include multi-start optimization results for a subset of trajectories to support near-optimality. Formal global optimality guarantees or tight cost lower bounds are difficult to obtain for these non-convex problems, but we will explicitly discuss this limitation and the reliance on the underlying solver's local optimality. revision: partial

Circularity Check

0 steps flagged

No circularity; standard data-generation + imitation pipeline

full rationale

The paper describes generating demonstration trajectories via an external trajectory optimizer, augmenting the dataset by re-labeling intermediate states as goals, and training a goal-conditioned neural network by behavioral cloning. No equations, uniqueness theorems, or predictions reduce to fitted parameters or self-referential definitions by construction. All load-bearing steps rely on the optimizer's output and standard supervised learning, with performance claims supported by empirical evaluation rather than internal re-derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the ability of trajectory optimization to produce high-quality optimal demonstrations and on the generalization power of imitation learning from augmented goal-conditioned data; no new physical entities or unstated mathematical axioms are introduced beyond standard control assumptions.

free parameters (1)

neural network parameter count
Chosen to be under 80,000 parameters for onboard deployment; specific architecture details and training hyperparameters not specified in abstract.

axioms (1)

domain assumption Trajectory optimization solvers can efficiently produce high-quality optimal trajectories for the listed control tasks.
Invoked in the dataset generation step described in the abstract.

pith-pipeline@v0.9.0 · 5571 in / 1254 out tokens · 38327 ms · 2026-05-08T11:23:06.483683+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Survey of Numerical Methods for Trajectory Optimization,

2514/2.4231. URL https://arc.aiaa.org/doi/10.2514/2.4231. John T Betts. Practical methods for optimal control and estimation using nonlinear programming. SIAM,

work page doi:10.2514/2.4231
[2]

End- to-end driving via conditional imitation learning

Felipe Codevilla, Matthias M¨ uller, Antonio L´ opez, Vladlen Koltun, and Alexey Dosovitskiy. End- to-end driving via conditional imitation learning. In 2018 IEEE international conference on robotics and automation (ICRA) , pages 4693–4700. IEEE,

2018
[3]

Learning to reach goals via iterated supervised learning, 2020

Dibya Ghosh, Abhishek Gupta, Ashwin Reddy, Justin Fu, Colin e Devin, Benjamin Eysenbach, and Sergey Levine. Learning to reach goals via iterated supe rvised learning. arXiv preprint arXiv:1912.06088,

work page arXiv 1912
[4]

Publisher: Public University of Navarre: Bachelor’s theses repository

URL https://academica-e.unavarra.es/handle/2454/55206. Publisher: Public University of Navarre: Bachelor’s theses repository. Lill Maria Gjerde Johannessen, Mathias Hauan Arbo, and Jan T ommy Gravdahl. Robot Dynamics with URDF & CasADi. In 2019 7th International Conference on Control, Mechatronic s and Automation (ICCMA). IEEE,

2019
[5]

Plato: Policy learning using adap- tive trajectory optimization

Gregory Kahn, Tianhao Zhang, Sergey Levine, and Pieter Abbeel. Plato: Policy learning using adap- tive trajectory optimization. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 3342–3349. IEEE,

2017
[6]

Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar

Patrick Kidger and Cristian Garcia. Equinox: neural networ ks in JAX via callable PyTrees and ﬁltered transformations. arXiv preprint arXiv:2111.00254,

work page arXiv
[7]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review arXiv
[8]

Goal-conditioned re- inforcement learning: Problems and solutions.arXiv preprint arXiv:2201.08299,

Minghuan Liu, Menghui Zhu, and Weinan Zhang. Goal-conditio ned reinforcement learning: Prob- lems and solutions. arXiv preprint arXiv:2201.08299,

work page arXiv
[9]

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Swish: a sel f-gated activation function. arXiv preprint arXiv:1710.05941, 7(1):5,

work page internal anchor Pith review arXiv
[10]

Goal-conditioned imitation learning using score-based diffusion policies

Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Liout ikov. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532,

work page arXiv
[11]

Gnm: A general navigation model to drive any robot

Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, an d Sergey Levine. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE,

2023
[12]

Learning dynamic-objective policies from a class of optimal trajectories

Christopher Iliffe Sprague, Dario Izzo, and Petter ¨Ogren. Learning dynamic-objective policies from a class of optimal trajectories. In 2020 59th IEEE Conference on Decision and Control (CDC) , pages 597–602. IEEE,

2020
[13]

doi: 10.25 14/1.G002357

ISSN 0731-5090, 1533-3884. doi: 10.25 14/1.G002357. URL https://arc.aiaa.org/doi/10.2514/1.G002357. Carlos S´ anchez-S´ anchez, Dario Izzo, and Daniel Hennes. O ptimal real-time landing using deep networks. In Proceedings of the sixth international conference on astro dynamics tools and tech- niques, ICATT, volume 12, pages 2493–2537,

work page doi:10.2514/1.g002357
[14]

FA TROP: A fast constrained optimal control problem solver for robot trajectory optimi zation and control

Lander V anroye, Ajay Sathya, Joris De Schutter, and Wilm Dec r´ e. FA TROP: A fast constrained optimal control problem solver for robot trajectory optimi zation and control. In 2023 IEEE/RSJ International Conference on Intelligent Robots and System s (IROS), pages 10036–10043. IEEE,

2023
[15]

Tianhao Zhang, Gregory Kahn, Sergey Levine, and Pieter Abbe el

doi: 10.1109/LRA.2022.3196132. Tianhao Zhang, Gregory Kahn, Sergey Levine, and Pieter Abbe el. Learning deep control policies for autonomous aerial vehicles with mpc-guided policy sear ch. In 2016 IEEE International Con- ference on Robotics and Automation (ICRA) , pages 528–535. IEEE,

work page doi:10.1109/lra.2022.3196132 2022
[16]

Appendix A. Experiment details All experiments, from dataset generation to policy training and evaluation, are performed on a laptop with a 13th Gen Intel i9 CPU (24 cores, 0.8-5.6 GHz) and an RTX 4 090 Laptop GPU with 16 GB VRAM. A.1. Dataset generation Dataset generation is fast even with an unoptimized impleme ntation, taking less than 12 minutes on al...

2000