pith. sign in

arxiv: 2601.05383 · v4 · submitted 2026-01-08 · 💻 cs.LG · math.OC

Imitation Learning for Combinatorial Optimisation under Uncertainty

Pith reviewed 2026-05-16 15:53 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords imitation learningcombinatorial optimizationuncertaintyexpert taxonomyDAggerstochastic expertssequential decision problemspolicy learning
0
0 comments X

The pith

Policies learned from stochastic experts outperform those from deterministic experts in imitation learning for combinatorial optimization under uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a taxonomy to classify experts used in imitation learning for solving large combinatorial optimization problems that involve uncertainty. The classification considers how experts handle uncertainty, whether they aim for optimal or approximate solutions, and the way they interact with the learning algorithm, from single demonstrations to ongoing feedback. Based on the taxonomy, the authors develop a generalized version of the DAgger algorithm that supports querying multiple experts and aggregating their advice. In tests on assigning physicians to patients with random arrivals and limited capacity, policies trained using stochastic experts performed better than those using deterministic experts, and interactive methods required fewer expert examples to reach good performance.

Core claim

The paper establishes that for imitation learning in sequential decision problems modeling combinatorial optimization under uncertainty, a generalized DAgger framework using experts classified by uncertainty treatment, optimality, and interaction mode leads to policies that benefit from stochastic expert demonstrations and interactive learning regimes, as evidenced by superior performance on a dynamic physician-to-patient assignment instance compared to deterministic or full-information baselines.

What carries the argument

A taxonomy of expert constructions along three dimensions—treatment of uncertainty, level of optimality, and interaction mode—together with a generalized Dataset Aggregation (DAgger) framework accommodating multiple expert queries and flexible interaction strategies.

If this is right

  • Policies derived from stochastic experts achieve higher solution quality than those from deterministic experts across tested interaction modes.
  • Interactive learning schemes attain comparable or superior results with significantly fewer expert demonstrations than one-shot approaches.
  • When stochastic optimization is too expensive, aggregating multiple deterministic experts offers a viable alternative for training effective policies.
  • The taxonomy enables systematic comparison and selection of expert types for future applications in uncertain combinatorial settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicitly incorporating uncertainty into expert behavior during training may be crucial for learning policies that generalize to real stochastic environments beyond the tested case.
  • The generalized framework could be adapted to other sequential decision problems in operations research, such as dynamic scheduling or resource allocation.
  • Validating the taxonomy on a broader set of problems would strengthen claims about its comprehensiveness for classifying all expert constructions.

Load-bearing premise

That the performance advantages observed for stochastic and interactive experts in the physician-to-patient assignment problem will hold for other combinatorial optimization problems under uncertainty.

What would settle it

Demonstrating on a separate problem, such as stochastic vehicle routing or inventory management, that deterministic expert-trained policies match or exceed the quality of those trained on stochastic experts.

read the original abstract

Imitation learning (IL) provides a data-driven framework for approximating policies for large-scale combinatorial optimisation problems formulated as sequential decision problems (SDPs), where exact solution methods are computationally intractable. A central but underexplored aspect of IL in this context is the role of the \emph{expert} that generates training demonstrations. Existing studies employ a wide range of expert constructions, yet lack a unifying framework to characterise their modelling assumptions, computational properties, and impact on learning performance. This paper introduces a systematic taxonomy of experts for imitation learning in combinatorial optimisation under uncertainty. The literature is classified along three principal dimensions: (i) treatment of uncertainty; (ii) level of optimality, distinguishing task-optimal and approximate experts; and (iii) interaction mode with the learner, ranging from one-shot supervision to iterative, interactive schemes. We further identify additional categories capturing other relevant expert characteristics. Building on this taxonomy, we propose a generalised Dataset Aggregation (DAgger) framework that accommodates multiple expert queries, expert aggregation, and flexible interaction strategies. The proposed framework is evaluated on a dynamic physician-to-patient assignment problem with stochastic arrivals and capacity constraints. Computational experiments compare learning outcomes across expert types and interaction regimes. The results show that policies learned from stochastic experts consistently outperform those learned from deterministic or full-information experts, while interactive learning improves solution quality using fewer expert demonstrations. Aggregated deterministic experts provide an effective alternative when stochastic optimisation becomes computationally challenging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a taxonomy classifying experts for imitation learning in combinatorial optimization under uncertainty along three dimensions (treatment of uncertainty, level of optimality, interaction mode) plus additional categories. It proposes a generalized DAgger framework supporting multiple expert queries, aggregation, and flexible interactions. The framework is evaluated on a dynamic physician-to-patient assignment problem with stochastic arrivals and capacity constraints; results indicate that policies learned from stochastic experts outperform those from deterministic or full-information experts, interactive regimes improve quality with fewer demonstrations, and aggregated deterministic experts serve as a practical alternative when stochastic optimization is expensive.

Significance. If the performance ordering generalizes, the taxonomy and extended DAgger would supply a much-needed organizing framework for expert design in IL for sequential decision problems, directly addressing an underexplored aspect of the literature. The empirical demonstration that explicit stochastic modeling in experts yields better policies, together with the computational trade-off finding for aggregated experts, would be of immediate practical value to researchers applying IL to large-scale CO under uncertainty.

major comments (2)
  1. [Experiments] Experiments section: the claim that stochastic experts 'consistently outperform' deterministic and full-information experts rests exclusively on results for the single dynamic physician-to-patient assignment instance. No other CO problems under uncertainty (e.g., stochastic TSP, vehicle routing, or scheduling) are evaluated, so the reported ordering may be driven by the specific structure of assignment rather than being general. This is load-bearing for the paper's central empirical conclusion and for the asserted broad applicability of the taxonomy.
  2. [Section 3] Generalized DAgger framework (Section 3): while the extension to multiple experts, aggregation, and interactive strategies is described, the manuscript provides no convergence analysis, sample-complexity bound, or comparison to standard DAgger under the new interaction modes. Without such guarantees, the methodological contribution remains primarily descriptive.
minor comments (2)
  1. [Abstract] The abstract states that 'additional categories' are identified but does not enumerate them; a short explicit list would improve readability.
  2. [Preliminaries] Notation for the sequential decision process (SDP) formulation could be aligned more closely with standard references in stochastic programming to reduce cognitive load for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive report and positive assessment of the significance of the taxonomy and generalized DAgger framework. We address the two major comments below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the claim that stochastic experts 'consistently outperform' deterministic and full-information experts rests exclusively on results for the single dynamic physician-to-patient assignment instance. No other CO problems under uncertainty (e.g., stochastic TSP, vehicle routing, or scheduling) are evaluated, so the reported ordering may be driven by the specific structure of assignment rather than being general. This is load-bearing for the paper's central empirical conclusion and for the asserted broad applicability of the taxonomy.

    Authors: We agree that the empirical results are confined to a single problem class and that this constrains the strength of the 'consistently outperform' claim. The dynamic physician assignment instance was selected because it is a realistic large-scale SDP with stochastic arrivals, capacity constraints, and online decisions, allowing clear demonstration of the taxonomy dimensions. We will revise the manuscript to (i) explicitly qualify the empirical findings as illustrative for this problem class, (ii) add a dedicated limitations paragraph discussing potential problem-specific effects, and (iii) outline how the taxonomy itself remains problem-agnostic. New experiments on additional domains (e.g., stochastic VRP) are beyond the scope of the current revision but will be noted as future work. revision: partial

  2. Referee: [Section 3] Generalized DAgger framework (Section 3): while the extension to multiple experts, aggregation, and interactive strategies is described, the manuscript provides no convergence analysis, sample-complexity bound, or comparison to standard DAgger under the new interaction modes. Without such guarantees, the methodological contribution remains primarily descriptive.

    Authors: The generalized DAgger is presented as a practical algorithmic extension that inherits the convergence properties of standard DAgger when the underlying no-regret learner and expert assumptions hold. We will add a short subsection in Section 3 that (i) recalls the standard DAgger regret bound, (ii) explains how multiple-expert aggregation and interactive querying preserve the same regret guarantees under the same assumptions, and (iii) explicitly states that no new sample-complexity analysis is derived. This will clarify the methodological status without overstating theoretical novelty. revision: partial

Circularity Check

0 steps flagged

No circularity: taxonomy and generalized DAgger are independently defined and empirically tested

full rationale

The paper defines a new taxonomy along three explicit dimensions (uncertainty treatment, optimality level, interaction mode) plus additional categories, then describes a generalized DAgger variant that adds multiple-expert queries and aggregation. These constructions are presented as extensions of prior IL literature rather than reductions to self-referential inputs. The central empirical claims (stochastic experts outperform deterministic/full-info ones; interactive regimes need fewer demonstrations) are obtained from computational experiments on the physician-to-patient assignment instance and do not reduce by construction to fitted parameters or self-citations. No self-definitional loops, fitted-input predictions, or ansatzes smuggled via self-citation appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions from imitation learning and combinatorial optimization literature plus the domain assumption that the proposed taxonomy dimensions are sufficient to organize the field.

axioms (1)
  • domain assumption The three principal dimensions plus additional categories capture the relevant characteristics of experts for imitation learning in combinatorial optimization under uncertainty.
    Invoked to justify the taxonomy construction and classification of literature.

pith-pipeline@v0.9.0 · 5554 in / 1199 out tokens · 31499 ms · 2026-05-16T15:53:11.093156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.