A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions

Yu Zhang , Dongjiang Zhuang , Qu Zhou , Zheng Huang , Junhe Wu , Jing Cao , Kai Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:18 UTC · model grok-4.3

classification 💻 cs.AI cs.IR

keywords notesclassificationrulessix-digittarifftop-1workflowagentic

0 comments

The pith

A deterministic six-stage agentic workflow for HS tariff classification reaches 64.2% top-1 accuracy at six digits on HSCodeComp and flags possible inconsistencies in some ground-truth labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

HS tariff classification requires assigning a precise code to any product based on international rules covering its material, shape, function, and essential character. These rules often conflict, so a correct answer must satisfy several constraints at once rather than just matching one feature. The authors argue that simply prompting a large language model end-to-end tends to resolve one constraint while violating others. Their solution uses a fixed sequence of six stages. Each stage handles one narrow aspect such as checking material composition or determining whether an item is a part or a whole. The model is only called inside these stages and must output structured answers that include verbatim quotes from the official notes. An offline knowledge base of Chinese HS rules supports the stages. On a benchmark called HSCodeComp the system reaches 64.2 percent top-1 accuracy at the six-digit level and 78.3 percent top-3. A manual review of disagreements suggests some benchmark labels may not fully follow the general interpretive rules. Because every decision cites its supporting rule, the output is easier to audit than a single free-form answer.

Core claim

Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone achieves 84.2% four-digit and 77.4% six-digit top-1 agreement.

Load-bearing premise

That a fixed six-stage decomposition of multi-dimensional HS rule reasoning, supported by offline knowledge engineering, will correctly resolve all priority conflicts among material, function, essential character, and part-versus-whole axes without missing interactions that require dynamic reordering.

read the original abstract

Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in knowledge volume but in *multi-dimensional rule reasoning*: a correct classification must satisfy competing priority rules along several axes simultaneously, including material, form, function, essential character, the part-versus-whole boundary, and specific listing versus residual headings. End-to-end prompting of large language models fails characteristically by resolving one axis while ignoring the priority constraints on the others. We present a *deterministic agentic workflow* in contrast to self-planning agents: the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms. This design yields interpretability by construction--each decision is decomposed into stage-wise structured outputs with verbatim citation of the chapter or section notes that bear on it. The architecture combines offline knowledge-engineering of the Chinese HS tariff with an online six-stage pipeline. Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone in non-thinking mode achieves 84.2% four-digit and 77.4% six-digit top-1 agreement with the frontier model. A two-stage manual audit of 226 six-digit disagreements suggests that a non-trivial fraction of HSCodeComp ground-truth labels may deviate from HS general rules; full adjudication records are released in the appendix as preliminary findings for community review.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable fixed six-stage workflow for HS tariff classification that hits 64% top-1 at six digits on HSCodeComp and adds interpretability through structured outputs and rule citations, but the static sequence may still miss case-specific priority shifts among competing HS rules.

read the letter

The main takeaway is that this deterministic agentic setup outperforms plain end-to-end prompting on the HSCodeComp benchmark while keeping every step traceable to specific chapter or section notes. They break the task into six fixed stages, use narrow LLM calls per stage, and back the outputs with verbatim rule quotes. That produces the reported numbers: 75% top-1 and 91.5% top-3 at four digits, 64.2% top-1 and 78.3% top-3 at six digits with the stronger model, and even 77.4% top-1 at six digits from the open 27B variant. They also ran a manual audit on 226 disagreements and released the records, which is useful for anyone who wants to check label quality themselves. The offline knowledge engineering of the Chinese HS tariff plus the online pipeline is a clean separation that keeps the control flow predictable rather than letting the model plan its own steps. That design choice is the clearest difference from self-planning agents and explains the interpretability claim. The soft spot is the reliance on a predetermined sequence to resolve all priority conflicts among material, function, essential character, and part-versus-whole rules. HS notes sometimes require reordering or simultaneous checks that a static pipeline cannot catch without backtracking. The paper does not show explicit tests for those edge cases or provide the exact stage prompts and decision logic, so it is hard to judge how often the fixed order produces locally correct but globally wrong results. The audit helps, but it does not directly measure missed reorderings. This work is aimed at trade-compliance teams and regulatory AI groups who need auditable outputs more than raw accuracy. A reader working on applied agentic systems or customs automation will find the benchmark numbers and the released audit records worth looking at. The paper is coherent on its own terms and reports concrete results against an external dataset, so it deserves a serious referee rather than a desk reject. I would send it out for review with a request for more detail on the stage implementations and any handling of dynamic rule ordering.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a deterministic six-stage agentic workflow for HS tariff classification that decomposes multi-dimensional rule reasoning (material, function, essential character, part-versus-whole) into fixed stages supported by offline knowledge engineering of the Chinese HS tariff. It contrasts this with end-to-end LLM prompting, claims interpretability via structured outputs and verbatim rule citations, and reports 75.0% top-1 / 91.5% top-3 accuracy at four digits and 64.2% top-1 / 78.3% top-3 at six digits on HSCodeComp with Qwen3.6-plus (plus 84.2% / 77.4% top-1 for an open-weight Qwen3.6-27B-FP8 backbone). A two-stage manual audit of 226 six-digit disagreements is included, with full adjudication records released in the appendix.

Significance. If the central performance claims hold under scrutiny, the work demonstrates a practical, interpretable alternative to black-box prompting for high-stakes regulatory classification tasks, with the release of adjudication records providing a concrete contribution to benchmark quality assessment in the domain. The deterministic control flow and stage-wise verifiability address a known failure mode of self-planning agents on priority-conflict problems.

major comments (2)

[Abstract and §3] Abstract and §3 (six-stage pipeline description): the claim that the fixed sequence correctly resolves all priority conflicts among material, function, essential character, and part-versus-whole axes is load-bearing for the reported accuracies, yet the manuscript provides no mechanism for dynamic reordering or backtracking; HS GIR 1–6 and chapter notes frequently require case-specific ordering that a static pipeline cannot guarantee without explicit conditional logic.
[§4 and abstract] §4 (Evaluation) and abstract: the 64.2% top-1 six-digit accuracy and the interpretation of the 226-disagreement audit rest on the assumption that the workflow's decisions align with HS rules rather than benchmark noise, but no quantitative breakdown of deviation types, exact prompting templates, or stage-level implementation details are supplied, preventing verification that the pipeline actually performs the claimed multi-dimensional reasoning.

minor comments (2)

[Appendix] The appendix release of adjudication records is a strength for reproducibility; however, the paper should specify the exact criteria and inter-annotator process used in the two-stage manual audit to allow readers to assess consistency.
[§3] Notation for the six stages and their input/output schemas could be presented in a single table for clarity, as the current prose description makes it difficult to trace data flow across stages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where the manuscript will be revised to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (six-stage pipeline description): the claim that the fixed sequence correctly resolves all priority conflicts among material, function, essential character, and part-versus-whole axes is load-bearing for the reported accuracies, yet the manuscript provides no mechanism for dynamic reordering or backtracking; HS GIR 1–6 and chapter notes frequently require case-specific ordering that a static pipeline cannot guarantee without explicit conditional logic.

Authors: We agree that the manuscript would benefit from greater explicitness on this point. The six-stage pipeline is deliberately ordered to follow the standard priority hierarchy encoded in the HS GIR and chapter notes (specific before general, material before function where applicable, etc.), with local verification steps that check for conflicts at each stage. However, we acknowledge that the current text does not include pseudocode or concrete examples of how these verifications enforce ordering without backtracking. We will revise §3 to add a detailed flow diagram and stage-wise conditional logic descriptions showing how the fixed sequence covers the required cases. revision: yes
Referee: [§4 and abstract] §4 (Evaluation) and abstract: the 64.2% top-1 six-digit accuracy and the interpretation of the 226-disagreement audit rest on the assumption that the workflow's decisions align with HS rules rather than benchmark noise, but no quantitative breakdown of deviation types, exact prompting templates, or stage-level implementation details are supplied, preventing verification that the pipeline actually performs the claimed multi-dimensional reasoning.

Authors: We accept this criticism. While the appendix already releases the full adjudication records for the 226 cases (including stage-wise citations), the main text lacks a quantitative breakdown of deviation categories and the exact stage prompts. We will add to §4 a table summarizing deviation types (e.g., material-priority conflicts, part-whole boundary issues, essential-character mismatches) derived from the audit, include the precise prompting templates for each of the six stages, and expand the implementation details to make the multi-dimensional reasoning verifiable. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that HS rules can be decomposed into a fixed sequence of narrow stages supported by offline knowledge engineering; no free parameters are fitted and no new entities are postulated.

axioms (2)

domain assumption HS tariff classification rules can be decomposed into a fixed six-stage pipeline that resolves all priority conflicts among material, form, function, and part-versus-whole axes
Invoked to justify the deterministic control flow over self-planning agents.
domain assumption Offline knowledge engineering of the Chinese HS tariff produces a complete and accurate base for stage-wise reasoning
Required for the online pipeline to cite correct chapter and section notes.

pith-pipeline@v0.9.0 · 5668 in / 1624 out tokens · 64758 ms · 2026-05-15T03:18:43.811371+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms... the stage structure is dictated by the tariff itself (chapter, heading, subheading)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.