pith. machine review for the scientific record. sign in

arxiv: 2604.16986 · v1 · submitted 2026-04-18 · 💻 cs.PL

Recognition: unknown

Shift schema drift left: policy-aware compile-time contracts for typed JVM and Spark pipelines

Vittal Mirji

Pith reviewed 2026-05-10 06:29 UTC · model grok-4.3

classification 💻 cs.PL
keywords schema driftcompile-time contractsScala 3Sparkdata pipelinesstructural compatibilitytyped JVMpolicy-aware
0
0 comments X

The pith

A Scala 3 framework proves producer-to-contract structural compatibility at compile time for Spark pipelines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework designed to detect schema drift earlier by using compile-time contracts in Scala 3. It establishes structural compatibility between producers and contracts under explicit policies before any code runs. The framework also derives Spark schemas directly from these contract types and performs additional runtime checks on DataFrames at the write boundary. This fills a gap between fully typed dataset approaches that require broad adoption and runtime-only enforcement systems. The work focuses on the mechanism itself, with broader productivity claims noted as future research.

Core claim

We present a small Scala 3 framework that occupies the seam between compile-time and runtime: it proves producer-to-contract structural compatibility under explicit policies at compile time, derives Spark schemas from the same contract types, and re-checks the actual DataFrame schema at the sink boundary before write.

What carries the argument

The fusion of compile-time witness for policy-aware structural proofs with a runtime comparator that adds nested-collection-optionality checks and implements structural subset semantics.

If this is right

  • Compile-time proofs ensure compatibility without needing to run the job.
  • Schemas derived from contract types maintain consistency across the pipeline.
  • Runtime checks catch issues that Spark's standard comparators miss, such as nested collection optionality.
  • Structural subset semantics allow for backward and forward compatible changes to field sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The contracts could enable safer evolution of data pipelines by catching mismatches at development time.
  • Similar techniques might apply to other typed data processing systems beyond Spark.
  • Adoption could reduce debugging time spent on schema-related failures in production.

Load-bearing premise

Explicit policies accurately represent the desired compatibility rules and the compile-time proofs combined with the custom runtime comparator suffice to prevent the schema drift cases targeted.

What would settle it

Observe a scenario where the framework approves a producer-contract pair at compile time under a given policy, but a schema drift still occurs during actual Spark job execution that the runtime check does not catch.

Figures

Figures reproduced from arXiv: 2604.16986 by Vittal Mirji.

Figure 1
Figure 1. Figure 1: The same contract type R drives both compile-time proof and the runtime sink pin. Compile time rejects de￾clared producer-to-contract drift in code. Runtime re-checks the actual Spark schema before write because external data boundaries can still violate the declared type path. 3.2 Normalized structural model The compile-time side needs an internal representation that is small enough to reason about and ex… view at source ↗
Figure 2
Figure 2. Figure 2: Policy family supported by the artifact. Exact-style policies differ on name and order sensitivity. Subset-style policies make direction explicit. Nested collection optionality is checked across all policies except Full [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Schema drift in data pipelines is often caught only when a job touches real data. Typed-Dataset layers close part of this gap but require wholesale adoption; table-level enforcement systems close another part but operate at write time against a stored schema. We present a small Scala 3 framework that occupies the seam: it proves producer-to-contract structural compatibility under explicit policies at compile time, derives Spark schemas from the same contract types, and re-checks the actual DataFrame schema at the sink boundary before write. The artifact fuses the compile-time witness with a policy-aware runtime comparator that adds a nested-collection-optionality check Spark's built-in comparators omit and implements structural subset semantics for backward- and forward-compatible field sets. Evaluation covers compile-time proofs, runtime policy tests, builder-path end-to-end tests, and reproducible benchmarks on two environments. This is a narrow, honest mechanism artifact; the broader claim that compile-time structural contracts deliver measurable productivity or reliability in practice is stated as motivation and left for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents a small Scala 3 framework that proves producer-to-contract structural compatibility under explicit policies at compile time, derives Spark schemas from the same contract types, and augments this with a policy-aware runtime comparator (handling nested-collection optionality and structural subset semantics) that re-checks DataFrame schemas before write. Evaluation covers compile-time proofs, runtime policy tests, builder-path end-to-end tests, and reproducible benchmarks on two environments; the authors position the work as a narrow mechanism artifact without claiming broad productivity gains.

Significance. If the type encoding is sound, the artifact provides a practical seam between compile-time contracts and Spark runtime enforcement, addressing schema drift in typed JVM pipelines without requiring wholesale adoption of typed datasets or purely post-hoc table-level checks. The reproducible benchmarks and explicit separation of compile-time witness from runtime backstop are concrete strengths that could inform similar policy-aware type systems in data-processing languages.

major comments (1)
  1. [mechanism description (post-abstract) and evaluation of compile-time proofs] The central claim that compile-time structural proofs prevent schema drift rests on the soundness of the Scala 3 type-level encoding of policies, structural subset, and nested-collection optionality (described in the mechanism section following the abstract). The manuscript supplies no formal model, no embedding in a proof assistant, and no exhaustive case analysis of the encoding; an off-by-one error in optional nested fields or forward-compatible sets would allow an incompatible producer to type-check while the runtime comparator serves only as a partial backstop. This directly undermines the advertised shift-left guarantee.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive review and for acknowledging the practical value of the compile-time/runtime seam in addressing schema drift. We address the major comment below.

read point-by-point responses
  1. Referee: [mechanism description (post-abstract) and evaluation of compile-time proofs] The central claim that compile-time structural proofs prevent schema drift rests on the soundness of the Scala 3 type-level encoding of policies, structural subset, and nested-collection optionality (described in the mechanism section following the abstract). The manuscript supplies no formal model, no embedding in a proof assistant, and no exhaustive case analysis of the encoding; an off-by-one error in optional nested fields or forward-compatible sets would allow an incompatible producer to type-check while the runtime comparator serves only as a partial backstop. This directly undermines the advertised shift-left guarantee.

    Authors: We agree that the soundness of the type-level encoding is foundational and that the manuscript provides neither a formal model nor a proof-assistant embedding. The implementation relies on standard Scala 3 features (match types for structural subset checks, type classes for policy application, and implicit evidence for compile-time proofs). The evaluation section already exercises compile-time proofs and runtime policy tests for nested-collection optionality and structural subsets, and the runtime comparator is explicitly designed to catch discrepancies that the compile-time layer might miss. In the revised version we will expand the mechanism section with additional worked examples of the encoding (including optional nested fields and forward-compatible sets), insert a limitations subsection that states the absence of a formal proof and the complementary role of the runtime backstop, and augment the test suite description with further edge-case coverage. These changes will make the shift-left claim more precise while remaining consistent with the paper's framing as a narrow mechanism artifact. revision: partial

standing simulated objections not resolved
  • A formal soundness proof or embedding of the Scala 3 type encoding in a proof assistant, which would require a separate formal-methods contribution beyond the scope of this practical mechanism paper.

Circularity Check

0 steps flagged

No circularity; paper is a self-contained description of a software artifact with no derivations, fits, or self-referential reductions.

full rationale

The manuscript presents a Scala 3 framework for compile-time policy-aware contracts and runtime schema checks in JVM/Spark pipelines. It contains no equations, no parameter fitting, no predictions derived from prior outputs, and no load-bearing self-citations or uniqueness theorems. The central mechanism is an implementation of type-level structural subset checks and a custom comparator; these are described directly rather than derived from the paper's own inputs. Evaluation consists of concrete tests and benchmarks on the artifact itself. No step reduces to its inputs by construction, satisfying the default expectation that most papers are non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence and soundness of Scala 3's type-level programming facilities for structural proofs and on the correctness of the custom runtime comparator implementation. No free parameters are fitted to data. No new physical or mathematical entities are postulated.

axioms (2)
  • standard math Scala 3 type system can express and prove structural compatibility under user-defined policies at compile time
    Invoked when the framework is said to 'prove producer-to-contract structural compatibility under explicit policies at compile time'.
  • domain assumption The custom runtime comparator correctly implements the intended structural subset and nested-collection-optionality semantics
    Required for the sink-boundary re-check to be reliable; stated as part of the artifact's design.
invented entities (1)
  • policy-aware compile-time contract framework no independent evidence
    purpose: To occupy the seam between typed datasets and table-level enforcement by providing compile-time proofs plus runtime validation
    The paper presents this as a new small Scala 3 framework; no independent evidence outside the artifact itself is supplied.

pith-pipeline@v0.9.0 · 5467 in / 1604 out tokens · 36885 ms · 2026-05-10T06:29:17.017618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 5 canonical work pages

  1. [1]

    Jonathan Aldrich, Joshua Sunshine, Darpan Saini, and Zachary Sparks

  2. [2]

    InProceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications

    Typestate-oriented programming. InProceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications. ACM, 1015–1022. doi:10.1145/16 39950.1640073

  3. [3]

    Apache Avro. 2025. Apache Avro Specification.https://avro.apache. org/docs/1.12.0/specification/Accessed 2026-04-17

  4. [4]

    Apache Iceberg. 2020. Evolution.https://iceberg.apache.org/docs/la test/evolution/Accessed 2026-04-17

  5. [5]

    Apache Spark. 2024. DataType Companion Object | Spark 3.5.6 Scala API.https://spark.apache.org/docs/3.5.6/api/scala/org/apache/spark /sql/types/DataType$.htmlAccessed 2026-04-17

  6. [6]

    Apache Spark. 2024. StructType | Spark 3.5.6 Scala API.https: //spark.apache.org/docs/3.5.6/api/scala/org/apache/spark/sql/types /StructType.htmlAccessed 2026-04-17

  7. [7]

    Confluent. 2011. Schema evolution and compatibility for Schema Registry on Confluent Platform.https://docs.confluent.io/platfo rm/current/schema-registry/fundamentals/schema-evolution.html Accessed 2026-04-17

  8. [8]

    Delta Lake. 2020. Batch reads and writes.https://docs.delta.io/delta- batch/Accessed 2026-04-17

  9. [9]

    Furia, Martin Nordio, Marco Piccioni, and Bertrand Meyer

    Hans-Christian Estler, Carlo A. Furia, Martin Nordio, Marco Piccioni, and Bertrand Meyer. 2012. Contracts in Practice.arXiv preprint arXiv:1211.4775(2012).https://arxiv.org/abs/1211.4775

  10. [10]

    Matthew Moy, Christos Dimoulas, and Matthias Felleisen. 2024. Ef- fectful Software Contracts.Proceedings of the ACM on Programming Languages(2024). doi:10.1145/3632930

  11. [11]

    Matthew Moy and Matthias Felleisen. 2023. Trace Contracts.Journal of Functional Programming(2023). doi:10.1017/S0956796823000096

  12. [12]

    Mina Oraji, Gerhard Hück, and Christian Bischof. 2026. Dynamic Contract Analysis for Parallel Programming Models.arXiv preprint arXiv:2603.03023(2026).https://arxiv.org/abs/2603.03023

  13. [13]

    Jonas Persson. 2024. Compile Time Resolved Contracts.https: //www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p3317r0.pdf WG21 P3317R0

  14. [14]

    Miles Sabin. 2014. shapeless.https://github.com/milessabin/shapeless Accessed 2026-04-17

  15. [15]

    Scala Documentation. 2026. Reflection | Macros in Scala 3.https: //docs.scala-lang.org/scala3/guides/macros/reflection.htmlAccessed 2026-04-17

  16. [16]

    Scalalandio. 2019. Supported transformations | Chimney.https: //chimney.readthedocs.io/en/stable/supported-transformations/ Accessed 2026-04-17

  17. [17]

    Typelevel. 2016. Frameless.https://typelevel.org/frameless/Accessed 2026-04-17

  18. [18]

    VirtusLab. 2023. iskra.https://github.com/VirtusLab/iskraAccessed 2026-04-17. 7