When Is a Columnar Scan Bandwidth-Bound? A Decode-Throughput Law and Its Cross-Hardware Validation

Madhulatha Mandarapu; Sandeep Kunkunuru

arxiv: 2606.22423 · v1 · pith:F2Z4LWUVnew · submitted 2026-06-21 · 💻 cs.DB · cs.PF

When Is a Columnar Scan Bandwidth-Bound? A Decode-Throughput Law and Its Cross-Hardware Validation

Madhulatha Mandarapu , Sandeep Kunkunuru This is my paper

Pith reviewed 2026-06-26 09:47 UTC · model grok-4.3

classification 💻 cs.DB cs.PF

keywords columnar scandecode throughputbandwidth boundSIMD decodingpredicate evaluationzone maps

0 comments

The pith

A decoder's value throughput stays constant across bit widths, so the bandwidth fraction a columnar scan achieves follows a one-parameter law.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in columnar scans the rate at which values are decoded remains essentially the same no matter how many bits each value occupies. This independence produces a simple law for the fraction of available memory bandwidth the scan actually uses: the fraction rises linearly with bit width until the scan becomes fully bandwidth-bound. Fitting a single throughput value per decode strategy reproduces the observed bandwidth fractions across encodings, selectivities, and two different CPUs with low error. The law also identifies a crossover bit width that shifts correctly when the machine's bandwidth changes. Two additional crossovers are shown for predicate evaluation and zone-map skipping.

Core claim

A decoder's value throughput T_dec is essentially independent of bit-width b because it is set by the decode layout and strategy rather than the number of bits per value. Consequently the achieved bandwidth fraction obeys f = min(1, T_dec * b / (8*beta)) with the compute-to-bandwidth ridge at b* = 8*beta/T_dec. Fitting one T_dec per strategy reproduces measured fractions with median error 0.027 on x86/AVX2 and 0.003 on Apple M4/NEON.

What carries the argument

The decode-throughput law f = min(1, T_dec * b / (8*beta)) that converts a constant per-strategy decode rate into a predicted bandwidth fraction.

If this is right

Scans become bandwidth-bound only above a predictable bit-width ridge that depends on the machine's bandwidth beta.
The same fitted T_dec predicts bandwidth usage on a second machine once its bandwidth is inserted into the law.
Higher-T_dec hand-tuned decoders push the bandwidth-bound regime to narrower bit widths.
Branch-free predicate evaluation outperforms branchy code only inside a mid-selectivity band set by the sigma(1-sigma) misprediction curve.
Zone-map skipping helps according to data clustering rather than selectivity alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The law could guide automatic choice of encoding width to reach the bandwidth roofline on a given machine.
Once T_dec is high enough that the ridge falls below the narrowest practical bit width, further decode-speed improvements yield little bandwidth gain.
The independence of T_dec from b may fail on architectures where instruction throughput scales with bit width, providing a clear test on additional hardware.

Load-bearing premise

That a decoder's throughput measured on one set of bit widths remains constant when bit width changes.

What would settle it

Measure the value decode rate T_dec at bit widths from 1 to 32 bits under fixed encoding and selectivity and test whether the rates differ by more than a few percent.

Figures

Figures reproduced from arXiv: 2606.22423 by Madhulatha Mandarapu, Sandeep Kunkunuru.

**Figure 1.** Figure 1: Bandwidth fraction f vs bit-width b (solid = measured, dashed = law min(1, Tdecb/8β)). Left: x86/AVX2. Right: Apple M4/NEON (held-out). Scalar row-major is compute-bound at all real widths; the transposed and byte-aligned strategies cross into bandwidth-bound at b ⋆ = 8β/Tdec. their “decode is free” result. The same law that describes our portable decoder (Tdec ≈4–9, ridge at b≈20–67) predicts the SOTA reg… view at source ↗

**Figure 2.** Figure 2: Left: branchy select time is the σ(1 − σ) misprediction parabola; predicated is flat (x86). Right: on sorted/clustered data zone maps read a tiny fraction of bytes; on random data they do not (x86). 7 Related Work Vectorized, bandwidth-aware scans originate with MonetDB/X100 [Boncz et al., 2005]; Data Blocks [Lang et al., 2016], BtrBlocks [Kuschewski et al., 2023], and FastLanes [Afroozeh and Boncz, 2023] … view at source ↗

read the original abstract

A columnar scan that decompresses, filters, and aggregates should be limited only by memory bandwidth (the roofline floor T >= BytesRead/beta), yet real kernels are often compute-bound and leave bandwidth idle. We give a predictive answer to when a scan is bandwidth-bound. Across encodings, predicate selectivities, and two very different machines, a decoder's value throughput T_dec (values decoded per second) is essentially independent of bit-width b: it is set by the decode layout/strategy, not by how many bits each value occupies. Hence the achieved bandwidth fraction obeys a one-parameter law, f = min(1, T_dec * b / (8*beta)), with the compute-to-bandwidth ridge at b* = 8*beta/T_dec. Fitting one T_dec per strategy reproduces measured bandwidth fractions with median error 0.027 on x86/AVX2 and 0.003 on a held-out Apple M4/NEON machine, and the ridge b* shifts correctly with each machine's bandwidth. Inserting FastLanes' reported decode throughput into the law reproduces its "decode is free at three bits" headline as the large-T_dec limit, unifying our portable decoder and hand-tuned state of the art in one curve. We add two crossovers, validated on both machines: branch-free predicate evaluation beats branchy in a mid-selectivity band (the sigma(1-sigma) misprediction parabola), and zone-map skipping is clustering-gated rather than selectivity-gated. We release the micro-benchmark, the correctness oracle, and a one-command reproduction. This is a baseline and a model, not a faster kernel: our portable C decoders reach ~2 values/cycle, far below hand-tuned SOTA, and the law holds precisely because it is parameterized by the measured T_dec.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a portable one-parameter law for when columnar decoders hit bandwidth limits, derived from T_dec independence and checked on two dissimilar machines with low error.

read the letter

The main point is that a decoder's value throughput T_dec stays roughly constant across bit widths for a given strategy, which produces the law f = min(1, T_dec * b / (8*beta)) and correctly predicts the ridge where the scan becomes bandwidth-bound.

What is new is the explicit formula plus its unification of the FastLanes result as the high-T_dec limit. The work also adds validated crossovers for branch-free predicates and zone-map skipping. The authors fit one T_dec per strategy on held-out data and reproduce measured bandwidth fractions with median errors of 0.027 on x86/AVX2 and 0.003 on Apple M4/NEON, and the ridge shifts as expected with each machine's beta.

The derivation is direct from the independence assumption, and the cross-hardware check supplies an external test rather than a tautology. Releasing the micro-benchmark, oracle, and one-command script strengthens the result.

A minor soft spot is that the independence claim is demonstrated mainly through the law's predictive accuracy rather than by first showing raw T_dec values plotted against b for each strategy. That leaves the assumption a step less transparent, though the low errors and correct ridge behavior still back it. The portable decoders are slower than hand-tuned SOTA, but the paper states upfront that its goal is the model, not new peak speed.

This is for readers who build or tune columnar storage, write decoders, or model database kernels. Anyone who needs a hardware-agnostic way to set bit widths or decide when a scan will be bandwidth-bound will find it useful. The evidence is solid enough and the claim falsifiable enough that it deserves a serious referee.

Referee Report

0 major / 1 minor

Summary. The manuscript derives a decode-throughput law stating that a decoder's value throughput T_dec (values/s) is independent of bit-width b and set by the decode layout/strategy. This yields the achieved bandwidth fraction f = min(1, T_dec * b / (8*beta)) with the compute-to-bandwidth ridge at b* = 8*beta/T_dec. Fitting one T_dec per strategy reproduces measured f across encodings, predicate selectivities, and two dissimilar machines (x86/AVX2 and held-out Apple M4/NEON) with median errors 0.027 and 0.003; the ridge shifts correctly with each machine's beta. The law unifies the portable decoder results with FastLanes' "decode is free at three bits" headline. Two additional validated crossovers are presented: branch-free predicate evaluation outperforms branchy in a mid-selectivity band, and zone-map skipping is clustering-gated. The micro-benchmark, correctness oracle, and one-command reproduction script are released.

Significance. If the central independence of T_dec from b holds, the result supplies a simple, portable, one-parameter-per-strategy predictive model for when columnar scans are bandwidth-bound rather than compute-bound. Credit is due for the direct cross-hardware validation on dissimilar ISAs, the low median errors, the unification with prior SOTA, and the public release of the benchmark and reproduction artifacts, all of which make the law falsifiable and reusable for database performance modeling.

minor comments (1)

[Abstract] Abstract: does not report error bars or uncertainty on the fitted T_dec values themselves, nor the precise number of encodings tested.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed and positive review. The recommendation to accept is appreciated, and we are pleased that the cross-hardware validation, low median errors, unification with prior work, and artifact release were viewed favorably.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper posits that T_dec is independent of b (set by decode layout/strategy) and derives the one-parameter law f = min(1, T_dec * b / (8*beta)) from that premise. T_dec is obtained by fitting one value per strategy; the resulting model is then shown to reproduce measured f across bit-widths, selectivities, and a held-out machine with low median error. This constitutes an empirical validation of the independence assumption rather than a reduction of the law to a tautology or to quantities defined only by the same fit. No step equates a claimed prediction to its own inputs by construction, no self-citation chain is load-bearing, and the cross-machine results supply an external check. The derivation is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The model rests on one fitted parameter per decoder strategy (T_dec) and the domain assumption that T_dec does not depend on bit width; no new entities are postulated.

free parameters (1)

T_dec per decode strategy
One scalar throughput value fitted per decoder layout/strategy from measured data; used to predict f for all bit widths.

axioms (2)

domain assumption Decoder value throughput T_dec is independent of bit-width b and set only by decode layout/strategy
Invoked when deriving the law from the observation that throughput does not vary with b.
standard math Machine bandwidth beta is an independent, measurable constant for each platform
Used as the denominator in the ridge-point formula; treated as an input from hardware measurement.

pith-pipeline@v0.9.1-grok · 5878 in / 1672 out tokens · 30431 ms · 2026-06-26T09:47:57.612565+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references

[1]

The FastLanes compression layout: Decoding 100 billion integers per second with scalar code

Azim Afroozeh and Peter Boncz. The FastLanes compression layout: Decoding 100 billion integers per second with scalar code. Proceedings of the VLDB Endowment, 16 0 (9): 0 2132--2144, 2023

2023
[2]

Boncz, Marcin Zukowski, and Niels Nes

Peter A. Boncz, Marcin Zukowski, and Niels Nes. MonetDB/X100 : Hyper-pipelining query execution. In CIDR, pages 225--237, 2005

2005
[3]

Everything you always wanted to know about compiled and vectorized queries but were afraid to ask

Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. Proceedings of the VLDB Endowment, 11 0 (13): 0 2209--2222, 2018

2018
[4]

BtrBlocks : Efficient columnar compression for data lakes

Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis. BtrBlocks : Efficient columnar compression for data lakes. Proceedings of the ACM on Management of Data, 1 0 (2): 0 1--26, 2023

2023
[5]

Boncz, Thomas Neumann, and Alfons Kemper

Harald Lang, Tobias M \"u hlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann, and Alfons Kemper. Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In SIGMOD, pages 311--326, 2016

2016
[6]

Boncz, and Martin L

Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. Generic database cost models for hierarchical memory systems. In VLDB, pages 191--202, 2002

2002
[7]

McCalpin

John D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter, pages 19--25, 1995

1995
[8]

Orestis Polychroniou and Kenneth A. Ross. Efficient lightweight compression alongside fast scans. In DaMoN, pages 1--6, 2015

2015
[9]

Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. Rethinking SIMD vectorization for in-memory databases. In SIGMOD, pages 1493--1508, 2015

2015
[10]

Roofline: an insightful visual performance model for multicore architectures

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. In Communications of the ACM, volume 52, pages 65--76. ACM, 2009

2009
[11]

An empirical evaluation of columnar storage formats

Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. An empirical evaluation of columnar storage formats. Proceedings of the VLDB Endowment, 17 0 (2): 0 148--161, 2023

2023

[1] [1]

The FastLanes compression layout: Decoding 100 billion integers per second with scalar code

Azim Afroozeh and Peter Boncz. The FastLanes compression layout: Decoding 100 billion integers per second with scalar code. Proceedings of the VLDB Endowment, 16 0 (9): 0 2132--2144, 2023

2023

[2] [2]

Boncz, Marcin Zukowski, and Niels Nes

Peter A. Boncz, Marcin Zukowski, and Niels Nes. MonetDB/X100 : Hyper-pipelining query execution. In CIDR, pages 225--237, 2005

2005

[3] [3]

Everything you always wanted to know about compiled and vectorized queries but were afraid to ask

Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. Everything you always wanted to know about compiled and vectorized queries but were afraid to ask. Proceedings of the VLDB Endowment, 11 0 (13): 0 2209--2222, 2018

2018

[4] [4]

BtrBlocks : Efficient columnar compression for data lakes

Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis. BtrBlocks : Efficient columnar compression for data lakes. Proceedings of the ACM on Management of Data, 1 0 (2): 0 1--26, 2023

2023

[5] [5]

Boncz, Thomas Neumann, and Alfons Kemper

Harald Lang, Tobias M \"u hlbauer, Florian Funke, Peter A. Boncz, Thomas Neumann, and Alfons Kemper. Data blocks: Hybrid OLTP and OLAP on compressed storage using both vectorization and compilation. In SIGMOD, pages 311--326, 2016

2016

[6] [6]

Boncz, and Martin L

Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. Generic database cost models for hierarchical memory systems. In VLDB, pages 191--202, 2002

2002

[7] [7]

McCalpin

John D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter, pages 19--25, 1995

1995

[8] [8]

Orestis Polychroniou and Kenneth A. Ross. Efficient lightweight compression alongside fast scans. In DaMoN, pages 1--6, 2015

2015

[9] [9]

Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. Rethinking SIMD vectorization for in-memory databases. In SIGMOD, pages 1493--1508, 2015

2015

[10] [10]

Roofline: an insightful visual performance model for multicore architectures

Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. In Communications of the ACM, volume 52, pages 65--76. ACM, 2009

2009

[11] [11]

An empirical evaluation of columnar storage formats

Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. An empirical evaluation of columnar storage formats. Proceedings of the VLDB Endowment, 17 0 (2): 0 148--161, 2023

2023