arxiv: 2605.01836 · v1 · submitted 2026-05-03 · 💻 cs.AR

Recognition: unknown

PipeRTL: Timing-Aware Pipeline Optimization at IR-Level for RTL Generation

Bei Yu, Chen Bai, Fangzhou Liu, Lancheng Zou, Rongliang Fu, Shuo Yin, Tsung-Yi Ho, Wenqian Zhao, Yuan Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:38 UTC · model grok-4.3

classification 💻 cs.AR

keywords pipeline optimizationIR-levelRTL generationtiming-awareregister relocationmin-cost flowhardware compilerssynthesis flow

0 comments

The pith

IR-level timing-aware register relocation via min-cost flow produces RTL that synthesizes to lower delay, power, and area.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern hardware compilers insert or adjust registers late in the flow after lowering to netlist, at which point much of the original operator structure is no longer visible to global decisions. This paper shows that moving the legality of register relocation into the IR, approximating downstream delays with a learned predictor, and solving the placement problem as a global min-cost flow under timing constraints yields RTL that downstream commercial tools can implement more efficiently. A sympathetic reader would care because the approach keeps high-level structure available for pipeline choices instead of losing it to early lowering. If the claim holds, hardware compilers can directly influence the sequential skeleton presented to synthesis rather than hoping backend retiming recovers the best arrangement.

Core claim

PipeRTL makes register-move legality explicit in the compiler IR, employs a learned timing predictor to estimate downstream delays, and casts timing-constrained register relocation as a global min-cost flow problem; the resulting RTL, when passed through a commercial synthesis flow, improves average critical-path delay, power, and area while supplying a stronger initial structure for later retiming passes.

What carries the argument

Global min-cost flow formulation that encodes timing constraints on register relocation made legal inside the IR, guided by a learned predictor for downstream delay behavior.

If this is right

Critical-path delay decreases on average across the evaluated open-source designs.
Power and area are reduced in the final synthesized implementations.
The generated RTL supplies a stronger sequential structure for subsequent backend retiming.
Pipeline decisions become an explicit compiler pass rather than a deferred backend heuristic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same IR-level formulation could be applied to other hardware compilers that expose similar operator structure before netlist lowering.
Predictor accuracy could be iteratively improved by feeding back actual post-synthesis delay measurements from the commercial tool.
The approach opens the possibility of adding further global timing-related optimizations at the same IR stage without waiting for netlist-level information.

Load-bearing premise

The learned timing predictor must sufficiently approximate the actual delay behavior that commercial backend tools will see once the design is lowered to netlist level.

What would settle it

Synthesizing the PipeRTL-generated RTL through the commercial flow and measuring no reduction (or an increase) in critical-path delay relative to the baseline would falsify the reported improvement.

read the original abstract

Modern hardware compilers increasingly rely on rich intermediate representations (IRs) to preserve optimization-relevant semantics before generating RTL code. However, one important optimization is still largely deferred to backend tools: pipeline optimization. In common RTL flows, registers are inserted by frontend heuristics or hardware designers and later adjusted by backend retiming after the design has been lowered to a much lower-level netlist representation. At that point, much of the operator-level structure originally exposed by the compiler IR has already been weakened or lost, limiting opportunities for global, compiler-level pipeline optimization. This paper presents PipeRTL, an IR-level pipeline optimization framework for hardware compilers, instantiated in CIRCT. PipeRTL makes the legality of register relocation explicit in the IR, uses a learned timing predictor to approximate downstream delay behavior, and formulates timing-aware register relocation as a global min-cost flow problem under timing constraints. Evaluation on open-source designs under a commercial backend synthesis flow shows that PipeRTL improves downstream implementation quality on average, reducing critical-path delay, power, and area across the evaluated benchmarks, while also providing a stronger starting point for backend retiming. These results indicate that exposing pipeline optimization as an explicit compiler pass can deliver backend-meaningful gains by improving the sequential structure presented to later stages and the resulting downstream implementation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PipeRTL moves register relocation to the IR level with explicit legality, a learned timing model, and min-cost flow, but the predictor's role in the reported gains is not clearly validated.

read the letter

The main point is that PipeRTL shifts pipeline register placement earlier, into the compiler IR in CIRCT, rather than leaving it to backend retiming after the design is lowered to netlist. It tracks which register moves are legal at the IR, feeds a learned timing predictor into the decisions, and solves the placement as a global min-cost flow under timing constraints. This combination at the IR stage is not in the prior flows the abstract summarizes, so the formulation itself is the new piece.

Referee Report

2 major / 2 minor

Summary. The paper presents PipeRTL, an IR-level pipeline optimization framework instantiated in CIRCT. It makes register relocation legality explicit, uses a learned timing predictor to approximate downstream delay, and formulates timing-aware register relocation as a global min-cost flow problem. Evaluation on open-source designs under a commercial backend synthesis flow reports average reductions in critical-path delay, power, and area, plus a stronger starting point for backend retiming.

Significance. If the central results hold after validation, the work demonstrates that exposing pipeline optimization as an explicit compiler pass can deliver measurable backend QoR gains by preserving operator-level structure that is otherwise lost after lowering to netlist. The combination of explicit legality modeling, learned predictors, and min-cost flow is a concrete step toward earlier, more global timing-aware transformations in hardware compilers.

major comments (2)

[Evaluation] Evaluation section: the reported average improvements in delay, power, and area after commercial synthesis are presented without training details for the learned timing predictor (dataset, features, model architecture), without prediction-error metrics on the evaluated designs, and without an ablation comparing predictor-guided relocation against a predictor-free or heuristic baseline. Because the central claim is that the predictor enables meaningfully better relocation decisions than alternatives when measured in the actual backend, these omissions make it impossible to determine whether the gains are attributable to timing awareness or to the explicit legality modeling and flow formulation alone.
[Methods] Methods / formulation: the integration of the learned predictor into the min-cost flow objective and timing constraints is not described with sufficient precision to verify that the resulting relocation decisions remain legal and that the flow produces decisions that are robust to predictor error. No analysis is given of how prediction inaccuracies propagate to the final netlist-level critical path.

minor comments (2)

[Abstract] Abstract: the phrase 'on average' is used without reporting the number of benchmarks, the magnitude of improvements, or variance; adding these numbers would strengthen the summary.
[Notation] Notation: the paper should define the precise interface between the IR-level timing predictor and the commercial backend (e.g., which delay model or cell library is approximated) to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight areas where additional information will improve the clarity and verifiability of our claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported average improvements in delay, power, and area after commercial synthesis are presented without training details for the learned timing predictor (dataset, features, model architecture), without prediction-error metrics on the evaluated designs, and without an ablation comparing predictor-guided relocation against a predictor-free or heuristic baseline. Because the central claim is that the predictor enables meaningfully better relocation decisions than alternatives when measured in the actual backend, these omissions make it impossible to determine whether the gains are attributable to timing awareness or to the explicit legality modeling and flow formulation alone.

Authors: We agree that the evaluation lacks sufficient detail on the timing predictor to allow readers to fully attribute the reported QoR gains. In the revised manuscript we will add: the composition and size of the training dataset, the full set of features used, the model architecture and training procedure, quantitative prediction-error metrics (MAE and max error) measured on the evaluated designs, and an ablation that compares the complete predictor-guided min-cost flow against a predictor-free baseline that retains only the legality modeling and flow formulation. These additions will make it possible to isolate the contribution of the learned timing model. revision: yes
Referee: [Methods] Methods / formulation: the integration of the learned predictor into the min-cost flow objective and timing constraints is not described with sufficient precision to verify that the resulting relocation decisions remain legal and that the flow produces decisions that are robust to predictor error. No analysis is given of how prediction inaccuracies propagate to the final netlist-level critical path.

Authors: We acknowledge that the current description of the min-cost flow integration is not precise enough. The revised paper will present the exact objective function and timing constraints, showing how the predictor outputs enter the edge costs while legality is enforced separately via the IR-level relocation rules (independent of predicted values). We will also add a dedicated subsection analyzing robustness to prediction error, including a sensitivity study that perturbs the predictor outputs within the observed error range and reports the resulting change in final critical-path delay after synthesis. This will quantify how inaccuracies propagate to the netlist-level outcome. revision: yes

Circularity Check

0 steps flagged

No circularity: external evaluation keeps claims independent of internal predictor fit

full rationale

The derivation formulates IR-level register relocation as a min-cost flow whose edge costs come from a learned timing predictor; the central claim is that the resulting RTL, when fed to a commercial backend, yields measured reductions in delay/power/area. This outcome is obtained by running the optimized netlist through an external synthesis tool on held-out open-source benchmarks, not by re-using the predictor's training targets or by renaming fitted quantities as 'predictions.' No self-citations, uniqueness theorems, or ansatzes are invoked to close the loop, and the abstract supplies no equations that equate the reported gains to quantities defined inside the same model. The chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of concrete free parameters or axioms; the learned timing predictor is presumed to contain fitted weights, but no values or training procedure are stated.

pith-pipeline@v0.9.0 · 5551 in / 1176 out tokens · 46210 ms · 2026-05-09T16:38:13.776748+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references

[1]

Krste Asanovic, David A Patterson, and Christopher Celio. 2015. The Berkeley Out-of-order Machine (BOOM): An Industry-competitive, Synthesizable, Parameterized RISC-V Processor. (2015)

2015
[2]

Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović. 2012. Chisel: Constructing Hardware in a Scala Embedded Language. InACM/IEEE Design Automation Conference (DAC)

2012
[3]

Thomas Bourgeat, Clément Pit-Claudel, Adam Chlipala, and Arvind. 2020. The essence of Bluespec: a core language for rule-based hardware design. InACM SIGPLAN Symposium on Programming Language Design & Implementation (PLDI)

2020
[4]

Ulrich Brenner and Anna Silvanus. 2022. Delay Optimization of Combinational Logic by AND-OR Path Restructuring. InIEEE/ACM Asia and South Pacific Design Automation Conference (ASPDAC)

2022
[5]

Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems. InACM International Symposium on Field-Programmable Gate Arrays (FPGA)

2011
[6]

Hongzheng Chen, Niansong Zhang, Shaojie Xiang, Zhichen Zeng, Mengjia Dai, and Zhiru Zhang. 2024. Allo: A Programming Model for Composable Accelerator Design. InACM SIGPLAN Symposium on Programming Language Design & Implementation (PLDI)

2024
[7]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. InUSENIX Symposium on Operating Systems Design and Implementation (OSDI)

2018
[8]

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. InAnnual Conference on Neural Information Processing Systems (NIPS)

2018
[9]

Jianyi Cheng, Samuel Coward, Lorenzo Chelini, Rafael Barbalho, and Theo Drane. 2024. SEER: Super-Optimization Explorer for High-Level Synthesis using E-graph Rewriting. InACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

2024
[10]

CIRCT Project. 2024. CIRCT: Circuit IR Compilers and Tools. https://circt.llvm.org/

2024
[11]

Lawrence T Clark, Vinay Vashishtha, Lucian Shifren, Aditya Gujja, Saurabh Sinha, Brian Cline, Chandarasekaran Ramamurthy, and Greg Yeric. 2016. ASAP7: A 7-nm FinFET Predictive Process Design Kit.Microelectronics Journal (2016)

2016
[12]

Samuel Coward, Theo Drane, and George A Constantinides. 2024. ROVER: RTL optimization via verified e-graph rewriting.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)(2024)

2024
[13]

Ayatallah Elakhras, Jiahui Xu, Martin Erhart, Paolo Ienne, and Lana Josipović. 2025. ElasticMiter: Formally verified dataflow circuit rewrites. InACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

2025
[14]

Fabrizio Ferrandi, Vito Giovanni Castellana, Serena Curzel, Pietro Fezzardi, Michele Fiorito, Marco Lattuada, Marco Minutoli, Christian Pilato, and Antonino Tumeo. 2021. Bambu: an open-source research framework for the high-level synthesis of complex applications. InACM/IEEE Design Automation Conference (DAC)

2021
[15]

Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, et al. 2021. Gemmini: Enabling Systematic Deep-learning Architecture Evaluation via Full-stack Integration. InACM/IEEE Design Automation Conference (DAC). 19

2021
[16]

Aaron P Hurst, Alan Mishchenko, and Robert K Brayton. 2007. Fast Minimum-register Retiming Via Binary Maximum- flow. InFormal Methods in Computer Aided Design (FMCAD)

2007
[17]

Aaron P Hurst, Alan Mishchenko, and Robert K Brayton. 2008. Scalable Min-register Retiming Under Timing and Initializability Constraints. InACM/IEEE Design Automation Conference (DAC)

2008
[18]

Liancheng Jia, Zizhang Luo, Liqiang Lu, and Yun Liang. 2021. Tensorlib: A Spatial Accelerator Generation Framework For Tensor Algebra. InACM/IEEE Design Automation Conference (DAC)

2021
[19]

Gangwon Jo, Heehoon Kim, Jeesoo Lee, and Jaejin Lee. 2020. SOFF: An OpenCL high-level synthesis framework for FPGAs. InIEEE/ACM International Symposium on Computer Architecture (ISCA)

2020
[20]

Lana Josipović, Shabnam Sheikhha, Andrea Guerrieri, Paolo Ienne, and Jordi Cortadella. 2020. Buffer placement and sizing for high-performance dataflow circuits. InACM International Symposium on Field-Programmable Gate Arrays (FPGA)

2020
[21]

Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. 2019. HeteroCL: A Multi-paradigm Programming Infrastructure For Software-defined Reconfigurable Computing. InACM International Symposium on Field-Programmable Gate Arrays (FPGA)

2019
[22]

Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. InIEEE/ACM International Symposium on Code Generation and Optimization (CGO)

2021
[23]

Charles E Leiserson and James B Saxe. 1991. Retiming Synchronous Circuitry.Algorithmica(1991)

1991
[24]

Mingjun Li, Pengjia Li, Shuo Yin, Shixin Chen, Beichen Li, Chong Tong, Jianlei Yang, Tinghuan Chen, and Bei Yu
[25]

InACM/IEEE Design Automation Conference (DAC)

WinoGen: A Highly Configurable Winograd Convolution IP Generator for Efficient CNN Acceleration on FPGA. InACM/IEEE Design Automation Conference (DAC)
[26]

Li, Adam M

Patrick S. Li, Adam M. Izraelevitz, and Jonathan Bachrach. 2016.Specification For The FIRRTL Language. Technical Report. EECS Department, University of California, Berkeley

2016
[27]

Derek Lockhart, Gary Zibrat, and Christopher Batten. 2014. PyMTL: A Unified Framework for Vertically Integrated Computer Architecture Research. InIEEE/ACM International Symposium on Microarchitecture (MICRO)

2014
[28]

Jinan Lou, Wei Chen, and M. Pedram. 1999. Concurrent logic restructuring and placement for timing closure. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

1999
[29]

2007.PHDL: A Python hardware design framework

Ali Mashtizadeh. 2007.PHDL: A Python hardware design framework. Ph. D. Dissertation. Massachusetts Institute of Technology

2007
[30]

Alan Mishchenko, Robert Brayton, Stephen Jang, and Victor Kravets. 2011. Delay optimization using SOP balancing. InIEEE/ACM International Conference on Computer-Aided Design (ICCAD)

2011
[31]

Rachit Nigam, Ethan Gabizon, Edmund Lam, Carolyn Zech, Jonathan Balkind, and Adrian Sampson. 2026. Parameterized Hardware Design with Latency-Abstract Interfaces. InACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

2026
[32]

Rachit Nigam, Samuel Thomas, Zhijing Li, and Adrian Sampson. 2021. A Compiler Infrastructure For Accelerator Generators. InACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

2021
[33]

Fabian Schuiki, Andreas Kurth, Tobias Grosser, and Luca Benini. 2020. LLHD: A Multi-level Intermediate Repre- sentation For Hardware Description Languages. InACM SIGPLAN Symposium on Programming Language Design & Implementation (PLDI)

2020
[34]

Narendra Shenoy and Richard Rudell. 2003. Efficient Implementation of Retiming.The Best of ICCAD: 20 Years of Excellence in Computer-Aided Design(2003)

2003
[35]

SpinalHDL Project. 2018. SpinalHDL: A Language to Describe Digital Hardware. https://spinalhdl.github.io/SpinalDoc- RTD/master/index.html

2018
[36]

Synopsys Inc. 2009. Design Compiler Register Retiming Reference Manual. https://picture.iczhiku.com/resource/ eetop/shkedggoGHzkAmNC.pdf

2009
[37]

Synopsys Inc. 2024. Synopsys Design Compiler: Concurrent Timing, Area, Power, and Test Optimization. https: //www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/dc-ultra.html

2024
[38]

Jian Weng, Boyang Han, Derui Gao, Ruijie Gao, Wanning Zhang, An Zhong, Ceyu Xu, Jihao Xin, Yangzhixin Luo, Lisa Wu Wills, et al. 2025. Assassyn: A Unified Abstraction for Architectural Simulation and Implementation. In IEEE/ACM International Symposium on Computer Architecture (ISCA)

2025
[39]

Youwei Xiao, Zizhang Luo, Kexing Zhou, and Yun Liang. 2024. Cement: Streamlining FPGA Hardware Design with Cycle-Deterministic EHDL and Synthesis. InACM International Symposium on Field-Programmable Gate Arrays (FPGA)

2024
[40]

Yinan Xu, Zihao Yu, Dan Tang, Guokai Chen, Lu Chen, Lingrui Gou, Yue Jin, Qianruo Li, Xin Li, Zuojun Li, Jiawei Lin, Tong Liu, Zhigang Liu, Jiazhan Tan, Huaqiang Wang, Huizhe Wang, Kaifan Wang, Chuanqi Zhang, Fawang Zhang, Linjuan Zhang, Zifei Zhang, Yangyang Zhao, Yaoyang Zhou, Yike Zhou, Jiangrui Zou, Ye Cai, Dandan Huan, Zusong Li, Jiye Zhao, Zihao Che...
[41]

InIEEE/ACM International Symposium on Microarchitecture (MICRO)

Towards Developing High Performance RISC-V Processors Using Agile Methodology. InIEEE/ACM International Symposium on Microarchitecture (MICRO)
[42]

Wenlong Yang, Lingli Wang, and Alan Mishchenko. 2012. Lazy man’s logic synthesis. InIEEE/ACM International Conference on Computer-Aided Design (ICCAD)

2012
[43]

Hanchen Ye, Cong Hao, Jianyi Cheng, Hyunmin Jeong, Jack Huang, Stephen Neuendorffer, and Deming Chen. 2022. ScaleHLS: A New Scalable High-level Synthesis Framework On Multi-level Intermediate Representation. InIEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE

2022
[44]

Hanchen Ye, Hyegang Jun, and Deming Chen. 2024. HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis. InACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

2024
[45]

Hanchen Ye, David Z Pan, Chris Leary, Deming Chen, and Xiaoqing Xu. 2024. Subgraph extraction-based feedback- guided iterative scheduling for HLS. InIEEE/ACM Design, Automation and Test in Europe Conference (DATE)

2024
[46]

Bei Yu, Sheqin Dong, Yuchun Ma, Tao Lin, Yu Wang, Song Chen, and Satoshi Goto. 2011. Network Flow-based Simultaneous Retiming and Slack Budgeting for Low Power Design. InIEEE/ACM Asia and South Pacific Design Automation Conference (ASPDAC)

2011
[47]

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. InUSENIX Symposium on Operating Systems Design and Implementation (OSDI)

2020
[48]

Kexing Zhou, Yun Liang, Yibo Lin, Runsheng Wang, and Ru Huang. 2023. Khronos: Fusing Memory Access for Improved Hardware RTL Simulation. InIEEE/ACM International Symposium on Microarchitecture (MICRO). 21

2023