Parallel Context Modeling for Sliding Window Attention in Neural Video Coding

Alexander Kopte; Andr\'e Kaup

arxiv: 2605.20977 · v1 · pith:FU5LD5PBnew · submitted 2026-05-20 · 📡 eess.IV

Parallel Context Modeling for Sliding Window Attention in Neural Video Coding

Alexander Kopte , Andr\'e Kaup This is my paper

Pith reviewed 2026-05-21 02:23 UTC · model grok-4.3

classification 📡 eess.IV

keywords neural video codingsliding window attentionparallel decodingcontext modelinghyperpriorrate-distortion performancevideo compression

0 comments

The pith

Embedding a hyperprior and accumulator in parallel sliding window attention speeds decoding and improves rate-distortion performance in neural video coding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to remove the sequential raster-scan restriction that limits decoding speed in sliding window attention models for neural video codecs. It does so by switching to diagonal wavefront processing and adding a hyperprior plus an accumulator that combines side information with local spatial context. If successful, this keeps the drift-free advantages of transformer approaches while cutting latency and raising compression efficiency for both I-frames and P-frames. Readers would care because current neural video codecs either propagate errors over time or run too slowly for practical use.

Core claim

The proposed P-SWA method utilizes diagonal wavefronts to enable parallel decoding. By embedding a hyperprior and introducing an accumulator to fuse side information and local spatial context, the method increases decoding speed by 36% over the parallel VCT while achieving Bjørntegaard Delta-rate savings of up to 10.0% for I-frames and 7.1% for P-frames over the SWA baseline.

What carries the argument

Diagonal wavefronts for parallel decoding together with a hyperprior and an accumulator that fuses side information and local spatial context.

Load-bearing premise

Switching to diagonal wavefronts for parallel decoding preserves the full context-modeling accuracy of the original sequential sliding window attention without introducing boundary artifacts or reducing prediction quality.

What would settle it

A side-by-side test that measures higher distortion or visible boundary artifacts when the same model runs with diagonal wavefronts instead of sequential raster order.

read the original abstract

Most neural video codecs rely on temporal conditioning, which makes them susceptible to error propagation over long sequences. While Transformer-based architectures like the VCT offer a drift-free alternative, they suffer from high computational complexity and inferior RD performance. The recent SWA addresses these shortcomings by reducing complexity and enhancing RD performance, yet it restricts decoding to a strictly sequential raster-scan order, creating a critical bottleneck in decoding latency. To resolve this, we propose P-SWA, utilizing diagonal wavefronts to enable parallel decoding. By embedding a hyperprior and introducing an accumulator to fuse side information and local spatial context, our method increases decoding speed by 36% over the parallel VCT. Simultaneously, it achieves Bj{\o}ntegaard Delta-rate savings of up to 10.0% for I-frames and 7.1% for P-frames over the SWA baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

P-SWA parallelizes SWA decoding with wavefronts and a hyperprior-accumulator fusion, but the reported speed and BD-rate gains rest on unshown experiments that need checking for boundary effects.

read the letter

The main point is that this paper takes the sequential sliding-window attention from recent neural video codecs and makes it parallel by switching to diagonal wavefront scheduling, then adds a hyperprior plus an accumulator to keep the context modeling working across the new order. That combination is presented as the fix for the latency bottleneck in SWA while still beating the baseline on rate-distortion for I- and P-frames. The 36% decoding speed claim over parallel VCT and the 10%/7.1% BD-rate savings are the concrete numbers they lead with. If the full experiments back those up with proper controls, it is a useful engineering step for anyone trying to ship transformer-based video codecs that decode faster without extra drift. The idea itself is not a deep theoretical shift; it is an application of known wavefront parallelization to this specific attention setup, with the fusion mechanism as the added piece. What the work does cleanly is identify the raster-scan restriction as the real blocker and propose a scheduling change plus side-information fusion to remove it. The stress-test concern about changed neighbor availability at wavefront boundaries is worth taking seriously. The abstract does not show an ablation that isolates whether the accumulator exactly restores the original conditioning distribution or whether some of the reported gains come from the altered context instead. Without dataset details, error bars, or boundary-specific checks visible here, it is hard to judge how much the quality holds. That said, the claims are empirical and falsifiable rather than circular, so the paper is not built on hidden fitting. This is for people already working on efficient neural video compression who need lower decode latency. It will not reset the field but could save real time in practice if the numbers survive review. I would send it to peer review so the experiments can be examined in full; the core proposal is clear enough to be worth referee time even if revisions are needed on the validation side.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes P-SWA, a parallel context modeling method for sliding window attention (SWA) in neural video coding. It replaces the sequential raster-scan order of SWA with diagonal wavefront scheduling to enable parallel decoding, embeds a hyperprior, and introduces an accumulator to fuse side information with local spatial context. The central empirical claims are a 36% decoding speed increase over the parallel VCT baseline together with Bjørntegaard Delta-rate savings of up to 10.0% for I-frames and 7.1% for P-frames relative to the original SWA.

Significance. If the reported speed and rate-distortion gains are reproducible and the parallel schedule truly preserves context-modeling fidelity, the work would address a practical bottleneck in transformer-based neural codecs. The combination of wavefront parallelization with an explicit accumulator for context fusion is a concrete engineering contribution that could improve deployability of drift-free video codecs.

major comments (1)

[Parallel decoding and accumulator description (method section)] The central claim that P-SWA achieves the stated BD-rate savings while preserving (or improving) context-modeling accuracy rests on the assumption that the accumulator fully compensates for any change in available spatial neighbors at wavefront boundaries. No ablation or controlled comparison (e.g., in the experimental section) isolates the effect of diagonal versus raster ordering on the conditional entropy model; without such evidence it remains possible that part of the reported 10.0%/7.1% savings arises from altered context rather than from the proposed fusion mechanism.

minor comments (2)

[Abstract] The abstract states performance numbers without reference to the exact test sequences, QP range, or number of frames; these details should be added for reproducibility.
[Results section] Figure captions and table headers should explicitly state whether the reported BD-rate figures are computed against the same anchor for both I- and P-frames.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment regarding the isolation of contributions in our P-SWA method. We address the point below and have revised the manuscript to include additional analysis.

read point-by-point responses

Referee: The central claim that P-SWA achieves the stated BD-rate savings while preserving (or improving) context-modeling accuracy rests on the assumption that the accumulator fully compensates for any change in available spatial neighbors at wavefront boundaries. No ablation or controlled comparison (e.g., in the experimental section) isolates the effect of diagonal versus raster ordering on the conditional entropy model; without such evidence it remains possible that part of the reported 10.0%/7.1% savings arises from altered context rather than from the proposed fusion mechanism.

Authors: We agree that an explicit ablation isolating the ordering change from the accumulator's effect would provide stronger support for the claim. The manuscript describes the accumulator as fusing the hyperprior with local spatial context specifically to compensate for reduced neighbors at wavefront boundaries in the diagonal schedule. Direct comparisons to the original SWA (raster order) already show consistent BD-rate gains, but to address the concern we have added a controlled ablation in the revised experimental section. This compares the entropy model under raster ordering, diagonal ordering without accumulator, and full P-SWA, confirming that the savings arise primarily from the fusion mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical measurements of architectural changes

full rationale

The paper proposes P-SWA with diagonal wavefront parallelization, a hyperprior, and an accumulator for fusing side information with local context. Reported gains (36% decoding speed over parallel VCT; BD-rate savings of 10.0% I-frames / 7.1% P-frames over SWA) are presented as experimental outcomes from implementation and testing. No derivation chain, equation, or result is shown reducing to a fitted parameter renamed as prediction, a self-definitional loop, or a load-bearing self-citation whose content is unverified. The method description and performance numbers are independent of the target claims and do not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, mathematical axioms, or newly postulated entities; the contribution is presented as an empirical architectural change.

pith-pipeline@v0.9.0 · 5673 in / 1050 out tokens · 29402 ms · 2026-05-21T02:23:30.895587+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By embedding a hyperprior and introducing an accumulator to fuse side information and local spatial context, our method increases decoding speed by 36% over the parallel VCT.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we adapt the SWA mechanism to utilize a repeating s×s spatial attention mask... creating a diagonal wavefront

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

[1]

Yet, most state-of- the-art Neural Video Codecs (NVCs) rely on explicit or implicit conditioning of the feature transform on temporal context

INTRODUCTION Learned video compression has achieved rate-distortion perfor- mance competitive with traditional codecs [1]. Yet, most state-of- the-art Neural Video Codecs (NVCs) rely on explicit or implicit conditioning of the feature transform on temporal context. While beneficial for compression efficiency, this dependency introduces error propagation, ...

work page
[2]

RELATED WORK Parallelization is a fundamental feature of traditional hybrid codecs. High Efficiency Video Coding (HEVC) [4] and Versatile Video Cod- ing (VVC) [5] employ Wavefront Parallel Processing (WPP), which leverages the fact that probability estimation for entropy coding re- lies primarily on the immediate spatial neighborhood. This allows the deco...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Overview The overall architecture of the proposed entropy model is illustrated in Fig

PROPOSED METHOD 3.1. Overview The overall architecture of the proposed entropy model is illustrated in Fig. 1. First, aContext Transformerextracts temporal context from the quantized latentˆyusing time-causal SW A [3]. Briefly, SW A restricts standard self-attention to a localized 3D sliding win- dow, significantly reducing computational complexity while ...

work page
[4]

Experimental setting For a fair comparison, we strictly adhere to the training and testing protocols established in [3]

EXPERIMENTS 4.1. Experimental setting For a fair comparison, we strictly adhere to the training and testing protocols established in [3]. We utilize the feature transform from the DCVC-DC image model [8], pre-trained on ImageNet [16] and subsequently frozen. Consequently, we use DCVC-DC as our an- chor. The entropy model is trained in RGB space on the Ope...

work page 1920
[5]

CONCLUSION AND OUTLOOK In this paper, we proposed P-SW A to overcome the inherent sequen- tiality of the original SW A entropy model. By embedding the hy- perprior within the transformer network and employing an accumu- lator to fuse its output with the previously decoded spatial context, alongside a decoupled channel autoregression mechanism, we en- able...

work page
[6]

Towards Practical Real- Time Neural Video Compression,

Zhaoyang Jia, Bin Li, Jiahao Li, Wenxuan Xie, Linfeng Qi, Houqiang Li, and Yan Lu, “Towards Practical Real- Time Neural Video Compression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), June 2025, pp. 12543–12552

work page 2025
[7]

VCT: A video compression transformer,

Fabian Mentzer, George D. Toderici, David Minnen, Sergi Caelles, Sung Jin Hwang, Mario Lucic, and Eirikur Agusts- son, “VCT: A video compression transformer,” inAdvances in Neural Information Processing Systems (NeurIPS), Dec. 2022, vol. 35, pp. 13091–13103

work page 2022
[8]

Sliding window attention for learned video compression,

Alexander Kopte and Andr ´e Kaup, “Sliding window attention for learned video compression,” inProceedings of the Picture Coding Symposium (PCS), Dec. 2025

work page 2025
[9]

Overview of the High Efficiency Video Coding (HEVC) Standard,

Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard,”IEEE Transactions on Circuits and Sys- tems for Video Technology, vol. 22, no. 12, pp. 1649–1668, Dec. 2012

work page 2012
[10]

Overview of the Versatile Video Coding (VVC) Standard and its Applications,

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer Ohm, “Overview of the Versatile Video Coding (VVC) Standard and its Applications,” IEEE Transactions on Circuits and Systems for Video Technol- ogy, vol. 31, no. 10, pp. 3736–3764, Oct. 2021

work page 2021
[11]

Channel-wise autoregres- sive entropy models for learned image compression,

David Minnen and Saurabh Singh, “Channel-wise autoregres- sive entropy models for learned image compression,” inPro- ceedings of the IEEE International Conference on Image Pro- cessing (ICIP), Oct. 2020, pp. 3339–3343

work page 2020
[12]

Hybrid Spatial-Temporal En- tropy Modelling for Neural Video Compression,

Jiahao Li, Bin Li, and Yan Lu, “Hybrid Spatial-Temporal En- tropy Modelling for Neural Video Compression,” inProceed- ings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, Oct. 2022, MM ’22, pp. 1503–1511, Associ- ation for Computing Machinery

work page 2022
[13]

Neural video compression with diverse contexts,

Jiahao Li, Bin Li, and Yan Lu, “Neural video compression with diverse contexts,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 22616–22626

work page 2023
[14]

MIMT: Masked im- age modeling transformer for video compression,

Jinxi Xiang, Kuan Tian, and Jun Zhang, “MIMT: Masked im- age modeling transformer for video compression,” inProceed- ings of the International Conference on Learning Representa- tions (ICLR), May 2023

work page 2023
[15]

Neural Video Com- pression with Feature Modulation,

Jiahao Li, Bin Li, and Yan Lu, “Neural Video Com- pression with Feature Modulation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), June 2024, pp. 26099–26108

work page 2024
[16]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cu- curull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Triton: An inter- mediate language and compiler for tiled neural network com- putations,

Philippe Tillet, H. T. Kung, and David Cox, “Triton: An inter- mediate language and compiler for tiled neural network com- putations,” inProceedings of the ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), June 2019, pp. 10–19

work page 2019
[18]

FlashAttention-2: Faster attention with better paral- lelism and work partitioning,

Tri Dao, “FlashAttention-2: Faster attention with better paral- lelism and work partitioning,” inProceedings of the Interna- tional Conference on Learning Representations (ICLR), May 2024

work page 2024
[19]

High Efficiency Video Coding (HEVC) Test Model 18.0 (HM 18.0),

JCT-VC, “High Efficiency Video Coding (HEVC) Test Model 18.0 (HM 18.0),”https://vcgit.hhi.fraunhofer. de/jvet/HM/-/tree/HM-18.0, Apr. 2023

work page 2023
[20]

Versatile Video Coding (VVC) Test Model 23.10 (VTM 23.10),

JVET, “Versatile Video Coding (VVC) Test Model 23.10 (VTM 23.10),”https://vcgit.hhi.fraunhofer. de/jvet/VVCSoftware_VTM/-/tree/VTM-23.10, May 2025

work page 2025
[21]

ImageNet Large Scale Visual Recognition Chal- lenge,

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei, “ImageNet Large Scale Visual Recognition Chal- lenge,”International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, Dec. 2015

work page 2015
[22]

OpenVid-1M: A large-scale high-quality dataset for text-to- video generation,

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai, “OpenVid-1M: A large-scale high-quality dataset for text-to- video generation,” inProceedings of the International Confer- ence on Learning Representations (ICLR), Apr. 2025

work page 2025
[23]

UVG dataset: 50/120fps 4k sequences for video codec analysis and development,

Alexandre Mercat, Marko Viitanen, and Jarno Vanne, “UVG dataset: 50/120fps 4k sequences for video codec analysis and development,” inProceedings of the ACM Multimedia Systems Conference (MMSys), June 2020, pp. 297–302

work page 2020
[24]

MCL-JCV: A JND-based H.264/A VC video quality assessment dataset,

Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang, Ioannis Katsavounidis, Anne Aaron, and C.-C. Jay Kuo, “MCL-JCV: A JND-based H.264/A VC video quality assessment dataset,” inProceed- ings of the IEEE International Conference on Image Process- ing (ICIP), Sept. 2016, pp. 1509–1513

work page 2016
[25]

Common test conditions and software refer- ence configurations,

Frank Bossen, “Common test conditions and software refer- ence configurations,” Tech. Rep. JCTVC-L1100, Joint Collab- orative Team on Video Coding (JCT-VC), Geneva, CH, Jan. 2013

work page 2013
[26]

The design and analysis of a cache architecture for texture mapping,

Z.S. Hakura and A. Gupta, “The design and analysis of a cache architecture for texture mapping,” inProceedings of the 24th Annual International Symposium on Computer Architec- ture (ISCA), June 1997, pp. 108–120

work page 1997

[1] [1]

Yet, most state-of- the-art Neural Video Codecs (NVCs) rely on explicit or implicit conditioning of the feature transform on temporal context

INTRODUCTION Learned video compression has achieved rate-distortion perfor- mance competitive with traditional codecs [1]. Yet, most state-of- the-art Neural Video Codecs (NVCs) rely on explicit or implicit conditioning of the feature transform on temporal context. While beneficial for compression efficiency, this dependency introduces error propagation, ...

work page

[2] [2]

RELATED WORK Parallelization is a fundamental feature of traditional hybrid codecs. High Efficiency Video Coding (HEVC) [4] and Versatile Video Cod- ing (VVC) [5] employ Wavefront Parallel Processing (WPP), which leverages the fact that probability estimation for entropy coding re- lies primarily on the immediate spatial neighborhood. This allows the deco...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Overview The overall architecture of the proposed entropy model is illustrated in Fig

PROPOSED METHOD 3.1. Overview The overall architecture of the proposed entropy model is illustrated in Fig. 1. First, aContext Transformerextracts temporal context from the quantized latentˆyusing time-causal SW A [3]. Briefly, SW A restricts standard self-attention to a localized 3D sliding win- dow, significantly reducing computational complexity while ...

work page

[4] [4]

Experimental setting For a fair comparison, we strictly adhere to the training and testing protocols established in [3]

EXPERIMENTS 4.1. Experimental setting For a fair comparison, we strictly adhere to the training and testing protocols established in [3]. We utilize the feature transform from the DCVC-DC image model [8], pre-trained on ImageNet [16] and subsequently frozen. Consequently, we use DCVC-DC as our an- chor. The entropy model is trained in RGB space on the Ope...

work page 1920

[5] [5]

CONCLUSION AND OUTLOOK In this paper, we proposed P-SW A to overcome the inherent sequen- tiality of the original SW A entropy model. By embedding the hy- perprior within the transformer network and employing an accumu- lator to fuse its output with the previously decoded spatial context, alongside a decoupled channel autoregression mechanism, we en- able...

work page

[6] [6]

Towards Practical Real- Time Neural Video Compression,

Zhaoyang Jia, Bin Li, Jiahao Li, Wenxuan Xie, Linfeng Qi, Houqiang Li, and Yan Lu, “Towards Practical Real- Time Neural Video Compression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), June 2025, pp. 12543–12552

work page 2025

[7] [7]

VCT: A video compression transformer,

Fabian Mentzer, George D. Toderici, David Minnen, Sergi Caelles, Sung Jin Hwang, Mario Lucic, and Eirikur Agusts- son, “VCT: A video compression transformer,” inAdvances in Neural Information Processing Systems (NeurIPS), Dec. 2022, vol. 35, pp. 13091–13103

work page 2022

[8] [8]

Sliding window attention for learned video compression,

Alexander Kopte and Andr ´e Kaup, “Sliding window attention for learned video compression,” inProceedings of the Picture Coding Symposium (PCS), Dec. 2025

work page 2025

[9] [9]

Overview of the High Efficiency Video Coding (HEVC) Standard,

Gary J. Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard,”IEEE Transactions on Circuits and Sys- tems for Video Technology, vol. 22, no. 12, pp. 1649–1668, Dec. 2012

work page 2012

[10] [10]

Overview of the Versatile Video Coding (VVC) Standard and its Applications,

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer Ohm, “Overview of the Versatile Video Coding (VVC) Standard and its Applications,” IEEE Transactions on Circuits and Systems for Video Technol- ogy, vol. 31, no. 10, pp. 3736–3764, Oct. 2021

work page 2021

[11] [11]

Channel-wise autoregres- sive entropy models for learned image compression,

David Minnen and Saurabh Singh, “Channel-wise autoregres- sive entropy models for learned image compression,” inPro- ceedings of the IEEE International Conference on Image Pro- cessing (ICIP), Oct. 2020, pp. 3339–3343

work page 2020

[12] [12]

Hybrid Spatial-Temporal En- tropy Modelling for Neural Video Compression,

Jiahao Li, Bin Li, and Yan Lu, “Hybrid Spatial-Temporal En- tropy Modelling for Neural Video Compression,” inProceed- ings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, Oct. 2022, MM ’22, pp. 1503–1511, Associ- ation for Computing Machinery

work page 2022

[13] [13]

Neural video compression with diverse contexts,

Jiahao Li, Bin Li, and Yan Lu, “Neural video compression with diverse contexts,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 22616–22626

work page 2023

[14] [14]

MIMT: Masked im- age modeling transformer for video compression,

Jinxi Xiang, Kuan Tian, and Jun Zhang, “MIMT: Masked im- age modeling transformer for video compression,” inProceed- ings of the International Conference on Learning Representa- tions (ICLR), May 2023

work page 2023

[15] [15]

Neural Video Com- pression with Feature Modulation,

Jiahao Li, Bin Li, and Yan Lu, “Neural Video Com- pression with Feature Modulation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), June 2024, pp. 26099–26108

work page 2024

[16] [16]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Am- jad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cu- curull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Triton: An inter- mediate language and compiler for tiled neural network com- putations,

Philippe Tillet, H. T. Kung, and David Cox, “Triton: An inter- mediate language and compiler for tiled neural network com- putations,” inProceedings of the ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), June 2019, pp. 10–19

work page 2019

[18] [18]

FlashAttention-2: Faster attention with better paral- lelism and work partitioning,

Tri Dao, “FlashAttention-2: Faster attention with better paral- lelism and work partitioning,” inProceedings of the Interna- tional Conference on Learning Representations (ICLR), May 2024

work page 2024

[19] [19]

High Efficiency Video Coding (HEVC) Test Model 18.0 (HM 18.0),

JCT-VC, “High Efficiency Video Coding (HEVC) Test Model 18.0 (HM 18.0),”https://vcgit.hhi.fraunhofer. de/jvet/HM/-/tree/HM-18.0, Apr. 2023

work page 2023

[20] [20]

Versatile Video Coding (VVC) Test Model 23.10 (VTM 23.10),

JVET, “Versatile Video Coding (VVC) Test Model 23.10 (VTM 23.10),”https://vcgit.hhi.fraunhofer. de/jvet/VVCSoftware_VTM/-/tree/VTM-23.10, May 2025

work page 2025

[21] [21]

ImageNet Large Scale Visual Recognition Chal- lenge,

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei, “ImageNet Large Scale Visual Recognition Chal- lenge,”International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, Dec. 2015

work page 2015

[22] [22]

OpenVid-1M: A large-scale high-quality dataset for text-to- video generation,

Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai, “OpenVid-1M: A large-scale high-quality dataset for text-to- video generation,” inProceedings of the International Confer- ence on Learning Representations (ICLR), Apr. 2025

work page 2025

[23] [23]

UVG dataset: 50/120fps 4k sequences for video codec analysis and development,

Alexandre Mercat, Marko Viitanen, and Jarno Vanne, “UVG dataset: 50/120fps 4k sequences for video codec analysis and development,” inProceedings of the ACM Multimedia Systems Conference (MMSys), June 2020, pp. 297–302

work page 2020

[24] [24]

MCL-JCV: A JND-based H.264/A VC video quality assessment dataset,

Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang, Ioannis Katsavounidis, Anne Aaron, and C.-C. Jay Kuo, “MCL-JCV: A JND-based H.264/A VC video quality assessment dataset,” inProceed- ings of the IEEE International Conference on Image Process- ing (ICIP), Sept. 2016, pp. 1509–1513

work page 2016

[25] [25]

Common test conditions and software refer- ence configurations,

Frank Bossen, “Common test conditions and software refer- ence configurations,” Tech. Rep. JCTVC-L1100, Joint Collab- orative Team on Video Coding (JCT-VC), Geneva, CH, Jan. 2013

work page 2013

[26] [26]

The design and analysis of a cache architecture for texture mapping,

Z.S. Hakura and A. Gupta, “The design and analysis of a cache architecture for texture mapping,” inProceedings of the 24th Annual International Symposium on Computer Architec- ture (ISCA), June 1997, pp. 108–120

work page 1997