BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

Jung-Hua Wang; Tsung-Wei Pan

arxiv: 2606.08674 · v3 · pith:FAYMSNSJnew · submitted 2026-06-07 · 💻 cs.CV · cs.AI

BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

Tsung-Wei Pan , Jung-Hua Wang This is my paper

Pith reviewed 2026-06-27 18:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video generationautoregressive modelingbiological behavioraction durationend-of-sequence tokentokenizationWasserstein distanceNTU RGB+D

0 comments

The pith

BioVid generates variable-length videos of biological actions by learning to emit an end-of-sequence token from the first frame alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an autoregressive model that treats action duration as a learnable semantic feature rather than an externally supplied parameter. A causal transformer processes frame-wise visual tokens and terminates generation when it predicts an EOS token, allowing length to arise from the distribution of real behaviors. This is enabled by a tokenizer that keeps single-frame information intact for next-token prediction while using a 3D decoder to maintain temporal coherence. On held-out drinking clips, the resulting length distribution lies only 1.24 frames from the ground-truth distribution in Wasserstein distance. Fixed-length baselines remain several times farther away even when tuned to the dataset mean.

Core claim

BioVid is a data-driven autoregressive framework for adaptive-length biological behavior generation. It employs a 2D-encode/3D-decode tokenizer that converts each frame into discrete visual tokens and a causal Transformer that, conditioned only on the first frame, models the token sequence and stops generation upon emitting an End-of-Sequence token. On 94 held-out clips of the A001 drinking action from NTU RGB+D, this yields a Wasserstein-1 distance of 1.24 frames from the real duration distribution, compared with distances of approximately 6-7 frames for fixed-length baselines configured to the dataset mean and approximately 15 frames for conventional 16-frame generation.

What carries the argument

The causal Transformer that emits an End-of-Sequence token to terminate generation, conditioned solely on the first frame's visual tokens.

If this is right

Action duration can be treated as an emergent property of the visual token sequence rather than an input hyperparameter.
Generation can be conditioned on a single starting frame while still reproducing the full empirical length distribution.
The 2D-encode/3D-decode tokenizer enables both next-token prediction and temporally coherent reconstruction.
Fixed temporal windows become unnecessary when the model learns termination from data.
The approach directly compares generated length statistics to real distributions via Wasserstein distance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same EOS mechanism could be applied to actions whose durations vary strongly with context, such as reaching versus grasping.
Pairing the model with a language prompt might allow the visual component to override or refine externally suggested lengths.
Evaluating on multi-action sequences would test whether the learned termination generalizes when behaviors transition.
If the first-frame conditioning suffices, similar autoregressive termination could be tested on other sequential data like audio or motion capture.

Load-bearing premise

The intrinsic duration of a biological behavior is encoded in the frame-wise visual token sequence and recoverable by a causal transformer that emits an EOS token at the statistically appropriate moment when conditioned only on the first frame.

What would settle it

Measure the Wasserstein-1 distance between generated and real duration distributions on a different action class from the same dataset; a distance remaining near 1.24 frames while fixed-length baselines stay at 6 frames or higher would support the claim.

read the original abstract

Video generation for biological behavior requires more than visually plausible motion: the duration of an action is itself a semantic property. Existing models usually rely on fixed temporal windows, external continuation, or prompt-driven stories, so length is specified externally rather than learned from behavior. To address this gap, we propose BioVid, a data-driven autoregressive framework for adaptive-length biological behavior generation. BioVid uses a 2D-encode/3D-decode tokenizer: a two-dimensional FSQ-R3GAN encoder converts each frame into discrete visual tokens, preserving single-frame information suited for next-token prediction and EOS-based termination, while a temporally inflated and video-finetuned three-dimensional decoder reconstructs generated tokens with temporal context to reduce flickering. A causal Transformer then models the frame-wise token sequence and, conditioned only on the first frame, stops generation when it emits an End-of-Sequence token, allowing duration to emerge from the learned behavior distribution. We evaluate BioVid on the A001 drinking action from NTU RGB+D. On 94 held-out clips, BioVid achieves a Wasserstein-1 distance of 1.24 frames from the real duration distribution. In comparison, fixed-length baselines yield distances of approximately 6-7 frames even when configured to the available length closest to the dataset mean, and approximately 15 frames when using the conventional 16-frame generation length. These results demonstrate the ability of BioVid to learn and reproduce the intrinsic duration distribution of biological behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BioVid uses an EOS token on autoregressive video tokens to let action duration emerge from data, with a clear win on one NTU drinking class but no ablations to confirm the mechanism.

read the letter

The paper's main move is to add an EOS token to a causal transformer over frame-wise visual tokens so that video generation for biological actions stops at a learned duration instead of a fixed window. On 94 held-out A001 drinking clips from NTU RGB+D, they report a Wasserstein-1 distance of 1.24 frames to the real duration distribution, against 6-7 frames for mean-matched fixed baselines and 15 for 16-frame generation.

The tokenizer split (2D FSQ-R3GAN encoder per frame plus temporally inflated 3D decoder) is a sensible engineering choice that keeps next-token prediction clean while giving the decoder temporal context. The result is concrete and directly addresses the stated gap.

The soft spot is the narrow test bed. Everything rests on a single action class, and the abstract supplies no ablations that would show the EOS decision is driven by the generated token content rather than a learned marginal length for drinking. Without those controls or results on additional actions, it is hard to know whether the claimed mechanism is actually operating. Training procedure, losses, and statistical tests are also absent from the text.

This is for people building autoregressive video models who need variable-length outputs that respect action semantics. The idea is clean enough and the number specific enough that it deserves a serious referee, even though the current evidence is limited in scope.

Referee Report

3 major / 2 minor

Summary. The paper proposes BioVid, an autoregressive video generation model using a 2D-encode/3D-decode tokenizer and a causal Transformer that generates frame-wise visual tokens and terminates via an EOS token conditioned only on the first frame, allowing action duration to emerge from the learned distribution. On 94 held-out NTU RGB+D clips of the A001 drinking action, it reports a Wasserstein-1 distance of 1.24 frames to the real duration distribution, outperforming fixed-length baselines (6-7 frames when matched to mean length, 15 frames for 16-frame generation).

Significance. If the result holds and generalizes, the work would demonstrate a meaningful advance in video generation by learning intrinsic durations of biological behaviors without external length specification or fixed windows. The use of EOS-based termination from visual tokens is a clean idea with potential impact on semantic video models, though the single-action evaluation limits immediate broader claims.

major comments (3)

[Abstract / §3] Abstract and §3 (model description): the central quantitative claim (Wasserstein-1 = 1.24) is reported with no architecture details, training procedure, loss functions, optimizer, or statistical tests, preventing assessment of whether the EOS decision is actually driven by visual token content rather than a learned marginal length distribution for the drinking action.
[§4] §4 (evaluation): the result is shown only for the single A001 drinking action on 94 held-out clips; this is insufficient to support the broader claim of 'biological behavior semantic comprehension' across behaviors, as no cross-action or cross-dataset results are provided.
[Abstract] Abstract: no ablations (e.g., ablating visual tokens vs. frame count or action class conditioning) are reported to confirm that termination is content-dependent on the generated token sequence rather than implicit training signals or a per-action length prior.

minor comments (2)

[Abstract] Exact baseline configurations and per-baseline Wasserstein distances should be tabulated rather than described as 'approximately 6-7' and 'approximately 15'.
[§3] The tokenizer description (FSQ-R3GAN encoder, temporally inflated 3D decoder) would benefit from a diagram or explicit equations for the 2D-encode/3D-decode pipeline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (model description): the central quantitative claim (Wasserstein-1 = 1.24) is reported with no architecture details, training procedure, loss functions, optimizer, or statistical tests, preventing assessment of whether the EOS decision is actually driven by visual token content rather than a learned marginal length distribution for the drinking action.

Authors: We agree that the abstract and §3 omit critical implementation details. In the revised manuscript we will expand §3 with the full 2D FSQ-R3GAN encoder architecture, 3D decoder configuration, causal Transformer hyperparameters, training procedure, loss functions, optimizer settings, and any statistical tests performed on the W1 distance. These additions will allow readers to evaluate whether EOS termination is driven by the token sequence. revision: yes
Referee: [§4] §4 (evaluation): the result is shown only for the single A001 drinking action on 94 held-out clips; this is insufficient to support the broader claim of 'biological behavior semantic comprehension' across behaviors, as no cross-action or cross-dataset results are provided.

Authors: The evaluation is deliberately scoped to the A001 drinking action to isolate the duration-learning capability. We will revise the abstract, introduction, and conclusion to state the evaluation scope explicitly and moderate language regarding 'biological behavior semantic comprehension' to avoid implying cross-action generalization. No new cross-action or cross-dataset experiments will be added in this revision. revision: partial
Referee: [Abstract] Abstract: no ablations (e.g., ablating visual tokens vs. frame count or action class conditioning) are reported to confirm that termination is content-dependent on the generated token sequence rather than implicit training signals or a per-action length prior.

Authors: We acknowledge that the absence of ablations leaves open the possibility that termination reflects a learned length prior rather than token content. The manuscript will be updated to include an explicit limitations paragraph discussing this point and the design rationale (first-frame conditioning only, autoregressive token prediction). Full ablations are not included in the current revision. revision: no

Circularity Check

0 steps flagged

No significant circularity; empirical match to external held-out distribution

full rationale

The paper's central result is an empirical Wasserstein-1 distance of 1.24 frames between generated durations and the real duration distribution on 94 held-out NTU RGB+D A001 clips. This comparison uses an external benchmark independent of any internal fitted parameters or self-defined quantities. The model description (2D-encode/3D-decode tokenizer plus causal Transformer with EOS termination conditioned on the first frame) contains no self-definitional steps, no fitted-input-called-prediction, and no load-bearing self-citations. The duration-matching claim is tested rather than presupposed by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate specific free parameters or invented entities; the core modeling assumption is treated as a domain assumption.

axioms (1)

domain assumption Duration of biological behaviors is a semantic property that can be learned as a distribution over visual token sequences via next-token prediction with EOS termination.
This premise is required for the EOS mechanism to produce durations that match real data without external length specification.

pith-pipeline@v0.9.1-grok · 5795 in / 1308 out tokens · 22398 ms · 2026-06-27T18:33:13.574589+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 29 canonical work pages · 20 internal anchors

[1]

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models,

A. Blattmann et al., “Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models,” Dec. 28, 2023, arXiv: arXiv:2304.08818. doi: 10.48550/arXiv.2304.08818

work page doi:10.48550/arxiv.2304.08818 2023
[2]

Video Diffusion Models

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video Diffusion Models,” Jun. 22, 2022, arXiv: arXiv:2204.03458. doi: 10.48550/arXiv.2204.03458

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.03458 2022
[3]

Imagen Video: High Definition Video Generation with Diffusion Models

J. Ho et al., “Imagen Video: High Definition Video Generation with Diffusion Models,” Oct. 05, 2022, arXiv: arXiv:2210.02303. doi: 10.48550/arXiv.2210.02303

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.02303 2022
[4]

Scalable Diffusion Models with Transformers

W. Peebles and S. Xie, “Scalable Diffusion Models with Transformers,” Mar. 02, 2023, arXiv: arXiv:2212.09748. doi: 10.48550/arXiv.2212.09748

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.09748 2023
[5]

Make-A-Video: Text-to-Video Generation without Text-Video Data

U. Singer et al., “Make-A-Video: Text-to-Video Generation without Text-Video Data,” Sep. 29, 2022, arXiv: arXiv:2209.14792. doi: 10.48550/arXiv.2209.14792

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.14792 2022
[6]

Long Video Generation with Time-Agnostic VQGAN and Time- Sensitive Transformer,

S. Ge et al., “Long Video Generation with Time-Agnostic VQGAN and Time- Sensitive Transformer,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds., Cham: Springer Nature Switzerland, 2022, pp. 102–118. doi: 10.1007/978-3-031-19790-1_7

work page doi:10.1007/978-3-031-19790-1_7 2022
[7]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk et al., “VideoPoet: A Large Language Model for Zero-Shot Video Generation,” Jun. 04, 2024, arXiv: arXiv:2312.14125. doi: 10.48550/arXiv.2312.14125

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.14125 2024
[8]

Slicing aided hyper inference and fine-tuning for small object detection,

Y . Seo, K. Lee, F. Liu, S. James, and P. Abbeel, “HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator,” in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 3943–3947. doi: 10.1109/ICIP46576.2022.9897982

work page doi:10.1109/icip46576.2022.9897982 2022
[9]

VideoGPT: Video Generation using VQ-VAE and Transformers

W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas, “VideoGPT: Video Generation using VQ-V AE and Transformers,” Sep. 14, 2021, arXiv: arXiv:2104.10157. doi: 10.48550/arXiv.2104.10157

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.10157 2021
[10]

MAGVIT: Masked Generative Video Transformer,

L. Yu et al., “MAGVIT: Masked Generative Video Transformer,” Apr. 05, 2023, arXiv: arXiv:2212.05199. doi: 10.48550/arXiv.2212.05199

work page doi:10.48550/arxiv.2212.05199 2023
[11]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

R. Villegas et al., “Phenaki: Variable Length Video Generation From Open Domain Textual Description,” Oct. 05, 2022, arXiv: arXiv:2210.02399. doi: 10.48550/arXiv.2210.02399

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.02399 2022
[12]

TV2TV: A Unified Framework for Interleaved Language and Video Generation,

X. Han et al., “TV2TV: A Unified Framework for Interleaved Language and Video Generation,” Dec. 08, 2025, arXiv: arXiv:2512.05103. doi: 10.48550/arXiv.2512.05103

work page doi:10.48550/arxiv.2512.05103 2025
[13]

NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1010–1019. Accessed: May 21, 2026. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2016/html/Shahroudy_NTU_RGBD_A_ CVPR_201...

2016
[14]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Y . He, T. Yang, Y . Zhang, Y . Shan, and Q. Chen, “Latent Video Diffusion Models for High-Fidelity Long Video Generation,” Mar. 20, 2023, arXiv: arXiv:2211.13221. doi: 10.48550/arXiv.2211.13221

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.13221 2023
[15]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

P. Esser, R. Rombach, and B. Ommer, “Taming Transformers for High- Resolution Image Synthesis,” presented at the CVPR, Computer Vision Foundation / IEEE, 2021, pp. 12873–12883. doi: 10.1109/CVPR46437.2021.01268

work page doi:10.1109/cvpr46437.2021.01268 2021
[16]

Neural Discrete Representation Learning

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural Discrete Representation Learning,” May 30, 2018, arXiv: arXiv:1711.00937. doi: 10.48550/arXiv.1711.00937

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1711.00937 2018
[17]

Attention Is All You Need

A. Vaswani et al., “Attention Is All You Need,” Aug. 02, 2023, arXiv: arXiv:1706.03762. doi: 10.48550/arXiv.1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023
[18]

Finite Scalar Quantization: VQ-V AE Made Simple,

F. Mentzer, D. Minnen, E. Agustsson, and M. Tschannen, “Finite Scalar Quantization: VQ-V AE Made Simple,” presented at the The Twelfth International Conference on Learning Representations, Oct. 2023. Accessed: Apr. 28, 2026. [Online]. Available: https://openreview.net/forum?id=8ishA3LxN8

2023
[19]

The GAN is dead; long live the GAN! A Modern GAN Baseline,

Y . Huang, A. Gokaslan, V . Kuleshov, and J. Tompkin, “The GAN is dead; long live the GAN! A Modern GAN Baseline,” Adv. Neural Inf. Process. Syst., vol. 37, pp. 44177–44215, Dec. 2024, doi: 10.52202/079017-1402

work page doi:10.52202/079017-1402 2024
[20]

The relativistic discriminator: a key element missing from standard GAN

A. Jolicoeur-Martineau, “The relativistic discriminator: a key element missing from standard GAN,” Sep. 10, 2018, arXiv: arXiv:1807.00734. doi: 10.48550/arXiv.1807.00734

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807.00734 2018
[21]

Which Training Methods for GANs do actually Converge?,

L. Mescheder, A. Geiger, and S. Nowozin, “Which Training Methods for GANs do actually Converge?,” in Proceedings of the 35th International Conference on Machine Learning, PMLR, Jul. 2018, pp. 3481–3490. Accessed: Apr. 28, 2026. [Online]. Available: https://proceedings.mlr.press/v80/mescheder18a.html

2018
[22]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,” Apr. 10, 2018, arXiv: arXiv:1801.03924. doi: 10.48550/arXiv.1801.03924

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1801.03924 2018
[23]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu, “RoFormer: Enhanced Transformer with Rotary Position Embedding,” Nov. 08, 2023, arXiv: arXiv:2104.09864. doi: 10.48550/arXiv.2104.09864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.09864 2023
[24]

You Only Look Once: Unified, Real-Time Object Detection

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” May 09, 2016, arXiv: arXiv:1506.02640. doi: 10.48550/arXiv.1506.02640

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.02640 2016
[25]

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks,” Sep. 23, 2015, arXiv: arXiv:1506.03099. doi: 10.48550/arXiv.1506.03099

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.03099 2015
[26]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-Free Diffusion Guidance,” Jul. 26, 2022, arXiv: arXiv:2207.12598. doi: 10.48550/arXiv.2207.12598

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2207.12598 2022
[27]

Hierarchical Neural Story Generation

A. Fan, M. Lewis, and Y . Dauphin, “Hierarchical Neural Story Generation,” May 13, 2018, arXiv: arXiv:1805.04833. doi: 10.48550/arXiv.1805.04833

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1805.04833 2018
[28]

MaskGIT: Masked Generative Image Transformer,

H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “MaskGIT: Masked Generative Image Transformer,” Feb. 08, 2022, arXiv: arXiv:2202.04200. doi: 10.48550/arXiv.2202.04200

work page doi:10.48550/arxiv.2202.04200 2022
[29]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” Oct. 16, 2021, arXiv: arXiv:2106.09685. doi: 10.48550/arXiv.2106.09685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2021
[30]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” Jan. 12, 2018, arXiv: arXiv:1706.08500. doi: 10.48550/arXiv.1706.08500

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.08500 2018
[31]

Towards Accurate Generative Models of Video: A New Metric & Challenges

T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards Accurate Generative Models of Video: A New Metric & Challenges,” Mar. 27, 2019, arXiv: arXiv:1812.01717. doi: 10.48550/arXiv.1812.01717

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1812.01717 2019
[32]

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training,

Z. Tong, Y . Song, J. Wang, and L. Wang, “VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training,” Oct. 18, 2022, arXiv: arXiv:2203.12602. doi: 10.48550/arXiv.2203.12602

work page doi:10.48550/arxiv.2203.12602 2022

[1] [1]

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models,

A. Blattmann et al., “Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models,” Dec. 28, 2023, arXiv: arXiv:2304.08818. doi: 10.48550/arXiv.2304.08818

work page doi:10.48550/arxiv.2304.08818 2023

[2] [2]

Video Diffusion Models

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video Diffusion Models,” Jun. 22, 2022, arXiv: arXiv:2204.03458. doi: 10.48550/arXiv.2204.03458

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.03458 2022

[3] [3]

Imagen Video: High Definition Video Generation with Diffusion Models

J. Ho et al., “Imagen Video: High Definition Video Generation with Diffusion Models,” Oct. 05, 2022, arXiv: arXiv:2210.02303. doi: 10.48550/arXiv.2210.02303

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.02303 2022

[4] [4]

Scalable Diffusion Models with Transformers

W. Peebles and S. Xie, “Scalable Diffusion Models with Transformers,” Mar. 02, 2023, arXiv: arXiv:2212.09748. doi: 10.48550/arXiv.2212.09748

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.09748 2023

[5] [5]

Make-A-Video: Text-to-Video Generation without Text-Video Data

U. Singer et al., “Make-A-Video: Text-to-Video Generation without Text-Video Data,” Sep. 29, 2022, arXiv: arXiv:2209.14792. doi: 10.48550/arXiv.2209.14792

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2209.14792 2022

[6] [6]

Long Video Generation with Time-Agnostic VQGAN and Time- Sensitive Transformer,

S. Ge et al., “Long Video Generation with Time-Agnostic VQGAN and Time- Sensitive Transformer,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds., Cham: Springer Nature Switzerland, 2022, pp. 102–118. doi: 10.1007/978-3-031-19790-1_7

work page doi:10.1007/978-3-031-19790-1_7 2022

[7] [7]

VideoPoet: A Large Language Model for Zero-Shot Video Generation

D. Kondratyuk et al., “VideoPoet: A Large Language Model for Zero-Shot Video Generation,” Jun. 04, 2024, arXiv: arXiv:2312.14125. doi: 10.48550/arXiv.2312.14125

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.14125 2024

[8] [8]

Slicing aided hyper inference and fine-tuning for small object detection,

Y . Seo, K. Lee, F. Liu, S. James, and P. Abbeel, “HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator,” in 2022 IEEE International Conference on Image Processing (ICIP), Oct. 2022, pp. 3943–3947. doi: 10.1109/ICIP46576.2022.9897982

work page doi:10.1109/icip46576.2022.9897982 2022

[9] [9]

VideoGPT: Video Generation using VQ-VAE and Transformers

W. Yan, Y . Zhang, P. Abbeel, and A. Srinivas, “VideoGPT: Video Generation using VQ-V AE and Transformers,” Sep. 14, 2021, arXiv: arXiv:2104.10157. doi: 10.48550/arXiv.2104.10157

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.10157 2021

[10] [10]

MAGVIT: Masked Generative Video Transformer,

L. Yu et al., “MAGVIT: Masked Generative Video Transformer,” Apr. 05, 2023, arXiv: arXiv:2212.05199. doi: 10.48550/arXiv.2212.05199

work page doi:10.48550/arxiv.2212.05199 2023

[11] [11]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

R. Villegas et al., “Phenaki: Variable Length Video Generation From Open Domain Textual Description,” Oct. 05, 2022, arXiv: arXiv:2210.02399. doi: 10.48550/arXiv.2210.02399

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.02399 2022

[12] [12]

TV2TV: A Unified Framework for Interleaved Language and Video Generation,

X. Han et al., “TV2TV: A Unified Framework for Interleaved Language and Video Generation,” Dec. 08, 2025, arXiv: arXiv:2512.05103. doi: 10.48550/arXiv.2512.05103

work page doi:10.48550/arxiv.2512.05103 2025

[13] [13]

NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,

A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1010–1019. Accessed: May 21, 2026. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2016/html/Shahroudy_NTU_RGBD_A_ CVPR_201...

2016

[14] [14]

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Y . He, T. Yang, Y . Zhang, Y . Shan, and Q. Chen, “Latent Video Diffusion Models for High-Fidelity Long Video Generation,” Mar. 20, 2023, arXiv: arXiv:2211.13221. doi: 10.48550/arXiv.2211.13221

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.13221 2023

[15] [15]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

P. Esser, R. Rombach, and B. Ommer, “Taming Transformers for High- Resolution Image Synthesis,” presented at the CVPR, Computer Vision Foundation / IEEE, 2021, pp. 12873–12883. doi: 10.1109/CVPR46437.2021.01268

work page doi:10.1109/cvpr46437.2021.01268 2021

[16] [16]

Neural Discrete Representation Learning

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural Discrete Representation Learning,” May 30, 2018, arXiv: arXiv:1711.00937. doi: 10.48550/arXiv.1711.00937

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1711.00937 2018

[17] [17]

Attention Is All You Need

A. Vaswani et al., “Attention Is All You Need,” Aug. 02, 2023, arXiv: arXiv:1706.03762. doi: 10.48550/arXiv.1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023

[18] [18]

Finite Scalar Quantization: VQ-V AE Made Simple,

F. Mentzer, D. Minnen, E. Agustsson, and M. Tschannen, “Finite Scalar Quantization: VQ-V AE Made Simple,” presented at the The Twelfth International Conference on Learning Representations, Oct. 2023. Accessed: Apr. 28, 2026. [Online]. Available: https://openreview.net/forum?id=8ishA3LxN8

2023

[19] [19]

The GAN is dead; long live the GAN! A Modern GAN Baseline,

Y . Huang, A. Gokaslan, V . Kuleshov, and J. Tompkin, “The GAN is dead; long live the GAN! A Modern GAN Baseline,” Adv. Neural Inf. Process. Syst., vol. 37, pp. 44177–44215, Dec. 2024, doi: 10.52202/079017-1402

work page doi:10.52202/079017-1402 2024

[20] [20]

The relativistic discriminator: a key element missing from standard GAN

A. Jolicoeur-Martineau, “The relativistic discriminator: a key element missing from standard GAN,” Sep. 10, 2018, arXiv: arXiv:1807.00734. doi: 10.48550/arXiv.1807.00734

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807.00734 2018

[21] [21]

Which Training Methods for GANs do actually Converge?,

L. Mescheder, A. Geiger, and S. Nowozin, “Which Training Methods for GANs do actually Converge?,” in Proceedings of the 35th International Conference on Machine Learning, PMLR, Jul. 2018, pp. 3481–3490. Accessed: Apr. 28, 2026. [Online]. Available: https://proceedings.mlr.press/v80/mescheder18a.html

2018

[22] [22]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,” Apr. 10, 2018, arXiv: arXiv:1801.03924. doi: 10.48550/arXiv.1801.03924

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1801.03924 2018

[23] [23]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu, “RoFormer: Enhanced Transformer with Rotary Position Embedding,” Nov. 08, 2023, arXiv: arXiv:2104.09864. doi: 10.48550/arXiv.2104.09864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.09864 2023

[24] [24]

You Only Look Once: Unified, Real-Time Object Detection

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” May 09, 2016, arXiv: arXiv:1506.02640. doi: 10.48550/arXiv.1506.02640

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.02640 2016

[25] [25]

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks,” Sep. 23, 2015, arXiv: arXiv:1506.03099. doi: 10.48550/arXiv.1506.03099

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1506.03099 2015

[26] [26]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-Free Diffusion Guidance,” Jul. 26, 2022, arXiv: arXiv:2207.12598. doi: 10.48550/arXiv.2207.12598

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2207.12598 2022

[27] [27]

Hierarchical Neural Story Generation

A. Fan, M. Lewis, and Y . Dauphin, “Hierarchical Neural Story Generation,” May 13, 2018, arXiv: arXiv:1805.04833. doi: 10.48550/arXiv.1805.04833

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1805.04833 2018

[28] [28]

MaskGIT: Masked Generative Image Transformer,

H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “MaskGIT: Masked Generative Image Transformer,” Feb. 08, 2022, arXiv: arXiv:2202.04200. doi: 10.48550/arXiv.2202.04200

work page doi:10.48550/arxiv.2202.04200 2022

[29] [29]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu et al., “LoRA: Low-Rank Adaptation of Large Language Models,” Oct. 16, 2021, arXiv: arXiv:2106.09685. doi: 10.48550/arXiv.2106.09685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2106.09685 2021

[30] [30]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” Jan. 12, 2018, arXiv: arXiv:1706.08500. doi: 10.48550/arXiv.1706.08500

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.08500 2018

[31] [31]

Towards Accurate Generative Models of Video: A New Metric & Challenges

T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly, “Towards Accurate Generative Models of Video: A New Metric & Challenges,” Mar. 27, 2019, arXiv: arXiv:1812.01717. doi: 10.48550/arXiv.1812.01717

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1812.01717 2019

[32] [32]

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training,

Z. Tong, Y . Song, J. Wang, and L. Wang, “VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training,” Oct. 18, 2022, arXiv: arXiv:2203.12602. doi: 10.48550/arXiv.2203.12602

work page doi:10.48550/arxiv.2203.12602 2022