TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning

Anyi Rao; Beijia Lu; Chong Zeng; Gordon Wetzstein; Lvmin Zhang; Maneesh Agrawala; Muyang Li; Shengqu Cai; Song Han

arxiv: 2512.23851 · v6 · pith:OSWUDAFYnew · submitted 2025-12-29 · 💻 cs.CV

TinyHistory: Lightweight Video History Embeddings via Two-Stage Context Learning

Lvmin Zhang , Shengqu Cai , Muyang Li , Chong Zeng , Beijia Lu , Anyi Rao , Song Han , Gordon Wetzstein

show 1 more author

Maneesh Agrawala

This is my paper

classification 💻 cs.CV

keywords historyvideoconsistencycontextlightweightautoregressivecontentembeddings

0 comments

read the original abstract

History context is central to autoregressive video generation, driving consistency and storytelling for both commercial models and personal use cases. For example, personal users, offline workflows, and individual-scale finetuning need to encode longer video histories under tight compute and memory budgets. We observe that content and identity consistency is an essential requirement, and that complete, uninterrupted history coverage together with content query and interpretation capabilities is broadly desired. We present TinyHistory, a lightweight history embedding learned through two-stage context learning. In the first stage, we pretrain the encoder on large-scale video data with a randomized frame query objective; in the second stage, we repurpose the pretrained encoder within an autoregressive video diffusion model to learn content-level consistency. As a result, we show that the learned lightweight embeddings achieve consistency comparable (by VLM, VBench, ELO, etc) to heavier alternatives, while reducing training overhead and extending the encodable history length within a given memory budget. We conduct ablation studies to analyze the influence and trade-offs of each component.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation
cs.CV 2026-06 unverdicted novelty 7.0

LongLive-RAG formulates long video generation as retrieval-augmented generation by treating self-generated latents as a dynamic searchable history and adding a Window Temporal Delta Loss for better retrieval.
ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space
cs.LG 2026-04 unverdicted novelty 7.0

ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.
Efficient Video Diffusion Models: Advancements and Challenges
cs.CV 2026-04 unverdicted novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressi...
Compression and Retrieval: Implicit Memory Retrieval for Video World Models
cs.CV 2026-06 unverdicted novelty 6.0

CaR uses attention with viewpoint positional encoding and context compression for flexible memory retrieval in video world models, backed by a new SceneFly dataset, and reports SOTA results with open-domain generalization.
InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars
cs.CV 2026-06 unverdicted novelty 6.0

InteractiveAvatar uses autoregressive distillation, Long-Short Visual Memory, and a Reasoning-Reaction Module to enable real-time, consistent, intent-aware avatar video streaming.
PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory
cs.CV 2026-06 unverdicted novelty 6.0

PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.
Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation
cs.MM 2026-06 unverdicted novelty 6.0

Echo-Infinity replaces handcrafted KV-cache schedules with end-to-end optimized Memory Queries and a Unified Relative RoPE recipe to support real-time infinite video generation in diffusion transformers.
OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

OmniMem enables scalable long video generation via adaptive sparse KV retrieval that addresses local bias and union explosion while preserving explicit historical access.
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

DySink uses adaptive retrieval of relevant historical frames plus a sink anomaly gate to improve dynamic degree and temporal quality in minute-long autoregressive video generation.
FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching
cs.CV 2026-05 unverdicted novelty 6.0

FlowLong generates videos several times longer than native model windows by blending adjacent predictions with Tweedie matching to enforce manifold and temporal consistency while using stochastic noise injection early...
Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory
cs.CV 2026-05 unverdicted novelty 6.0

IAMFlow is a training-free identity-aware memory system that tracks entities via LLM global ID assignment and VLM frame verification to reduce identity drift in narrative long video generation from shifting prompts.
AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization
cs.CV 2026-06 unverdicted novelty 5.0

AnchorWorld proposes a simulation framework that adds exogenous viewpoint supervision for full-body grounding and anchor-view text customization for dynamic world evolution in egocentric settings.