Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Oriol Vinyals, Yazhe Li

classification 💻 cs.LG stat.ML

keywords learningrepresentationscontrastiveusefulapproachcodingfuturelatent

read the original abstract

While supervised learning has enabled great progress in many applications, unsupervised learning has not seen such widespread adoption, and remains an important and challenging endeavor for artificial intelligence. In this work, we propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which we call Contrastive Predictive Coding. The key insight of our model is to learn such representations by predicting the future in latent space by using powerful autoregressive models. We use a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples. It also makes the model tractable by using negative sampling. While most prior work has focused on evaluating representations for a particular modality, we demonstrate that our approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations
cs.LG 2026-04 unverdicted novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving ...
FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data
cs.CV 2026-04 unverdicted novelty 8.0

FashionMV introduces product-level multi-view CIR, a 127K-product dataset built via automated LMM pipeline, and a 0.8B ProCIR model that beats larger baselines on three fashion benchmarks.
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
cs.CV 2021-04 conditional novelty 8.0

CLIPScore uses a web-pretrained CLIP model to evaluate image captions without references and achieves higher human correlation than CIDEr or SPICE.
Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
cs.LG 2026-05 unverdicted novelty 7.0

CPPO is an on-policy contrastive RL method that derives advantages from contrastive Q-values for PPO optimization, outperforming prior CRL baselines in 14/18 tasks and matching or exceeding reward-based PPO in 12/18 tasks.
Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models
cs.LG 2026-05 conditional novelty 7.0

DyGFM introduces decoupled pre-training and divergence-conditioned prompts to create the first multi-domain dynamic graph foundation model that outperforms baselines on node classification and link prediction.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
cs.LG 2026-05 conditional novelty 7.0

ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
cs.CV 2026-05 unverdicted novelty 7.0

AssemblyBench dataset and AssemblyDyno transformer model enable physics-aware prediction of assembly sequences and trajectories for complex industrial objects from multimodal instructions and 3D shapes.
Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study
eess.SP 2026-05 unverdicted novelty 7.0

Contrastive predictive coding pretraining combined with structured state space models yields the strongest ECG foundation models, with continued gains from scaling data to 11 million samples.
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
cs.CV 2026-05 unverdicted novelty 7.0

CoDAAR creates a unified discrete representation space for multimodal sequences by aligning modality-specific codebooks through index-level semantic consensus, enabling both specificity and cross-modal generalization.
Modulation Consistency-based Contrastive Learning for Self-Supervised Automatic Modulation Classification
eess.SP 2026-05 unverdicted novelty 7.0

Mod-CL uses intra-instance modulation consistency to form positive pairs from temporal signal segments in a tailored contrastive objective, outperforming baselines on RadioML datasets especially in low-label regimes.
Martingale-Consistent Self-Supervised Learning
cs.LG 2026-05 unverdicted novelty 7.0

The paper develops a martingale-consistent SSL framework enforcing expected coherence between coarse and refined predictions via new objectives and a Monte Carlo estimator, improving robustness under partial observations.
Pyramid Self-contrastive Learning Framework for Test-time Ultrasound Image Denoising
cs.CV 2026-05 conditional novelty 7.0

A2A achieves one-shot ultrasound denoising via pyramid self-contrastive learning on sub-aperture signals to disentangle anatomy from noise, yielding large SNR and CNR gains in simulations and in vivo scans.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning
cs.CV 2026-05 unverdicted novelty 7.0

LoRA adapters fix collapsed visual CLS token attention in CLIP for superior cross-domain few-shot learning, and the new Semantic Probe framework revives prompt methods to reach state-of-the-art on four benchmarks.
Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery
cs.CV 2026-05 unverdicted novelty 7.0

SkyPart uses learnable prototypes for patch grouping, altitude modulation only in training, graph-attention readout, and Kendall-weighted loss to set new state-of-the-art single-pass performance on SUES-200, Universit...
Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets
cs.LG 2026-05 unverdicted novelty 7.0

In generalized contrastive learning with imbalanced classes, optimal representations collapse to class means whose angular geometry is determined by class proportions via convex optimization, and extreme imbalance cau...
Discrete Langevin-Inspired Posterior Sampling
cs.LG 2026-05 unverdicted novelty 7.0

ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.
Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection
cs.CV 2026-05 unverdicted novelty 7.0

S2M extracts structured text quadruples from change masks to provide noise-free multimodal supervision, achieving 17.80% Sek and 66.14% F_scd on the new Gaza-Change-v2 dataset and outperforming LLM-based multimodal methods.
TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations
cs.CV 2026-05 unverdicted novelty 7.0

TrajGANR learns continuous neural representations of trajectories to enable fine-grained alignment with street-view images and locations in a joint multimodal self-supervised objective, outperforming prior geospatial ...
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
cs.CL 2026-05 unverdicted novelty 7.0

LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
TinySSL: Distilled Self-Supervised Pretraining for Sub-Megabyte MCU Models
cs.CV 2026-05 conditional novelty 7.0

CA-DSSL enables effective self-supervised pretraining for 396K-parameter MCU backbones, reaching 62.7% linear-probe accuracy on CIFAR-100 and 94% of supervised performance while fitting in 378 KB INT8.
DataDignity: Training Data Attribution for Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.
MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery
cs.CV 2026-05 unverdicted novelty 7.0

MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egoc...
PHALAR: Phasors for Learned Musical Audio Representations
cs.SD 2026-05 unverdicted novelty 7.0

PHALAR achieves up to 70% relative accuracy gain in stem retrieval with under half the parameters and 7x faster training by using phasor-based equivariant representations, setting new SOTA on multiple datasets.
PHALAR: Phasors for Learned Musical Audio Representations
cs.SD 2026-05 unverdicted novelty 7.0

PHALAR introduces a phasor-based contrastive framework with learned spectral pooling and complex heads that enforces pitch-equivariant and phase-equivariant biases, delivering up to 70% relative accuracy gains in stem...
PHALAR: Phasors for Learned Musical Audio Representations
cs.SD 2026-05 unverdicted novelty 7.0

PHALAR introduces a contrastive audio representation framework with spectral pooling and complex-valued processing that sets new state-of-the-art results in stem retrieval on MoisesDB, Slakh, and ChocoChorales while a...
Multi-Axis Speech Similarity via Factor-Partitioned Embeddings
eess.AS 2026-05 unverdicted novelty 7.0

Factor-partitioned embeddings map speech to a vector with attribute-specific subspaces, supporting signed weighted sums of per-axis cosines for conditioned retrieval that can suppress biases like same-speaker matches.
SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters
cs.CV 2026-05 unverdicted novelty 7.0

SpectraDINO adapts frozen DINOv2 backbones to multispectral data via per-modality adapters and staged distillation with cosine, contrastive, patch, and neighborhood-structure losses, achieving SOTA on object detection...
QuIVer: Rethinking ANN Graph Topology via Training-Free Binary Quantization
cs.DB 2026-05 unverdicted novelty 7.0

QuIVer constructs ANN graph indices entirely inside a 2-bit quantized metric space, delivering high recall and throughput on embedding datasets while using far less memory than standard HNSW implementations.
Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges
cs.LG 2026-05 unverdicted novelty 7.0

Structured diffusion bridges with alignment constraints achieve near fully-paired quality in modality translation while working effectively in unpaired and semi-paired regimes.
Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning
cs.MA 2026-05 unverdicted novelty 7.0

A quality-aware exploration method using return-conditioned sigmoid scheduling and per-agent RSQ metrics achieves top-tier returns on seven cooperative MARL benchmarks.
MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

MIRL uses mutual information to guide trajectory selection and provide separate rewards for visual perception in RLVR for VLMs, achieving 70.22% average accuracy with 25% fewer full trajectories.
SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion
cs.CV 2026-05 conditional novelty 7.0

SplAttN replaces hard projection with Gaussian soft splatting to avoid cross-modal entropy collapse, achieving SOTA point cloud completion on PCN and ShapeNet while maintaining visual cue dependency on KITTI.
Multimodal Data Curation Through Ranked Retrieval
cs.IR 2026-05 unverdicted novelty 7.0

Symmetric Nucleus Subsampling and Expert Embedding Engine reduce modality gaps in multimodal embeddings by over 90% and outperform baselines in data curation for downstream models.
Query-Efficient Quantum Approximate Optimization via Graph-Conditioned Trust Regions
cs.LG 2026-04 unverdicted novelty 7.0

A GNN predicts Gaussians over QAOA parameters to create graph-conditioned trust regions that reduce circuit evaluations for MaxCut from 85-343 down to 45 while keeping approximation ratios within 3 points of heuristics.
Beyond Static Collision Handling: Adaptive Semantic ID Learning for Multimodal Recommendation at Industrial Scale
cs.IR 2026-04 unverdicted novelty 7.0

AdaSID adaptively regulates semantic ID overlaps in multimodal recommendations to improve retrieval performance, codebook utilization, and downstream metrics like GMV.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
Amortized Vine Copulas for High-Dimensional Density and Information Estimation
cs.LG 2026-04 unverdicted novelty 7.0

VDC amortizes vine copula construction by reusing a single trained denoising model across edges plus IPFP projection, yielding competitive density and mutual information estimates with faster high-dimensional fitting.
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
cs.SD 2026-04 unverdicted novelty 7.0

Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
cs.LG 2026-04 unverdicted novelty 7.0

NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
Understanding Human Actions through the Lens of Executable Models
cs.AI 2026-04 unverdicted novelty 7.0

EXACT is a new DSL for human motions as executable reward-generating programs, enabling compositional neuro-symbolic models that improve data efficiency and capture intuitive action relationships over monolithic approaches.
Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers
cs.IR 2026-04 unverdicted novelty 7.0

Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expans...
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
cs.RO 2026-04 unverdicted novelty 7.0

GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads
cs.IR 2026-04 unverdicted novelty 7.0

HeadRank improves decoding-free passage reranking by preference-aligning attention heads to increase discriminability in middle-context documents, outperforming baselines on 14 benchmarks with only 211 training queries.
CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
cs.SE 2026-04 unverdicted novelty 7.0

CodeMMR creates a unified embedding space for text, code, and images, outperforming baselines by 10 nDCG@10 points and boosting RAG code generation quality.
DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts
cs.CV 2026-04 unverdicted novelty 7.0

DETR-ViP boosts visual-prompted detection performance by learning globally discriminative prompts through integration and distillation on top of image-text contrastive learning, with a selective fusion step for stability.
A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 7.0

A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
cs.CV 2026-04 unverdicted novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
Sparse Contrastive Learning for Content-Based Cold Item Recommendation
cs.IR 2026-04 unverdicted novelty 7.0

SEMCo uses sparse entmax contrastive learning for purely content-based cold-start item recommendation, outperforming standard methods in ranking accuracy.
UNIGEOCLIP: Unified Geospatial Contrastive Learning
cs.CV 2026-04 unverdicted novelty 7.0

UNIGEOCLIP creates a unified embedding for aerial imagery, street views, elevation, text, and coordinates via all-to-all contrastive alignment plus a scaled lat-long encoder, outperforming single-modality and coordina...
Bottleneck Tokens for Unified Multimodal Retrieval
cs.LG 2026-04 unverdicted novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models
cs.AI 2026-04 unverdicted novelty 7.0

EmergentBridge improves zero-shot cross-modal transfer for unpaired modality pairs by learning noisy bridge anchors and enforcing proxy alignment only in the orthogonal subspace to preserve existing anchor alignments.
One-Step Score-Based Density Ratio Estimation
stat.ML 2026-04 unverdicted novelty 7.0

OS-DRE performs score-based density ratio estimation in one step by approximating the temporal score component with a closed-form RBF frame and providing error bounds from approximation theory.
A Minimal Model of Representation Collapse: Frustration, Stop-Gradient, and Dynamics
cond-mat.dis-nn 2026-04 unverdicted novelty 7.0

A minimal embedding model shows representation collapse arises from frustrated samples through slow dynamics and is prevented by stop-gradient.
Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
cs.LG 2026-04 unverdicted novelty 7.0

Multimodal contrastive learning using multilinear products is fragile to single bad modalities, and a gated version improves top-1 retrieval accuracy on synthetic and real trimodal data.
Tencent Advertising Algorithm Challenge 2025: All-Modality Generative Recommendation
cs.IR 2026-04 accept novelty 7.0

Releases TencentGR-1M and TencentGR-10M datasets with baselines for all-modality generative recommendation in advertising, including weighted evaluation for conversions.
PLUME: Latent Reasoning Based Universal Multimodal Embedding
cs.CV 2026-04 unverdicted novelty 7.0

PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
cs.CV 2023-03 conditional novelty 7.0

BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumo...