hub

//arxiv.org/abs/1703.00810

Shwartz-Ziv R, Tishby N ( · 2017 · cs.LG · arXiv 1703.00810

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

open full Pith review browse 14 citing papers arXiv PDF

abstract

Despite their great success, there is still no comprehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their inner organization. Previous work proposed to analyze DNNs in the \textit{Information Plane}; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer. In this work we follow up on this idea and demonstrate the effectiveness of the Information-Plane visualization of DNNs. Our main results are: (i) most of the training epochs in standard DL are spent on {\emph compression} of the input to efficient representation and not on fitting the training labels. (ii) The representation compression phase begins when the training errors becomes small and the Stochastic Gradient Decent (SGD) epochs change from a fast drift to smaller training error into a stochastic relaxation, or random diffusion, constrained by the training error value. (iii) The converged layers lie on or very close to the Information Bottleneck (IB) theoretical bound, and the maps from the input to any hidden layer and from this hidden layer to the output satisfy the IB self-consistent equations. This generalization through noise mechanism is unique to Deep Neural Networks and absent in one layer networks. (iv) The training time is dramatically reduced when adding more hidden layers. Thus the main advantage of the hidden layers is computational. This can be explained by the reduced relaxation time, as this it scales super-linearly (exponentially for simple diffusion) with the information compression from the previous layer.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.

The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.

Task Relevance Is Not Local Replaceability: A Two-Axis View of Channel Information

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Channel importance splits into task relevance and local replaceability; local-axis metrics predict safe removal under pruning better than target-axis metrics across multiple CNNs and datasets.

Copy First, Translate Later: Interpreting Translation Dynamics in Multilingual Pretraining

cs.CL · 2026-04-19 · unverdicted · novelty 7.0

Multilingual pretraining develops translation in two phases: early copying driven by surface similarities, followed by generalizing mechanisms while copying is refined.

Selection Plateau and a Sparsity-Dependent Hierarchy of Pruning Features

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

All rank-monotone pruning scorers converge to identical accuracy at fixed sparsity, but non-monotone features with sparsity-dependent complexity can escape this plateau, as shown by the SICS hypothesis on ViT-Small/CIFAR-10.

How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.

Why Self-Supervised Encoders Want to Be Normal

cs.IT · 2026-04-30 · unverdicted · novelty 6.0

Self-supervised encoders prefer isotropic Gaussian latent states because the Information Bottleneck, recast as rate-distortion over the predictive manifold, makes these states optimal for target-neutral representations.

Training Deep Visual Networks Beyond Loss and Accuracy Through a Dynamical Systems Approach

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

Introduces integration, metastability, and dynamical stability index measures from layer activations and reports patterns distinguishing CIFAR-10 from CIFAR-100 difficulty plus early convergence signals across ResNet variants, DenseNet, MobileNetV2, VGG-16, and a Vision Transformer.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Information Plane Analysis of Binary Neural Networks

cs.LG · 2026-05-05 · unverdicted · novelty 5.0

Binary neural networks exhibit frequent late-stage compression in the information plane, but compressed representations do not reliably correlate with better generalization performance.

From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

cs.RO · 2026-04-04 · accept · novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

Personalization as a Game: Equilibrium-Guided Generative Modeling for Physician Behavior in Pharmaceutical Engagement

cs.GT · 2026-04-08 · unverdicted · novelty 3.0

EGPF treats physician engagement as an incomplete-information Bayesian game, infers behavioral types via functors, and uses equilibrium strategies to direct generative AI, reporting 34% better AUC and 28% higher relevance than baselines.

There Will Be a Scientific Theory of Deep Learning

stat.ML · 2026-04-23 · unverdicted · novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

citing papers explorer

Showing 14 of 14 citing papers.

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning cs.LG · 2026-05-13 · unverdicted · none · ref 33 · internal anchor
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence? cs.AI · 2026-05-10 · unverdicted · none · ref 22
Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.
Task Relevance Is Not Local Replaceability: A Two-Axis View of Channel Information cs.CV · 2026-05-08 · unverdicted · none · ref 10
Channel importance splits into task relevance and local replaceability; local-axis metrics predict safe removal under pruning better than target-axis metrics across multiple CNNs and datasets.
Copy First, Translate Later: Interpreting Translation Dynamics in Multilingual Pretraining cs.CL · 2026-04-19 · unverdicted · none · ref 30
Multilingual pretraining develops translation in two phases: early copying driven by surface similarities, followed by generalizing mechanisms while copying is refined.
Selection Plateau and a Sparsity-Dependent Hierarchy of Pruning Features cs.LG · 2026-05-10 · unverdicted · none · ref 23
All rank-monotone pruning scorers converge to identical accuracy at fixed sparsity, but non-monotone features with sparsity-dependent complexity can escape this plateau, as shown by the SICS hypothesis on ViT-Small/CIFAR-10.
How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework cs.CL · 2026-04-30 · unverdicted · none · ref 3
LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.
Why Self-Supervised Encoders Want to Be Normal cs.IT · 2026-04-30 · unverdicted · none · ref 38
Self-supervised encoders prefer isotropic Gaussian latent states because the Information Bottleneck, recast as rate-distortion over the predictive manifold, makes these states optimal for target-neutral representations.
Training Deep Visual Networks Beyond Loss and Accuracy Through a Dynamical Systems Approach cs.CV · 2026-04-08 · unverdicted · none · ref 9
Introduces integration, metastability, and dynamical stability index measures from layer activations and reports patterns distinguishing CIFAR-10 from CIFAR-100 difficulty plus early convergence signals across ResNet variants, DenseNet, MobileNetV2, VGG-16, and a Vision Transformer.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 252
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 175
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Information Plane Analysis of Binary Neural Networks cs.LG · 2026-05-05 · unverdicted · none · ref 1
Binary neural networks exhibit frequent late-stage compression in the information plane, but compressed representations do not reliably correlate with better generalization performance.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data cs.RO · 2026-04-04 · accept · none · ref 86
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
Personalization as a Game: Equilibrium-Guided Generative Modeling for Physician Behavior in Pharmaceutical Engagement cs.GT · 2026-04-08 · unverdicted · none · ref 17
EGPF treats physician engagement as an incomplete-information Bayesian game, infers behavioral types via functors, and uses equilibrium strategies to direct generative AI, reporting 34% better AUC and 28% higher relevance than baselines.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 281
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

//arxiv.org/abs/1703.00810

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer