Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
hub
Large Scale Distributed Neural Network Training through Online Distillation
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
FedF-ADMM uses function-space ADMM updates projected via knowledge distillation plus a PI-like stabilization term to deliver faster, more stable convergence and higher accuracy than prior decentralized FL methods under severe non-IID conditions.
CE-FI maps heterogeneous model representations to a shared embedding space via unsupervised training on unlabeled data, enabling privacy-preserving federated inference that outperforms solo models on image classification benchmarks.
LatentBurst is a new multi-frame super-resolution network for hexadeca-Bayer CIS images that uses pyramid latent alignment, an efficient UNet, and two-step knowledge distillation to handle motion and run on mobile devices.
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Populations of 1-4B parameter LLMs using peer verification and shared cultural memory achieve 8.8-18.9 point gains on mathematical reasoning tasks and close much of the gap to 70B+ single models.
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
FedKDNAS combines client-side neural architecture search with knowledge distillation from aggregated server predictions to improve accuracy and efficiency in heterogeneous federated learning.
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
citing papers explorer
-
Emerging Properties in Self-Supervised Vision Transformers
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
-
Function-Space ADMM for Decentralized Federated Learning: A Control Theoretic Perspective
FedF-ADMM uses function-space ADMM updates projected via knowledge distillation plus a PI-like stabilization term to deliver faster, more stable convergence and higher accuracy than prior decentralized FL methods under severe non-IID conditions.
-
Enabling Federated Inference via Unsupervised Consensus Embedding
CE-FI maps heterogeneous model representations to a shared embedding space via unsupervised training on unlabeled data, enabling privacy-preserving federated inference that outperforms solo models on image classification benchmarks.
-
LatentBurst: A Fast and Efficient Multi Frame Super-Resolution for Hexadeca-Bayer Pattern CIS images
LatentBurst is a new multi-frame super-resolution network for hexadeca-Bayer CIS images that uses pyramid latent alignment, an efficient UNet, and two-step knowledge distillation to handle motion and run on mobile devices.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
The Ratchet Effect in Silico through Interaction-Driven Cumulative Intelligence in Large Language Models
Populations of 1-4B parameter LLMs using peer verification and shared cultural memory achieve 8.8-18.9 point gains on mathematical reasoning tasks and close much of the gap to 70B+ single models.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
Optimized Federated Knowledge Distillation with Distributed Neural Architecture Search
FedKDNAS combines client-side neural architecture search with knowledge distillation from aggregated server predictions to improve accuracy and efficiency in heterogeneous federated learning.
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
-
Gemma 3 Technical Report
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.