hub Mixed citations

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding · 2021 · cs.LG · arXiv 2107.14795

Mixed citation behavior. Most common role is background (60%).

41 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 41 citing papers arXiv PDF

abstract

A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 2

citation-polarity summary

background 3 use method 2

representative citing papers

ENSEMBITS: an alphabet of protein conformational ensembles

cs.LG · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.

Neural Signals Generate Clinical Notes in the Wild

cs.LG · 2026-01-29 · unverdicted · novelty 8.0

CELM is the first EEG-to-language foundation model that generates clinical reports from variable-length EEG recordings using a new dataset of 9,922 reports paired with 11,000 hours of data from 9,048 patients.

Atomistic Language Models Understand and Generate Materials

cs.LG · 2026-06-19 · unverdicted · novelty 7.0

ALMs unify pretrained atomistic encoder, LLM, and denoising diffusion via continuous projectors and staged training to reach SOTA on text-conditioned crystal prediction and de novo generation.

Diff-CA: Separating Common and Salient Factors with Diffusion Models

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

A diffusion-based contrastive analysis method that decomposes conditioning into common and salient factors with weak supervision and proves identifiability of the additive model.

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

GAMSI is a dual-pathway Geometry-Aware MLLM using Metric-Structure Decoupled Queries and Expert-Guided Visual Grounding on RGB inputs alone, trained on a new 152k-sample MTS dataset to reach SOTA on seven spatial benchmarks.

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

cs.CL · 2026-05-10 · conditional · novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.

MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation

cs.GR · 2026-05-09 · unverdicted · novelty 7.0

MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topology, and region limits.

A foundation model of vision, audition, and language for in-silico neuroscience

q-bio.NC · 2026-05-05 · unverdicted · novelty 7.0

TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.

A Self-Supervised Framework for Space Object Behaviour Characterisation

cs.LG · 2025-04-08 · unverdicted · novelty 7.0

Self-supervised Perceiver-VAE pre-trained on 227,000 light curves from MMT-9 and fine-tuned on simulators achieves 85% accuracy and 0.92-0.95 ROC AUC in anomaly detection and motion mode prediction for space objects.

RoboDreamer: Learning Compositional World Models for Robot Imagination

cs.RO · 2024-04-18 · unverdicted · novelty 7.0

RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

High-Resolution Image Synthesis with Latent Diffusion Models

cs.CV · 2021-12-20 · conditional · novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

cs.AI · 2026-06-17 · unverdicted · novelty 6.0 · 2 refs

ITNet frames convolution, attention, and recurrence as special cases of one learnable integral transform with an MLP kernel and shows a single shared operator plus modality encoders matches specialized models on ImageNet-1K, GLUE, ModelNet40, VQA v2, and NLVR2.

Streamlining Analysis and Design of Two-Dimensional Electronic Spectroscopy using Machine Learning

physics.chem-ph · 2026-06-17 · unverdicted · novelty 6.0

A Gaussian mixture model is used to learn spectral densities from 2DES experiments, enabling extraction of vibronic couplings, spectral extrapolation, and optimized experiment selection across simulated and experimental systems.

Revisiting Neural Processes via Fourier Transform and Volterra Series

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Introduces SFConvCNPs and SFVConvCNPs using set Fourier convolutions and Volterra expansions for translation-equivariant neural processes on irregular data with global receptive fields and linear scaling.

InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

InfoAtlas is a pretrained neural model for zero-shot mutual information estimation that matches state-of-the-art accuracy with 100x speedup and handles varying dimensions via a single model.

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.

Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

A Perceiver IO fusion architecture combines satellite and street-level imagery via DINOv2 tokens and RGB-M masking to classify roof attributes on a new dataset of 32,135 buildings across ten countries.

Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation

cs.IR · 2026-05-17 · unverdicted · novelty 6.0

TGQ-Former uses metadata-guided hybrid queries and dual-gated modulation to improve visual token selection in multimodal e-commerce retrieval, raising average Hit Rate@100 by 6.04% over baselines.

TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

TOPOS creates high-fidelity 3D heads with fixed industry topology from single images via a specialized VAE with Perceiver Resampler and a rectified flow transformer.

A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

Hypergraph and Latent ODE Learning for Multimodal Root Cause Localization in Microservices

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

HyperODE RCA integrates hypergraph learning with latent ODEs and cross-modal attention to improve root cause localization in microservice architectures on the Tianchi AIOps benchmark.

MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KITTI-2012.

OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

q-bio.NC · 2026-04-20 · unverdicted · novelty 6.0

OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.

citing papers explorer

Showing 12 of 12 citing papers after filters.

ENSEMBITS: an alphabet of protein conformational ensembles cs.LG · 2026-05-13 · unverdicted · none · ref 5 · 2 links · internal anchor
Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
Neural Signals Generate Clinical Notes in the Wild cs.LG · 2026-01-29 · unverdicted · none · ref 3 · internal anchor
CELM is the first EEG-to-language foundation model that generates clinical reports from variable-length EEG recordings using a new dataset of 9,922 reports paired with 11,000 hours of data from 9,048 patients.
Atomistic Language Models Understand and Generate Materials cs.LG · 2026-06-19 · unverdicted · none · ref 66 · internal anchor
ALMs unify pretrained atomistic encoder, LLM, and denoising diffusion via continuous projectors and staged training to reach SOTA on text-conditioned crystal prediction and de novo generation.
A Self-Supervised Framework for Space Object Behaviour Characterisation cs.LG · 2025-04-08 · unverdicted · none · ref 17 · internal anchor
Self-supervised Perceiver-VAE pre-trained on 227,000 light curves from MMT-9 and fine-tuned on simulators achieves 85% accuracy and 0.92-0.95 ROC AUC in anomaly detection and motion mode prediction for space objects.
Revisiting Neural Processes via Fourier Transform and Volterra Series cs.LG · 2026-05-31 · unverdicted · none · ref 148 · internal anchor
Introduces SFConvCNPs and SFVConvCNPs using set Fourier convolutions and Volterra expansions for translation-equivariant neural processes on irregular data with global receptive fields and linear scaling.
InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate cs.LG · 2026-05-29 · unverdicted · none · ref 33 · internal anchor
InfoAtlas is a pretrained neural model for zero-shot mutual information estimation that matches state-of-the-art accuracy with 100x speedup and handles varying dimensions via a single model.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management cs.LG · 2026-05-04 · unverdicted · none · ref 263 · internal anchor
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Hypergraph and Latent ODE Learning for Multimodal Root Cause Localization in Microservices cs.LG · 2026-05-01 · unverdicted · none · ref 9 · internal anchor
HyperODE RCA integrates hypergraph learning with latent ODEs and cross-modal attention to improve root cause localization in microservice architectures on the Tianchi AIOps benchmark.
LEIA: Learned Environment for Interactive Architected Materials cs.LG · 2026-05-27 · unverdicted · none · ref 26 · internal anchor
LEIA is a world model for autoregressive 3D simulation of architected materials under interactive loading, benchmarked on MicroPlate and applied to surrogate-guided de novo design search with finite-element validation.
Symmetry in the Wild: The Role of Equivariance in Neural Fluid Surrogates cs.LG · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
Explicit E(3)-equivariance in neural CFD surrogates improves generalization on diverse-geometry hemodynamics benchmarks but degrades in-distribution performance on strongly aligned aerodynamics data, consistently beating data augmentation.
PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling cs.LG · 2026-04-15 · unverdicted · none · ref 6 · 2 links · internal anchor
PRiMeFlow applies flow matching in gene expression space with a U-Net velocity field and pretraining-finetuning to model perturbation-induced heterogeneity, showing strong benchmark performance on PerturBench and the ARC Virtual Cell Challenge.
CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability cs.LG · 2026-05-31 · unverdicted · none · ref 7 · internal anchor
CART is a recurrent transformer with shared core, frozen prelude KV tensors, and LTI stability gate that fails to beat dense baselines at parameter parity across tested widths.

Perceiver IO: A General Architecture for Structured Inputs & Outputs

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer