Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
super hub Mixed citations
Jumper , author R
Mixed citation behavior. Most common role is background (64%).
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
Presents a general framework for generator matching on projected image spaces from latent Markov processes, generalizing static latent results to dynamic conditional processes.
Derives a conditional-marginal entropy-rate objective for bridge-aware discretization that yields U-shaped schedules and improves low-NFE sample quality on 2D, CIFAR-10, and protein tasks.
Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.
SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
DenseAMs show tradeoffs between entropy production, retrieval accuracy, and speed at intermediate loads, with a new failure mode in higher-order networks at finite temperature.
Quantum circuits for coherent multilayer neural network inference achieve quadratic to polylogarithmic speedups over classical methods depending on quantum data access models for inputs and weights.
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
CrystalBoltz performs experiment-guided posterior sampling with diffusion models on structure-factor amplitudes for protein structure determination, reporting lower RMSD and R-factors than baselines with 33x faster runtime.
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
Rectified AI priors, obtained by correcting AI-induced data laws before embedding them in techniques like Dirichlet process priors, reduce bias, improve credible interval coverage, and boost performance in tasks like skin disease classification.
A specialized PINN architecture solves the spatially inhomogeneous electron Boltzmann equation with high accuracy across gases and electric field strengths without case-specific tuning.
Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.
PLASMA applies regularized optimal transport with Sinkhorn iterations to produce fast, interpretable residue-level alignments and similarity scores between protein structures.
Diversity-regularized DPO fine-tuning of ProteinMPNN improves structural similarity scores by at least 8% over base model and sequence diversity by up to 20% over standard DPO for peptide inverse folding on OpenFold structures.
GNN-based MD simulators achieve stable structure-only initialization and reliable OOD generalization through inference-time physics optimization and a GNN barostat on elastic network compression tasks.
The paper introduces Experiment-as-Code Labs as a declarative stack synthesizing AI agents, systems orchestration, and physical lab control for AI-driven discovery.
Boltz-2 and fine-tuned DrugFormDTA lead ML-based binding prediction while GNINA leads docking tools on a cleaned antiviral dataset, with performance varying by viral protein.
MIRA is a new analytic score for conditional distribution accuracy derived from equal probability mass assignment, enabling Bayesian model comparison via direct posterior validation.
Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
citing papers explorer
-
ENSEMBITS: an alphabet of protein conformational ensembles
Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
-
Latent Process Generator Matching
Presents a general framework for generator matching on projected image spaces from latent Markov processes, generalizing static latent results to dynamic conditional processes.
-
Entropy Across the Bridge: Conditional-Marginal Discretization for Flow and Schr\"odinger Samplers
Derives a conditional-marginal entropy-rate objective for bridge-aware discretization that yields U-shaped schedules and improves low-NFE sample quality on 2D, CIFAR-10, and protein tasks.
-
ProteinJEPA: Latent prediction complements protein language models
Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
-
TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations
TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.
-
Rates of forgetting for the sequentially Markov coalescent
SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
-
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
-
Stochastic Thermodynamics of Associative Memory
DenseAMs show tradeoffs between entropy production, retrieval accuracy, and speed at intermediate loads, with a new failure mode in higher-order networks at finite temperature.
-
Accelerating Inference for Multilayer Neural Networks with Quantum Computers
Quantum circuits for coherent multilayer neural network inference achieve quadratic to polylogarithmic speedups over classical methods depending on quantum data access models for inputs and weights.
-
AlphaEvolve: A coding agent for scientific and algorithmic discovery
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.
-
Towards Understanding Self-Pretraining for Sequence Classification
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
-
CrystalBoltz: End-to-End Protein Structure Determination via Experiment-Guided Diffusion for X-Ray Crystallography
CrystalBoltz performs experiment-guided posterior sampling with diffusion models on structure-factor amplitudes for protein structure determination, reporting lower RMSD and R-factors than baselines with 33x faster runtime.
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
Supercharging Bayesian Inference with Reliable AI-Informed Priors
Rectified AI priors, obtained by correcting AI-induced data laws before embedding them in techniques like Dirichlet process priors, reduce bias, improve credible interval coverage, and boost performance in tasks like skin disease classification.
-
A physics-informed neural network approach to solve the spatially inhomogeneous electron Boltzmann equation
A specialized PINN architecture solves the spatially inhomogeneous electron Boltzmann equation with high accuracy across gases and electric field strengths without case-specific tuning.
-
Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
Flashlight is a compiler-native PyTorch framework that generates efficient fused kernels for arbitrary and data-dependent attention variants, supporting more cases than FlexAttention with competitive performance.
-
Fast and Interpretable Protein Substructure Alignment via Optimal Transport
PLASMA applies regularized optimal transport with Sinkhorn iterations to produce fast, interpretable residue-level alignments and similarity scores between protein structures.
-
Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization
Diversity-regularized DPO fine-tuning of ProteinMPNN improves structural similarity scores by at least 8% over base model and sequence diversity by up to 20% over standard DPO for peptide inverse folding on OpenFold structures.
-
Enabling Structure-Only Initialization and Out-of-Distribution Generalization in GNN-based Molecular Dynamics Simulators
GNN-based MD simulators achieve stable structure-only initialization and reliable OOD generalization through inference-time physics optimization and a GNN barostat on elastic network compression tasks.
-
Experiment-as-Code Labs: A Declarative Stack for AI-Driven Scientific Discovery
The paper introduces Experiment-as-Code Labs as a declarative stack synthesizing AI agents, systems orchestration, and physical lab control for AI-driven discovery.
-
Benchmarking open-source tools for in silico antiviral drug discovery
Boltz-2 and fine-tuned DrugFormDTA lead ML-based binding prediction while GNINA leads docking tools on a cleaned antiviral dataset, with performance varying by viral protein.
-
MIRA: A Score for Conditional Distribution Accuracy and Model Comparison
MIRA is a new analytic score for conditional distribution accuracy derived from equal probability mass assignment, enabling Bayesian model comparison via direct posterior validation.
-
Sampling Parallelism for Fast and Efficient Bayesian Learning
Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
AIMBio-Mat: An AI-Native FAIR Platform for Closed-Loop Materials Discovery and Biomedical Translation
AIMBio-Mat is a conceptual blueprint for an AI-native, FAIR, governance-aware decision layer that formulates biomedical-materials discovery as constrained multi-objective optimization under uncertainty.
-
The Research Guide: From Informal Role to Profession
The authors argue that guiding non-PhD learners through authentic research requires a dedicated profession with its own training, career structure, and recognition because existing models and programs fall short.
-
Towards a Universal Foundation Model for Protein Dynamics: A Multi-Chain Tree-Structured Framework with Transformer Propagators
Proposes TSCG hierarchical representation and Transformer propagator for universal coarse-grained protein MD with claimed 10k-20k times acceleration over all-atom MD while preserving statistical properties.
-
On the Diffusion Time Evolution of Folding Chains in the Heteropolymer Model
Folding chains in the heteropolymer model diffuse according to D ~ t^ν with ν decreasing from 0.666 to 0.5 as coupling randomness increases.
- NOVA: Fundamental Limits of Knowledge Discovery Through AI
- From Mechanistic to Compositional Interpretability