PathVQA is the first public dataset of over 32,000 questions on nearly 5,000 pathology images for medical visual question answering.
super hub Mixed citations
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Mixed citation behavior. Most common role is background (47%).
abstract
In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed
- method To validate whether the quality advantage of GS scenes translates to stronger navigation agents, we train five agent groups under differ- ent scene-domain mixtures, with training budget fixed at5×107 steps:A: 100 mesh scenes,B: 100 GS scenes,C: 80M + 20G,D: 50M + 50G, andE: 20M + 80G. All agents share a unified DD-PPO [30] architecture with a ResNet [6] visual encoder and a GRU [4] policy head, receiving256×256RGB and depth observations, with only training scene composition varying. Each agent i
- method Recent works [54,62] have demon- strated that incorporating G-buffers and temporal cues can significantly improve reconstruction quality. Motivated by this, we propose a Geometry-Temporal Re- current Refinement Network to further enhance the interpolated outputs. As illustrated in Fig. 3 (a), the proposed network employs a modified gated recurrent unit (GRU) [9] architecture. The network takes as input the current 8 Y. Zhao et al. low-resolution frameIand its gradientsG, the depth mapD, the norm
- method and we map the stateH t to queries, keys and values with affine projections using learned parameter matrices W Q∈ Rd×d/k, W K∈ Rd×d/k, W V∈ Rd×d/k and W O∈ Rd×d. At step t, the UT then computes revised representationsH t∈ Rm×d for all m input positions as follows H t = LAYERNORM(At+TRANSITION (At)) (4) where At = LAYERNORM((H t−1+P t)+ MULTIHEAD SELFATTENTION (H t−1+P t)), (5) where LAYERNORM () is defined in Ba et al. (2016), and TRANSITION () and P t are discussed below. Depending on the task, w
- dataset and the kinematics are deterministic; digits overlap without collision, and bouncing reflections are non-linear. ( 2)PhyWorld Collision 30K[Kang et al., 2025] comprises roughly 30 thousand simulated rigid-body collisions with varying ball radii and velocities. It tests OOD generalisation and strict physical consistency (momentum and kinetic energy conservation). (3)WeatherBench (2m temperature)[Rasp et al., 2020] is a processed version of the ERA5 archive [Hersbach et al., 2020], containing glob
- background Middle: attention-based learning enables adaptive and reliability-aware aggregation of heterogeneous measurements. Right: representative applications, including radio map reconstruction, LEO satellite localization, and map-informed resource allocation. To address these challenges, attention mechanisms have emerged as an effective framework for adaptive information aggregation. Originally developed for neural machine trans- lation [29], [30] and later generalized by Transformers [31], attention c
- method θm (zt, at)→ˆz t+1) predicts the next latent state after a primitive action, and the decoder (Decθd (ˆzt+1)→ˆs t+1) reconstructs next-state information from the predicted latent. The pur- pose is not to use this model as a long-horizon planner, but to learn a dynamics-aware latent space where behaviorally similar states can merge and useful hubs can emerge. We use a GRU [9] memory module with clipped history so the latent can capture recent context without memorizing full trajectory identity. Ad
authors
co-cited works
representative citing papers
NDR-SHKF replaces the static forgetting factor in Sage-Husa Kalman Filters with a learned vector-valued memory attenuation policy from a bifurcated recurrent network trained end-to-end on whitened innovations to minimize estimation error.
Nested-GPT is an autoregressive Transformer surrogate that generates variable-multiplicity parton showers while enforcing ordered Markovian branching and matches reference Monte Carlo results for leading-log non-global logarithm resummation in the large-Nc limit.
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
Dilated RNN wave functions induce power-law correlations for the critical 1D transverse-field Ising model and the Cluster state, unlike the exponential decay of conventional RNN ansatze.
HealthPoint represents clinical events as points in a 4D space (content, time, modality, case) and applies low-rank relational attention to achieve state-of-the-art mortality prediction from multi-level incomplete multimodal EHRs.
Denoising particle filters train state estimators on individual transitions via score matching, then use the learned denoiser with a dynamics model to approximate Bayesian filtering step-by-step, matching end-to-end baselines while preserving composability.
MELT is the first behavioral trace dataset for high-risk memecoin launch detection on Solana, providing 122 features, risk annotations, and ML benchmarks that reduce investment loss when used for selection.
CogAlpha combines LLM reasoning with code-level evolutionary search to discover financial alphas that show higher predictive accuracy and generalization than prior methods on five stock datasets.
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.
DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
Introduces B-MOD dataset of 19,728 mobile device photos of documents with precise text line annotations and a neural baseline showing high error rates on harder images.
Graph Attention Networks compute learnable attention coefficients over node neighborhoods to produce weighted feature aggregations, achieving state-of-the-art results on citation networks and inductive protein-protein interaction graphs.
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
MoCo-AIS is a MoCo-based contrastive learning framework that learns vessel trajectory embeddings and improves similarity computation over baselines on large-scale real-world AIS datasets while offering a benchmarking platform.
LVCG is the first self-supervised framework for learning view-invariant latent VCG representations that claims to outperform ECG-space baselines with better robustness and generalization in domain shift settings.
GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
3DGS³ adds gradient-guided super-sampling and lightweight temporal interpolation to low-resolution 3DGS renders to produce high-resolution, high-frame-rate output without retraining the underlying scene representation.
citing papers explorer
-
PathVQA: 30000+ Questions for Medical Visual Question Answering
PathVQA is the first public dataset of over 32,000 questions on nearly 5,000 pathology images for medical visual question answering.
-
Learned Memory Attenuation in Sage-Husa Kalman Filters for Robust UAV State Estimation
NDR-SHKF replaces the static forgetting factor in Sage-Husa Kalman Filters with a learned vector-valued memory attenuation policy from a bifurcated recurrent network trained end-to-end on whitened innovations to minimize estimation error.
-
Nested-GPT for variable-multiplicity parton showers: A case study in the resummation of non-global logarithms
Nested-GPT is an autoregressive Transformer surrogate that generates variable-multiplicity parton showers while enforcing ordered Markovian branching and matches reference Monte Carlo results for leading-log non-global logarithm resummation in the large-Nc limit.
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
-
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
-
Zero-shot Imitation Learning by Latent Topology Mapping
ZALT learns latent hub states and hub-to-hub dynamics from demonstrations to plan zero-shot solutions for unseen start-goal tasks, achieving 55% success in a 3D maze versus 6% for baselines.
-
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
-
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.
-
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.
-
Geometry-Induced Long-Range Correlations in Recurrent Neural Network Quantum States
Dilated RNN wave functions induce power-law correlations for the critical 1D transverse-field Ising model and the Cluster state, unlike the exponential decay of conventional RNN ansatze.
-
A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs
HealthPoint represents clinical events as points in a 4D space (content, time, modality, case) and applies low-rank relational attention to achieve state-of-the-art mortality prediction from multi-level incomplete multimodal EHRs.
-
Denoising Particle Filters: Learning State Estimation with Single-Step Objectives
Denoising particle filters train state estimators on individual transitions via score matching, then use the learned denoiser with a dynamics model to approximate Bayesian filtering step-by-step, matching end-to-end baselines while preserving composability.
-
MELT: A Behavioral Trace Dataset for High-Risk Memecoin Launch Detection
MELT is the first behavioral trace dataset for high-risk memecoin launch detection on Solana, providing 122 features, risk annotations, and ML benchmarks that reduce investment loss when used for selection.
-
Cognitive Alpha Mining via LLM-Driven Code-Based Evolution
CogAlpha combines LLM reasoning with code-level evolutionary search to discover financial alphas that show higher predictive accuracy and generalization than prior methods on five stock datasets.
-
Mastering Diverse Domains through World Models
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
-
Human Motion Diffusion Model
MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.
-
Mastering Atari with Discrete World Models
DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
-
Brno Mobile OCR Dataset
Introduces B-MOD dataset of 19,728 mobile device photos of documents with precise text line annotations and a neural baseline showing high error rates on harder images.
-
Graph Attention Networks
Graph Attention Networks compute learnable attention coefficients over node neighborhoods to produce weighted feature aggregations, achieving state-of-the-art results on citation networks and inductive protein-protein interaction graphs.
-
Mixed Precision Training
Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.
-
MoCo-AIS: A Contrastive Learning Framework for Similarity Computation of Vessel Trajectories
MoCo-AIS is a MoCo-based contrastive learning framework that learns vessel trajectory embeddings and improves similarity computation over baselines on large-scale real-world AIS datasets while offering a benchmarking platform.
-
Learning Cardiac Latent Representations in Vectorcardiogram Space
LVCG is the first self-supervised framework for learning view-invariant latent VCG representations that claims to outperform ECG-space baselines with better robustness and generalization in domain shift settings.
-
Generative Recursive Reasoning
GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
-
3DGS$^3$: Joint Super Sampling and Frame Interpolation for Real-Time Large-Scale 3DGS Rendering
3DGS³ adds gradient-guided super-sampling and lightweight temporal interpolation to low-resolution 3DGS renders to produce high-resolution, high-frame-rate output without retraining the underlying scene representation.
-
Graph Federated Unlearning for Privacy Preservation
Orthogonal unlearning updates plus server-side virtual clients enable effective user data removal in graph federated learning without major performance loss.
-
Deep Kernel Learning for Stratifying Glaucoma Trajectories
A deep kernel learning architecture with transformer feature extraction on clinical-BERT embeddings and Gaussian process backend identifies three glaucoma subgroups by decoupling progression trajectories from current visual acuity in multimodal EHR data.
-
IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem
IDOBE compiles over 10,000 epidemiological outbreaks into a public benchmark and shows that MLP-based models deliver the most robust short-term forecasts while statistical methods hold a slight edge pre-peak.
-
MATRIX: Multi-Layer Code Watermarking via Dual-Channel Constrained Parity-Check Encoding
MATRIX embeds multi-layer watermarks in LLM-generated code via dual-channel constrained parity-check encoding, achieving 99.2% detection accuracy with 0-0.14% functionality loss and 7.7-26.67% better attack robustness than prior methods.
-
Early-Warning Learner Satisfaction Forecasting in MOOCs via Temporal Event Transformers and LLM Text Embeddings
TET-LLM predicts MOOC satisfaction early via temporal event transformers on behavior, LLM embeddings on text, and topic distributions, beating baselines at RMSE 0.82 and AUC 0.77 for 7-day forecasts.
-
Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting
Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
-
Upper Generalization Bounds for Neural Oscillators
Upper generalization bounds for neural oscillators scale polynomially with MLP size and time length, avoiding the curse of parametric complexity, with numerical validation on a Bouc-Wen nonlinear system.
-
Beyond Static: Related Questions Retrieval Through Conversations in Community Question Answering
TeCQR retrieves related questions in cQA by generating tag-enhanced clarifying questions, using noise-tolerant semantic matching, and two-stage training to learn fine-grained representations of queries, questions, and tags.
-
Attention-Based Neural-Augmented Kalman Filter for Legged Robot State Estimation
AttenNKF augments InEKF with an attention-based neural compensator trained in latent space to correct foot-slip errors in legged robot state estimation.
-
AsarRec: Adaptive Sequential Augmentation for Robust Self-supervised Sequential Recommendation
AsarRec learns adaptive sequence augmentations via transformation matrices and Semi-Sinkhorn projection to improve robustness of self-supervised sequential recommenders under noise.
-
Cataract-LMM Large-Scale Multi-Source Multi-Task Benchmark for Deep Learning in Surgical Video Analysis
Cataract-LMM is a new multi-source dataset of 3000 annotated phacoemulsification videos enabling benchmarks for phase recognition, scene segmentation, interaction tracking, and automated skill assessment.
-
RAPTOR: A Foundation Policy for Quadrotor Control
A 2084-parameter recurrent policy trained by distilling 1000 RL teacher policies enables zero-shot control across 10 real quadrotors differing in mass, motors, frames, propellers, and flight controllers.
-
Scalable Option Learning in High-Throughput Environments
SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.
-
Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks
Gating in RNNs couples state time-scales with parameter gradients to produce lag- and direction-dependent effective learning rates, shown via exact Jacobians and first-order expansion.
-
SpectraLLM: Uncovering the Ability of LLMs for Molecular Structure Elucidation from Multi-Spectral Data
SpectraLLM is an LLM fine-tuned to predict small-molecule structures from single or multiple spectra, reporting state-of-the-art results on four public benchmarks with gains from multi-modal input.
-
Chinese Cyberbullying Detection: Dataset, Method, and Validation
Introduces CHNCI, the first Chinese cyberbullying incident detection dataset with 220,676 comments across 91 incidents, created via ensemble pseudo-labeling from explanation-generating methods followed by human annotation.
-
Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Slot-MLLM introduces a slot-attention-based object-centric visual tokenizer with Q-Former encoder, diffusion decoder, and residual vector quantization for improved local visual comprehension and generation in multimodal LLMs.
-
Decentralized Collective World Model for Emergent Communication and Coordination
A decentralized collective world model integrates predictive coding with bidirectional communication to achieve simultaneous symbol emergence and coordination, outperforming non-communicative baselines in a two-agent trajectory task under divergent perceptions.
-
Pretraining a Foundation Model for Small-Molecule Natural Products
NaFM is a pretrained foundation model for natural products using scaffold-focused contrastive learning and masked graph objectives that achieves SOTA on taxonomy classification, gene/microbial analysis, and virtual screening tasks.
-
Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code
ByteTR recovers variable types in binary code more effectively than prior methods by decoupling unbalanced type sets, mitigating compiler optimization effects via static analysis, and modeling inter-procedural data flows with a gated GNN.
-
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.
-
SAM 2: Segment Anything in Images and Videos
SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.
-
READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling
READ recurrent adapters with partial video-language alignment via optimal transport outperform standard fine-tuning on low-resource temporal grounding and summarization tasks.
-
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
-
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.
-
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.