QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
hub Canonical reference
Attention is all you need
Canonical reference. 89% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background characteristics inherent in power load time series. Data-driven approaches based on artificial intelligence have become mainstream in recent years. Early methods centered on recurrent neural networks (RNNs) and convolutional neural networks (CNNs), which are adept at capturing temporal de- pendencies and inter-variable relationships [5]. With the advent of the Transformer architecture [6], attention-based models have advanced rapidly for time series forecasting, giving rise to numerous variants
- background However, these models still face challenges: their ability to explicitly model local interactions remains limited, and their interpretability is relatively weak. These drawbacks motivate our approach, which leverages physically grounded quantum walk dynamics to provide both richer local structural model- ing and improved interpretability. Formally, the self-attention mechanism in the Transformer framework [20] is defined as Attention (Q, K, V) = softmax (QKT √ d ) V(2) WhereQ, K, V∈R n×dare the
- background Generating accurate, human-like motion requires ac- counting for variability in emotion and semantic emphasis, two aspects that remain underexplored. Computational efficiency is an additional requirement for real-time robotics applications. Model architectures have evolved from recurrent networks such as long short-term memory (LSTM) [16] to attention-based transformers [17]. Adversarial and diffusion-based methods have also been proposed to improve motion realism and diver- sity [2], [14], [18]
- background Neural Machine Translation (NMT) has emerged as a pow- erful end-to-end approach for automated translation, employ- ing a single neural network to directly model the probability of a target sentence given a source sentence [1]. In recent years, NMT models have significantly improved translation quality, accompanied by a substantial expansion in model scale. Since Transformer introduced [2], the parameter count of NMT models has grown exponentially. For instance, M2M-100 (12 billion parameters) [
- background after GEMM completion while the output tiles still reside in on- chip memory (L1/L2 caches or registers), we avoid costly global memory traffic. However, conventional normalization layers operate along the feature dimension, which often misaligns with the physical data layout of GEMM outputs. To address this, we proposesBlockNorm, a normalization approach inspired by GroupNorm [71] which is originally designed to apply normalization within individual channels of a feature map. In our version of
- background bias parameters(γ, β)from conditional inputs, then modulates intermediate features viaγ⊙x+βto achieve lightweight conditional feature selection [7]. We observe that this channel- level modulation effectively adjusts feature weights with low overhead and good trainability. In contrast, attention-based cross-modal fusion typically relies on spatial weights or token- level interactions [9], [15]-[17], which increase computa- tional/parameter overhead and may complicate optimization in reinforcement
co-cited works
representative citing papers
CTQWformer fuses continuous-time quantum walks into a graph transformer and recurrent module to outperform standard GNNs and graph kernels on classification benchmarks.
CAIS delivers 1.38x end-to-end LLM training speedup over NVLS and 1.61x over T3 by making in-switch computing aware of computation memory requirements instead of treating communication as an isolated phase.
Cascaded discrete diffusion generates CAD command sequences with absorbing transitions and parameters with Gaussian, scale-invariant, and prior-preserving kernels, outperforming autoregressive and continuous diffusion baselines on the DeepCAD dataset.
Tempus delivers 607 GOPS at 10.677 W using fixed 16 AIE cores on Versal AI Edge, with 211.2x better platform-aware utility than spatial SOTA ARIES and zero URAM/DSP utilization.
Temporal autocorrelation reintroduces spectral bias in KANs for time series forecasting, which DCT preprocessing can mitigate.
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
BiSplat-WRF applies 2D planar Gaussians rendered on angular domains plus a bilinear spatial transformer to capture electromagnetic interactions, outperforming prior NeRF and GS methods on SSIM for wireless radiance field reconstruction.
DEMUX achieves state-of-the-art multi-tab website fingerprinting accuracy by preserving boundary signals, modeling at multiple scales, and associating dispersed traffic fragments with a new three-component architecture.
STFER uses LVLM-generated identity-consistent semantic text to drive visual token filtering and expert routing for improved any-time person re-identification under clothing changes and modality shifts.
CDPR integrates polarization priors into a diffusion-based monocular depth estimator via shared latent space and adaptive gating, outperforming RGB-only methods in challenging scenes.
Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.
SHIELD reduces eDRAM refresh energy by 35% for LLM inference on edge NPUs by isolating sign/exponent from mantissa bits, disabling refresh on transient QO mantissas, and relaxing it on persistent KV mantissas while keeping accuracy intact.
LiftFormer transforms monocular depth prediction into depth-oriented geometric and edge-aware subspace representations via lifting and frame theory, achieving state-of-the-art results on standard datasets.
HealthPoint represents clinical events as points in a 4D space (content, time, modality, case) and applies low-rank relational attention to achieve state-of-the-art mortality prediction from multi-level incomplete multimodal EHRs.
CBEN provides paired optical-radar images with cloud occlusion, revealing 23-33 point AP drops in clear-sky trained models and 17-29 point relative gains when models are trained on cloudy data.
D³ETOR combines debate-enhanced pseudo labeling from SAM with frequency-aware progressive debiasing in FADeNet to achieve state-of-the-art weakly-supervised camouflaged object detection using scribbles.
ExDoS uses expert-guided dual-focus distillation between source semantic graphs and bytecode control-flow graphs plus a dual-attention network to improve smart contract vulnerability detection, reporting 3-6% F1 gains over baselines.
GT-NSGDm achieves the optimal non-asymptotic convergence rate O(1/T^{(p-1)/(3p-2)}) for decentralized nonconvex stochastic optimization under zero-mean heavy-tailed noise with p-th moment.
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
MAQIU adds a memorization module and recall mechanism to update query intent dynamically in chat-based image retrieval, cutting FLOPs by 86.4% versus ChatIR while improving results.
CSI-JEPA learns temporal-spectral representations from unlabeled CSI via masked prediction and achieves up to 10.64 percentage points accuracy gain and 98% label savings on seven real-world Wi-Fi sensing tasks.
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
EKD trains lightweight NMT students progressively from a chain of teachers with rising capacity, achieving BLEU scores within 0.08 of the largest teacher on IWSLT-14.
citing papers explorer
-
Computer-Aided Design Generation by Cascaded Discrete Diffusion Model
Cascaded discrete diffusion generates CAD command sequences with absorbing transitions and parameters with Gaussian, scale-invariant, and prior-preserving kernels, outperforming autoregressive and continuous diffusion baselines on the DeepCAD dataset.
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID
STFER uses LVLM-generated identity-consistent semantic text to drive visual token filtering and expert routing for improved any-time person re-identification under clothing changes and modality shifts.
-
CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation
CDPR integrates polarization priors into a diffusion-based monocular depth estimator via shared latent space and adaptive gating, outperforming RGB-only methods in challenging scenes.
-
LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation
LiftFormer transforms monocular depth prediction into depth-oriented geometric and edge-aware subspace representations via lifting and frame theory, achieving state-of-the-art results on standard datasets.
-
CBEN -- A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding
CBEN provides paired optical-radar images with cloud occlusion, revealing 23-33 point AP drops in clear-sky trained models and 17-29 point relative gains when models are trained on cloudy data.
-
Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing for Weakly-Supervised Camouflaged Object Detection with Scribble Annotations
D³ETOR combines debate-enhanced pseudo labeling from SAM with frequency-aware progressive debiasing in FADeNet to achieve state-of-the-art weakly-supervised camouflaged object detection using scribbles.
-
UniT: Unified Geometry Learning with Group Autoregressive Transformer
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
-
Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval
MAQIU adds a memorization module and recall mechanism to update query intent dynamically in chat-based image retrieval, cutting FLOPs by 86.4% versus ChatIR while improving results.
-
Text-to-CAD Retrieval: a Strong Baseline
Text-to-CAD retrieval is introduced as a cross-modal task with a baseline that learns joint embeddings from CAD construction sequences, point clouds, and text queries via a masked feature decoder.
-
RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation
RIHA proposes a hierarchical alignment transformer that uses multi-scale visual and textual feature pyramids plus optimal transport to generate more accurate radiology reports from medical images.
-
Boundary-Centric Active Learning for Temporal Action Segmentation
B-ACT improves label efficiency in temporal action segmentation by selecting only boundary frames for annotation via a two-stage uncertainty-driven process that fuses neighborhood uncertainty, class ambiguity, and temporal dynamics.
-
Light-ResKAN: A Parameter-Sharing Lightweight KAN with Gram Polynomials for Efficient SAR Image Recognition
Light-ResKAN reaches 99.09% accuracy on MSTAR SAR images with 82.9 times fewer FLOPs and 163.78 times fewer parameters than VGG16 by combining KAN convolutions, Gram polynomials, and channel-wise parameter sharing.
-
Frozen LLMs as Map-Aware Spatio-Temporal Reasoners for Vehicle Trajectory Prediction
A framework encodes observed trajectories and HD maps into tokens for frozen LLMs to perform spatio-temporal reasoning and predict future vehicle paths with a linear decoder.
-
Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection
A new adapter module combining boundary-aware state space modeling with spatial processing boosts localization and robustness in temporal action detection.
-
Coordinate-Based Dual-Constrained Autoregressive Motion Generation
CDAMD is a new autoregressive text-to-motion framework operating on continuous motion coordinates with dual constraints and diffusion-inspired components, establishing new benchmarks and claiming SOTA fidelity plus semantic consistency.
-
FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection
FMC-DETR proposes a frequency-decoupled fusion framework with WeKat backbone, MDFC coordination, and CPF fusion modules that claims state-of-the-art results on remote sensing object detection benchmarks.
-
AMIEOD: Adaptive Multi-Experts Image Enhancement for Object Detection in Low-Illumination Scenes
AMIEOD combines a multi-expert enhancement module with detection-guided regression and selection losses to raise object detection accuracy in low-illumination images.
-
EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges
EV-CLIP introduces mask and context visual prompts to adapt CLIP for improved few-shot video action recognition under visual challenges such as low light and egocentric views, outperforming other efficient methods with backbone-scale-independent efficiency.
-
SwinTextUNet: Integrating CLIP-Based Text Guidance into Swin Transformer U-Nets for Medical Image Segmentation
SwinTextUNet integrates CLIP text guidance into Swin U-Net via cross-attention and convolutional fusion, achieving 86.47% Dice and 78.2% IoU on QaTaCOV19 medical image segmentation.
-
Video-guided Machine Translation with Global Video Context
A globally video-guided multimodal translation framework retrieves semantically related video segments with a vector database and applies attention mechanisms to improve subtitle translation accuracy in long videos.
-
Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild
Zero-shot MLLMs on ShanghaiTech and CHAD exhibit strong conservative bias with high precision but collapsed recall; class-specific prompts raise peak F1 from 0.09 to 0.64 yet recall remains the bottleneck.
-
Analysis of Invasive Breast Cancer in Mammograms Using YOLO, Explainability, and Domain Adaptation
A ResNet50 OOD filter plus YOLOv8/11/12 pipeline reaches 99.77% OOD rejection accuracy and 0.947 mAP on mammograms while blocking irrelevant imaging inputs.
- RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation