{"total":35,"items":[{"citing_arxiv_id":"2605.13833","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling","primary_cat":"cs.LG","submitted_at":"2026-05-13T17:56:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12491","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Elastic Attention Cores for Scalable Vision Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:59:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[52] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities.arXiv preprint arXiv:1812.01243, 2018. [53] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, pages 5156-5165. PMLR, 2020. [54] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020. [55] Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry."},{"citing_arxiv_id":"2605.12464","ref_index":110,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Search Your Block Floating Point Scales!","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:50:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10875","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Compute Where it Counts: Self Optimizing Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-11T17:27:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or random allocation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10544","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing","primary_cat":"cs.CL","submitted_at":"2026-05-11T13:23:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10123","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Complex-Valued Phase-Coherent Transformer","primary_cat":"cs.LG","submitted_at":"2026-05-11T07:38:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PCT replaces softmax token competition with a smooth phase-preserving gate on normalized complex similarities, yielding stronger generalization on long-range and phase-sensitive benchmarks than both real and complex Transformers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09905","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging","primary_cat":"cs.LG","submitted_at":"2026-05-11T02:48:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Randomly initialized Transformers act as adaptive sequence smoothers for sleep staging via a Random Attention Prior Kernel, with gains mainly from inductive bias rather than training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09727","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-10T19:52:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Non-linear transformers enable cross-domain generalization in in-context RL by representing value functions from different domains with shared weights inside a shared RKHS.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11007","ref_index":88,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RT-Transformer: The Transformer Block as a Spherical State Estimator","primary_cat":"cs.LG","submitted_at":"2026-05-10T08:14:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08587","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Kaczmarz Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-09T01:07:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08308","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Practical Wi-Fi-based Motion Recognition Under Variable Traffic Patterns","primary_cat":"cs.LG","submitted_at":"2026-05-08T13:23:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A sampling-rate-versatile transformer network with dynamic augmentation achieves stable high accuracy for Wi-Fi-based motion and gesture recognition across variable sampling rates and traffic patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05971","ref_index":11,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Training Transformers for KV Cache Compressibility","primary_cat":"cs.LG","submitted_at":"2026-05-07T10:17:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05838","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MDN: Parallelizing Stepwise Momentum for Delta Linear Attention","primary_cat":"cs.LG","submitted_at":"2026-05-07T08:12:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05806","ref_index":5,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Retrieval from Within: An Intrinsic Capability of Attention-Based Models","primary_cat":"cs.LG","submitted_at":"2026-05-07T07:42:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04421","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning","primary_cat":"cs.LG","submitted_at":"2026-05-06T02:27:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connections for better information flow.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02568","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k","primary_cat":"cs.LG","submitted_at":"2026-05-04T13:19:29+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01711","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Linear-Time Global Visual Modeling without Explicit Attention","primary_cat":"cs.CV","submitted_at":"2026-05-03T04:51:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic parameterization of standard layers can replace explicit attention for linear-time global visual modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26353","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GateMOT: Q-Gated Attention for Dense Object Tracking","primary_cat":"cs.CV","submitted_at":"2026-04-29T07:04:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25188","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation","primary_cat":"cs.CV","submitted_at":"2026-04-28T03:49:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23113","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reducing Detail Hallucinations in Long-Context Regulatory Understanding via Targeted Preference Optimization","primary_cat":"cs.SI","submitted_at":"2026-04-25T02:41:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DetailDPO cuts detail-level hallucination errors in LLMs on long regulatory documents by 42-61% using targeted contrastive pairs on a new 13,000-pair benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22442","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models","primary_cat":"cs.LG","submitted_at":"2026-04-24T10:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20819","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling","primary_cat":"cs.LG","submitted_at":"2026-04-22T17:46:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20789","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity","primary_cat":"cs.CL","submitted_at":"2026-04-22T17:14:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20311","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction","primary_cat":"cs.MM","submitted_at":"2026-04-22T08:11:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19021","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control","primary_cat":"cs.LG","submitted_at":"2026-04-21T03:15:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15118","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"NFTDELTA: Detecting Permission Control Vulnerabilities in NFT Contracts through Multi-View Learning","primary_cat":"cs.CR","submitted_at":"2026-04-16T15:07:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NFTDELTA detects permission control vulnerabilities in NFT contracts by combining sequence and graph views of function CFGs, reporting 241 confirmed issues across 795 collections with 97.92% average precision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12798","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation","primary_cat":"cs.LG","submitted_at":"2026-04-14T14:28:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11983","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Geometric Algebra-informed NeRF Framework for Generalizable Wireless Channel Prediction","primary_cat":"cs.NI","submitted_at":"2026-04-13T19:20:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GAI-NeRF combines geometric algebra attention and an adaptive ray tracing module inside a NeRF model to deliver more accurate and generalizable wireless channel predictions across varied indoor environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06473","ref_index":8,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MICA: Multivariate Infini Compressive Attention for Time Series Forecasting","primary_cat":"cs.LG","submitted_at":"2026-04-07T21:19:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MICA adapts infini compressive attention to the channel dimension, enabling scalable cross-channel dependencies in Transformers and cutting forecast error by 5.4% on average versus channel-independent baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14191","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention to Mamba: A Recipe for Cross-Architecture Distillation","primary_cat":"cs.CL","submitted_at":"2026-04-01T09:23:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.12524","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"YOLOv12: Attention-Centric Real-Time Object Detectors","primary_cat":"cs.CV","submitted_at":"2025-02-18T04:20:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.14509","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models","primary_cat":"cs.LG","submitted_at":"2023-09-25T20:15:57+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2210.09461","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Token Merging: Your ViT But Faster","primary_cat":"cs.CV","submitted_at":"2022-10-17T22:23:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.02311","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PaLM: Scaling Language Modeling with Pathways","primary_cat":"cs.CL","submitted_at":"2022-04-05T16:11:45+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2104.09864","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","primary_cat":"cs.CL","submitted_at":"2021-04-20T09:54:06+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}