{"total":37,"items":[{"citing_arxiv_id":"2605.18530","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Continuous Diffusion Scales Competitively with Discrete Diffusion for Language","primary_cat":"cs.CL","submitted_at":"2026-05-18T15:15:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18856","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:48:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spherical KV introduces angle-domain attention with spherical key parameterization and rate-distortion retention to cut KV cache residency while preserving efficient paged decoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11007","ref_index":148,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RT-Transformer: The Transformer Block as a Spherical State Estimator","primary_cat":"cs.LG","submitted_at":"2026-05-10T08:14:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04901","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference","primary_cat":"cs.CR","submitted_at":"2026-05-06T13:31:15+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03413","ref_index":215,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning to Theorize the World from Observation","primary_cat":"cs.LG","submitted_at":"2026-05-05T06:39:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01517","ref_index":245,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation","primary_cat":"cs.CV","submitted_at":"2026-05-02T16:10:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00604","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-05-01T12:18:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"planation: a stateless MTP predictor cannot detect structural transitions (acc@transition ≈ 0. 006), 22 so retaining it at inference adds compute without routing beneﬁt. The ﬁx - conditioning on β-accumulated state - has not been proposed or tested in either paper. Anticipatory mechanisms in recurrent networks have a longer history. LSTM [24] and GRU [25] learn when to retain and forget information via gated memory, implicitly implementing prediction of future relevance. Our predictor is more explicit - it directly predicts the next embedding - and is applied speciﬁcally to the routing decision rather than the representation. 7.5 Free Energy Principle in Machine Learning The FEP has inspired several ML architectures."},{"citing_arxiv_id":"2604.21215","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Recurrent Transformer: Greater Effective Depth and Efficient Decoding","primary_cat":"cs.LG","submitted_at":"2026-04-23T02:12:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17384","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards a Data-Parameter Correspondence for LLMs: A Preliminary Discussion","primary_cat":"cs.LG","submitted_at":"2026-04-19T11:18:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A data-parameter correspondence unifies data-centric and parameter-centric LLM optimizations as dual geometric operations on the statistical manifold via Fisher-Rao metric and Legendre duality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16509","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning-Based Sparsification of Dynamic Graphs in Robotic Exploration Algorithms","primary_cat":"cs.RO","submitted_at":"2026-04-15T03:39:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A PPO-trained transformer policy sparsifies dynamic graphs during RRT frontier exploration, cutting size by up to 96% and yielding the most consistent exploration rates across environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03263","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling","primary_cat":"cs.CL","submitted_at":"2026-03-12T21:21:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00432","ref_index":245,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2025-07-01T05:23:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.14386","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers","primary_cat":"cs.CV","submitted_at":"2025-04-19T19:20:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LOOPE learns a patch ordering for positional embeddings in ViTs and introduces the Three Cell Experiment benchmark that shows 30-35% gaps in positional retention versus the usual 4-6%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.18970","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba","primary_cat":"cs.LG","submitted_at":"2025-03-22T01:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":0.0,"formal_verification":"none","one_line_summary":"A survey tracing the evolution of state-space models like S4 and Mamba, their efficiency trade-offs, and applications in NLP, vision, and other domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.04434","ref_index":160,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2024-05-07T15:56:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.18416","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Capabilities of Gemini Models in Medicine","primary_cat":"cs.AI","submitted_at":"2024-04-29T04:11:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Towards accurate differential diagnosis with large language models.arXiv preprint arXiv:2312.00164, 2023. J. Medina-Martínez, C. Saus-Ortega, M. M. Sánchez-Lorente, E. M. Sosa-Palanca, P. García-Martínez, and M. I. Mármol-López. Health inequities in lgbt people and nursing interventions to reduce them: A systematic review.International Journal of Environmental Research and Public Health, 18(22): 11801, 2021. Meta. Papers with code - medical, 2024. URLhttps://paperswithcode.com/area/medical. Accessed: 2024-04-26. M. Moor, O. Banerjee, Z. S. H. Abad, H. M. Krumholz, J. Leskovec, E. J. Topol, and P. Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259-265, 2023a. M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y."},{"citing_arxiv_id":"2404.07143","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention","primary_cat":"cs.CL","submitted_at":"2024-04-10T16:18:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.02954","ref_index":162,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek LLM: Scaling Open-Source Language Models with Longtermism","primary_cat":"cs.CL","submitted_at":"2024-01-05T18:59:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.08560","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MemGPT: Towards LLMs as Operating Systems","primary_cat":"cs.AI","submitted_at":"2023-10-12T17:51:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.01852","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment","primary_cat":"cs.CV","submitted_at":"2023-10-03T07:33:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.16797","ref_index":85,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution","primary_cat":"cs.CL","submitted_at":"2023-09-28T19:01:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.08089","ref_index":172,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory","primary_cat":"cs.CV","submitted_at":"2023-08-16T01:43:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DragNUWA integrates text, image, and trajectory controls into a diffusion video model using a Trajectory Sampler, Multiscale Fusion, and Adaptive Training to enable fine-grained open-domain video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2307.06435","ref_index":111,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Comprehensive Overview of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-07-12T20:01:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"CPM-2 also pro- poses the INFMOE, a memory-efficient framework with a strat- egy to dynamically offload parameters to the CPU for inference at a 100B scale. It overlaps data movement with inference com- putation for lower inference time. ERNIE 3.0 [110]: ERNIE 3.0 takes inspiration from multi- task learning to build a modular architecture using Transformer- XL [111] as the backbone. The universal representation mod- ule is shared by all the tasks, which serve as the basic block for task-specific representation modules, which are all trained jointly for natural language understanding, natural language generation, and knowledge extraction. This LLM is primar- ily focused on the Chinese language. It claims to train on the"},{"citing_arxiv_id":"2303.16199","ref_index":103,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention","primary_cat":"cs.CV","submitted_at":"2023-03-28T17:59:12+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2205.01068","ref_index":119,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OPT: Open Pre-trained Transformer Language Models","primary_cat":"cs.CL","submitted_at":"2022-05-02T17:49:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2107.06499","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Deduplicating Training Data Makes Language Models Better","primary_cat":"cs.CL","submitted_at":"2021-07-14T06:06:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Deduplicating training datasets reduces language model verbatim memorization by 10x, improves training efficiency, and enables more accurate evaluation by cutting train-test overlap.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2101.00027","ref_index":141,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Pile: An 800GB Dataset of Diverse Text for Language Modeling","primary_cat":"cs.CL","submitted_at":"2020-12-31T19:00:10+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1911.05507","ref_index":111,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Compressive Transformers for Long-Range Sequence Modelling","primary_cat":"cs.LG","submitted_at":"2019-11-13T14:36:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1910.03771","ref_index":151,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HuggingFace's Transformers: State-of-the-art Natural Language Processing","primary_cat":"cs.CL","submitted_at":"2019-10-09T03:23:22+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1909.11942","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations","primary_cat":"cs.CL","submitted_at":"2019-09-26T07:06:13+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1909.08053","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","primary_cat":"cs.CL","submitted_at":"2019-09-17T19:42:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1909.05858","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CTRL: A Conditional Transformer Language Model for Controllable Generation","primary_cat":"cs.CL","submitted_at":"2019-09-11T17:57:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.09669","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EmotionX-HSU: Adopting Pre-trained BERT for Emotion Classification","primary_cat":"cs.CL","submitted_at":"2019-07-23T03:05:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Fine-tuning BERT yields micro-F1 scores of 79.1% on Friends and 86.2% on EmotionPush test sets for four-class emotion classification in the EmotionX-2019 shared task.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.06607","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agglomerative Attention","primary_cat":"cs.LG","submitted_at":"2019-07-15T17:11:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents agglomerative attention, a linear-complexity attention model that achieves comparable performance to full attention on language modeling tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.05572","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"R-Transformer: Recurrent Neural Network Enhanced Transformer","primary_cat":"cs.LG","submitted_at":"2019-07-12T04:01:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"R-Transformer integrates RNNs with multi-head attention to model local and global sequence dependencies without position embeddings and reports large-margin gains over prior methods on diverse tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.04868","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LakhNES: Improving multi-instrumental music generation with cross-domain pre-training","primary_cat":"cs.SD","submitted_at":"2019-07-10T18:00:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pre-training a Transformer on Lakh MIDI improves quantitative and qualitative performance when generating four-instrument NES music scores from NES-MDB.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.08237","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"XLNet: Generalized Autoregressive Pretraining for Language Understanding","primary_cat":"cs.CL","submitted_at":"2019-06-19T17:35:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}