{"total":13,"items":[{"citing_arxiv_id":"2607.01918","ref_index":182,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Zeus: Towards Tuning-Free Foundation Model for Time Series Analysis","primary_cat":"cs.LG","submitted_at":"2026-07-02T09:16:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Zeus proposes a multi-scale Transformer with point-wise tokenization and Multi-Objective Temporal Masking to enable tuning-free performance on forecasting, interpolation, and other time series tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01775","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding","primary_cat":"cs.LG","submitted_at":"2026-07-02T06:45:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21724","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling","primary_cat":"cs.CL","submitted_at":"2026-04-23T14:27:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"X-GRAM applies data-aware dynamic token injection with hybrid hashing and local feature extraction to achieve up to 4.4 accuracy point gains over vanilla backbones and 3.2 over retrieval baselines at 0.73B-1.15B scales using 50% smaller tables.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.00816","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sundial: A Family of Highly Capable Time Series Foundation Models","primary_cat":"cs.LG","submitted_at":"2025-02-02T14:52:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sundial uses TimeFlow Loss for native pre-training of Transformers on continuous time series from TimeBench, achieving SOTA point and probabilistic forecasting with millisecond inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.15947","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MoE-LLaVA: Mixture of Experts for Large Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2024-01-29T08:13:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.04799","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DA-Cramming: Enhancing Cost-Effective Language Model Pretraining with Dependency Agreement Integration","primary_cat":"cs.CL","submitted_at":"2023-11-08T16:18:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DA-Cramming inserts chunk-level dependency agreement embeddings into a dual-stage pretraining pipeline and reports better downstream performance than prior Cramming baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2307.06435","ref_index":78,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Comprehensive Overview of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-07-12T20:01:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ReGLU(x, W, V, b, c) = max(0, xW + b)⊗, GEGLU (x, W, V, b, c) = GELU (xW + b) ⊗ (xV + c), S wiGLU(x, W, V, b, c, β) = S wishβ(xW + b) ⊗ (xV + c). 2.5. Layer Normalization Layer normalization leads to faster convergence and is an in- tegrated component of transformers [64]. In addition to Layer- Norm [76] and RMSNorm [77], LLMs use pre-layer normal- ization [78], applying it before multi-head attention (MHA). Pre-norm is shown to provide training stability in LLMs. An- other normalization variant, DeepNorm [79] fixes the issue with larger gradients in pre-norm. 2.6. Distributed LLM Training This section describes distributed LLM training approaches briefly. More details are available in [13, 37, 80, 81]."},{"citing_arxiv_id":"2208.07339","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale","primary_cat":"cs.LG","submitted_at":"2022-08-15T17:08:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2111.00396","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficiently Modeling Long Sequences with Structured State Spaces","primary_cat":"cs.LG","submitted_at":"2021-10-31T03:32:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while being faster than Transformers for generation.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"sequence models where tokens are predicted sequentially based on past context. Although RNNs were the model of choice for many years, Transformers are now the dominant model in such applications that contain data that is inherently discrete. We show that alternative models to Transformers can still be competitive in these settings. By simply taking a strong Transformer baseline [ 2] and replacing the self-attention layers, S4 substantially closes the gap to Transformers (within 0.8 ppl), setting SoTA for attention-free models by over 2 ppl. Fast autoregressive inference. A prominent limitation of autoregressive models is inference speed (e.g. generation), since they require a pass over the full context for every new sample. Several methods have been"},{"citing_arxiv_id":"2108.12409","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation","primary_cat":"cs.CL","submitted_at":"2021-08-27T17:35:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ALiBi enables transformers trained on length-1024 sequences to extrapolate to length-2048 with the same perplexity as a sinusoidal model trained on 2048, while training 11% faster and using 11% less memory.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the ﬁrsti keys K∈ Ri×d, where d is the head dimension: softmax(qiK⊤) These attention scores are then multiplied by the values to return the output of the attention sublayer.9 When using ALiBi, we do not add position embeddings at any point in the network. The only modiﬁcation we apply is after the query-key dot product, where we add a static, non-learned bias:10 softmax(qiK⊤ + m· [−(i− 1), ...,−2,−1, 0]), where scalar m is a head-speciﬁc slope ﬁxed before training. Figure 3 offers a visualization. For our models with 8 heads, the slopes that we used are the geometric sequence: 1 21 , 1 22 , ..., 1 28 . For models that require 16 heads, we interpolate those 8 slopes by geometrically averaging every consecutive pair, resulting in the geometric sequence that starts at 1√"},{"citing_arxiv_id":"1911.05507","ref_index":110,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Compressive Transformers for Long-Range Sequence Modelling","primary_cat":"cs.LG","submitted_at":"2019-11-13T14:36:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Compressive Transformer sets new records on WikiText-103 (17.1 ppl) and Enwik8 (0.97 bpc) via memory compression and introduces the PG-19 long-range language benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1909.11942","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations","primary_cat":"cs.CL","submitted_at":"2019-09-26T07:06:13+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"from unlabelled data. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 1367-1377. Association for Computational Linguistics, 2016. doi: 10.18653/v1/N16-1162. URL http: //aclweb.org/anthology/N16-1162. Jerry R. Hobbs. Coherence and coreference. Cognitive Science, 3(1):67-90, 1979. Jeremy Howard and Sebastian Ruder. Universal language model ﬁne-tuning for text classiﬁcation. arXiv preprint arXiv:1801.06146, 2018. Shankar Iyer, Nikhil Dandekar, and Kornl Csernai. First quora dataset release: Ques- tion pairs, January 2017. URL https://www.quora.com/q/quoradata/ First-Quora-Dataset-Release-Question-Pairs . Yacine Jernite, Samuel R Bowman, and David Sontag."},{"citing_arxiv_id":"1906.08237","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"XLNet: Generalized Autoregressive Pretraining for Language Understanding","primary_cat":"cs.CL","submitted_at":"2019-06-19T17:35:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}