arXiv preprint arXiv:2203.16634 , year=

Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, Omer Levy · 2022 · arXiv 2203.16634

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

citation-role summary

background 2 method 1

citation-polarity summary

background 1 support 1 use method 1

representative citing papers

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

Group Representational Position Encoding

cs.LG · 2025-12-08 · unverdicted · novelty 7.0

GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Graphical einops: bridging tensor networks and computation graphs

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Introduces a graphical calculus with nested graded tubes bridging tensor networks and computation graphs for einops, turning equivariance proofs into diagrammatic derivations and enabling efficient sparse attention via mask preprocessing.

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency points for high-resolution image generation on mobile devices.

OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

q-bio.NC · 2026-04-20 · unverdicted · novelty 6.0

OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.

A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks

cs.LG · 2026-03-23 · unverdicted · novelty 6.0 · 2 refs

iAmTime is a time-series foundation model that uses instruction-conditioned in-context learning from demonstrations to perform zero-shot adaptation on forecasting, imputation, classification, and related tasks.

Positional Encoding via Token-Aware Phase Attention

cs.CL · 2025-09-16 · unverdicted · novelty 6.0

TAPA adds a learnable phase function to attention to preserve long-range token interactions, enabling direct continual pretraining, length extrapolation, lower perplexity, and stronger retrieval than RoPE-style methods.

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

cs.CV · 2024-10-14 · unverdicted · novelty 6.0

Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.

ATMA: Length-Invariant Language Modeling via Polar Attention and Gated-Delta Compression Memory

cs.LG · 2026-06-23 · unverdicted · novelty 5.0

ATMA combines polar attention (direction + bounded-magnitude channels) with gated-delta recurrent compression to achieve length-invariant perplexity and >90% needle retrieval at 64K tokens after 2K training.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings q-bio.QM · 2026-04-09 · unverdicted · none · ref 37
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

arXiv preprint arXiv:2203.16634 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer