pith. sign in

hub Canonical reference

Attention is All you Need , url =

Canonical reference. 78% of citing Pith papers cite this work as background.

79 Pith papers citing it
Background 78% of classified citations

hub tools

citation-role summary

background 8 other 1

citation-polarity summary

claims ledger

  • other expectation conditional on all workers computing the stochastic gradients at iterationk. Theorem E.1.Let Assumptions 1.1, 1.2, 1.3 and 1.6 be satisfied and suppose that Cij satisfies Defini- tion 1.5 with parameter ω, Cs,ij satisfies Definition 1.5 with parameter ωs. Consider Heterogeneous- Time Inkheart SGD with arbitrarily mi, bi, ℓi >0 and weights βi (not necessarily defined as in (17); it is sufficient to assume thatβ i ∈[0,1]and Pn i=1 βi = 1) are chosen to satisfy    8 nP i=1 β2 i ω
  • background 1 and 4.2. 3.2 Masked diffusion objective We train fD as a masked denoiser. We sample a mask set M⊆ {1, . . . , n} and construct a corrupted input ˜xby replacing xi with ⟨M⟩ for i∈M . The diffusion model predicts the masked tokens conditioned on˜x, and we optimize cross-entropy only on masked positions: Ldiff(θ) =E x,M "X i∈M CE(pθ(· |˜x)i, x i) # ,(1) where pθ(· |˜x)i is the diffusion model's predicted distribution at positioni. This denoising objective is the shared training loss used across t
  • background For notational convenience, we also introduce evaluation operators: for functions f:Z →R , g:{0,1} × X →R,andh: X →R, we use the operator notationE zf=f(z),E a,xg=g(a, x), andE xh=h(x). We recall the expectation operators PY|A,X , PA|X ,andP X introduced in (7). For f:Z →R , a∈ {0,1} , x∈ X , and z′ ∈ Z ′, we define the shorthand notation (PY|a,X f)(z ′) := (PY|A,X f)(x ′, a, y′) (PY|a,x f)(z ′) := (PY|A,X f)(x, a, y ′)(33) Since the right-hand side of the first equality is invariant in(a′, y′)
  • background 2 The LLM's Action-Selection Interface as a Linear Bandit The bandit perspective above requires a contextual representation of each decision state and a scoring rule over candidate actions. We now show that the frozen LLM already provides both. 4 Preprint. Under review. At the token position immediately preceding the next action decision, the model's last-layer hidden state ue,t =h LLM(qe,h e,t−1 )∈R d (1) encodes the task semantics and the full trajectory so far. Intuitively, this vector is the
  • background Every question is designed to demand reasoning over accumulated temporal evidence across the long video, rather than single-clip retrieval or surface-level pattern matching. Formally, given an observed video frame sequence V≤tq ={v 1, . . ., vtq } up to query timet q and a queryq, a modelf (m) θ produces an answer ˆy(m) =f (m) θ  q,M (m) ≤tq (V≤tq )  , (1) where m∈ {entity , event, behavior} denotes the memory type, M(m) ≤tq is the corresponding structured memory built from past observations V
  • background i = Norm(¯ui), so attention computes the unnormalized version of the exact MLE. Robust reweighting.As in isotropic RFA, we introduce robust M-estimation weights to down- weight inconsistent observations. Here, robustness is applied to directional disagreement on the hypersphere through the angular distanced 2 ij: wij =  1 + d2 ij ν −κ ,˜κ ij =w ijκij.(14) Geometric filtering update.We represent the RT filter state in eigenbasis coordinates zs,i = miuz,i, where the spherical geometry is exact.

co-cited works

polarities

background 7 unclear 2

representative citing papers

Is Dimensionality a Barrier for Retrieval Models?

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.

Language Acquisition Device in Large Language Models

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

Pre-pretraining on MP-STRUCT matches k-Shuffle Dyck baselines in efficiency while adding human-like resistance to implausible languages and challenges the need for C-RASP definability in effective PPT languages.

Online Learning-to-Defer with Varying Experts

stat.ML · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Presents first online L2D algorithm for multiclass classification with bandit feedback and varying experts, achieving O((n+n_e)T^{2/3}) regret generally and O((n+n_e)√T) under low noise.

Sinkhorn Treatment Effects: A Causal Optimal Transport Measure

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

The Sinkhorn treatment effect is a new entropic optimal transport measure of divergence between counterfactual distributions that admits first- and second-order pathwise differentiability, debiased estimators, and asymptotically valid tests for distributional treatment effects.

How Language Models Process Negation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

LLMs process negation using both attention-based suppression and constructive representation mechanisms (construction dominant), with late-layer attention shortcuts explaining poor accuracy on negation tasks.

citing papers explorer

Showing 50 of 79 citing papers.