{"work":{"id":"751efe07-5e91-415c-b3d1-f4734aa26960","openalex_id":null,"doi":null,"arxiv_id":null,"raw_key":"raw:a4e105760402323d556ab2be","title":"Attention is all you need.Advances in neural information processing systems, 30, 2017","authors":null,"authors_text":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin","year":2017,"venue":null,"abstract":null,"external_url":null,"cited_by_count":null,"metadata_source":"raw_reference","metadata_fetched_at":"2026-05-27T10:19:03.212133+00:00","pith_arxiv_id":null,"created_at":"2026-05-11T02:39:44.967088+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"Attention is all you need.Advances in neural information processing systems, 30","render_title":"Attention is all you need.Advances in neural information processing systems, 30"},"hub":{"state":{"work_id":"751efe07-5e91-415c-b3d1-f4734aa26960","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":138,"external_cited_by_count":null,"distinct_field_count":20,"first_pith_cited_at":"2022-05-27T17:53:09+00:00","last_pith_cited_at":"2026-05-22T10:01:28+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-30T09:59:36.315796+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":17},{"context_role":"method","n":12},{"context_role":"baseline","n":2},{"context_role":"dataset","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":17},{"context_polarity":"use_method","n":12},{"context_polarity":"baseline","n":2},{"context_polarity":"unclear","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Attention is all you need.Advances in neural information processing systems, 30","claims":[{"claim_text":"and block-sparse FlashAttentionenable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classiﬁcation) and entirely new capabilities: the ﬁrst Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy). 1 Introduction Transformer models [82] have emerged as the most widely used architecture in applications such a","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Specific Log Tasks (Level Prediction, Defect Detection, Repair, Quality) Heng et al. [51] Level Pred FT / Zero Snippet Level - CodeLlama- 13B BERT, RoBERTa Acc, AUC OmniLLP [121] Level Pred RAG + ICLLog + Neigh- bors Level Cluster Retrieval CodeXEmbed - ARI, AUC LogUpdater [213] Log Repair Agent Defect Log Fixed Log Defect TaxonomyCodeT5+, GPT- 4o Claude3.5 BLEU, ROUGE Defects4Log [175] Detection ICL + CoTMethod + LogDefect Type Defect Patterns DeepSeek-R1 GPT-4o Precision, Recall LOGIMPROVER [8","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"The emergence of language foundation models can be largely traced back to the introduction of the Transformer architecture [42], which leverages multi-head self-attention and cross-attention mechanisms for scalable sequence modeling, and adopts an encoder-decoder structure for effective sequence-to-sequence generation. Building on this architecture, BERT [43] pretrains a bidirectional Transformer encoder in a self-supervised manner using masked language modeling and next sentence prediction obje","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"trade-off than efficient models and upgraded baselines of similar size. LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases (6 tasks × 2 settings) and roughly matches on 3. In accuracy-throughput, measuring time in practice, Look- When is more efficient still at 6.7× faster than InternVideo2-B at equal accuracy.1 1 Introduction: Video computation takes too much time and space Transformers [1] have revolutionized video modeling [ 2, 3, 4, 5]. They split videos into several thousand or mo","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"wi for each paper by log-normalizing and aggregating its GitHub stars, citation counts, influential citations, and Altmetric score. The distribution of the four log-normalized ground-truth impact metrics utilized in the dataset is shown in Figure 4. Baselines.We benchmark FAME against three distinct categories of evaluators. First, we evaluate ML models, including XGBoost [9], SVR [11, 27], Transformer [31] and TGCN [39], trained directly 5 Table 1: Prospective forecasting performance across an ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"In theory, multi-party computation (MPC) protocols enable collaborative, private inference over time-series data while protecting users' privacy [3], e.g., Yao's garbled circuits [4] or additive secret sharing [5]. However, their runtime scales poorly as model size and input length increase, especially in higher latency settings. This poor scaling is particularly evident in transformer models [ 6], the premier model architecture for tasks that rely on time-series data [7, 8]. Attention layers ar","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Attention is all you need.Advances in neural information processing systems, 30 because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (16 contexts).","role_counts":[{"n":16,"context_role":"background"},{"n":11,"context_role":"method"},{"n":2,"context_role":"baseline"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"other"}]},"error":null,"updated_at":"2026-05-22T12:53:40.425651+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"d919b3a4-50a1-409a-bb1c-f4bab6b13f20","orcid":null,"display_name":"Ashish Vaswani"},{"id":"d555cf92-dbb8-4cd6-b9cc-3a82d85183de","orcid":null,"display_name":"Noam Shazeer"},{"id":"57ff1fcd-d66b-4c9e-8bd1-0aa73a589c42","orcid":null,"display_name":"Niki Parmar"},{"id":"e63cf8df-4dd7-40a3-8738-9640bd710f58","orcid":null,"display_name":"Jakob Uszkoreit"},{"id":"e8a13cc0-055a-4a1b-b7e1-8ec3d9f5e9b6","orcid":null,"display_name":"Llion Jones"},{"id":"a216b659-3bfd-4f0a-9146-e7a32e5292c2","orcid":null,"display_name":"Aidan N Gomez"}]},"error":null,"updated_at":"2026-05-22T12:53:42.477364+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-15T08:37:42.687070+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":7},{"title":"Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851","work_id":"82ba805b-3e59-43c6-b37f-3aa1940eea68","shared_citers":6},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":5},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":5},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":4},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":4},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":4},{"title":"Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32","work_id":"262300a3-c1d4-4d6e-ac74-8a4100dd12c8","shared_citers":4},{"title":"Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916","work_id":"115823a2-8918-4227-8872-3d0a36ff07a9","shared_citers":4},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":3},{"title":"Finite scalar quantization: Vq-vae made simple","work_id":"34dd22bc-0de9-4e11-9a1b-1358e10fbfe1","shared_citers":3},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":3},{"title":"Learning transferable visual models from natural language supervision","work_id":"ad3e05b3-af3a-4fa2-ab30-c45f9f403277","shared_citers":3},{"title":"Lora: Low-rank adaptation of large language models.ICLR, 1(2):3","work_id":"421353f1-f10a-4559-8de6-966e7d699eaf","shared_citers":3},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":3},{"title":"Swin transformer: Hierarchical vision transformer using shifted windows","work_id":"d577574a-ec23-4a36-89fc-494c4f56328e","shared_citers":3},{"title":"Ai-researcher: Autonomous scientific innovation","work_id":"3845f0f0-08d4-4650-b390-6bfdd269f79a","shared_citers":2},{"title":"Attention residuals.arXiv preprint arXiv:2603.15031","work_id":"7356447b-f55f-41d1-b128-b3a54c4c879d","shared_citers":2},{"title":"Autogen: Enabling next-gen llm applications via multi-agent conversations","work_id":"e57ce12a-7d16-4d21-a253-28bdb8094e1a","shared_citers":2},{"title":"Bert: Pre-training of deep bidirectional transformers for language understanding","work_id":"1bdc18bb-17d9-44c5-8f2b-ca096572a66b","shared_citers":2},{"title":"Cambridge university press","work_id":"1e4dd1c5-2683-4e02-8324-f7fe359cdc17","shared_citers":2},{"title":"Chameleon: Mixed-Modal Early-Fusion Foundation Models","work_id":"2661b9a6-25cc-41a1-8100-612d2b801289","shared_citers":2},{"title":"Classifier-free diffusion guidance","work_id":"00335b93-6180-4719-8268-4de5322f9961","shared_citers":2}],"time_series":[{"n":2,"year":2022},{"n":1,"year":2023},{"n":35,"year":2026}],"dependency_candidates":[{"n":1,"role":"method","polarity":"use_method","paper_title":"Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems","primary_cat":"cs.MA","context_text":"would differ in the network structure (i.e., G̸=G ′). Then, there exists (f, g) such that the population trajectories({x i(t+k)}, G(t+k)and({x ′ i(t+k)}, G ′(t+k)diverge for allk >0. Single-task agentic systems either treat observations as independent and identically distributed (i.i.d.) [93] or the dependencies are modeled globally through a full attention mechanism [ 95]. Neither captures the topology-constrained local observability that characterizes real social systems. In a MASS, G is an irreducible determinant of population-level outcome. Formally, agent i at time t observes only: Mi(t) ={m j(t)|j∈N(i)} Empirically, information cascade size and reach depends not on independent sharing behavior, but on the network connections a message travels, with highly connected and central agents playing","citing_arxiv_id":"2605.07069"}]},"error":null,"updated_at":"2026-05-15T08:37:46.256931+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-15T08:37:46.201570+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Attention is all you need.Advances in neural information processing systems, 30","claims":[{"claim_text":"and block-sparse FlashAttentionenable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classiﬁcation) and entirely new capabilities: the ﬁrst Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy). 1 Introduction Transformer models [82] have emerged as the most widely used architecture in applications such a","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Specific Log Tasks (Level Prediction, Defect Detection, Repair, Quality) Heng et al. [51] Level Pred FT / Zero Snippet Level - CodeLlama- 13B BERT, RoBERTa Acc, AUC OmniLLP [121] Level Pred RAG + ICLLog + Neigh- bors Level Cluster Retrieval CodeXEmbed - ARI, AUC LogUpdater [213] Log Repair Agent Defect Log Fixed Log Defect TaxonomyCodeT5+, GPT- 4o Claude3.5 BLEU, ROUGE Defects4Log [175] Detection ICL + CoTMethod + LogDefect Type Defect Patterns DeepSeek-R1 GPT-4o Precision, Recall LOGIMPROVER [8","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"The emergence of language foundation models can be largely traced back to the introduction of the Transformer architecture [42], which leverages multi-head self-attention and cross-attention mechanisms for scalable sequence modeling, and adopts an encoder-decoder structure for effective sequence-to-sequence generation. Building on this architecture, BERT [43] pretrains a bidirectional Transformer encoder in a self-supervised manner using masked language modeling and next sentence prediction obje","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"trade-off than efficient models and upgraded baselines of similar size. LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases (6 tasks × 2 settings) and roughly matches on 3. In accuracy-throughput, measuring time in practice, Look- When is more efficient still at 6.7× faster than InternVideo2-B at equal accuracy.1 1 Introduction: Video computation takes too much time and space Transformers [1] have revolutionized video modeling [ 2, 3, 4, 5]. They split videos into several thousand or mo","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"wi for each paper by log-normalizing and aggregating its GitHub stars, citation counts, influential citations, and Altmetric score. The distribution of the four log-normalized ground-truth impact metrics utilized in the dataset is shown in Figure 4. Baselines.We benchmark FAME against three distinct categories of evaluators. First, we evaluate ML models, including XGBoost [9], SVR [11, 27], Transformer [31] and TGCN [39], trained directly 5 Table 1: Prospective forecasting performance across an ","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"In theory, multi-party computation (MPC) protocols enable collaborative, private inference over time-series data while protecting users' privacy [3], e.g., Yao's garbled circuits [4] or additive secret sharing [5]. However, their runtime scales poorly as model size and input length increase, especially in higher latency settings. This poor scaling is particularly evident in transformer models [ 6], the premier model architecture for tasks that rely on time-series data [7, 8]. Attention layers ar","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Attention is all you need.Advances in neural information processing systems, 30 because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (16 contexts).","role_counts":[{"n":16,"context_role":"background"},{"n":11,"context_role":"method"},{"n":2,"context_role":"baseline"},{"n":1,"context_role":"dataset"},{"n":1,"context_role":"other"}]},"error":null,"updated_at":"2026-05-22T12:53:42.485581+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Attention is all you need.Advances in neural information processing systems, 30","claims":[{"claim_text":"and block-sparse FlashAttentionenable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classiﬁcation) and entirely new capabilities: the ﬁrst Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy). 1 Introduction Transformer models [82] have emerged as the most widely used architecture in applications such a","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"would differ in the network structure (i.e., G̸=G ′). Then, there exists (f, g) such that the population trajectories({x i(t+k)}, G(t+k)and({x ′ i(t+k)}, G ′(t+k)diverge for allk >0. Single-task agentic systems either treat observations as independent and identically distributed (i.i.d.) [93] or the dependencies are modeled globally through a full attention mechanism [ 95]. Neither captures the topology-constrained local observability that characterizes real social systems. In a MASS, G is an ir","claim_type":"method","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Attention is all you need.Advances in neural information processing systems, 30 because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (1 contexts).","role_counts":[{"n":1,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-15T08:37:46.262201+00:00"}},"summary":{"title":"Attention is all you need.Advances in neural information processing systems, 30","claims":[{"claim_text":"and block-sparse FlashAttentionenable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classiﬁcation) and entirely new capabilities: the ﬁrst Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy). 1 Introduction Transformer models [82] have emerged as the most widely used architecture in applications such a","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"would differ in the network structure (i.e., G̸=G ′). Then, there exists (f, g) such that the population trajectories({x i(t+k)}, G(t+k)and({x ′ i(t+k)}, G ′(t+k)diverge for allk >0. Single-task agentic systems either treat observations as independent and identically distributed (i.i.d.) [93] or the dependencies are modeled globally through a full attention mechanism [ 95]. Neither captures the topology-constrained local observability that characterizes real social systems. In a MASS, G is an ir","claim_type":"method","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Attention is all you need.Advances in neural information processing systems, 30 because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (1 contexts).","role_counts":[{"n":1,"context_role":"background"},{"n":1,"context_role":"method"}]},"graph":{"co_cited":[{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":7},{"title":"Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851","work_id":"82ba805b-3e59-43c6-b37f-3aa1940eea68","shared_citers":6},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":5},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":5},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":4},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":4},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":4},{"title":"Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32","work_id":"262300a3-c1d4-4d6e-ac74-8a4100dd12c8","shared_citers":4},{"title":"Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916","work_id":"115823a2-8918-4227-8872-3d0a36ff07a9","shared_citers":4},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":3},{"title":"Finite scalar quantization: Vq-vae made simple","work_id":"34dd22bc-0de9-4e11-9a1b-1358e10fbfe1","shared_citers":3},{"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","shared_citers":3},{"title":"Learning transferable visual models from natural language supervision","work_id":"ad3e05b3-af3a-4fa2-ab30-c45f9f403277","shared_citers":3},{"title":"Lora: Low-rank adaptation of large language models.ICLR, 1(2):3","work_id":"421353f1-f10a-4559-8de6-966e7d699eaf","shared_citers":3},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":3},{"title":"Swin transformer: Hierarchical vision transformer using shifted windows","work_id":"d577574a-ec23-4a36-89fc-494c4f56328e","shared_citers":3},{"title":"Ai-researcher: Autonomous scientific innovation","work_id":"3845f0f0-08d4-4650-b390-6bfdd269f79a","shared_citers":2},{"title":"Attention residuals.arXiv preprint arXiv:2603.15031","work_id":"7356447b-f55f-41d1-b128-b3a54c4c879d","shared_citers":2},{"title":"Autogen: Enabling next-gen llm applications via multi-agent conversations","work_id":"e57ce12a-7d16-4d21-a253-28bdb8094e1a","shared_citers":2},{"title":"Bert: Pre-training of deep bidirectional transformers for language understanding","work_id":"1bdc18bb-17d9-44c5-8f2b-ca096572a66b","shared_citers":2},{"title":"Cambridge university press","work_id":"1e4dd1c5-2683-4e02-8324-f7fe359cdc17","shared_citers":2},{"title":"Chameleon: Mixed-Modal Early-Fusion Foundation Models","work_id":"2661b9a6-25cc-41a1-8100-612d2b801289","shared_citers":2},{"title":"Classifier-free diffusion guidance","work_id":"00335b93-6180-4719-8268-4de5322f9961","shared_citers":2}],"time_series":[{"n":2,"year":2022},{"n":1,"year":2023},{"n":35,"year":2026}],"dependency_candidates":[{"n":1,"role":"method","polarity":"use_method","paper_title":"Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems","primary_cat":"cs.MA","context_text":"would differ in the network structure (i.e., G̸=G ′). Then, there exists (f, g) such that the population trajectories({x i(t+k)}, G(t+k)and({x ′ i(t+k)}, G ′(t+k)diverge for allk >0. Single-task agentic systems either treat observations as independent and identically distributed (i.i.d.) [93] or the dependencies are modeled globally through a full attention mechanism [ 95]. Neither captures the topology-constrained local observability that characterizes real social systems. In a MASS, G is an irreducible determinant of population-level outcome. Formally, agent i at time t observes only: Mi(t) ={m j(t)|j∈N(i)} Empirically, information cascade size and reach depends not on independent sharing behavior, but on the network connections a message travels, with highly connected and central agents playing","citing_arxiv_id":"2605.07069"}]},"authors":[{"id":"a216b659-3bfd-4f0a-9146-e7a32e5292c2","orcid":null,"display_name":"Aidan N Gomez","source":"manual","import_confidence":0.72},{"id":"d919b3a4-50a1-409a-bb1c-f4bab6b13f20","orcid":null,"display_name":"Ashish Vaswani","source":"manual","import_confidence":0.72},{"id":"e63cf8df-4dd7-40a3-8738-9640bd710f58","orcid":null,"display_name":"Jakob Uszkoreit","source":"manual","import_confidence":0.72},{"id":"e8a13cc0-055a-4a1b-b7e1-8ec3d9f5e9b6","orcid":null,"display_name":"Llion Jones","source":"manual","import_confidence":0.72},{"id":"57ff1fcd-d66b-4c9e-8bd1-0aa73a589c42","orcid":null,"display_name":"Niki Parmar","source":"manual","import_confidence":0.72},{"id":"d555cf92-dbb8-4cd6-b9cc-3a82d85183de","orcid":null,"display_name":"Noam Shazeer","source":"manual","import_confidence":0.72}]}}