Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation
Pith reviewed 2026-05-22 01:29 UTC · model grok-4.3
The pith
Smoothing token embeddings by semantic similarity improves diffusion text generation quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Smoothie applies progressive smoothing to token embeddings based on semantic similarity to create a diffusion process that removes information gradually while supporting natural decoding. This bridges the gap between Gaussian diffusion in continuous spaces, which keeps semantic relations but complicates token recovery, and categorical simplex diffusion, which stays discrete but ignores token similarities. On multiple sequence-to-sequence and unconditional text generation benchmarks, Smoothie yields better generation quality than existing diffusion-based models. Ablation studies confirm that the proposed diffusion space outperforms both the standard embedding space and the categorical simplex
What carries the argument
Progressive smoothing of token embeddings based on semantic similarity, which defines the diffusion space and controls how information is removed step by step.
If this is right
- Higher generation quality than prior diffusion models on sequence-to-sequence tasks.
- Higher generation quality than prior diffusion models on unconditional text generation tasks.
- The smoothing-based diffusion space outperforms the standard embedding space.
- The smoothing-based diffusion space outperforms the categorical simplex space.
Where Pith is reading between the lines
- The same smoothing idea could be tested on other discrete sequences such as code or structured data where token similarities matter.
- The method may allow diffusion models to use fewer steps for sampling if the gradual semantic smoothing speeds convergence.
- Combining Smoothie with non-diffusion text models might reduce the usual quality gap between diffusion and autoregressive approaches.
Load-bearing premise
That progressively smoothing token embeddings based on semantic similarity enables gradual information removal while maintaining a natural decoding process without introducing artifacts that degrade final generation quality.
What would settle it
If replacing the semantic-similarity smoothing with uniform or random smoothing in the same experimental setup produces equal or higher quality scores on the reported generation tasks.
Figures
read the original abstract
Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence and unconditional generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. The code is available at https://github.com/ashaba1in/smoothie.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Smoothie, a diffusion model for text generation that performs progressive smoothing of token embeddings based on semantic similarity (via convex combinations with k-nearest neighbors in embedding space). The forward process gradually removes information while respecting semantic relations, the reverse process trains a standard denoising network to recover the original embedding, and final decoding uses nearest-neighbor lookup. Experiments on sequence-to-sequence and unconditional generation tasks report that Smoothie outperforms prior diffusion-based text models, with ablations demonstrating that the proposed smoothing space yields better results than both raw embedding space and the categorical simplex.
Significance. If the reported gains hold under the described controls, the work offers a practical bridge between continuous semantic latent spaces and discrete token constraints in diffusion models for text. The direct ablation comparisons to raw embeddings and the simplex, use of standard metrics (BLEU, perplexity, MAUVE), re-implementation of baselines under matched compute, and public code release are positive features that would strengthen the contribution if the quantitative improvements are robust.
major comments (1)
- [§4] §4 (Experiments): the central claim of consistent outperformance rests on the ablation tables and main results; the manuscript should explicitly report effect sizes, standard deviations across runs, and statistical significance tests for the BLEU/perplexity/MAUVE gains to confirm they are not attributable to variance or implementation differences.
minor comments (3)
- [§3.1] §3.1: the exact schedule for the smoothing parameter (how k or the convex weight evolves with diffusion timestep) could be stated more explicitly, perhaps with a short pseudocode block, to aid reproducibility.
- [Figure 2] Figure 2 caption: clarify whether the visualized trajectories are from the forward or reverse process and label the axes consistently with the embedding-space notation used in the text.
- [§2] Related work section: the discussion of prior continuous-latent diffusion methods could include a brief comparison table of their decoding strategies versus the nearest-neighbor lookup used here.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the major comment on the experimental reporting below and will incorporate the requested statistical details in the revised manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): the central claim of consistent outperformance rests on the ablation tables and main results; the manuscript should explicitly report effect sizes, standard deviations across runs, and statistical significance tests for the BLEU/perplexity/MAUVE gains to confirm they are not attributable to variance or implementation differences.
Authors: We agree that including standard deviations, effect sizes, and statistical significance tests would strengthen the presentation of our results. In the revised version, we will rerun the main experiments and key ablations with multiple random seeds (at least 3–5 runs per setting) and report means with standard deviations. We will also add effect sizes (e.g., Cohen’s d or mean differences with 95% confidence intervals) and perform paired statistical significance tests (e.g., t-tests or Wilcoxon tests) between Smoothie and the strongest baselines, marking results that reach p < 0.05. These additions will be placed in the main results tables and ablation tables in §4. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper explicitly defines the forward diffusion process as a convex combination of token embeddings with their k-nearest neighbors in embedding space based on semantic similarity. The reverse process employs a standard denoising network trained to predict the original embedding, and final decoding uses nearest-neighbor lookup. Ablation studies directly compare performance against raw embedding space and categorical simplex without any reduction of the claimed gains to a fitted parameter or self-referential definition. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are present in the provided description of the method. The experimental results on BLEU, perplexity, and MAUVE metrics with re-implemented baselines under matched compute provide independent support rather than circular construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Token embeddings encode semantic similarity that can be used to define a smoothing process
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We represent each token wyi ... with a vector of negative squared Euclidean distances ... Dt = Dt/σt² + δε ... pt = softmax(Dt) ... generalization of simplex-based diffusion
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our choice of the Euclidean distance is based on the Euclidean semantic space hypothesis
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
Reference graph
Works this paper leans on
-
[1]
Optimizing statistical machine translation for text simplification. volume 4, pages 401–415,
-
[2]
URLhttps://www.aclweb.org/anthology/Q16-1029
-
[3]
Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg
Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 17981–17993. Curran Associates, Inc., 2021. URL htt...
work page 2021
-
[4]
Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023
work page 2023
-
[5]
Brown, Dawn Song, Úlfar Er- lingsson, Alina Oprea, and Colin Raffel
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models, 2021. URL https://arxiv.org/abs/2012.07805
-
[6]
Analog bits: Generating discrete data using diffusion models with self-conditioning, 2023
Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning, 2023
work page 2023
-
[7]
Zihang Chen, Hongbo Zhang, Xiaoji Zhang, and Leqi Zhao. Quora question pairs. 2017. URL https://api.semanticscholar.org/CorpusID:233225749
work page 2017
-
[8]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024
work page 2024
-
[9]
Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G. Schwing, and David Forsyth. Fast, diverse and accurate image captioning guided by part-of-speech. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10687–10696, 2019. doi: 10.1109/CVPR.2019.01095
-
[10]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V ol...
-
[11]
Quasar: Datasets for Question Answering by Search and Reading
Bhuwan Dhingra, Kathryn Mazaitis, and William W Cohen. Quasar: Datasets for question answering by search and reading.arXiv preprint arXiv:1707.03904, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Diverse text generation via variational encoder-decoder models with gaussian process priors, 2022
Wanyu Du, Jianqiao Zhao, Liwei Wang, and Yangfeng Ji. Diverse text generation via variational encoder-decoder models with gaussian process priors, 2022. URL https://arxiv.org/abs/ 2204.01227
-
[13]
Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffusion, 2024
work page 2024
-
[14]
Diffuseq: Sequence to sequence text generation with diffusion models
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=jQj-_ rLVXsj. 10
work page 2023
-
[15]
DiffuSeq-v2: Bridging discrete and continuous text spaces for accelerated Seq2Seq diffusion models
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. DiffuSeq-v2: Bridging discrete and continuous text spaces for accelerated Seq2Seq diffusion models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9868–9875, Singapore, December 2023. Association for Com...
-
[16]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceed...
work page 2014
-
[17]
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen
Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. SSD-LM: Semi-autoregressive simplex- based diffusion language model for text generation and modular control. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 11575–11596, T...
-
[18]
Hashimoto, David Alvarez-Melis, and Tommi S
Tatsunori B. Hashimoto, David Alvarez-Melis, and Tommi S. Jaakkola. Word embeddings as metric recovery in semantic spaces.Transactions of the Association for Computational Linguistics, 4:273–286, 2016. doi: 10.1162/tacl_a_00098. URL https://aclanthology. org/Q16-1020/
-
[19]
DiffusionBERT: Improving generative masked language models with diffusion models
Zhengfu He, Tianxiang Sun, Qiong Tang, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. DiffusionBERT: Improving generative masked language models with diffusion models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 4521–...
work page 2023
-
[20]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URL https://openreview. net/forum?id=qw8AKxfYbI
work page 2021
-
[21]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020
work page 2020
-
[22]
Emiel Hoogeboom and Tim Salimans. Blurring diffusion models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=OjDkC57x5sz
work page 2023
-
[23]
Argmax flows and multinomial diffusion: Learning categorical distributions
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 12454–12465. Curran Associates, Inc., 2021. U...
work page 2021
-
[24]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qian- glong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallu- cination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, January 2025. ISSN 1558-2868. doi: 10.1145/370...
-
[25]
Neural crf model for sentence alignment in text simplification
Chao Jiang, Mounica Maddela, Wuwei Lan, Yang Zhong, and Wei Xu. Neural crf model for sentence alignment in text simplification. InProceedings of the Association for Computational Linguistics (ACL), 2020. 11
work page 2020
-
[26]
TESS: Text-to-text self-conditioned simplex diffusion
Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew Peters, and Arman Cohan. TESS: Text-to-text self-conditioned simplex diffusion. In Yvette Graham and Matthew Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (V olume 1: Long Papers), pages 234...
work page 2024
-
[27]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meet...
-
[28]
doi: 10.18653/v1/2020.acl-main.703
Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703
-
[29]
Diffusion-lm improves controllable text generation
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Infor- mation Processing Systems, volume 35, pages 4328–4343. Curran Associates, Inc.,
-
[30]
URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 1be5bc25d50895ee656b8c2d9eb89d6a-Paper-Conference.pdf
work page 2022
-
[31]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
work page 2004
-
[32]
Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, and Weizhu Chen. Text generation with diffusion language models: a pre-training approach with continuous paragraph denoise. InProceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023
work page 2023
-
[33]
Roberta: A robustly optimized bert pretraining approach, 2019
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019
work page 2019
-
[34]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution, 2024. URLhttps://arxiv.org/abs/2310.16834
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Latent diffusion for language generation
Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q Wein- berger. Latent diffusion for language generation. In A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Informa- tion Processing Systems, volume 36, pages 56998–57025. Curran Associates, Inc.,
-
[36]
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ b2a2bd5d5051ff6af52e1ef60aefd255-Paper-Conference.pdf
work page 2023
-
[37]
A corpus and cloze evaluation for deeper understanding of commonsense stories
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies,...
work page 2016
-
[38]
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels...
-
[39]
doi:10.3115/1073083.1073135 , editor =
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Comput...
-
[40]
Mauve: Measuring the gap between neural text and human text using divergence frontiers
Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, and Zaid Harchaoui. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 4816–
-
[41]
Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_ files/paper/2021/file/260c2432a0eecc28ce03c10dadc078a4-Paper.pdf
work page 2021
-
[42]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. URLhttps://arxiv.org/abs/2307.01952
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar. org/CorpusID:160025533
work page 2019
-
[44]
Variational inference with normalizing flows
Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. InProceedings of the 32nd International Conference on International Conference on Machine Learning - V olume 37, ICML’15, page 1530–1538. JMLR.org, 2015
work page 2015
-
[45]
Generative modelling with inverse heat dissipation
Severi Rissanen, Markus Heinonen, and Arno Solin. Generative modelling with inverse heat dissipation. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=4PJUBT9f2Ol
work page 2023
-
[46]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022
work page 2022
-
[47]
Alexander Shabalin, Viacheslav Meshchaninov, Egor Chimbulatov, Vladislav Lapikov, Roman Kim, Grigory Bartosh, Dmitry Molchanov, Sergey Markov, and Dmitry Vetrov. Tencdm: Understanding the properties of the diffusion model in the space of language model encodings,
- [48]
-
[49]
The spread of low-credibility content by social bots.Nature Communications, 9(1), November 2018
Chengcheng Shao, Giovanni Luca Ciampaglia, Onur Varol, Kai-Cheng Yang, Alessandro Flammini, and Filippo Menczer. The spread of low-credibility content by social bots.Nature Communications, 9(1), November 2018. ISSN 2041-1723. doi: 10.1038/s41467-018-06930-7. URLhttp://dx.doi.org/10.1038/s41467-018-06930-7
-
[50]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[51]
Taxonomy of risks posed by language models
Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, Abeba Birhane, Lisa Anne Hendricks, Laura Rimell, William Isaac, Julia Haas, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Taxonomy...
-
[52]
AR-diffusion: Auto-regressive diffusion model for text generation
Tong Wu, Zhihao Fan, Xiao Liu, Hai-Tao Zheng, Yeyun Gong, yelong shen, Jian Jiao, Juntao Li, zhongyu wei, Jian Guo, Nan Duan, and Weizhu Chen. AR-diffusion: Auto-regressive diffusion model for text generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=0EG6qUQ4xE
work page 2023
-
[53]
Seqdiffuseq: Text diffusion with encoder-decoder transformers.ArXiv, abs/2212.10325, 2022
Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Fei Huang, and Songfang Huang. Seqdiffuseq: Text diffusion with encoder-decoder transformers.ArXiv, abs/2212.10325, 2022
-
[54]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020. URLhttps://arxiv.org/abs/1904.09675. 13 A Limitations Pre-trained EmbeddingsOur proposed method relies on a pre-trained embedding matrix E from the BERT model. While this choice simplifies the training process and improves i...
work page internal anchor Pith review Pith/arXiv arXiv 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.