Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

Danilo Comminiello; Eleonora Grassucci; Sergio Barbarossa

arxiv: 2306.04321 · v2 · pith:NUPZGYGInew · submitted 2023-06-07 · 💻 cs.AI · cs.MM

Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

Eleonora Grassucci , Sergio Barbarossa , Danilo Comminiello This is my paper

Pith reviewed 2026-05-24 08:41 UTC · model grok-4.3

classification 💻 cs.AI cs.MM

keywords semantic communicationdiffusion modelsimage generationsemantic preservationbandwidth reductionnoisy channelsgenerative modelsAI communications

0 comments

The pith

A diffusion model can synthesize high-quality images that preserve semantic meaning from only highly compressed and noisy semantic data sent over a channel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes sending only denoised semantic information instead of full bit sequences, then using a diffusion model to regenerate the original scene at the receiver. The model is conditioned on this partial data through spatially-adaptive normalizations so that the output remains consistent with the transmitted semantics. This setup is tested across multiple noisy-channel scenarios and shown to keep objects, locations, and depths recognizable even when conventional recovery would fail. A reader would care because the approach directly addresses bandwidth limits in visual communication while still delivering usable meaning rather than exact pixels.

Core claim

By transmitting highly compressed semantic information only and guiding a diffusion model with spatially-adaptive normalizations from the denoised received signal, complex scenes can be synthesized that preserve key semantic features without recovering the original bits or requiring extra data or post-processing; the method outperforms prior solutions by maintaining recognizable objects, locations, and depths under extreme channel noise.

What carries the argument

Diffusion-guided synthesis conditioned on denoised semantic information via spatially-adaptive normalizations, which steers generation to maintain semantic consistency while allowing bandwidth reduction.

If this is right

Bandwidth is reduced because only compressed semantic descriptors need to be sent rather than full bit streams.
Image quality and semantic fidelity remain high even when the received signal is severely degraded by noise.
Objects, spatial layout, and depth remain recognizable without any bit-level recovery or additional side information.
The same framework applies to multiple communication scenarios without task-specific retraining beyond the diffusion conditioning step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to video by conditioning successive frames on a shared semantic stream to enforce temporal consistency.
Goal-oriented communication becomes feasible if the diffusion guidance is further shaped by a downstream task loss instead of pure reconstruction.
Hybrid systems might combine this generative recovery with occasional high-fidelity patches when semantic uncertainty exceeds a threshold.

Load-bearing premise

The diffusion model can reliably produce semantically faithful complex scenes from nothing more than the received denoised semantic information and the normalizations derived from it.

What would settle it

Quantitative evaluation on a held-out test set in which channel noise is increased until the generated images lose measurable semantic fidelity, for example by dropping object recognition accuracy or depth estimation error below a chosen threshold compared with the source.

Figures

Figures reproduced from arXiv: 2306.04321 by Danilo Comminiello, Eleonora Grassucci, Sergio Barbarossa.

**Figure 2.** Figure 2: Proposed generative semantic communication framework. The sender transmits one-hot, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Our method results for different PSNR values of the communication channel. The [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparisons among most performing models (CC-FPSE [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Generated samples from ablation studies with [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Encoder and decoder blocks of our U-Net-based semantic diffusion model. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Example of how the JPEG compression affects the original image and a sample of the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Other comparisons for transmitted semantics and a fixed PSNR value of [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Generatd samples of our method from the transmitted semantics under different PSNR [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Semantic communication is expected to be one of the cores of next-generation AI-based communications. One of the possibilities offered by semantic communication is the capability to regenerate, at the destination side, images or videos semantically equivalent to the transmitted ones, without necessarily recovering the transmitted sequence of bits. The current solutions still lack the ability to build complex scenes from the received partial information. Clearly, there is an unmet need to balance the effectiveness of generation methods and the complexity of the transmitted information, possibly taking into account the goal of communication. In this paper, we aim to bridge this gap by proposing a novel generative diffusion-guided framework for semantic communication that leverages the strong abilities of diffusion models in synthesizing multimedia content while preserving semantic features. We reduce bandwidth usage by sending highly-compressed semantic information only. Then, the diffusion model learns to synthesize semantic-consistent scenes through spatially-adaptive normalizations from such denoised semantic information. We prove, through an in-depth assessment of multiple scenarios, that our method outperforms existing solutions in generating high-quality images with preserved semantic information even in cases where the received content is significantly degraded. More specifically, our results show that objects, locations, and depths are still recognizable even in the presence of extremely noisy conditions of the communication channel. The code is available at https://github.com/ispamm/GESCO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies diffusion models to reconstruct images from compressed noisy semantic features in communication, but semantic fidelity claims need quantitative task-based validation beyond visuals.

read the letter

The core contribution is a framework that transmits only highly compressed semantic information, denoises it, and feeds it into a diffusion model via spatially-adaptive normalizations to synthesize the scene at the receiver. This moves beyond simple bit recovery or earlier generative semantic comm methods by targeting complex scene reconstruction under bandwidth limits and channel noise. The public code is a plus for anyone wanting to test it. The approach is plausible on paper and directly tackles the goal of semantic equivalence rather than exact bit matching. The main limitation is the strength of the evidence for preserved semantics. The abstract states that objects, locations, and depths remain recognizable even in extreme noise, yet the evaluation appears to rest on image quality metrics and qualitative examples. Those do not directly measure whether the generated outputs support downstream tasks like detection or segmentation at different SNRs. If the paper does not report such metrics or clear baseline comparisons with error bars, the robustness claim stays harder to verify than the abstract presents. Minor gaps include missing details on dataset splits and exact experimental controls, which a referee could clarify. This work sits at the overlap of generative models and AI-native wireless systems. Readers already working on semantic communication or diffusion applications in constrained channels would get the most from it. The idea is coherent enough on its own terms to merit peer review so the experimental details can be examined and the semantic preservation tested more rigorously.

Referee Report

2 major / 1 minor

Summary. The paper proposes a generative semantic communication system that transmits only highly compressed semantic features, denoises them at the receiver, and conditions a diffusion model via spatially-adaptive normalizations to synthesize images that preserve semantic content (objects, locations, depths) even under severe channel noise. It claims to outperform prior semantic-communication baselines across multiple scenarios while reducing bandwidth, with publicly released code.

Significance. If the central empirical claims hold under rigorous semantic-fidelity testing, the framework would demonstrate a practical route to goal-oriented image regeneration from minimal transmitted data, directly addressing bandwidth constraints in next-generation semantic communications. The open-source implementation is a clear strength that supports reproducibility and extension.

major comments (2)

[Abstract and Evaluation section] Abstract and Evaluation section: the claim that 'objects, locations, and depths are still recognizable' under extreme noise is load-bearing for the paper's contribution, yet the reported assessment relies on standard generative metrics or qualitative figures without downstream semantic-task metrics (e.g., object-detection mAP or segmentation IoU) computed on the synthesized outputs across SNR regimes. This leaves the semantic-preservation assertion unverified.
[Method section] Method section (diffusion conditioning): the description of how spatially-adaptive normalizations applied to the denoised semantic features alone enable reliable scene synthesis does not include an ablation or quantitative test isolating the contribution of the conditioning mechanism versus any implicit priors, which is required to substantiate the 'no additional transmitted data' bandwidth claim.

minor comments (1)

[Abstract] The abstract states 'in-depth assessment across scenarios' but omits explicit listing of baselines, datasets, and exact quantitative metrics; adding a concise table or paragraph with these details would improve clarity without altering the technical contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below, indicating the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the claim that 'objects, locations, and depths are still recognizable' under extreme noise is load-bearing for the paper's contribution, yet the reported assessment relies on standard generative metrics or qualitative figures without downstream semantic-task metrics (e.g., object-detection mAP or segmentation IoU) computed on the synthesized outputs across SNR regimes. This leaves the semantic-preservation assertion unverified.

Authors: We agree that downstream task-specific metrics would provide stronger, more direct evidence for semantic preservation under noise. In the revised manuscript we will add object-detection mAP and segmentation IoU results computed on the synthesized images across the reported SNR regimes, using standard pre-trained models. These metrics will be presented alongside the existing generative metrics to verify the claim. revision: yes
Referee: [Method section] Method section (diffusion conditioning): the description of how spatially-adaptive normalizations applied to the denoised semantic features alone enable reliable scene synthesis does not include an ablation or quantitative test isolating the contribution of the conditioning mechanism versus any implicit priors, which is required to substantiate the 'no additional transmitted data' bandwidth claim.

Authors: We acknowledge the value of an explicit ablation to isolate the contribution of the spatially-adaptive normalization conditioning. The revised manuscript will include a quantitative ablation study comparing the full model against a variant without the conditioning (or with a simpler conditioning scheme), reporting the same generative metrics while keeping the transmitted semantic features identical. This will directly support the bandwidth claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical framework for generative semantic communication using diffusion models conditioned on compressed, denoised semantic features via spatially-adaptive normalizations. Performance claims rest on experimental evaluation across noisy channel scenarios rather than any mathematical derivation chain. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text that would reduce the central results to inputs by construction. The assessment is described as in-depth but external to any internal definitional loop, making the work self-contained against the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented entities are identifiable. The approach implicitly relies on standard assumptions of diffusion model training and semantic feature extraction but provides no details.

pith-pipeline@v0.9.0 · 5763 in / 1168 out tokens · 21990 ms · 2026-05-24T08:41:49.964486+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The diffusion model learns to synthesize semantic-consistent scenes through spatially-adaptive normalizations from such denoised semantic information.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train our semantic diffusion model directly with noisy semantic maps... PSNR values in {1,5,10,...}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Causal Diffusion Model for Video Reconstruction from Ultra-Low-Bitrate Representations
cs.CV 2026-02 unverdicted novelty 7.0

A causal diffusion model reconstructs videos from ultra-low-bitrate semantics and compressed frames using temporal distillation from a bidirectional teacher, outperforming prior baselines.
Intention-Aware Semantic Agent Communications for AI Glasses
eess.SP 2026-04 unverdicted novelty 5.0

An intention-aware semantic agent system for AI glasses reduces bandwidth by over 50% in simulations while preserving task performance through adaptive preprocessing guided by inferred user intentions.
Anchor-Aided Multi-User Semantic Communication with Adaptive Decoders
cs.IT 2026-04 unverdicted novelty 5.0

A multi-user semantic communication framework uses an anchor decoder symmetric to the encoder to overcome catastrophic forgetting, enabling one frozen encoder to support adaptive decoders for users with varying comput...
Anchor-Aided Multi-User Semantic Communication with Adaptive Decoders
cs.IT 2026-04 unverdicted novelty 5.0

A multi-user semantic communication framework employs an anchor decoder symmetric to the encoder to mitigate catastrophic forgetting, enabling sequential training and frozen-encoder adaptation for users with distinct ...
Lightweight Diffusion Models for Resource-Constrained Semantic Communication
eess.SP 2024-10 unverdicted novelty 5.0

Q-GESCO uses quantized diffusion models to regenerate images from semantic maps in noisy channels, matching full-precision performance with up to 75% memory and 79% FLOP reductions.
Training-Free Multi-User Generative Semantic Communications via Null-Space Diffusion Sampling
eess.SP 2024-05 unverdicted novelty 5.0

Introduces a null-space diffusion sampling method for training-free multi-user generative semantic communications in OFDMA systems.
Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications
eess.SP 2025-11 unverdicted novelty 3.0

The tutorial synthesizes diffusion model techniques for generative semantic communications to achieve high compression while preserving meaning in wireless transmission.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 6 Pith papers · 3 internal anchors

[1]

Semantic communications: Overview, open issues, and future research directions,

X. Luo, H.-H. Chen, and Q. Guo, “Semantic communications: Overview, open issues, and future research directions,” IEEE Wireless Comm., vol. 29, no. 1, pp. 210–219, 2022

work page 2022
[2]

Communication beyond transmitting bits: Semantics-guided source and channel coding,

J. Dai, P. Zhang, K. Niu, S. Wang, Z. Si, and X. Qin, “Communication beyond transmitting bits: Semantics-guided source and channel coding,” ArXiv preprint: ArXiv:2208.02481, 2021

work page arXiv 2021
[3]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 6840–6851, 2020

work page 2020
[4]

Photorealistic text-to- image diffusion models with deep language understanding,

C. Saharia, C. W., S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, R. Gontijo- Lopes, B. K. Ayan, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to- image diffusion models with deep language understanding,” in Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[5]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 10674–10685, 2021

work page 2021
[6]

Text-to-audio generation using instruction- tuned LLM and latent diffusion model,

D. Ghosal, N. Majumder, A. Mehrish, and S. Poria, “Text-to-audio generation using instruction- tuned LLM and latent diffusion model,”ArXiv preprint: ArXiv:2304.13731, 2023

work page arXiv 2023
[7]

CogVideo: Large-scale pretraining for text- to-video generation via transformers,

W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “CogVideo: Large-scale pretraining for text- to-video generation via transformers,” inInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[8]

Semantic image synthesis via diffusion models,

W. Wang, J. Bao, W.-G. Zhou, D. Chen, D. Chen, L. Yuan, and H. Li, “Semantic image synthesis via diffusion models,”ArXiv preprint: ArXiv:2207.00050, 2022

work page arXiv 2022
[9]

Freestyle layout-to-image synthesis,

H. Xue, Z. H. Feng, Q. Sun, L. Song, and W. Zhang, “Freestyle layout-to-image synthesis,” ArXiv preprint ArXiv:2303.14412, 2023

work page arXiv 2023
[10]

6G networks: Beyond Shannon towards semantic and goal-oriented communications,

E. Calvanese Strinati and S. Barbarossa, “6G networks: Beyond Shannon towards semantic and goal-oriented communications,” Computer Networks, vol. 190, p. 107930, 2020

work page 2020
[11]

Joint task and data oriented semantic commu- nications: A deep separate source-channel coding scheme,

J. Huang, D. Li, C. H. Xiu, X. Qin, and W. Zhang, “Joint task and data oriented semantic commu- nications: A deep separate source-channel coding scheme,” ArXiv preprint: ArXiv:2302.13580 , 2023

work page arXiv 2023
[12]

Semantic communications: Principles and challenges,

Z. Qin, X. Tao, J. Lu, and G. Y . Li, “Semantic communications: Principles and challenges,” ArXiv preprint: ArXiv:2201.01389, 2021

work page arXiv 2021
[13]

Semantic-preserving image compression,

N. Patwa, N. A. Ahuja, S. Somayazulu, O. Tickoo, S. Varadarajan, and S. G. Koolagudi, “Semantic-preserving image compression,”IEEE International Conference on Image Processing (ICIP), pp. 1281–1285, 2020

work page 2020
[14]

An end-to-end deep learning image compression framework based on semantic analysis,

C. Wang, Y . Han, and W. Wang, “An end-to-end deep learning image compression framework based on semantic analysis,” Applied Sciences, 2019

work page 2019
[15]

Wireless semantic communications for video conferencing,

P. Jiang, C.-K. Wen, S. Jin, and G. Y . Li, “Wireless semantic communications for video conferencing,” IEEE Journal on Selected Areas in Communications , vol. 41, pp. 230–244, 2022

work page 2022
[16]

Perfor- mance evaluation of semantic video compression using multi-cue object detection,

N. M. AL-Shakarji, F. Bunyak, H. Aliakbarpour, G. Seetharaman, and K. Palaniappan, “Perfor- mance evaluation of semantic video compression using multi-cue object detection,” in IEEE Applied Imagery Pattern Recognition Workshop (AIPR), pp. 1–8, 2019

work page 2019
[17]

GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,” inInternational Conference on Machine Learning (ICML) , 2021

work page 2021
[18]

CoBIT: A contrastive bi-directional image-text generation model,

H. You, M. Guo, Z. Wang, K.-W. Chang, J. Baldridge, and J. Yu, “CoBIT: A contrastive bi-directional image-text generation model,” ArXiv preprint: ArXiv:2303.13455, 2023

work page arXiv 2023
[19]

Optimal transport in diffusion modeling for conversion tasks in audio domain,

V . Popov, A. Amatov, M. Kudinov, V . Gogoryan, T. Sadekova, and I. V ovk, “Optimal transport in diffusion modeling for conversion tasks in audio domain,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 1–5, 2023

work page 2023
[20]

Make-An-Audio: Text-to-audio generation with prompt-enhanced diffusion models,

R. Huang, J.-B. Huang, D. Yang, Y . Ren, L. Liu, M. Li, Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-An-Audio: Text-to-audio generation with prompt-enhanced diffusion models,”ArXiv preprint: ArXiv:2301.12661, 2023. 10

work page arXiv 2023
[21]

Deep Audio Waveform Prior,

A. Turetzky, T. Michelson, Y . Adi, and S. Peleg, “Deep Audio Waveform Prior,” inInterspeech, pp. 2938–2942, 2022

work page 2022
[22]

Make-A-Video: Text-to-Video Generation without Text-Video Data

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman, “Make-A-Video: Text-to-video generation without text-video data,” ArXiv preprint: ArXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Seer: Language Instructed Video Prediction with Latent Diffusion Models

X. Gu, C. Wen, J. Song, and Y . Gao, “Seer: Language instructed video prediction with latent diffusion models,”ArXiv preprint: ArXiv:2303.14897, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Text2Performer: Text-driven human video generation,

Y . Jiang, S. Yang, T. K. Liang, W. Wu, C. L. Change, and Z. Liu, “Text2Performer: Text-driven human video generation,” ArXiv preprint: ArXiv:2304.08483, 2023

work page arXiv 2023
[25]

Diffusion models in vision: A survey,

F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence , pp. 1–20, 2023

work page 2023
[26]

Diverse semantic image synthesis via probability distribution modeling,

Z. Tan, M. Chai, D. Chen, J. Liao, Q. Chu, B. Liu, G. Hua, and N. Yu, “Diverse semantic image synthesis via probability distribution modeling,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7962–7971, 2021

work page 2021
[27]

Semantic image synthesis with spatially-adaptive normalization,

T. Park, M.-Y . Liu, T.-C. Wang, and J.-Y . Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2019

work page 2019
[28]

You only need adversarial supervision for semantic image synthesis,

E. Schönfeld, V . Sushko, D. Zhang, J. Gall, B. Schiele, and A. Khoreva, “You only need adversarial supervision for semantic image synthesis,” in International Conference on Learning Representations (ICLR), 2021

work page 2021
[29]

Semantically multi-modal image synthesis,

Z. Zhu, Z.-L. Xu, A. You, and X. Bai, “Semantically multi-modal image synthesis,”IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 5466–5475, 2020

work page 2020
[30]

Learning to predict layout-to-image conditional convolutions for semantic image synthesis,

X. Liu, G. Yin, J. Shao, X. Wang, and H. Li, “Learning to predict layout-to-image conditional convolutions for semantic image synthesis,” in Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[31]

Generative model based highly efficient semantic communication approach for image transmission,

T. Han, J. Tang, Q. Yang, Y . Duan, Z. Zhang, and Z. Shi, “Generative model based highly efficient semantic communication approach for image transmission,” arXiv preprint: arXiv:2211.10287, 2022

work page arXiv 2022
[32]

Generative joint source-channel coding for semantic image transmission,

E. Erdemir, T.-Y . Tung, P. L. Dragotti, and D. Gunduz, “Generative joint source-channel coding for semantic image transmission,” arXiv preprint: arXiv:2211.13772, 2022

work page arXiv 2022
[33]

V AE for joint source-channel coding of distributed gaussian sources over AWGN MAC,

Y . Malur Saidutta, A. Abdi, and F. Fekri, “V AE for joint source-channel coding of distributed gaussian sources over AWGN MAC,” inIEEE Int. Workshop on Signal Processing Advances in Wireless Comm. (SPA WC), pp. 1–5, 2020

work page 2020
[34]

A variational auto-encoder approach for image transmission in wireless chan- nel,

A. H. Estiri, M. R. Sabramooz, A. Banaei, A. H. Dehghan, B. Jamialahmadi, and M. J. Siavoshani, “A variational auto-encoder approach for image transmission in wireless chan- nel,” arXiv preprint: arXiv:2010.03967, 2020

work page arXiv 2010
[35]

Generative model based highly efficient semantic communication approach for image transmission,

T. Han, J. Tang, Q. Yang, Y . Duan, Z. Zhang, and Z. Shi, “Generative model based highly efficient semantic communication approach for image transmission,”IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) , 2022

work page 2022
[36]

Versatile diffusion: Text, images and variations all in one diffusion model,

X. Xu, Z. Wang, E. Zhang, K. Wang, and H. Shi, “Versatile diffusion: Text, images and variations all in one diffusion model,”ArXiv preprint: ArXiv:2211.08332, 2022

work page arXiv 2022
[37]

Learning task-oriented communication for edge inference: An information bottleneck approach,

J. Shao, Y . Mao, and J. Zhang, “Learning task-oriented communication for edge inference: An information bottleneck approach,” IEEE Journal on Selected Areas in Communications , vol. 40, pp. 197–211, 2021

work page 2021
[38]

U-Net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI) , 2015

work page 2015
[39]

Searching for Activation Functions

P. Ramachandran, B. Zoph, and Q. V . Le, “Swish: a self-gated activation function,” ArXiv preprint: ArXiv:1710.05941, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Group normalization,

Y . Wu and K. He, “Group normalization,”International Journal of Computer Vision, vol. 128, pp. 742–755, 2018

work page 2018
[41]

Improved denoising diffusion probabilistic models,

A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” inInterna- tional Conference on Machine Learning (ICML) , pp. 8162—-8171, 2021. 11

work page 2021
[42]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Q. Nichol, “Diffusion models beat gans on image synthesis,” inAdvances in Neural Information Processing Systems (NeurIPS) , vol. 34, 2021

work page 2021
[43]

Classifier-free diffusion guidance,

J. Ho and T. Salimans, “Classifier-free diffusion guidance,” inAdvances in Neural Information Processing Systems Workshops (NeurIPSW), 2021

work page 2021
[44]

Dilated residual networks,

F. Yu, V . Koltun, and T. A. Funkhouser, “Dilated residual networks,”IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 636–644, 2017

work page 2017
[45]

Per-pixel classification is not all you need for semantic segmentation,

B. Cheng, A. G. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” in Neural Information Processing Systems, 2021

work page 2021
[46]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision (ECCV) , pp. 213–229, 2020

work page 2020
[47]

M4Depth: Monocular depth estimation for autonomous vehicles in unseen environments,

M. Fonder, D. Ernst, and M. Van Droogenbroeck, “M4Depth: Monocular depth estimation for autonomous vehicles in unseen environments,”ArXiv preprint: ArXiv:2105.09847, 2021

work page arXiv 2021
[48]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,”IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 12159–12168, 2021

work page 2021
[49]

A mathematical theory of communication,

C. E. Shannon, “A mathematical theory of communication,”The Bell system technical journal , vol. 27, no. 3, pp. 379–423, 1948

work page 1948
[50]

Recent contributions to the mathematical theory of communication,

W. Weaver, “Recent contributions to the mathematical theory of communication,” ETC: A Review of General Semantics, pp. 261–281, 1953. 12 Conv SiLU GroupNorm GroupNorm Conv SiLU FCEncoder block Conv SiLU SPADE SPADE Conv SiLU FC Decoder block Figure 6: Encoder and decoder blocks of our U-Net-based semantic diffusion model. Supplementary Material From Techn...

work page 1953
[51]

It deals with the classical Shannon’s communication theory and focuses on the proper way of transmitting bits from a sender to a receiver

The technical challenge. It deals with the classical Shannon’s communication theory and focuses on the proper way of transmitting bits from a sender to a receiver

work page
[52]

Rather than just transmitting bits, this level should account for properly transmitting the meaning of the messages the sender wants to communicate to the receiver

The semantic challenge. Rather than just transmitting bits, this level should account for properly transmitting the meaning of the messages the sender wants to communicate to the receiver

work page
[53]

This level deals with the efficiency of the transmission of previous levels

The effectiveness challenge. This level deals with the efficiency of the transmission of previous levels. With the upcoming advent of the sixth generation (6G), a radical rethinking of communication framework design has started, sliding from the first to the second level of Weaver’s theory [10, 1]. In this switch, generative learning methods are making th...

work page

[1] [1]

Semantic communications: Overview, open issues, and future research directions,

X. Luo, H.-H. Chen, and Q. Guo, “Semantic communications: Overview, open issues, and future research directions,” IEEE Wireless Comm., vol. 29, no. 1, pp. 210–219, 2022

work page 2022

[2] [2]

Communication beyond transmitting bits: Semantics-guided source and channel coding,

J. Dai, P. Zhang, K. Niu, S. Wang, Z. Si, and X. Qin, “Communication beyond transmitting bits: Semantics-guided source and channel coding,” ArXiv preprint: ArXiv:2208.02481, 2021

work page arXiv 2021

[3] [3]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 6840–6851, 2020

work page 2020

[4] [4]

Photorealistic text-to- image diffusion models with deep language understanding,

C. Saharia, C. W., S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, R. Gontijo- Lopes, B. K. Ayan, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to- image diffusion models with deep language understanding,” in Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[5] [5]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p. 10674–10685, 2021

work page 2021

[6] [6]

Text-to-audio generation using instruction- tuned LLM and latent diffusion model,

D. Ghosal, N. Majumder, A. Mehrish, and S. Poria, “Text-to-audio generation using instruction- tuned LLM and latent diffusion model,”ArXiv preprint: ArXiv:2304.13731, 2023

work page arXiv 2023

[7] [7]

CogVideo: Large-scale pretraining for text- to-video generation via transformers,

W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “CogVideo: Large-scale pretraining for text- to-video generation via transformers,” inInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[8] [8]

Semantic image synthesis via diffusion models,

W. Wang, J. Bao, W.-G. Zhou, D. Chen, D. Chen, L. Yuan, and H. Li, “Semantic image synthesis via diffusion models,”ArXiv preprint: ArXiv:2207.00050, 2022

work page arXiv 2022

[9] [9]

Freestyle layout-to-image synthesis,

H. Xue, Z. H. Feng, Q. Sun, L. Song, and W. Zhang, “Freestyle layout-to-image synthesis,” ArXiv preprint ArXiv:2303.14412, 2023

work page arXiv 2023

[10] [10]

6G networks: Beyond Shannon towards semantic and goal-oriented communications,

E. Calvanese Strinati and S. Barbarossa, “6G networks: Beyond Shannon towards semantic and goal-oriented communications,” Computer Networks, vol. 190, p. 107930, 2020

work page 2020

[11] [11]

Joint task and data oriented semantic commu- nications: A deep separate source-channel coding scheme,

J. Huang, D. Li, C. H. Xiu, X. Qin, and W. Zhang, “Joint task and data oriented semantic commu- nications: A deep separate source-channel coding scheme,” ArXiv preprint: ArXiv:2302.13580 , 2023

work page arXiv 2023

[12] [12]

Semantic communications: Principles and challenges,

Z. Qin, X. Tao, J. Lu, and G. Y . Li, “Semantic communications: Principles and challenges,” ArXiv preprint: ArXiv:2201.01389, 2021

work page arXiv 2021

[13] [13]

Semantic-preserving image compression,

N. Patwa, N. A. Ahuja, S. Somayazulu, O. Tickoo, S. Varadarajan, and S. G. Koolagudi, “Semantic-preserving image compression,”IEEE International Conference on Image Processing (ICIP), pp. 1281–1285, 2020

work page 2020

[14] [14]

An end-to-end deep learning image compression framework based on semantic analysis,

C. Wang, Y . Han, and W. Wang, “An end-to-end deep learning image compression framework based on semantic analysis,” Applied Sciences, 2019

work page 2019

[15] [15]

Wireless semantic communications for video conferencing,

P. Jiang, C.-K. Wen, S. Jin, and G. Y . Li, “Wireless semantic communications for video conferencing,” IEEE Journal on Selected Areas in Communications , vol. 41, pp. 230–244, 2022

work page 2022

[16] [16]

Perfor- mance evaluation of semantic video compression using multi-cue object detection,

N. M. AL-Shakarji, F. Bunyak, H. Aliakbarpour, G. Seetharaman, and K. Palaniappan, “Perfor- mance evaluation of semantic video compression using multi-cue object detection,” in IEEE Applied Imagery Pattern Recognition Workshop (AIPR), pp. 1–8, 2019

work page 2019

[17] [17]

GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,” inInternational Conference on Machine Learning (ICML) , 2021

work page 2021

[18] [18]

CoBIT: A contrastive bi-directional image-text generation model,

H. You, M. Guo, Z. Wang, K.-W. Chang, J. Baldridge, and J. Yu, “CoBIT: A contrastive bi-directional image-text generation model,” ArXiv preprint: ArXiv:2303.13455, 2023

work page arXiv 2023

[19] [19]

Optimal transport in diffusion modeling for conversion tasks in audio domain,

V . Popov, A. Amatov, M. Kudinov, V . Gogoryan, T. Sadekova, and I. V ovk, “Optimal transport in diffusion modeling for conversion tasks in audio domain,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 1–5, 2023

work page 2023

[20] [20]

Make-An-Audio: Text-to-audio generation with prompt-enhanced diffusion models,

R. Huang, J.-B. Huang, D. Yang, Y . Ren, L. Liu, M. Li, Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-An-Audio: Text-to-audio generation with prompt-enhanced diffusion models,”ArXiv preprint: ArXiv:2301.12661, 2023. 10

work page arXiv 2023

[21] [21]

Deep Audio Waveform Prior,

A. Turetzky, T. Michelson, Y . Adi, and S. Peleg, “Deep Audio Waveform Prior,” inInterspeech, pp. 2938–2942, 2022

work page 2022

[22] [22]

Make-A-Video: Text-to-Video Generation without Text-Video Data

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman, “Make-A-Video: Text-to-video generation without text-video data,” ArXiv preprint: ArXiv:2209.14792, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Seer: Language Instructed Video Prediction with Latent Diffusion Models

X. Gu, C. Wen, J. Song, and Y . Gao, “Seer: Language instructed video prediction with latent diffusion models,”ArXiv preprint: ArXiv:2303.14897, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Text2Performer: Text-driven human video generation,

Y . Jiang, S. Yang, T. K. Liang, W. Wu, C. L. Change, and Z. Liu, “Text2Performer: Text-driven human video generation,” ArXiv preprint: ArXiv:2304.08483, 2023

work page arXiv 2023

[25] [25]

Diffusion models in vision: A survey,

F.-A. Croitoru, V . Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence , pp. 1–20, 2023

work page 2023

[26] [26]

Diverse semantic image synthesis via probability distribution modeling,

Z. Tan, M. Chai, D. Chen, J. Liao, Q. Chu, B. Liu, G. Hua, and N. Yu, “Diverse semantic image synthesis via probability distribution modeling,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7962–7971, 2021

work page 2021

[27] [27]

Semantic image synthesis with spatially-adaptive normalization,

T. Park, M.-Y . Liu, T.-C. Wang, and J.-Y . Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2019

work page 2019

[28] [28]

You only need adversarial supervision for semantic image synthesis,

E. Schönfeld, V . Sushko, D. Zhang, J. Gall, B. Schiele, and A. Khoreva, “You only need adversarial supervision for semantic image synthesis,” in International Conference on Learning Representations (ICLR), 2021

work page 2021

[29] [29]

Semantically multi-modal image synthesis,

Z. Zhu, Z.-L. Xu, A. You, and X. Bai, “Semantically multi-modal image synthesis,”IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 5466–5475, 2020

work page 2020

[30] [30]

Learning to predict layout-to-image conditional convolutions for semantic image synthesis,

X. Liu, G. Yin, J. Shao, X. Wang, and H. Li, “Learning to predict layout-to-image conditional convolutions for semantic image synthesis,” in Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[31] [31]

Generative model based highly efficient semantic communication approach for image transmission,

T. Han, J. Tang, Q. Yang, Y . Duan, Z. Zhang, and Z. Shi, “Generative model based highly efficient semantic communication approach for image transmission,” arXiv preprint: arXiv:2211.10287, 2022

work page arXiv 2022

[32] [32]

Generative joint source-channel coding for semantic image transmission,

E. Erdemir, T.-Y . Tung, P. L. Dragotti, and D. Gunduz, “Generative joint source-channel coding for semantic image transmission,” arXiv preprint: arXiv:2211.13772, 2022

work page arXiv 2022

[33] [33]

V AE for joint source-channel coding of distributed gaussian sources over AWGN MAC,

Y . Malur Saidutta, A. Abdi, and F. Fekri, “V AE for joint source-channel coding of distributed gaussian sources over AWGN MAC,” inIEEE Int. Workshop on Signal Processing Advances in Wireless Comm. (SPA WC), pp. 1–5, 2020

work page 2020

[34] [34]

A variational auto-encoder approach for image transmission in wireless chan- nel,

A. H. Estiri, M. R. Sabramooz, A. Banaei, A. H. Dehghan, B. Jamialahmadi, and M. J. Siavoshani, “A variational auto-encoder approach for image transmission in wireless chan- nel,” arXiv preprint: arXiv:2010.03967, 2020

work page arXiv 2010

[35] [35]

Generative model based highly efficient semantic communication approach for image transmission,

T. Han, J. Tang, Q. Yang, Y . Duan, Z. Zhang, and Z. Shi, “Generative model based highly efficient semantic communication approach for image transmission,”IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP) , 2022

work page 2022

[36] [36]

Versatile diffusion: Text, images and variations all in one diffusion model,

X. Xu, Z. Wang, E. Zhang, K. Wang, and H. Shi, “Versatile diffusion: Text, images and variations all in one diffusion model,”ArXiv preprint: ArXiv:2211.08332, 2022

work page arXiv 2022

[37] [37]

Learning task-oriented communication for edge inference: An information bottleneck approach,

J. Shao, Y . Mao, and J. Zhang, “Learning task-oriented communication for edge inference: An information bottleneck approach,” IEEE Journal on Selected Areas in Communications , vol. 40, pp. 197–211, 2021

work page 2021

[38] [38]

U-Net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI) , 2015

work page 2015

[39] [39]

Searching for Activation Functions

P. Ramachandran, B. Zoph, and Q. V . Le, “Swish: a self-gated activation function,” ArXiv preprint: ArXiv:1710.05941, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

Group normalization,

Y . Wu and K. He, “Group normalization,”International Journal of Computer Vision, vol. 128, pp. 742–755, 2018

work page 2018

[41] [41]

Improved denoising diffusion probabilistic models,

A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” inInterna- tional Conference on Machine Learning (ICML) , pp. 8162—-8171, 2021. 11

work page 2021

[42] [42]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Q. Nichol, “Diffusion models beat gans on image synthesis,” inAdvances in Neural Information Processing Systems (NeurIPS) , vol. 34, 2021

work page 2021

[43] [43]

Classifier-free diffusion guidance,

J. Ho and T. Salimans, “Classifier-free diffusion guidance,” inAdvances in Neural Information Processing Systems Workshops (NeurIPSW), 2021

work page 2021

[44] [44]

Dilated residual networks,

F. Yu, V . Koltun, and T. A. Funkhouser, “Dilated residual networks,”IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 636–644, 2017

work page 2017

[45] [45]

Per-pixel classification is not all you need for semantic segmentation,

B. Cheng, A. G. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” in Neural Information Processing Systems, 2021

work page 2021

[46] [46]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision (ECCV) , pp. 213–229, 2020

work page 2020

[47] [47]

M4Depth: Monocular depth estimation for autonomous vehicles in unseen environments,

M. Fonder, D. Ernst, and M. Van Droogenbroeck, “M4Depth: Monocular depth estimation for autonomous vehicles in unseen environments,”ArXiv preprint: ArXiv:2105.09847, 2021

work page arXiv 2021

[48] [48]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,”IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 12159–12168, 2021

work page 2021

[49] [49]

A mathematical theory of communication,

C. E. Shannon, “A mathematical theory of communication,”The Bell system technical journal , vol. 27, no. 3, pp. 379–423, 1948

work page 1948

[50] [50]

Recent contributions to the mathematical theory of communication,

W. Weaver, “Recent contributions to the mathematical theory of communication,” ETC: A Review of General Semantics, pp. 261–281, 1953. 12 Conv SiLU GroupNorm GroupNorm Conv SiLU FCEncoder block Conv SiLU SPADE SPADE Conv SiLU FC Decoder block Figure 6: Encoder and decoder blocks of our U-Net-based semantic diffusion model. Supplementary Material From Techn...

work page 1953

[51] [51]

It deals with the classical Shannon’s communication theory and focuses on the proper way of transmitting bits from a sender to a receiver

The technical challenge. It deals with the classical Shannon’s communication theory and focuses on the proper way of transmitting bits from a sender to a receiver

work page

[52] [52]

Rather than just transmitting bits, this level should account for properly transmitting the meaning of the messages the sender wants to communicate to the receiver

The semantic challenge. Rather than just transmitting bits, this level should account for properly transmitting the meaning of the messages the sender wants to communicate to the receiver

work page

[53] [53]

This level deals with the efficiency of the transmission of previous levels

The effectiveness challenge. This level deals with the efficiency of the transmission of previous levels. With the upcoming advent of the sixth generation (6G), a radical rethinking of communication framework design has started, sliding from the first to the second level of Weaver’s theory [10, 1]. In this switch, generative learning methods are making th...

work page