GenED-SC: Generative Editing Semantic Communication with Integrated Multi-Modal LLMs
Pith reviewed 2026-06-28 16:26 UTC · model grok-4.3
The pith
A two-stage system uses JSCC to send key image regions then MLLM editing from text to restore details under noisy channels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed two-stage semantic image transmission framework integrates JSCC-based discriminative transmission that selectively prioritizes semantically important regions with MLLM-driven generative editing that refines missing details from textual descriptions, achieving state-of-the-art results in semantic preservation, perceptual quality, and visual fidelity especially in low-SNR regimes.
What carries the argument
The two-stage framework that pairs JSCC for layout-preserving transmission of key regions with MLLM generative editing driven by accompanying textual descriptions.
If this is right
- Scene layout and object integrity remain intact under bandwidth constraints.
- Perceptual quality rises by drawing on pre-trained generative knowledge without full reliance on it.
- Overall transmission quality holds across a broad range of channel conditions rather than degrading sharply in noise.
- Semantic fidelity improves relative to methods that optimize only visual metrics or only generative priors.
Where Pith is reading between the lines
- The same split between discriminative coding and text-guided refinement could apply to video sequences or audio streams.
- Hybrid systems of this type suggest generative components can serve as a post-processing layer rather than a full replacement for coding.
- Deployment would benefit from safeguards that detect when editing introduces inconsistencies before final output.
Load-bearing premise
The multimodal large language model can reliably generate missing image details from text without introducing semantic errors or hallucinations that degrade transmission quality.
What would settle it
Direct comparison of transmitted images against ground truth showing whether MLLM edits change object identities, scene semantics, or introduce inconsistent elements in low-SNR test cases.
Figures
read the original abstract
Deep learning-based joint source-channel coding has recently demonstrated strong potential for semantic communication (SemComm). However, most existing approaches focus on optimizing visual-fidelity metrics, which can lead to reduced perceptual quality. Generative model-based SemComm leverages rich prior knowledge from large-scale pre-training to enhance perceptual quality, but often at the cost of increased distortion and unreliability. This paper addresses the above issues by proposing a two-stage semantic image transmission framework, integrating a multimodal large language model (MLLM) for generative editing. In the first stage, a JSCC-based discriminative transmission selectively prioritizes semantically important regions, preserving scene layout and object integrity under limited bandwidth. In the second phase, MLLM-driven generative editing refines missing details based on the textual descriptions, enhancing semantic fidelity and perceptual quality. Extensive experiments show that the proposed framework achieves state-of-the-art performance in semantic preservation, perceptual quality, and visual fidelity across a wide range of channel conditions, especially in low-SNR regimes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GenED-SC, a two-stage semantic image transmission framework. Stage 1 uses JSCC-based discriminative transmission to prioritize semantically important regions and preserve scene layout/object integrity under bandwidth limits. Stage 2 applies MLLM-driven generative editing to refine missing details from textual descriptions. The paper asserts that this yields state-of-the-art performance in semantic preservation, perceptual quality, and visual fidelity across channel conditions, especially low-SNR regimes.
Significance. If the claims hold after verification, the work would offer a useful direction for semantic communications by combining the strengths of discriminative JSCC (for structural preservation) with generative MLLM editing (for perceptual enhancement), potentially improving robustness in bandwidth-constrained, noisy channels beyond current DL-based or purely generative SemComm methods.
major comments (2)
- [Abstract] Abstract: The central claim of achieving state-of-the-art performance in semantic preservation, perceptual quality, and visual fidelity (especially low-SNR) is stated without any quantitative metrics, baselines, datasets, SNR ranges, or error analysis. This renders the headline result unevaluable from the manuscript.
- [Second phase description] Description of the second phase (MLLM-driven generative editing): The framework asserts that this step 'enhances semantic fidelity' by refining details from textual descriptions, but supplies no mechanism, prompt strategy, ablation, or metric (e.g., hallucination rate via CLIP/object-detection consistency or human semantic scoring) to ensure outputs remain faithful rather than introducing plausible inventions. Because Stage 1 already discards information under bandwidth limits, this assumption is load-bearing for the semantic-preservation claim and remains unverified.
minor comments (1)
- [Abstract] The abstract would benefit from briefly naming the concrete metrics (e.g., LPIPS, CLIP similarity, semantic segmentation IoU) used to support the SOTA claims.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of achieving state-of-the-art performance in semantic preservation, perceptual quality, and visual fidelity (especially low-SNR) is stated without any quantitative metrics, baselines, datasets, SNR ranges, or error analysis. This renders the headline result unevaluable from the manuscript.
Authors: We agree that the abstract would be strengthened by including quantitative context. The revised abstract will incorporate key results including specific metric improvements (e.g., CLIP similarity, FID), datasets (COCO), baselines, and SNR ranges (0-20 dB), drawn from the experimental sections. revision: yes
-
Referee: [Second phase description] Description of the second phase (MLLM-driven generative editing): The framework asserts that this step 'enhances semantic fidelity' by refining details from textual descriptions, but supplies no mechanism, prompt strategy, ablation, or metric (e.g., hallucination rate via CLIP/object-detection consistency or human semantic scoring) to ensure outputs remain faithful rather than introducing plausible inventions. Because Stage 1 already discards information under bandwidth limits, this assumption is load-bearing for the semantic-preservation claim and remains unverified.
Authors: We recognize the importance of verifying faithfulness in the generative stage. Section 3 of the manuscript describes the MLLM prompt strategy and integration with Stage 1 outputs, while Section 4 provides ablations on the editing module's contribution. To strengthen the claim, the revision will add explicit hallucination metrics (CLIP/object-detection consistency) and human semantic scoring. revision: partial
Circularity Check
No circularity: framework proposal contains no derivations or self-referential predictions
full rationale
The paper proposes a two-stage semantic image transmission framework (JSCC discriminative transmission followed by MLLM generative editing) and reports experimental SOTA results. No equations, parameter fits, uniqueness theorems, or predictions are presented in the provided text that reduce by construction to the inputs themselves. Claims rest on external experimental validation rather than self-definition, fitted-input renaming, or load-bearing self-citations, satisfying the criteria for a self-contained non-circular description.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Deep joint source- channel coding for wireless image transmission,
E. Bourtsoulatze, D. Burth Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, 2019
2019
-
[2]
Low-rate semantic commu- nication with codebook-based conditional generative models,
K. Ye, M. Gong, S. Wang, and D. Feng, “Low-rate semantic commu- nication with codebook-based conditional generative models,” in2025 IEEE 99th Veh Technol Conf, pp. 1–5, 2025
2025
-
[3]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pp. 10684–10695, June 2022
2022
-
[4]
Language-oriented communication with semantic coding and knowledge distillation for text- to-image generation,
H. Nam, J. Park, J. Choi, M. Bennis, and S.-L. Kim, “Language-oriented communication with semantic coding and knowledge distillation for text- to-image generation,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13506–13510, 2024
2024
-
[5]
Image generative semantic communication with multi-modal similarity estimation for resource-limited networks,
E. Hosonuma, T. Yamazaki, T. Miyoshi, A. Taya, Y . Nishiyama, and K. Sezaki, “Image generative semantic communication with multi-modal similarity estimation for resource-limited networks,”IEICE Transactions on Communications, vol. E108-B, no. 3, pp. 260–273, 2025
2025
-
[6]
Trustworthy image semantic communication with genai: Explainablity, controllability, and efficiency,
X. Wang, D. Ye, C. Feng, H. H. Yang, X. Chen, and T. Q. S. Quek, “Trustworthy image semantic communication with genai: Explainablity, controllability, and efficiency,”IEEE Wireless Communications, vol. 32, no. 2, pp. 68–75, 2025
2025
-
[7]
Diff- go+: An efficient diffusion goal-oriented communication system with local feedback,
A. Wijesinghe, S. Zhang, S. Wanninayaka, W. Wang, and Z. Ding, “Diff- go+: An efficient diffusion goal-oriented communication system with local feedback,”IEEE Transactions on Wireless Communications, pp. 1– 1, 2025
2025
-
[8]
Ultra lowrate image compression with semantic residual coding and compression-aware diffusion,
A. Ke, X. Zhang, T. Chen, M. Lu, C. Zhou, J. Gu, and Z. Ma, “Ultra lowrate image compression with semantic residual coding and compression-aware diffusion,” inInternational Conference on Machine Learning, 2025
2025
-
[9]
Z. Wang, E. Xie, A. Li, Z. Wang, X. Liu, and Z. Li, “Divide and conquer: Language models can plan and self-correct for compositional text-to- image generation,”arXiv preprint, 2024. arXiv:2401.15688 [cs.CV]
arXiv 2024
-
[10]
Cddm: Channel denoising diffusion models for wireless semantic communica- tions,
T. Wu, Z. Chen, D. He, L. Qian, Y . Xu, M. Tao, and W. Zhang, “Cddm: Channel denoising diffusion models for wireless semantic communica- tions,”IEEE Transactions on Wireless Communications, vol. 23, no. 9, pp. 11168–11183, 2024
2024
-
[11]
Semantics guided diffusion for deep joint source and channel coding design,
M. Zhang, G. Zhu, K. Han, R. Jin, and X. Chen, “Semantics guided diffusion for deep joint source and channel coding design,” in2024 16th International Conference on Wireless Communications and Signal Processing (WCSP), pp. 1509–1514, 2024
2024
-
[12]
Se- mantic prior aided channel-adaptive equalizing and de-noising semantic communication system with latent diffusion model,
B. Xu, S. Han, X. Xu, W. Li, R. Meng, C. Dong, and P. Zhang, “Se- mantic prior aided channel-adaptive equalizing and de-noising semantic communication system with latent diffusion model,”IEEE Transactions on Wireless Communications, vol. 24, no. 6, pp. 4614–4630, 2025
2025
-
[13]
Generative model based highly efficient semantic communication approach for image transmission,
T. Han, J. Tang, Q. Yang, Y . Duan, Z. Zhang, and Z. Shi, “Generative model based highly efficient semantic communication approach for image transmission,” inICASSP 2023 - 2023 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023
2023
-
[14]
Robust semantic transmission of images with generative adversarial networks,
Q. He, H. Yuan, D. Feng, B. Che, Z. Chen, and X.-G. Xia, “Robust semantic transmission of images with generative adversarial networks,” inGLOBECOM 2022 - 2022 IEEE Global Communications Conference, pp. 3953–3958, 2022
2022
-
[15]
Latency-aware generative semantic communications with pre-trained diffusion models,
L. Qiao, M. B. Mashhadi, Z. Gao, C. H. Foh, P. Xiao, and M. Bennis, “Latency-aware generative semantic communications with pre-trained diffusion models,”IEEE Wireless Communications Letters, vol. 13, no. 10, pp. 2652–2656, 2024
2024
-
[16]
Sg2sc: A generative semantic communication framework for scene understanding- oriented image transmission,
M. Yang, D. Gao, F. Xie, J. Li, X. Song, and G. Shi, “Sg2sc: A generative semantic communication framework for scene understanding- oriented image transmission,” inICASSP 2024 - 2024 IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13486–13490, 2024
2024
-
[17]
Agent-driven generative semantic communication with cross- modality and prediction,
W. Yang, Z. Xiong, Y . Yuan, W. Jiang, T. Q. S. Quek, and M. Deb- bah, “Agent-driven generative semantic communication with cross- modality and prediction,”IEEE Transactions on Wireless Communica- tions, vol. 24, no. 3, pp. 2233–2248, 2025
2025
-
[18]
Mixture of se- mantics transmission for generative ai-enabled semantic communication systems,
J. Ni, T. Wu, Z. Chen, Y . Xu, M. Tao, and W. Zhang, “Mixture of se- mantics transmission for generative ai-enabled semantic communication systems,”IEEE Communications Letters, vol. 29, no. 12, pp. 2770–2774, 2025
2025
-
[19]
Semantic importance-aware communications using pre-trained language models,
S. Guo, Y . Wang, S. Li, and N. Saeed, “Semantic importance-aware communications using pre-trained language models,”IEEE Communi- cations Letters, vol. 27, no. 9, pp. 2328–2332, 2023
2023
-
[20]
Large- language-model-enabled text semantic communication systems,
Z. Wang, L. Zou, S. Wei, K. Li, F. Liao, H. Mi, and R. Lai, “Large- language-model-enabled text semantic communication systems,”Applied Sciences, vol. 15, p. 7227, June 2025
2025
-
[21]
On the uses of large language models to design end-to-end learning semantic communication,
Y . Wang, Z. Sun, J. Fan, and H. Ma, “On the uses of large language models to design end-to-end learning semantic communication,” in2024 IEEE Wireless Communications and Networking Conference (WCNC), pp. 1–6, 2024
2024
-
[22]
Lamosc: Large language model-driven semantic communication system for visual trans- mission,
Y . Zhao, Y . Yue, S. Hou, B. Cheng, and Y . Huang, “Lamosc: Large language model-driven semantic communication system for visual trans- mission,”IEEE Transactions on Cognitive Communications and Net- working, vol. 10, no. 6, pp. 2005–2018, 2024
2005
-
[23]
Gen- erative semantic communications with foundation models: Perception- error analysis and semantic-aware power allocation,
C. Xu, M. B. Mashhadi, Y . Ma, R. Tafazolli, and J. Wang, “Gen- erative semantic communications with foundation models: Perception- error analysis and semantic-aware power allocation,”IEEE Journal on Selected Areas in Communications, vol. 43, no. 7, pp. 2493–2505, 2025
2025
-
[24]
Addressing out- of-distribution challenges in image semantic communication systems with multi-modal large language models,
F. Zhang, Y . Du, K. Chen, Y . Shao, and S. C. Liew, “Addressing out- of-distribution challenges in image semantic communication systems with multi-modal large language models,” in2024 22nd International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), pp. 7–14, 2024
2024
-
[25]
Multimodal large language models driven privacy-preserving wireless semantic communication in 6g,
D. Cao, J. Wu, and A. K. Bashir, “Multimodal large language models driven privacy-preserving wireless semantic communication in 6g,” in 2024 IEEE International Conference on Communications Workshops (ICC Workshops), pp. 171–176, 2024
2024
-
[26]
Vi- sual language model-based cross-modal semantic communication sys- tems,
F. Jiang, C. Tang, L. Dong, K. Wang, K. Yang, and C. Pan, “Vi- sual language model-based cross-modal semantic communication sys- tems,”IEEE Transactions on Wireless Communications, vol. 24, no. 5, pp. 3937–3948, 2025
2025
-
[27]
Cross-modal generative semantic communications for mobile aigc: Joint semantic encoding and prompt engineering,
Y . Liu, H. Du, D. Niyato, J. Kang, Z. Xiong, S. Mao, P. Zhang, and X. Shen, “Cross-modal generative semantic communications for mobile aigc: Joint semantic encoding and prompt engineering,”IEEE Transactions on Mobile Computing, vol. 23, no. 12, pp. 14871–14888, 2024
2024
-
[28]
Leveraging stable diffusion with context-aware prompts for semantic communica- tion,
L. V . Nguyen, T. T. Nguyen, O. A. Dobre, and T. Q. Duong, “Leveraging stable diffusion with context-aware prompts for semantic communica- tion,” in2024 IEEE 21st International Conference on Mobile Ad-Hoc and Smart Systems (MASS), pp. 610–615, 2024
2024
-
[29]
Sequential semantic generative communication for progressive text-to-image generation,
H. Nam, J. Park, J. Choi, and S.-L. Kim, “Sequential semantic generative communication for progressive text-to-image generation,” in2023 20th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), pp. 91–94, 2023
2023
-
[30]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning(M. Meila and T. Zhang, eds.), vol. 139 ofProceedings of Machine Learnin...
2021
-
[31]
Cbam: Convolutional block attention module,
S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” inProceedings of the European Conference on Computer Vision (ECCV), September 2018
2018
-
[32]
H. Choi and D. Seo, “Feature importance-aware deep joint source- channel coding for computationally efficient and adjustable image trans- mission,”arXiv preprint arXiv:2504.04758, 2025
Pith/arXiv arXiv 2025
-
[33]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
Pith/arXiv arXiv 2023
-
[34]
Denseclip: Language-guided dense prediction with context-aware prompting,
Y . Rao, W. Zhao, G. Chen, Y . Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18082–18091, June 2022
2022
-
[35]
Swinjscc: Taming swin transformer for deep joint source-channel coding,
K. Yang, S. Wang, J. Dai, X. Qin, K. Niu, and P. Zhang, “Swinjscc: Taming swin transformer for deep joint source-channel coding,”IEEE Transactions on Cognitive Communications and Networking, vol. 11, no. 1, pp. 90–104, 2025
2025
-
[36]
No-reference image quality assessment in the spatial domain,
A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,”IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, 2012
2012
-
[37]
Large ai model-enabled generative semantic communications for image transmission,
Q. Ma, W. Ni, and Z. Qin, “Large ai model-enabled generative semantic communications for image transmission,” inGLOBECOM 2025 - 2025 IEEE Global Communications Conference, pp. 6364–6369, 2025
2025
-
[38]
Text-guided token communication for wireless image transmission,
B. Liu, L. Qiao, Y . Wang, Z. Gao, Y . Ma, K. Ying, and T. Qin, “Text-guided token communication for wireless image transmission,” in 2025 IEEE/CIC International Conference on Communications in China (ICCC), pp. 1–6, 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.