Net-Ev² proposes a two-stage generative simulator with structure-guided masked pre-training and topology-aware diffusion using graph U-Net down/upsampling to model network event evolution from text inputs, plus a new 6.5M multimodal benchmark and JL-MMD metric.
hub Canonical reference
Audioldm: Text-to-audio generation with latent diffusion models
Canonical reference. 71% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A per-timestep conditioned diffusion transformer generates realistic fMRI dynamics for unseen cognitive tasks by injecting compositional language and optional spatial priors in-context.
An entropy-based log-barrier on DiT outputs acts as an online curriculum in supervised diffusion fine-tuning, producing higher thematic development and textural diversity than standard training on MusicCaps.
HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
MixtureTT performs direct per-stem timbre transfer on polyphonic mixtures via a shared diffusion transformer, outperforming single-stem baselines on SATB choral data while eliminating cascaded separation errors.
LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
FoleyDesigner generates spatio-temporally aligned stereo Foley audio for film clips via multi-agent analysis, diffusion models on video cues, and LLM mixing, supported by the new FilmStereo dataset.
JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.
AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
Extends vLLM with delay-pattern de-interleaving, multi-stream sampling, and co-scheduled CFG to achieve 80% of non-CFG throughput for unified audio tasks while open-sourcing the pipeline.
dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.
WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.
Caption poisoning attacks can steer retrieval-augmented text-to-music generation toward attacker-chosen targets by injecting crafted captions into the knowledge database.
PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when applied to Stable Audio VAE with F5-TTS.
DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with high compression.
A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.
Diffusion models overfit denoising loss at intermediate noise but generalize in inference as model error smooths the flow field and sampling paths avoid memorized noisy training data.
DGSNA dynamically generates scene-specific noise via prompt-driven language models and text-to-audio diffusion, then mixes it with speech to improve recognition and keyword spotting robustness by up to 11.32%.
ALM2Vec learns unified audio embeddings from large audio-language models for text-audio retrieval, instruction-aware retrieval, and other tasks across domains.
EigeNet applies a cross-view alternate-attention transformer with geometry modulation for few-shot novel-view RIR prediction, reporting SOTA results on simulated and real data.
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.
Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
Evaluates 42 variants of foundation models across three formalized paradigms for missing modality reconstruction, identifies shortfalls in semantic extraction and validation, and introduces an agentic framework that reduces FID by at least 14% for images and MER by at least 10% for text.
MAVIN proposes boundary-aware attention, ID-aware propagation, a multi-agent scripting pipeline, and the MAVINSet dataset as the first framework for multi-shot audio-visual generation with narrative control, claiming SOTA results.
citing papers explorer
-
AudioMoG: Guiding Audio Generation with Mixture-of-Guidance
AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
-
How Far Are We from Generating Missing Modalities with Foundation Models?
Evaluates 42 variants of foundation models across three formalized paradigms for missing modality reconstruction, identifies shortfalls in semantic extraction and validation, and introduces an agentic framework that reduces FID by at least 14% for images and MER by at least 10% for text.
- Cross-modal Consistency Guidance for Robust Emotion Control in Auto-Regressive TTS Models