f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

ICLR 2023


Apple  

Abstract


TLDR: We propose f-DM, an end-to-end non-cascaded diffusion model that allows progressive signal transformations along diffusion.

Diffusion models (DMs) have recently emerged as SoTA tools for generative modeling in various domains. Standard DMs can be viewed as an instantiation of hierarchical variational autoencoders (VAEs) where the latent variables are inferred from input-centered Gaussian distributions with fixed scales and variances. Unlike VAEs, this formulation constrains DMs from changing the latent spaces and learning abstract representations. In this work, we propose f-DM, a generalized family of DMs which allows progressive signal transformation. More precisely, we extend DMs to incorporate a set of (hand-designed or learned) transformations, where the transformed input is the mean of each diffusion step. We propose a generalized formulation and derive the corresponding de-noising objective with a modified sampling algorithm. As a demonstration, we apply f-DM in image generation tasks with a range of functions, including down-sampling, blurring, and learned transformations based on the encoder of pretrained VAEs. In addition, we identify the importance of adjusting the noise levels whenever the signal is sub-sampled and propose a simple rescaling recipe. f-DM can produce high-quality samples on standard image generation benchmarks like FFHQ, AFHQ, LSUN, and ImageNet with better efficiency and semantic interpretation.



Overview of Our Method


Our method is motivated by comparing standard DMs with hierarchical VAEs, and we propose f-DM which allows progressive signal transformation during the diffusion process. We show the architecture and the diffusion algorithm below. For more technical details, please refer to the paper.


Generation of Various Models


We show the video comparison of the reverse diffusion process on AFHQ 256x256 for f-DMs and standard DM (DDPM) on AFHQ. For standard DM, we visualize the denoised xt and the next noised input zs for each diffusion step. For f-DMs, we addisionally visualize the predicted δt (signal degradation) in the video. We plot the first three channels of VQVAE latent variables All the samples are generated via 250 DDPM steps. Note that low-resolution images (f-DM-DS, f-DM-VAE) are resized for ease of visualization.

Standard DM (xt, zs)
f-DM-DS (xt, δt, zs)

f-DM-Blur-G (xt, δt, zs)
f-DM-Blur-U (xt, δt, zs)

f-DM-VQVAE (xt, δt, zs)
f-DM-VQGAN (xt, δt, zs)




Generation on Various Datasets


We show additional generation results on LSUN Church/Bed 256×256, class-conditional ImageNet 256×256 and FFHQ 1024×1024. We plot the first three channels of VQVAE latent variables All faces presented are synthesized by the models, and are not real identities.

LSUN Bed 256×256 with f-DM-DS (xt, δt, zs)

LSUN Church 256×256 with f-DM-Blur-U (xt, δt, zs)

Class-conditional ImageNet 256×256 with f-DM-VQVAE (xt, δt, zs)
macaw (class=88)
daisy (class=985)


seashore (class=978)
pizza (class=963)


FFHQ 1024×1024 with f-DM-DS (xt, δt, zs)


Conditional Generation


We show the video comparison on AFHQ 256x256 from f-DM-DS and f-DM-Blur-G with low-resolution (16×16) and Gaussian blurred images as inputs. All samples are generated through DDIM (η=0) with 250 steps as well as gradient-based initialization (30 gradient steps before diffusion starts). From left to right, we show the input (xT), the denoised output from the optimized initialization (xTpred), the denoised output (xt), the delta (δt), the next noised input (zs), and the target real image (x0), respectively.

Super-resolution (xT, xTpred, xt, δt, zs, x0)

De-blurring (xT, xTpred, xt, δt, zs, x0)


Latent Space Manipulation


We show the video comparison on AFHQ 256x256 from f-DM-DS with fully random or partially fixed noises as input. We also include the averaged images for reference. All samples are generated through DDIM (η=0) with 250 steps. This demonstrates the f-DM-DS learns multi-scale representations in its latent space.

Random
Fixed up to 16x16
Fixed up to 32x32
Fixed up to 64x64
Fixed up to 128x128



We show the video comparison on AFHQ 256x256 from f-DM-DS, f-DM-VQVAE and the baseline (standard DDPM). All samples are generated through DDIM (η=0) with 250 steps where we interpolates the DDIM initial noises with slerp. This demonstrates the f-DMs are able to learn semantically smooth latent space compared to the baseline.

Baseline (DDPM)
f-DM-DS
f-DM-VQVAE



Ablation Studies


We show the video comparison on FFHQ 256x256 from f-DM-DS with (right) and without (left) the interpolation formulation. There are no delta predictions (δ) when the model is trained without interpolation. For each step, we apply the same noise inputs in both cases for better comparison.



We show the video comparison on FFHQ 256x256 from f-DM-DS with (right) and without (left) the proposed rescaling method. Without rescaling, the images (zs) are much noiser. For each step, We apply the same noise inputs in both cases for better comparison.





Citation



  @article{gu2022fdm,
    title={f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation},
    author={Gu, Jiatao and Zhai, Shuangfei and Zhang, Yizhe and Bautista, Miguel Angel and Susskind, Josh},
    journal={arXiv preprint arXiv:2210.04955},
    year={2022}
  }