f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

Abstract

TLDR: We propose f-DM, an end-to-end non-cascaded diffusion model that allows progressive signal transformations along diffusion.

Diffusion models (DMs) have recently emerged as SoTA tools for generative modeling in various domains. Standard DMs can be viewed as an instantiation of hierarchical variational autoencoders (VAEs) where the latent variables are inferred from input-centered Gaussian distributions with fixed scales and variances. Unlike VAEs, this formulation constrains DMs from changing the latent spaces and learning abstract representations. In this work, we propose f-DM, a generalized family of DMs which allows progressive signal transformation. More precisely, we extend DMs to incorporate a set of (hand-designed or learned) transformations, where the transformed input is the mean of each diffusion step. We propose a generalized formulation and derive the corresponding de-noising objective with a modified sampling algorithm. As a demonstration, we apply f-DM in image generation tasks with a range of functions, including down-sampling, blurring, and learned transformations based on the encoder of pretrained VAEs. In addition, we identify the importance of adjusting the noise levels whenever the signal is sub-sampled and propose a simple rescaling recipe. f-DM can produce high-quality samples on standard image generation benchmarks like FFHQ, AFHQ, LSUN, and ImageNet with better efficiency and semantic interpretation.

Overview of Our Method

Our method is motivated by comparing standard DMs with hierarchical VAEs, and we propose f-DM which allows progressive signal transformation during the diffusion process. We show the architecture and the diffusion algorithm below. For more technical details, please refer to the paper.

Generation of Various Models

We show the video comparison of the reverse diffusion process on AFHQ 256x256 for f-DMs and standard DM (DDPM) on AFHQ. For standard DM, we visualize the denoised x_t and the next noised input z_s for each diffusion step. For f-DMs, we addisionally visualize the predicted δ_t (signal degradation) in the video. We plot the first three channels of VQVAE latent variables All the samples are generated via 250 DDPM steps. Note that low-resolution images (f-DM-DS, f-DM-VAE) are resized for ease of visualization.

Standard DM (x_t, z_s)

f-DM-DS (x_t, δ_t, z_s)

f-DM-Blur-G (x_t, δ_t, z_s)

f-DM-Blur-U (x_t, δ_t, z_s)

f-DM-VQVAE (x_t, δ_t, z_s)

f-DM-VQGAN (x_t, δ_t, z_s)

Generation on Various Datasets

We show additional generation results on LSUN Church/Bed 256×256, class-conditional ImageNet 256×256 and FFHQ 1024×1024. We plot the first three channels of VQVAE latent variables All faces presented are synthesized by the models, and are not real identities.

LSUN Bed 256×256 with f-DM-DS (x_t, δ_t, z_s)

LSUN Church 256×256 with f-DM-Blur-U (x_t, δ_t, z_s)

Class-conditional ImageNet 256×256 with f-DM-VQVAE (x_t, δ_t, z_s)

macaw (class=88)

daisy (class=985)

seashore (class=978)

pizza (class=963)

FFHQ 1024×1024 with f-DM-DS (x_t, δ_t, z_s)

Conditional Generation

We show the video comparison on AFHQ 256x256 from f-DM-DS and f-DM-Blur-G with low-resolution (16×16) and Gaussian blurred images as inputs. All samples are generated through DDIM (η=0) with 250 steps as well as gradient-based initialization (30 gradient steps before diffusion starts). From left to right, we show the input (x_T), the denoised output from the optimized initialization (x_T^pred), the denoised output (x_t), the delta (δ_t), the next noised input (z_s), and the target real image (x₀), respectively.

Super-resolution (x_T, x_T^pred, x_t, δ_t, z_s, x₀)

De-blurring (x_T, x_T^pred, x_t, δ_t, z_s, x₀)

Latent Space Manipulation

We show the video comparison on AFHQ 256x256 from f-DM-DS with fully random or partially fixed noises as input. We also include the averaged images for reference. All samples are generated through DDIM (η=0) with 250 steps. This demonstrates the f-DM-DS learns multi-scale representations in its latent space.

Random

Fixed up to 16x16

Fixed up to 32x32

Fixed up to 64x64

Fixed up to 128x128

We show the video comparison on AFHQ 256x256 from f-DM-DS, f-DM-VQVAE and the baseline (standard DDPM). All samples are generated through DDIM (η=0) with 250 steps where we interpolates the DDIM initial noises with slerp. This demonstrates the f-DMs are able to learn semantically smooth latent space compared to the baseline.

Baseline (DDPM)

f-DM-DS

f-DM-VQVAE

Ablation Studies

We show the video comparison on FFHQ 256x256 from f-DM-DS with (right) and without (left) the interpolation formulation. There are no delta predictions (δ) when the model is trained without interpolation. For each step, we apply the same noise inputs in both cases for better comparison.

We show the video comparison on FFHQ 256x256 from f-DM-DS with (right) and without (left) the proposed rescaling method. Without rescaling, the images (z_s) are much noiser. For each step, We apply the same noise inputs in both cases for better comparison.

f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

ICLR 2023

Jiatao Gu Shuangfei Zhai Yizhe Zhang Miguel Angel Bautista Josh Susskind

Apple

Paper PDF

arXiv

Code & Data (Comming soon)

Abstract

TLDR: We propose f-DM, an end-to-end non-cascaded diffusion model that allows progressive signal transformations along diffusion.

Overview of Our Method

Generation of Various Models

Standard DM (x_t, z_s)

f-DM-DS (x_t, δ_t, z_s)

f-DM-Blur-G (x_t, δ_t, z_s)

f-DM-Blur-U (x_t, δ_t, z_s)

f-DM-VQVAE (x_t, δ_t, z_s)

f-DM-VQGAN (x_t, δ_t, z_s)

Generation on Various Datasets

LSUN Bed 256×256 with f-DM-DS (x_t, δ_t, z_s)

LSUN Church 256×256 with f-DM-Blur-U (x_t, δ_t, z_s)

Class-conditional ImageNet 256×256 with f-DM-VQVAE (x_t, δ_t, z_s)

FFHQ 1024×1024 with f-DM-DS (x_t, δ_t, z_s)

Conditional Generation

Super-resolution (x_T, x_T^pred, x_t, δ_t, z_s, x₀)

De-blurring (x_T, x_T^pred, x_t, δ_t, z_s, x₀)

Latent Space Manipulation

Random

Fixed up to 16x16

Fixed up to 32x32

Fixed up to 64x64

Fixed up to 128x128

Baseline (DDPM)

f-DM-DS

f-DM-VQVAE

Ablation Studies

Citation

f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation

ICLR 2023

Jiatao Gu Shuangfei Zhai Yizhe Zhang Miguel Angel Bautista Josh Susskind Apple Paper PDF arXiv Code & Data (Comming soon)

Abstract

TLDR: We propose f-DM, an end-to-end non-cascaded diffusion model that allows progressive signal transformations along diffusion.

Overview of Our Method

Generation of Various Models

Standard DM (xt, zs)

f-DM-DS (xt, δt, zs)

f-DM-Blur-G (xt, δt, zs)

f-DM-Blur-U (xt, δt, zs)

f-DM-VQVAE (xt, δt, zs)

f-DM-VQGAN (xt, δt, zs)

Generation on Various Datasets

LSUN Bed 256×256 with f-DM-DS (xt, δt, zs)

LSUN Church 256×256 with f-DM-Blur-U (xt, δt, zs)

Class-conditional ImageNet 256×256 with f-DM-VQVAE (xt, δt, zs)

FFHQ 1024×1024 with f-DM-DS (xt, δt, zs)

Conditional Generation

Super-resolution (xT, xTpred, xt, δt, zs, x0)

De-blurring (xT, xTpred, xt, δt, zs, x0)

Latent Space Manipulation

Random

Fixed up to 16x16

Fixed up to 32x32

Fixed up to 64x64

Fixed up to 128x128

Baseline (DDPM)

f-DM-DS

f-DM-VQVAE

Ablation Studies

Citation

Jiatao Gu Shuangfei Zhai Yizhe Zhang Miguel Angel Bautista Josh Susskind

Apple

Paper PDF

arXiv

Code & Data (Comming soon)

Standard DM (x_t, z_s)

f-DM-DS (x_t, δ_t, z_s)

f-DM-Blur-G (x_t, δ_t, z_s)

f-DM-Blur-U (x_t, δ_t, z_s)

f-DM-VQVAE (x_t, δ_t, z_s)

f-DM-VQGAN (x_t, δ_t, z_s)

LSUN Bed 256×256 with f-DM-DS (x_t, δ_t, z_s)

LSUN Church 256×256 with f-DM-Blur-U (x_t, δ_t, z_s)

Class-conditional ImageNet 256×256 with f-DM-VQVAE (x_t, δ_t, z_s)

FFHQ 1024×1024 with f-DM-DS (x_t, δ_t, z_s)

Super-resolution (x_T, x_T^pred, x_t, δ_t, z_s, x₀)

De-blurring (x_T, x_T^pred, x_t, δ_t, z_s, x₀)