BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping

BOOT🥾: Data-free Distillation of Denoising Diffusion Models with Bootstrapping

Arxiv 2023


Random samples in 256x256 from our single-step student models distilled from DeepFloyd IF with prompts from diffusiondb.

Abstract


Diffusion models have demonstrated excellent potential for generating diverse images. However, their performance often suffers from slow generation due to iterative denoising. Existing distillation methods either require significant amounts of offline computation for generating synthetic training data or need to perform expensive online learning with the help of real data. In this work, we present a novel technique called BOOT, that overcomes these limitations with an efficient data-free distillation algorithm. The core idea is to learn a time-conditioned model that predicts the output of a pre-trained diffusion model teacher given any time step. Such a model can be efficiently trained based on bootstrapping from two consecutive sampled steps. Furthermore, our method can be easily adapted to large-scale text-to-image diffusion models, which are challenging for conventional methods given the fact that the training sets are often large and difficult to access. We demonstrate the effectiveness of our approach on several benchmarks, achieving comparable generation quality while being orders of magnitude faster than the diffusion teacher. The text-to-image results show that BOOT is able to handle highly complex distributions, shedding light on efficient generative modeling.



Idea: Predicting the Signal-ODE Trajectory


We propose BOOT, a data-free knowledge distillation method for denoising diffusion models based on bootstrapping. Unlike previous works, BOOT predicts all possible x_t along the diffusion trajectory given the same noise point ϵ and a time indicator t. Since our model always reads pure Gaussian noise, there is no need to sample from real data. To avoid predicting noisy images directly, we learn the student model from a novel Signal-ODE which works in the low-frequency signal space. Below is the illustration of standard diffusion model and our distilled student model where our model can predict all timesteps in parallel.


Training with Bootstrapping


Learning from the same noise input also enables bootstrapping, which avoids costly evaluation of the diffusion model in the training time. Details of the training pipeline of BOOT is shown below. s and t are two consecutive timesteps where s < t. From a noise map ϵ, the objective of BOOT minimizes the difference between the output of a student model at timestep s, and the output of stacking the same student model and a teacher model at an earlier time t. The whole process is data-free.


Examples of Controllable Generation


All images below are pre-generated in the same noise with various prompts by single step. The student model is distilled from DeepFloyd-IF 64x64.


A wearing in the style of .
Displayed Image

All images below are pre-generated in the same noise with various prompts by single step. The student model is distilled from StableDiffusion 512x512.


, a portrait of , light and shadow contrast, high detail.
Displayed Image



Video for Interactive Demo: Comparison with Diffusion Model





Video for Interactive Demo: Additional results of BOOT






Citation


    @article{boot,
        title={BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping},
        author={Gu, Jiatao and Zhai, Shuangfei and Zhang, Yizhe and Liu, Lingjie and Susskind, Josh},
        journal={arXiv preprint arXiv:2306.05544},
        year={2023}
      }