Diffusion models have demonstrated excellent potential for generating diverse images. However, their performance often suffers from slow generation due to iterative denoising. Existing distillation methods either require significant amounts of offline computation for generating synthetic training data or need to perform expensive online learning with the help of real data. In this work, we present a novel technique called BOOT, that overcomes these limitations with an efficient data-free distillation algorithm. The core idea is to learn a time-conditioned model that predicts the output of a pre-trained diffusion model teacher given any time step. Such a model can be efficiently trained based on bootstrapping from two consecutive sampled steps. Furthermore, our method can be easily adapted to large-scale text-to-image diffusion models, which are challenging for conventional methods given the fact that the training sets are often large and difficult to access. We demonstrate the effectiveness of our approach on several benchmarks, achieving comparable generation quality while being orders of magnitude faster than the diffusion teacher. The text-to-image results show that BOOT is able to handle highly complex distributions, shedding light on efficient generative modeling.
We propose BOOT, a data-free knowledge distillation method for denoising diffusion models based on bootstrapping. Unlike previous works, BOOT predicts all possible x_t along the diffusion trajectory given the same noise point ϵ and a time indicator t. Since our model always reads pure Gaussian noise, there is no need to sample from real data. To avoid predicting noisy images directly, we learn the student model from a novel Signal-ODE which works in the low-frequency signal space. Below is the illustration of standard diffusion model and our distilled student model where our model can predict all timesteps in parallel.
Learning from the same noise input also enables bootstrapping, which avoids costly evaluation of the diffusion model in the training time. Details of the training pipeline of BOOT is shown below. s and t are two consecutive timesteps where s < t. From a noise map ϵ, the objective of BOOT minimizes the difference between the output of a student model at timestep s, and the output of stacking the same student model and a teacher model at an earlier time t. The whole process is data-free.
All images below are pre-generated in the same noise with various prompts by single step. The student model is distilled from DeepFloyd-IF 64x64.
All images below are pre-generated in the same noise with various prompts by single step. The student model is distilled from StableDiffusion 512x512.
@article{boot,
title={BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping},
author={Gu, Jiatao and Zhai, Shuangfei and Zhang, Yizhe and Liu, Lingjie and Susskind, Josh},
journal={arXiv preprint arXiv:2306.05544},
year={2023}
}