Gradients for Time Scheduled Conditional Variables in Neural Differential Equations

A short derivation of the continuous adjoint equation for time scheduled conditional variables.

Authors

Affiliations

Published

Dec. 3, 2024

\newtheorem p r o p o s i t i o n P r o p o s i t i o n

Introduction

The advent of large-scale diffusion models conditioned on text embeddings has allowed for creative control over the generative process. A recent and powerful technique is that of prompt scheduling, i.e., instead of passing a fixed prompt to the diffusion model, the prompt can changed depending on the timestep. This concept was initially proposed by Doggettx in this reddit post and the code changes to the stable diffusion repository can be seen here.

Examples of the prompt scheduling technique proposed by Doggettx.

More generally, we can view this as have the conditional information (in this case text embeddings) scheduled w.r.t. time. Formally, assume we have a U-Net trained on the noise-prediction task $ϵ_{θ} (x_{t}, z, t)$ conditioned on a time scheduled text embedding $z (t)$ . The sampling procedure amounts to solving the probability flow ODE from time $T$ to time $0$ . $\begin{matrix} (1) & \frac{d x_{t}}{d t} = f (t) x_{t} + \frac{g^{2} (t)}{2 σ_{t}} ϵ_{θ} (x_{t}, z (t), t), \end{matrix}$ where $f, g$ define the drift and diffusion coefficients of a Variance Preserving (VP) type SDE .

Training-free guidance

A closely related area of active research has been the development of techniques which search of the optimal generation parameters.

More specifically, they attempt to solve the following optimization problem: $\begin{matrix} (2) & {\arg \min}_{x_{T}, z, θ} L (x_{T} + \int_{T}^{0} f (t) x_{t} + \frac{g^{2} (t)}{2 σ_{t}} ϵ_{θ} (x_{t}, z, t) d t), \end{matrix}$ where $L$ is a real-valued loss function on the output $x_{0}$ .

Several recent works this year explore solving the continuous adjoint equations to find the gradients: $\begin{matrix} (3) & \frac{\partial L}{\partial x_{t}}, \frac{\partial L}{\partial z}, \frac{\partial L}{\partial θ} . \end{matrix}$ These gradients can the be used in combination with gradient descent algorithms to solve the optimization problem. However, what if $z$ is scheduled and not constant w.r.t to time?

Problem statement. Given $\begin{matrix} (4) & x_{0} = x_{T} + \int_{T}^{0} f (t) x_{t} + \frac{g^{2} (t)}{2 σ_{t}} ϵ_{θ} (x_{t}, z (t), t) d t, \end{matrix}$ and $L (x_{0})$ , find: $\begin{matrix} (5) & \frac{\partial L}{\partial z (t)}, t \in [0, T] . \end{matrix}$

In an earlier blog post we showed how to find $\partial L / \partial z$ by solving the continuous adjoint equations. How do the continuous adjoint equations change with replacing $z$ with time scheduled $z (t)$ in the sampling equation? What we will now show is that

We can just simply replace $z$ with $z (t)$ in the continuous adjoint equations.

This result will intuitive, does require some technical details to show.

Gradients of time-scheduled conditional variables

It is well known that diffusion models are just a special type of neural differential equation, either a neural ODE or SDE. As such we will show this result holds more generally for neural ODEs.

Theorem (Continuous adjoint equations for time scheduled conditional variables). Suppose there exists a function $z : [0, T] \to R^{z}$ which can be defined as a càdlàgFrench: continue à droite, limite à gauche. piecewise function where $z$ is continuous on each partition of $[0, T]$ given by $Π = {0 = t_{0} < t_{1} < \dots < t_{n} = T}$ and whose right derivatives exists for all $t \in [0, T]$ . Let $f_{θ} : R^{d} \times R^{z} \times [0, T] \to R^{d}$ be continuous in $t$ , uniformly Lipschitz in $y$ , and continuously differentiable in $y$ . Let $y : [0, T] \to R^{d}$ be the unique solution for the ODE $\begin{matrix} (6) & \frac{d y}{d t} (t) = f_{θ} (y (t), z (t), t), \end{matrix}$ with initial condition $y (0) = y_{0}$ . Then $\partial L / \partial z (t) := a_{z} (t)$ and there exists a unique solution $a_{z} : [0, T] \to R^{z}$ to the following initial value problem: $\begin{matrix} (7) & a_{z} (T) = 0, \frac{d a_{z}}{d t} (t) = - a_{y} (t)^{⊤} \frac{\partial f_{θ} (y (t), z (t), t)}{\partial z (t)} . \end{matrix}$

Why càdlàg?

In practice $z (t)$ is often a discrete set ${z_{k}}_{k = 1}^{n}$ where $n$ corresponds to the number of discretization steps the numerical ODE solver takes. While the proof is easier for a continuously differentiable function $z (t)$ we opt for this construction for the sake of generality. We choose a càdlàg piecewise function, a relatively mild assumption, to ensure that the we can define the augmented state on each continuous interval of the piecewise function in terms of the right derivative.

In the remainder of this blog post will provide the proof of this result. Our proof technique is an extension of the one used by Patrick Kidger (Appendix C.3.1) used to prove the existence to the solution to the continuous adjoint equations for neural ODEs.

Proof. Recall that $z (t)$ is a piecewise function of time with partition of the time domain $Π$ . Without loss of generality we consider some time interval $π = [t_{m - 1}, t_{m}]$ for some $1 \leq m \leq n$ . Consider the augmented state defined on the interval $π$ : $\begin{matrix} (8) & \frac{d}{d t} [\begin{matrix} y \\ z \end{matrix}] (t) = f_{aug} = [\begin{matrix} f_{θ} (y_{t}, z_{t}, t) \\ \vec{\partial} z (t) \end{matrix}], \end{matrix}$ where $\vec{\partial} z (t) : [0, T] \to R^{z}$ denotes the right derivative of $z$ at time $t$ . Let $a_{aug}$ denote the augmented state as $\begin{matrix} (9) & a_{aug} (t) := [\begin{matrix} a_{y} \\ a_{z} \end{matrix}] (t) . \end{matrix}$ Then the Jacobian of $f_{aug}$ is defined as $\begin{matrix} (10) & \frac{\partial f_{aug}}{\partial [y, z]} = [\begin{matrix} \frac{\partial f_{θ} (y, z, t)}{\partial y} & \frac{\partial f_{θ} (y, z, t)}{\partial z} \\ 0 & 0 \end{matrix}] . \end{matrix}$ As the state $z (t)$ evolves with $\vec{\partial} z (t)$ on the interval $[t_{m - 1}, t_{m}]$ in the forward direction the derivative of this augmented vector field w.r.t. $z$ is clearly $0$ as it only depends on time. Remark, as the bottom row of the Jacobian of $f_{aug}$ is all $0$ and $f_{θ}$ is continuous in $t$ we can consider the evolution of $a_{aug}$ over the whole interval $[0, T]$ rather than just a partition of it. The evolution of the augmented adjoint state on $[0, T]$ is then given as $\begin{matrix} (11) & \frac{d a_{aug}}{d t} (t) = - [\begin{matrix} a_{y} & a_{z} \end{matrix}] (t) \frac{\partial f_{aug}}{\partial [y, z]} (t) . \end{matrix}$ Therefore, $a_{z} (t)$ is a solution to the initial value problem: $\begin{matrix} (12) & a_{z} (T) = 0, \frac{d a_{z}}{d t} (t) = - a_{y} (t)^{⊤} \frac{\partial f_{θ} (y (t), z (t), t)}{\partial z (t)} . \end{matrix}$

Next we show that there exist a unique solution to the initial value problem. Now as $y$ is continuous and $f_{θ}$ is continuously differentiable in $y$ it follows that $t \mapsto \frac{\partial f_{θ}}{\partial y} (y (t), z (t), t)$ is a continuous function on the compact set $[t_{m - 1}, t_{m}]$ . As such it is bounded by some $L > 0$ . Likewise, for $a_{y} \in R^{d}$ the map $(t, a_{y}) \mapsto - a_{y} \frac{\partial f_{θ}}{\partial [y, z]} (y (t), z (t), t)$ is Lipschitz in $a_{y}$ with Lipschitz constant $L$ and this constant is independent of $t$ . Therefore, by the Picard-Lindelöf theorem the solution $a_{aug} (t)$ exists and is unique.

□

If you found this useful and would like to cite this post in academic context, please cite this as:

Blasingame, Zander W. (Dec 2024). Gradients for Time Scheduled Conditional Variables in Neural Differential Equations. https://zblasingame.github.io.

or as a BibTeX entry:

@article{blasingame2024gradients-for-time-scheduled-conditional-variables-in-neural-differential-equations,
  title   = {Gradients for Time Scheduled Conditional Variables in Neural Differential Equations},
  author  = {Blasingame, Zander W.},
  year    = {2024},
  month   = {Dec},
  url     = {https://zblasingame.github.io/blog/2024/cadlag-conditional/}
}

Footnotes

French: continue à droite, limite à gauche.[↩]

References

High-Resolution Image Synthesis With Latent Diffusion Models
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684-10695.
{Hierarchical Text-Conditional Image Generation with CLIP Latents}
{Ramesh}, A., {Dhariwal}, P., {Nichol}, A., {Chu}, C. and {Chen}, M., 2022. arXiv e-prints, pp. arXiv:2204.06125. DOI: 10.48550/arXiv.2204.06125
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [PDF]
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D.J. and Norouzi, M., 2022. Advances in Neural Information Processing Systems, Vol 35, pp. 36479--36494. Curran Associates, Inc.
Denoising Diffusion Implicit Models [link]
Song, J., Meng, C. and Ermon, S., 2021. International Conference on Learning Representations.
AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models [link]
Pan, J., Liew, J.H., Tan, V., Feng, J. and Yan, H., 2024. The Twelfth International Conference on Learning Representations.
AdjointDEIS: Efficient Gradients for Diffusion Models [link]
Blasingame, Z.W. and Liu, C., 2024. The Thirty-eighth Annual Conference on Neural Information Processing Systems.
Implicit Diffusion: Efficient Optimization through Stochastic Sampling
Marion, P., Korba, A., Bartlett, P., Blondel, M., De Bortoli, V., Doucet, A., Llinares-Lopez, F., Paquette, C. and Berthet, Q., 2024. arXiv preprint arXiv:2402.05468.
On Neural Differential Equations
Kidger, P., 2022.