Guided generation and inverse problems require gradients of a loss with respect to inputs of a diffusion sampler. Storing every U‑Net activation along the trajectory is prohibitive in memory; the continuous adjoint method avoids this by solving a backward differential equation, but the adjoint equation is itself stiff. In this work we propose AdjointDEIS, a family of exponential‑integrator solvers for the continuous adjoint equations of diffusion ODEs and SDEs. AdjointDEIS attains $\mathcal{O}(1)$ memory in solver length and $k$-th order convergence; en route we prove that the continuous adjoint of a diffusion SDE with state‑independent diffusion coefficient reduces to a deterministic ODE.
A diffusion model learns to invert a noising process by training a denoiser $\boldsymbol\epsilon_\theta$. Sampling at inference time amounts to numerically solving the probability‑flow ODE — or its stochastic counterpart — from $t=T$ down to $t=0$ with a fast solver such as DPM‑Solver or DEIS. Each solver step calls the U‑Net once; modern samplers produce strong image quality in $N \approx 10$ to $50$ steps.
For many downstream tasks — guided generation, latent optimization, and inverse‑problem solvers — the quantity of interest is not the sample itself but the gradient of some loss $\mathcal{L}(\boldsymbol x_0)$ with respect to an upstream variable: the initial latent $\boldsymbol x_T$, the conditioning vector $\boldsymbol z$, or the model parameters $\theta$. Discretize‑then‑optimize (DTO) is the direct approach: unroll the $N$‑step trajectory as a computation graph, store every U‑Net activation, and backpropagate. Whilst correct, this incurs $\mathcal{O}(N)$ memory in the number of solver steps, which is prohibitive on large image models.
We want $\nabla_{\!\boldsymbol x_T,\boldsymbol z,\theta}\,\mathcal{L}\!\big(\boldsymbol x_0(\boldsymbol x_T,\boldsymbol z,\theta)\big)$, where $\boldsymbol x_0$ is produced by an $N$‑step diffusion sampler — computed cheaply enough to sit inside an optimization loop, and accurately enough to descend the loss.
An alternative is the continuous adjoint method — Pontryagin's adjoint construction, brought into deep learning by Chen et al.'s Neural ODEs. Rather than storing the forward trajectory, one solves a second ODE backwards in time whose state is the gradient. The memory cost is $\mathcal{O}(1)$ in $N$.
AdjointDPM applied this construction to deterministic diffusion samplers. Two questions were left open, and are precisely what this paper addresses:
Lifting the Neural‑ODE adjoint construction from ODEs to SDEs would, in general, yield a backward equation that is itself stochastic, inheriting the variance and discretization difficulties of the forward process. We show that for the SDEs that arise in diffusion models, this does not happen: the backward equation is deterministic.
The forward diffusion SDE in Itô form is
with drift $\boldsymbol f$ and a state‑independent diffusion coefficient $g(t)$. This is the regime of essentially every modern diffusion model — the VP and VE SDEs of Song et al., EDM, and so on. The Stratonovich form of the same equation differs from Itô only by an Itô–Stratonovich correction $-\tfrac{1}{2}g(t)g'(t)$, which is itself deterministic.
No stochastic term appears. The same statement holds for the parameter and conditioning adjoints.
The argument is short. In Stratonovich form the chain rule reduces to ordinary calculus, and since $g(t)$ does not depend on $\boldsymbol x$ we have $\partial_{\boldsymbol x}\big(g(t)\,d\boldsymbol w_t\big)\equiv 0$. The stochastic term contributes nothing to the adjoint dynamics. The forward trajectory $\boldsymbol x_t$ appearing on the right‑hand side is recovered by integrating the diffusion SDE backwards from $\boldsymbol x_0$, or in practice by storing $\boldsymbol x_T$ together with a small number of waypoints.
The adjoint of a diffusion SDE is a deterministic backward ODE. It can therefore be discretized with the same exponential‑integrator schemes that one uses on the forward probability‑flow ODE; no Brownian‑tree storage, no variance reduction. Up to a scale factor, the adjoint trajectory is the same object regardless of whether the forward sampler is ODE‑based or SDE‑based.
Theorem 3.1 generalizes AdjointDPM, which treated only the deterministic ODE sampler, to the full stochastic case — without any new variance‑reduction machinery. It also clarifies a folk observation in the guidance literature: gradients estimated through stochastic samplers and through their probability‑flow ODE counterparts behave similarly at the population level. Theorem 3.1 is the structural reason.
The backward ODE is stiff for the same reason the forward probability‑flow ODE is: the dominant component of the drift is a fast linear term whose stiffness scales as $1/\sigma(t)$ near $t=0$. Generic Runge‑Kutta integrators applied to the adjoint pay for this in step count. DPM‑Solver and DEIS address the stiffness on the forward pass by integrating the linear part exactly and approximating only the nonlinear remainder; we apply the same construction to the backward pass.
Write the adjoint dynamics as a stiff linear part plus a smooth nonlinear residual,
where $A(t)$ collects the diagonal stiff terms induced by the noise schedule $(\alpha_t,\sigma_t)$. The exact variation‑of‑constants integral over a backward step $[t_{n+1}, t_n]$ is
Under the standard log‑SNR change of variables $\lambda(t) = \log(\alpha_t/\sigma_t)$, the integrating‑factor term has a closed form and the remaining integral has a smooth integrand expressed in the noise prediction $\boldsymbol\epsilon_\theta$. AdjointDEIS‑$k$ truncates the Taylor expansion of that integrand to order $k$. We instantiate two members of the family:
AdjointDEIS‑1. Single‑step, first‑order. The adjoint analogue of DDIM. One U‑Net call per step.
AdjointDEIS‑2M. Multistep, second‑order. The adjoint analogue of DPM‑Solver‑2M; reuses noise predictions from prior steps and attains $k$‑th order local truncation error.
The convergence guarantees of AdjointDEIS‑$k$ follow from those of the corresponding forward exponential integrators; the proofs in the paper transfer with little change once the dynamics are written in the appropriate coordinates.
A face morphing attack takes two bona fide identities $a$ and $b$ and produces a single image that a face recognition (FR) system accepts as matching both. We use it as the benchmark application: the task is well‑instrumented, the metric MMPMR (Mating Morph Presentation Match Rate) is unforgiving, and the prior representation‑based attacks (DiM, Fast‑DiM, Morph‑PIPE) are well tuned.
Let $v_a, v_b, v_{ab}$ denote the embeddings of the two bona fide images and the morph under a frozen FR network. The morph $x_{ab}$ should be close to both identities in embedding space; we use a sum‑of‑distances loss together with a symmetry term that penalizes an unbalanced morph:
The optimization variable is the initial latent $x_T$ — we fine‑tune the noise from which the diffusion sampler starts, not the model. Settings: 50 optimization steps, learning rate $0.01$, $N=20$ forward sampling steps, $M=20$ adjoint backward steps. We evaluate on SYN‑MAD 2022 against three modern FR backbones: AdaFace, ArcFace, and ElasticFace.
A morphing attack is multi-objective: the attack must fool every recognizer whilst keeping NFE low. The radar below plots seven attacks against three FR systems (MMPMR at $\mathrm{FMR}=0.1\%$ against AdaFace, ArcFace, ElasticFace) and a compute-efficiency axis ($1-\mathrm{NFE}/2350$). The MMPMR axes are scaled from 65% to 100% to make the separation between attacks legible.
| Attack | NFE ↓ | AdaFace ↑ | ArcFace ↑ | ElasticFace ↑ |
|---|---|---|---|---|
| Webmorph (heuristic) | — | 97.96 | 96.93 | 98.36 |
| MIPGAN-II (GAN) | 50 | 70.55 | 72.19 | 65.24 |
| DiM-A | 350 | 92.23 | 90.18 | 93.05 |
| Fast-DiM | 300 | 92.02 | 90.18 | 93.05 |
| Morph-PIPE | 2350 | 95.91 | 92.84 | 95.50 |
| AdjointDEIS-1 (SDE) | 2250 | 98.57 | 97.96 | 97.75 |
| AdjointDEIS-1 (ODE) (ours) | 2250 | 99.80 | 98.77 | 99.39 |
The theoretical results above stand without modification. Our recommendation for the practical computation of these gradients, however, has shifted since publication, and we record the revision here.
Following Kidger's analysis (§5.1.2.3) and our own follow‑up work (Appendix E), we now generally discourage use of the continuous adjoint method for backpropagating through diffusion models. The reason is accuracy: the backward latent recovery and the discretization error of the adjoint sweep compound, and the resulting gradient estimates are noisier than DTO with recursive gradient checkpointing or with an algebraically reversible solver.
For most modern use cases — particularly when GPU memory is sufficient — discretize‑then‑optimize with recursive gradient checkpointing (for example via Diffrax) is the more reliable approach.
A second alternative is to discretize the forward pass with an algebraically reversible solver, in which case the backward trajectory is reconstructed exactly from the closed‑form inverse of each step and no autodiff tape need be stored. Asynchronous leapfrog and reversible Heun, the McCallum–Foster construction, and our subsequent work on Rex — which extends the McCallum–Foster construction to exponential integrators and to diffusion SDEs — all provide $\mathcal{O}(1)$ memory in solver length without incurring the discretization error of a continuous adjoint sweep.
The continuous adjoint remains useful when memory is the binding constraint, when the integrand is well‑conditioned, when the chosen solver is of sufficiently high order that the backward discretization error is negligible, and when a reversible solver is unavailable for the forward sampler in use. AdjointDEIS‑2M is the appropriate choice in that regime; AdjointDEIS‑1 is adequate for prototyping but will leave accuracy on the table.
Gradients of a loss with respect to inputs of a diffusion sampler are required by guided generation, latent optimization, and inverse‑problem solvers, but discretize‑then‑optimize incurs $\mathcal{O}(N)$ memory in the number of solver steps. The continuous adjoint method addresses the memory cost by solving a backward differential equation; until this work the construction was confined to deterministic samplers, and the backward equation was discretized with off‑the‑shelf solvers that inherited the stiffness of the forward process. We prove that the continuous adjoint of a diffusion SDE with state‑independent diffusion coefficient reduces to a deterministic ODE, which removes the obstruction to applying the continuous adjoint method to stochastic samplers. On the basis of this structural identity we construct AdjointDEIS, a family of exponential‑integrator solvers for the continuous adjoint equations of diffusion ODEs and SDEs, with $\mathcal{O}(1)$ memory in solver length and $k$‑th order convergence. As an application, AdjointDEIS‑guided DiM is the first representation‑based morphing attack to outperform the heuristic Webmorph baseline on all three modern FR systems we tested.