IEEE TBIOM 2024 IJCB 2024 (Oral) IEEE S&P 2024

DiM: Diffusion Morphs
for Face Morphing Attacks

Face morphing attacks seek to deceive a face recognition system by presenting a single image which carries the biometric qualities of two different identities, posing a significant threat to biometric systems. We present Diffusion Morphs (DiM), a morphing attack which uses a diffusion-based architecture to improve the visual fidelity of the morphed image and its ability to represent characteristics from both identities. We measure the visual fidelity of the proposed attack via Fréchet Inception Distance (FID) and benchmark the vulnerability of state-of-the-art FR systems against DiM and existing landmark- and GAN-based attacks; additionally, we introduce a novel metric to measure the relative strength between different morphing attacks. Our follow-up work, Fast‑DiM, examines the ODE solvers driving the probability flow ODE and proposes a faster DiM pipeline which reduces network function evaluations by up to 85% in encoding with only 1.6% reduction in Mated Morph Presentation Match Rate (MMPMR), and halves sampling cost with at most 0.23% MMPMR loss.

Clarkson University
Identity a — bona fide face from FRLL dataset
$\boldsymbol x_0^{(a)}$ identity $a$ · FRLL 012
DiM morph of identities a and b γ0.50 paper frame
$\boldsymbol x_0^{(ab)} = \Phi\!\bigl(\,\mathrm{slerp}(\boldsymbol x_T^{(a)},\boldsymbol x_T^{(b)};\gamma),\;\mathrm{lerp}(\boldsymbol z_a,\boldsymbol z_b;\gamma)\bigr)$ DiM morph · $\gamma$ sweeps the latent path
Identity b — bona fide face from FRLL dataset
$\boldsymbol x_0^{(b)}$ identity $b$ · FRLL 101
drag to scrub · ← → keys
γ=0
0.25
0.5
0.75
1
Figure 1. A morphed face is a single image which registers as two different identities to a face recognition system. Here, $\boldsymbol x_0^{(a)}$ and $\boldsymbol x_0^{(b)}$ are two FRLL bona fide images. The center panel cross‑fades through the latent traversal $\boldsymbol z_{ab}=\mathrm{lerp}(\boldsymbol z_a,\boldsymbol z_b;\gamma)$, $\boldsymbol x_T^{(ab)}=\mathrm{slerp}(\boldsymbol x_T^{(a)},\boldsymbol x_T^{(b)};\gamma)$ as $\gamma$ sweeps $0\!\to\!1$, anchored at $\gamma=0$ on $\boldsymbol x_0^{(a)}$, at $\gamma{=}0.5$ on the actual DiM‑C morph (the central frame from Figure 1 of the TBIOM paper), and at $\gamma=1$ on $\boldsymbol x_0^{(b)}$. N.B., the intermediate frames are visualization‑only cross‑dissolves between the three real anchors; the diffusion autoencoder produces the genuine $\gamma=0.5$ morph at the center.
01 — The Problem

A single face that registers as two people

A face morphing attack produces a single image that a face recognition (FR) system accepts as a match for two different identities — breaking the bedrock biometric assumption that one face corresponds to one person. The threat is concrete: in many countries a passport photo is submitted by the applicant and enrolled directly, and a morph slipped through the application step quietly authorizes two different travelers to use that document.

Landmark‑based morphs. The original recipe aligns facial keypoints, warps the two faces to a shared geometry, and averages them pixel‑wise. Such morphs fool FR systems but produce visible artefactsGhosting around the hair and ears, doubled eyebrows, blurred pupils. Humans and trained detectors easily spot them. which are easily distinguishable from real faces.

GAN‑based morphs. StyleGAN2 and MIPGAN‑II invert each face into a learned latent code, average these codes, and decode the average back into image space. The output is more realistic; however, two well‑known caveats apply: the inversion is lossyBackground and hair on a GAN morph are often a different person's — the encoder cannot losslessly project an arbitrary face into the learned manifold., and there is a fundamental editability–fidelity tension in GAN inversions. In particular, the morph either resembles the bona fides or fools the recognizer — rarely both.

The three constraints

A useful morph must satisfy three things at once: (a) the FR system matches it to both source identities, (b) a human or learned detector cannot easily flag it, and (c) it does not collapse onto a low-fidelity GAN manifold. Every prior method gives up at least one.

Recall that diffusion models dominate GANs on photorealism and, moreover, admit a clean encoding through the probability‑flow ODE: any image is deterministically pushed to noise and back without a learned encoder bottleneck.A consequence of the bijective nature of ODEs. N.B., numerical schemes do not preserve this exactly unless they are algebraically reversible. We turn that property into a face morphing attack — DiM — which beats every prior method on visual fidelity and attack strength simultaneously.

02 — The Method

DiM: diffusion morphs

A diffusion autoencoder represents a face by two complementary latents: a semantic code $\boldsymbol z = E(\boldsymbol x_0)$, which is compact, learned, and captures identity; and a stochastic code $\boldsymbol x_T$, which is high‑dimensional and captures the rest — texture, hair strands, lighting. The reverse process $\boldsymbol x_T \!\to\! \boldsymbol x_0$ is conditioned on $\boldsymbol z$ and is fully deterministic: solving the probability‑flow ODE forward returns the noise code, and solving it backward returns the image.

For morphing, we encode both bona fides this way and produce a morphed pair $(\boldsymbol z_{ab}, \boldsymbol x_T^{ab})$ by interpolation; then we decode that interpolated pair exactly the same way we would decode any single face.No retraining, no inversion loss, no special "morph head" — the same network that synthesizes single faces produces the morphs. The result is a real‑looking image which lives on the data manifold, rather than outside of it.

The DiM architecture · one cycle = one morph

Two bona fides encode in parallel to twin latents $(\boldsymbol z_i, \boldsymbol x_T^{(i)})$. The semantic codes meet at a linear interpolation; the stochastic codes, on the other hand, meet at a spherical one. A conditional reverse PF‑ODE then decodes the interpolated pair into the morph.

0.0s
Figure 2. The DiM pipeline. Two bona fide images are encoded in parallel through a shared diffusion autoencoder — each producing a semantic code $\boldsymbol z_i$ (compact, identity‑bearing) and a stochastic code $\boldsymbol x_T^{(i)}$ (the noise to which the probability‑flow ODE maps the image). The two semantic codes meet at a linear interpolation; the two stochastic codes, on the other hand, meet at a spherical one. The conditional reverse PF‑ODE then decodes the interpolated pair into the morphed image $\boldsymbol x_0^{(ab)}$ — using exactly the same network which synthesizes single faces. Adapted from Blasingame & Liu, IEEE TBIOM 2024 (Fig. 1).

Notably, the choice of slerp over lerp on the noise code is not cosmetic. Lerp would shrink the norm of the interpolated noise away from $\sqrt{d}$, putting it off the noise manifold the model was trained on and visibly degrading the output. Slerp, on the other hand, keeps the magnitude on‑manifold.

Why this works: the probability-flow ODE is reversible

Recall that a diffusion model defines a forward stochastic process which gradually adds Gaussian noise to a sample. showed that it admits a deterministic counterpart — the probability‑flow ODE — whose trajectories preserve the same time‑marginal distributions whilst evolving a single state, rather than a random one. Symbolically:

Probability‑flow ODE (Song et al., 2021) $$\mathrm{d}\boldsymbol x \;=\; \Big[\boldsymbol f(\boldsymbol x, t) \;-\; \tfrac12 g(t)^2 \nabla_{\boldsymbol x}\!\log p_t(\boldsymbol x)\Big]\,\mathrm{d}t.$$

Because the probability‑flow ODE is an ODE rather than an SDE, the map $\boldsymbol x_0 \!\to\! \boldsymbol x_T$ is bijective: any image is deterministically encoded to noise and decoded back.Up to a discretization error; cf.. There is no learned encoder, no inversion loss, no manifold clipping — we recover a faithful encoding by running the same network in reverse. This is the structural advantage over GAN morphs, in which inversion is lossy by construction.

Pre‑morphing helps the stochastic code

The semantic code $\boldsymbol z$ is well‑behaved under linear interpolation, as it lives in a Euclidean embedding learned for precisely that. The stochastic code $\boldsymbol x_T$, however, encodes high‑frequency detail which depends sensitively on the input geometry. In particular, if the two source faces are misaligned, then $\boldsymbol x_T^{(a)}$ and $\boldsymbol x_T^{(b)}$ point in very different directions, and even slerp cannot paper over the mismatch.

To address this issue, DiM‑C pre‑morphs the two source images with a pixel‑wise average before encoding. This pulls $\boldsymbol x_T^{(a)}$ and $\boldsymbol x_T^{(b)}$ closer together; consequently, the interpolated noise is better behaved — fewer artefacts, lower FID.DiM‑A (no pre‑morph) remains the strongest attack; DiM‑C, on the other hand, produces the most natural‑looking morphs. The two variants sit at different points on an attack‑strength / visual‑fidelity trade‑off.

03 — Visual fidelity & attack strength

DiM is the most photorealistic and the most effective attack

Two metrics govern face morphing. FID measures how close the morph distribution is to the real face distribution — lower is more photorealistic. MMPMR (Mated Morph Presentation Match Rate) measures the fraction of morphs the FR system accepts as both source identities — higher is a stronger attack. Most prior methods trade one against the other; DiM, notably, does not.

FID — lower is more photorealistic

AttackFRLLFRGCFERET
StyleGAN245.1986.4141.91
FaceMorpher (landmark)91.9788.1479.58
OpenCV (landmark)85.71100.0291.94
MIPGAN-II66.41115.9670.88
DiM (ours)42.6364.1650.45

Table 1. DiM attains the lowest FID on FRLL and FRGC, and is competitive with StyleGAN2 on FERET. Landmark methods lag far behind on every dataset; notably, MIPGAN‑II's FID is much higher than DiM's despite being the strongest GAN attack on FR systems — a hint at the editability–fidelity trade‑off.

MMPMR — higher is a stronger attack

The chart below reports MMPMR at $\mathrm{FMR}=0.1\%$ for five attacks across three FR systems and three datasets; toggle the FR system and dataset to walk the matrix. DiM reaches 88.09% against ArcFace on FRLL — the next‑best attack on that cell is MIPGAN‑II at 56.52%, and the strongest landmark attack tops out at 47.70%. Notably, the most accurate FR system is also the most vulnerable.

FR system
Dataset
MMPMR (%) mated morph presentation match rate — higher = stronger attack

Figure 3. MMPMR @ $\mathrm{FMR}=0.1\%$ across five attacks. DiM (deep blue) leads or ties the strongest GAN attack in nearly every cell; in particular, the margin grows on the modern margin‑based recognizer (ArcFace). Numbers from DiM Tables III & IV (Blasingame & Liu, TBIOM 2024).

Two FRLL identity pairs through six morphing pipelines. Landmark and GAN methods leave characteristic artefacts; DiM, notably, does not.

Identity A
FaceMorpher
landmark
OpenCV
landmark
DiM
ours
StyleGAN2
GAN
MIPGAN-II
GAN
Identity B
Pair 1 — FRLL subjects 043 & 114
Bona fide identity A (subject 043)Bona fide
FaceMorpher morph of subjects 043 and 114FaceMorpher
OpenCV morph of subjects 043 and 114OpenCV
DiM morph of subjects 043 and 114DiM
StyleGAN2 morph of subjects 043 and 114StyleGAN2
MIPGAN-II morph of subjects 043 and 114MIPGAN-II
Bona fide identity B (subject 114)Bona fide
Pair 2 — FRLL subjects 128 & 105
Bona fide identity A (subject 128)Bona fide
FaceMorpher morph of subjects 128 and 105FaceMorpher
OpenCV morph of subjects 128 and 105OpenCV
DiM morph of subjects 128 and 105DiM
StyleGAN2 morph of subjects 128 and 105StyleGAN2
MIPGAN-II morph of subjects 128 and 105MIPGAN-II
Bona fide identity B (subject 105)Bona fide
Figure 4. Two FRLL identity pairs through six morphing pipelines. Landmark methods (FaceMorpher, OpenCV) leave ghosting and pixel-average smearing around eyes and jawlines. GAN methods (StyleGAN2, MIPGAN-II) produce sharper faces but introduce identity-bleed artefacts in skin and hair. DiM's morphs sit in-distribution with the diffusion autoencoder's training data, and the same smoothness that drives the MMPMR numbers leaves no obvious tell. Images from Blasingame & Liu, IEEE TBIOM 2024 (Fig. 6, 7).

Aggregating across the full 3×3 matrix of FR systems and datasets, DiM's geometric‑mean MMPMR (19.80) beats MIPGAN‑II (14.16) by 40% and StyleGAN2 (2.15) by an order of magnitude. In particular, the gap is widest on ArcFace — the modern margin‑based recognizer which landmark‑era attacks were never tuned against. The same tight angular margin which makes ArcFace confident on bona fides is what renders it vulnerable here: the average of two identity embeddings lands inside both decision boundaries, and a high‑fidelity diffusion decode keeps the morph close to both source embeddings in feature space.

Five axes, one polygon — the attack‑profile radar

A morphing attack is multi‑objective: it must fool every recognizer, look photoreal, and resist a learned detector. The radar below collapses these axes into one shape per attack. We observe that DiM's polygon is the largest on FRLL and remains the largest or near‑largest as the dataset changes — toggle between the three to see which axes each attack wins.

Dataset
ArcFace MMPMR VGGFace2 MMPMR FaceNet MMPMR Photorealism 1 − FID/120 Evasion RSM tournament
StyleGAN2 FaceMorpher OpenCV MIPGAN-II DiM (ours)
Figure 5. Five axes, normalized to $[0,1]$. Larger polygon = stronger attack on more dimensions. MMPMR axes are actual/100 (DiM TBIOM 2024 Tables III & IV, ProdAvg @ FMR=0.1%); photorealism is $1-\mathrm{FID}/120$ from Table II; evasion is an RSM tournament summary — DiM has positive RSM against every other attack on every dataset. The mauve fill marks DiM's region.
04 — Detection

Detectors do not generalize to diffusion morphs

A morph attack detector (MAD) flags morphed images at presentation time. The deployment question, however, is not whether a detector trained on attack $\alpha$ can catch attack $\alpha$ — that is the easy case — but rather whether a detector trained on known attacks can catch an attack it has never seen. We test exactly this via leave‑one‑out cross‑validation on a SE‑ResNeXt101 detector.

DiM is the hardest unseen attack to detect

When the detector is trained on every attack except DiM and tested on DiM, performance collapses — far more than for any other held‑out attack. In particular, landmark attacks (FaceMorpher, OpenCV) transfer to one another easily because they share a pipeline; moreover, MIPGAN‑II's hold‑out is largely covered by StyleGAN2. DiM, however, occupies a different region of attack space.

Leave-one-out detector ablation · lower is harder to detect

SE-ResNeXt101 MAD trained on four attacks, tested on the fifth. Each dot is one dataset.

FRLL FRGC FERET
undetectable 0 25 50 75 100 Detector accuracy on held-out attack (%) FaceMorpher landmark 76.4 MIPGAN-II GAN OpenCV landmark StyleGAN2 GAN 87.9 DiM diffusion · ours 13.96 72.73 75.89
Figure 6. Each row holds out one attack from training and tests the SE‑ResNeXt101 MAD against it. Every other attack stays ≥87% even as a novel attack, as the four siblings cover its artefact distribution; in particular, DiM collapses to 13.96% on FRLL — the detector trained on landmark and GAN morphs has effectively no signal on diffusion morphs. Data from DiM Table V (Blasingame & Liu, IEEE TBIOM 2024).

A new metric: relative strength

Pairwise comparison of morph attacks is awkward, as each attack produces a different number of morphs and FR vulnerabilities differ across systems. To address this, we propose a cleaner alternative — the relative strength metric (RSM) — defined for two attacks $\alpha,\beta$ as

Relative strength (Eq. 12) $$\Delta(\alpha\,\Vert\,\beta) \;=\; \log\!\frac{T(\alpha,\beta)}{T(\beta,\alpha)}, \quad T(\alpha,\beta) = \mathbb P\!\big[f^\alpha(\boldsymbol X^\beta)=1 \,\big|\, f^\alpha(\boldsymbol X^\alpha)=1\big].$$

Here, $T(\alpha,\beta)$ is the probability that a detector trained on $\alpha$ catches a $\beta$‑morph, conditioned on it correctly catching its own training attack.The conditioning on $f^\alpha(\boldsymbol X^\alpha)=1$ is what makes RSM a fair pairwise comparison: it normalizes away the fact that some attacks are easier to detect on their own training distribution. A positive RSM means $\alpha$ is the stronger attack — harder to catch, and easier to use as training data. We observe that DiM has positive RSM against every other attack across all three datasets. It is, simultaneously, the hardest attack to detect and the one which trains the most generalizable detector.

Deployment implication

A morph detector trained only on landmark and GAN attacks fails on DiM — not by a small margin, but catastrophically. The converse, notably, also holds: a detector trained with DiM catches the older attacks well. As such, any deployed MAD whose training set predates diffusion morphs should be retrained.

05 — Fast‑DiM

The same attack at less than half the compute

DiM costs 350 NFEs per morph — 250 forward to encode, 100 reverse to sample. In the follow‑up paper we ask which of those NFEs are doing real work. Recall that DiM uses a first‑order Euler‑style solver everywhere, including in places where higher‑order solvers from the modern PF‑ODE literature were available. Notably, swapping in DPM‑Solver++ 2M for sampling and a principled DDIM forward solver for encoding cuts total NFEs from 350 to 150, with at most 1.6% loss in attack strength.

Two orthogonal NFE reductions

(1) Faster sampling. DiM uses 100‑step DDIM; replacing it with the second‑order multistep DPM‑Solver++ 2M drops the sampler to 50 steps with at most 0.23% MMPMR loss. (2) Faster encoding. The original DiffAE "stochastic encoder" relies on a non‑standard substitution which does not correspond to a clean discretization of the forward PF‑ODE. We derive principled DDIM and DPM++ 2M solvers for the forward ODE direction; both reach the same reconstruction quality in far fewer steps.

VariantSampling solverEncoding solver$N$$N_F$Total NFE
DiM (original)DDIMDiffAE100250350
Fast‑DiMDPM‑Solver++ 2MDiffAE50250300
Fast‑DiM‑odeDPM‑Solver++ 2MDDIM (forward)50100150

Table 2. The full Fast‑DiM‑ode pipeline is 57% cheaper than DiM — matching MIPGAN‑II's NFE budget — whilst remaining within 1.6% MMPMR of DiM on every modern FR system tested.

Same attack strength — on AdaFace, ArcFace, ElasticFace

AttackNFEAdaFaceArcFaceElasticFace
FaceMorpher89.7887.7389.57
Webmorph97.9696.9398.36
OpenCV94.4892.4394.27
MIPGAN‑II15070.5572.1965.24
DiM‑A (original)35092.2390.1893.05
Fast‑DiM30092.0290.1893.05
Fast‑DiM‑ode15091.8288.7591.21

Table 3. SYN‑MAD 2022 (FRLL pairs), MMPMR @ $\mathrm{FMR}=0.1\%$. At NFE=150, Fast‑DiM‑ode matches MIPGAN‑II's compute budget whilst being ~20 percentage points stronger on every modern FR system tested. Webmorph leads in absolute MMPMR; however, it is a landmark attack with the same artefact statistics which make landmark morphs easy to flag. Notably, the DiM family is the only one in this table which pairs high MMPMR with low detectability (Section 04).

Concretely, at the same NFE budget as MIPGAN‑II, Fast‑DiM‑ode fools ArcFace 88.75% of the time against MIPGAN‑II's 72.19%. The compute objection to diffusion morphs no longer holds.

The compute–quality frontier, on one chart

Plotting attack strength on three modern recognizers together with compute efficiency on a single radar makes the trade‑off explicit. In particular, Fast‑DiM‑ode and MIPGAN‑II share the compute axis at NFE=150; however, Fast‑DiM‑ode dominates on every attack‑strength axis. Original DiM‑A's polygon collapses to zero on the compute axis — the very cost the Fast‑DiM family removes.

AdaFace MMPMR attack on AdaFace ArcFace MMPMR ElasticFace MMPMR attack on ElasticFace Compute 1 − NFE/350
MIPGAN-II (NFE=150) DiM-A (NFE=350) Fast-DiM (NFE=300) Fast-DiM-ode (NFE=150)
Figure 7. SYN-MAD 2022 (FRLL pairs), MMPMR @ FMR=0.1%, normalized to $[0,1]$. Compute axis is $1-\mathrm{NFE}/350$; original DiM‑A sits at the origin of that axis. Fast‑DiM‑ode (mauve fill) matches MIPGAN-II's compute budget while staying on DiM‑A's attack-strength polygon.
06 — Looking forward

A new threat model for face morphing

Before DiM, every morphing attack lived on one of two axes: landmark‑based methods reached high FR vulnerability but left visible ghosting; GAN‑based methods produced cleaner images but lost identity through inversion. Morph attack detectors generally detected these ghosting artefacts in landmark-based morphs. Diffusion morphs exploit this vulnerability, by keeping the strength of landmark attacks and the photorealism of GAN attacks. Thus they occupy a region of attack space that detectors trained on the prior two families do not reach.

Key contributions

References