Why deepfake detectors fail on new generators

The most-cited deepfake-detection result of the last six years is arguably this one: with careful preprocessing and augmentation, a standard image classifier trained on outputs from a single CNN generator (ProGAN) generalized surprisingly well to ten unseen architectures, including StyleGAN2.^[1] The 2020 paper by Wang, Wang, Zhang, Owens, and Efros — “CNN-Generated Images Are Surprisingly Easy to Spot... for Now” — is the source of the cautious optimism that runs through much of the popular coverage of deepfake detection.

The phrase to dwell on is “for now.” Five years later, generators are diffusion-dominant. Detectors trained on GAN-family outputs frequently fail on diffusion-family outputs and vice versa, and accuracy drops further on compressed, recompressed, or screen-recorded media. This is the cross-generator generalization gap, and it is the discipline's central open problem.

Why it happens

Frank et al. (ICML 2020) gave the field its cleanest mechanistic account: GAN-generated images carry severe artifacts in the frequency domain, caused by upsampling operations common across architectures.^[2] A detector that learns those artifacts is implicitly learning the upsampling fingerprint of one family of decoders. Diffusion models do not share that fingerprint exactly — their decoders leak differently. The detector's overfit generalizes within a family, but not across.

It gets worse on the audio side. ASVspoof 2021 — the canonical anti-spoofing benchmark — reports that countermeasures developed for the new deepfake-speech track “lack generalization across different source datasets,” even after the codec-distortion augmentations the challenge specifically introduced.^[3] The audio version of the same gap.

What about watermarking?

If detection always trails generation, the natural response is to sign the output instead. Two strong proposals exist: Fernandez et al.'s Stable Signature (ICCV 2023), which fine-tunes a latent-diffusion generator to embed a recoverable signature in every output^[4]; and Google DeepMind's SynthID-Image, which has been used to watermark over ten billion images and video frames at internet scale.^[5]

Both work — for cooperating actors. Saberi et al. (arXiv 2023) showed a fundamental trade-off between watermark evasion error and spoofing error for low-perturbation methods, and demonstrated that high-perturbation methods are vulnerable to model-substitution attacks.^[6] Watermarking is a useful production signal; it is not a closed defense against motivated adversaries who don't sign their work.

What the field actually does about it

Three threads worth following:

Train across families. Augment training data with samples from multiple generators, including diffusion-family outputs even when the deployment target is GAN-family (and vice versa). This is the operationally cheapest mitigation.
Lean on biosignal and semantic features that are not architecture-specific. PPG-derived heart-rate signals (Ciftci et al., IEEE TPAMI 2020) and inter-eye specular-highlight symmetry (Wang, Tondi, Barni, Frontiers 2022) are robust to a generator change in a way that frequency residuals are not.^[7][8]
Stack provenance underneath detection. Cooperating actors sign with C2PA (see /provenance); uncooperating actors get the detection stack. The two play different positions.

The honest summary is that no single approach has closed the gap. The 2020 cautious optimism was correct for the GAN era. We are not in the GAN era anymore. Read every published in-distribution AUC with this in mind — and read /research-lab for the longer version.

Why deepfake detectors fail on new generators.

Why it happens

What about watermarking?

What the field actually does about it

Sources

From research to practice.