How DIWA-Net Detects Deepfakes
Modern deepfakes manipulate what you see and what you hear. DIWA-Net examines both signals together, so a forgery has to fool two independent detectors at once.
Single-modality detectors fail
A video-only model can be fooled by a clean face swap; an audio-only model can be fooled by a convincing voice clone. Looking at one stream in isolation leaves a blind spot the other could have caught.
Five stages, two frozen experts
Adapt a little, gain a lot
We adapt only 3.5% of parameters but achieve 99.5% AUC — better than full fine-tuning.
8 languages · 7 generation methods
6 training languages plus 2 open-set hold-outs (German, Hindi), evaluated on MAVOS-DD including unseen manipulation methods.
Research evaluation notice
DIWA-Net is a research system. Its strongest results — like those of other published detectors (TALL-Swin, MRDF, AVFF, and related multimodal baselines) — are defined by rigorous evaluation on a fixed benchmark, not by ad-hoc uploads from the open web.
On the MAVOS-DD multilingual audio–video benchmark (60,364 real and synthetic videos; DIWA-Net trained on a 12,000-video subset), our checkpoint achieves:
- 99.5%AUC · In-Domain
- 95.9%AUC · Open-Set Full
- 97.0%AUC · OS-Method
These figures apply to MAVOS-DD evaluation only. Videos you upload may differ in language, compression, lighting, manipulation tool, or distribution from the benchmark — and can be misclassified. Treat every verdict as a research signal, not a forensic or legal guarantee. Always corroborate important claims through multiple sources.