FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning
Gaojian Wang1,2 Feng Lin1,2 Tong Wu1,2 Zhenguang Liu1,2 Zhongjie Ba1,2 Kui Ren1,2
1State Key Laboratory of Blockchain and Data Security, Zhejiang University
2Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
This is the implementation of FSFM-3C, a self-supervised pre-training framework to learn a transferable facial representation that boosts face security tasks.
- 2024-12: The demo of visualizing different facial masking strategies that are introduced in FSFM-3C for MIM is available at
- 2024-12: The online detectors (based on simply fine-tuned models of the paper implementation) is available at
- 2024-12: The pre-trained/fine-tuned models and pre-training/fine-tuning logs of the paper implementation are available at
- 2024-12: All codes including data-preprocessing, pre-training, fine-tuning, and testing are released at this page
- 2024-12: Our paper is available at
- 🔧 Installation
- ⏳ FSFM Pre-training
- ⚡ Fine-tuning FSFM Pre-trained ViTs for Downstream Tasks
- Citation
Git clone this repository, creating a conda environment, and activate it via the following command:
git clone https://github.com/wolo-wolo/FSFM.git
cd FSFM/
conda create -n fsfm3c python=3.9
conda activate fsfm3c
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117 # run this first. (our exp implementation)
pip install -r requirements.txt
The implementation of pre-training FSFM-3C ViT models from unlabeled facial images.
⬇️ Dataset Preparation
💡 FSFM can be readily pre-trained on various facial (images or videos) datasets and their combinations without annotations, learning a general facial representation that transcends specific domains or tasks. Thus, it can benefit from the larger scale and greater diversity of unlabeled faces widely available in the open world.
For paper implementation, we have pre-trained our model on the following datasets. Download these datasets optionally and refer to Folder Structure.
- VGGFace2 for main experiments (raw data: images)
- FaceForensics++ for our ablation studies (raw data: videos)
- YoutubeFace for data scaling testing (raw data: frames)
⬇️ Toolkit Preparation
We use DLIB for face detection and the FACER toolkit for face parsing. Download the FACER toolkits in advance.
cd /datasets/pretrain/preprocess/tools
git clone https://github.com/FacePerceiver/facer
📁 Folder Structure
You can organize the Folder structure in
/datasets/pretrain/preprocess/config/default.py
The following is the default Folder Structure. The paths in each directory are described in the comments.
datasets/
├── data/
│ ├── VGG-Face2/ # VGGFace2
│ │ ├── train/ # download data
│ │ ├── test/ # download data
│ │ └── facial_images/ # facial images (train + test) (automatic creating by pretrain/preprocess/dataset_preprocess.py)
│ │
│ ├── FaceForensics/ # FF++
│ │ ├── dataset/ # download splits
│ │ │ └── splits/
│ │ │ ├── train.json
│ │ │ ├── val.json
│ │ │ └── test.json
│ │ ├── original_sequences/ # download data
│ │ │ └── youtube/ # real faces (we use c23 version) for pre-training
│ │ │ └── c23/
│ │ ├── manipulated_sequences/ # download data, fake faces for deepfake detection, not used in pre-training
│ │ └── facial_images_split/ # facial images (automatic creating by pretrain/preprocess/dataset_preprocess.py)
│ │
│ └── YoutubeFace/ # YoutubeFace
│ ├── frame_images_DB/ # download data
│ └── facial_images/ # facial images (automatic creating by pretrain/preprocess/dataset_preprocess.py)
│
├── pretrain/preprocess/
│ ├── config/
│ │ ├── __init__.py
│ │ └── default.py # define folder structure
│ ├── tools/
│ │ ├── facer/ # download FACER toolkit to here
│ │ └── util.py # Frame and Face Extraction functions
│ ├── dataset_preprocess.py # for face extraction from images or videos
│ └── face_parse.py # for face parsing to make pre-training data
│
└── pretrain_datasets/ # final pre-training data (automatic creating by face_parse.py)
├── FaceForensics_youtube/ # FF++_o data for pre-training
├── YoutubeFace/ # YoutubeFace (YTF) data for pre-training
└── VGGFace2/ # VGGFace2 (VF2) data for pre-training
🗂️ Make Pre-training Dataset
1) 🦱 Face Extraction
We use DLIB for face detection with a 30% addition cropping size. Run /datasets/pretrain/preprocess_dlib/dataset_preprocess.py
to extract faces from images or videos:
cd /datasets/pretrain/
python dataset_preprocess.py --dataset [VF2, FF++_o, YTF]
The facial images from each dataset:
- VF2 : ~300W facial images, VGGFace2, including the full train and test subsets
- YTF : ~60W facial images, YouTubeFace, including 3,425 videos from YouTube, already broken to frames
- FF++_o : ~10W facial images for 128_frames per video, ~43W for all_frames per video, from the original YouTube subset of FaceForensics++ (FF++) c23 (HQ) version,
includes 720 training and 140 validation videos
(~10W serves for our some ablations due to limited computational resources)
You can specific the
FF_compression
andFF_num_frames
in/datasets/pretrain/preprocess/config/default.py
, as an example for preprocessing facial video dataset.
2) 🧑 Face Parsing
We use the FACER toolkit for face parsing.
Cropped faces are resized to 224×224, and parsing maps are saved as .npy files, enabling efficient facial masking during pre-training.
Run /datasets/pretrain/preprocess_dlib/face_parse.py
for processing:
python face_parse.py --dataset [FF++_o, YTF, VF2]
# or CUDA_VISIBLE_DEVICES=0 python face_parse.py --dataset [FF++_o, YTF, VF2]
The resulting /datasets/pretrain_datasets/
folder structure should finally be:
pretrain_datasets/
└── specific_dataset
├── images (3*224*224 .png)
├── parsing_maps (1*224*224 .npy)
└── vis_parsing_maps (optional for visualization)
🔄 Pre-training from Scratch
cd ./fsfm-3c/pretrain/
and run the script main_pretrain.py
to pre-train the model.
CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=4 main_pretrain.py \
--batch_size 256 \
--accum_iter 4 \
--epochs 400 \
--model fsfm_vit_base_patch16 \
--input_size 224 \
--mask_ratio 0.75 \
--norm_pix_loss \
--weight_sfr 0.007 \
--weight_cl 0.1 \
--cl_loss SimSiam \
--weight_decay 0.05 \
--blr 1.5e-4 \
--warmup_epochs 40 \
--pretrain_data_path ../../datasets/pretrain_datasets/'{VGG-Face2, YoutubeFace, FaceForensics_youtube/128_frames/c23}' \
--output_dir 'path to save pretrained model ckpt and logs}' # default to: /fsfm-3c/pretrain/checkpoint/$USR/experiments_pretrain/$PID$
- We use
--accum_iter
to maintain the effective batch size, which is 256batch_size
(per gpu) * 1nodes
* 4 (gpus per node) * 4accum_iter
= 4096. blr
is the base learning rate. The actuallr
is computed by the linear scaling rule:lr
=blr
* effective batch size / 256.- Here we use
--norm_pix_loss
as the target for better representation learning. To train a baseline model (e.g., for visualization), use pixel-based construction and turn off--norm_pix_loss
. - In
--output_dir
, we save the weights of online network and target network separately tocheckpoint-$epoch$.pth
(for downstream tasks) andcheckpoint-te-$epoch$.pth
(for resume pre-training), and also save the weights with min pre-training loss tocheckpoint-min_pretrain_loss.pth
andcheckpoint-te-min_pretrain_loss.pth
, respectively.
🚀 Model and Data Scaling
-
Model Scaling. To pre-train ViT-Small, ViT-Base, ViT-Large, or ViT-Huge, set
--model
to one of:--model [fsfm_vit_small_patch16, fsfm_vit_base_patch16, fsfm_vit_large_patch16, fsfm_vit_huge_patch14 (with --patch_size 14)]
-
Data Scaling.
- FSFM can be readily pre-trained on various facial image/video datasets (requires real faces only), you can follow ⚫Pre-training Data for preparation.
- To pre-train the model on arbitrary combinations of various datasets, just add
--pretrain_data_path
like:CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=4 main_pretrain.py \ -- (Omit other params...) --pretrain_data_path ../../datasets/pretrain_datasets/VGG-Face2 \ --pretrain_data_path ../../datasets/pretrain_datasets/YoutubeFace \ --pretrain_data_path ../../datasets/pretrain_datasets/FaceForensics_youtube/128_frames/c23
💾 Pre-training/Resume from Checkpoint
-
To continue pre-training from pre-trained/model checkpoints:
CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=4 main_pretrain.py \ -- (Omit other params...) --resume 'path_to_model_ckpt/checkpoint-$epoch$.pth' \ --resume_target_network 'path_to_model_ckpt/checkpoint-te-$epoch$.pth' \
📥 Download Manually
We provide the model weights on the
and will continuously update them, which can be downloaded from the following links (default placed in ./fsfm-3c/pretrain/checkpoint/pretrained_models/
):
Backbone | Pre-trained data | Epochs | Online Network 🤗 | Target Network 🤗 | Logs 🤗 | Normalize 🤗 |
---|---|---|---|---|---|---|
ViT-B/16 | VGG-Face2(~) | 400 | checkpoint-400.pth | checkpoint-te-400.pth | log.txt&log_detail.txt | pretrain_ds_mean_std.txt |
coming soon |
- For Downstream Tasks: load the ViT weights from the Online Network and apply normalization from Normalize (instead of ImageNet's mean&std).
- Resuming Weights for Continued Pre-training: additionally, download the Target Network and refer to Pre-training/Resume from Checkpoint
- 💡 Further Improvements: you can pre-train for more epochs, adopt larger ViTs, and use more faces. Due to computational limitations, we will continue to update models.
💻 Download Script
The models can be downloaded from huggingface_hub python /fsfm-3c/pretrain/download_pretrained_weitghts.py
:
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="Wolowolo/fsfm-3c", filename="pretrained_models/VF2_ViT-B/checkpoint-400.pth", local_dir="./pretrain/checkpoint/VF2_ViT-B", local_dir_use_symlinks=False)
hf_hub_download(repo_id="Wolowolo/fsfm-3c", filename="pretrained_models/VF2_ViT-B/checkpoint-te-400.pth", local_dir="./pretrain/checkpoint/VF2_ViT-B", local_dir_use_symlinks=False)
hf_hub_download(repo_id="Wolowolo/fsfm-3c", filename="pretrained_models/VF2_ViT-B/pretrain_ds_mean_std.txt", local_dir="./pretrain/checkpoint/VF2_ViT-B", local_dir_use_symlinks=False)
The implementation of fine-tuning pre-trained model on various downstream face security-related tasks.
To evaluate the generalizability of our method across diverse deepfake detection scenarios, we follow the challenging cross-dataset setup.
⬇️ Dataset Preparation
For paper implementation, we fine-tune one detector on the FaceForensics++ (FF++, c23/HQ version) dataset and test it on unseen datasets: CelebDF-v2 (CDFv2), Deepfake Detection Challenge (DFDC), Deepfake Detection Challenge preview (DFDCp), and Wild Deepfake(WDF). Download these datasets and refer to DfD Folder Structure.
📁 DfD Folder Structure
You can organize the Folder structure in
/datasets/finetune/preprocess/config/default.py
The following is the default Folder Structure for deepfake detection. The paths in each directory are described in the comments.
datasets/
├── data/
│ ├── Celeb-DF-v2/ # Celeb-DF (v2)
│ │ ├── Celeb-real/ # download data
│ │ ├── YouTube-real/ # download data
│ │ ├── Celeb-synthesis/ # download data
│ │ └── List_of_testing_videos.txt # download data
│ │
│ ├── DFDC/ # DeepFake Detection Challenge (Full)
│ │ └── test/ # download data
│ │ ├── ori_videos/
│ │ ├── labels.csv
│ │ └── metadata.json
│ │
│ ├── DFDCP/ # DeepFake Detection Challenge (Preview)
│ │ ├── original_videos/ # download data
│ │ ├── method_A/ # download data
│ │ ├── method_B/ # download data
│ │ └── dataset.json # download data
│ │
│ ├── deepfake_in_the_wild/ # DeepFake Detection Challenge (Preview)
│ │ ├── real_test/ # download data
│ │ └── fake_test/ # download data
│ │
│ └── FaceForensics/ # FF++
│ ├── dataset/ # download splits
│ │ └── splits/
│ │ ├── train.json
│ │ ├── val.json
│ │ └── test.json
│ ├── original_sequences/ # download data
│ │ ├── youtube/ # videos of real faces in FF++
│ │ │ └── c23/
│ │ └── actors/
│ │ └── raw/ # videos of real faces in DFD (DeepFakeDetection) datasets
│ ├── manipulated_sequences/ # download data, videos of fake faces in FF++
│ │ ├── DeepFakes/
│ │ │ └── c23/
│ │ ├── Face2Face/
│ │ │ └── c23/
│ │ ├── FaceSwap/
│ │ │ └── c23/
│ │ ├── NeuralTextures/
│ │ │ └── c23/
│ │ └── DeepFakeDetection/ # videos of fake faces in DFD (DeepFakeDetection) datasets
│ │ │ └── raw/
│ └── facial_images_split/ # facial images (automatic creating by finetune/preprocess/dataset_preprocess.py)
│
├── finetune/preprocess/
│ ├── config/
│ │ ├── __init__.py
│ │ └── default.py # define folder structure
│ ├── tools/
│ │ └── util.py # Frame and Face Extraction functions
│ └── dataset_preprocess.py # to construct fine-tuning data (including train/val/test/) for DfD and DiFF tasks
│
└── finetune_datasets/ # final fine-tuning data (automatic creating by dataset_preprocess.py)
└── deepfakes_detection/ # data for DfD fine-tuning (automatic creating by finetune/preprocess/dataset_preprocess.py)
├── Celeb-DF-v2/
├── deepfake_in_the_wild/
├── DFDC/
├── DFDCP/
└── FaceForensics/
🗂️ Make Fine-tuning&Testing Dataset
We use DLIB for face detection with a 30% addition cropping size. Run /datasets/finetune/preprocess/dataset_preprocess.py
to make train/val/test datasets
for our downstream deepfakes detection task.
cd /datasets/finetune/preprocess
python dataset_preprocess.py --dataset FF++_all # extracting faces from videos and making FF++ train/val/ sets
# This would yield DS_FF++_all_cls/ dataset for our DfD model fine-tuning, placed in following folder:
# finetune_datasets/
# └── deepfakes_detection/
# └── FaceForensics
# └── $num$_frames/ # default $num$ is 32
# └── DS_FF++_all_cls/
# └── $compression$/ # default $compression$ is c23 (there are three version in FF++/DFD: raw/c40/c23)
# ├── train/
# ├── val/
# └── test/
python dataset_preprocess.py --dataset [CelebDFv2, DFDC, DFDC_P, WildDeepfake, CelebDF, DFD] # extracting faces and making test sets
# This would yield testing set for our cross-dataset DfD evaluation. placed in following folder:
# finetune_datasets/
# └── deepfakes_detection/
# ├── [Celeb-DF-v2/DFDC/DFDCP] # only facial images of test set
# │ └── $num$_frames/ # default $num$ is 32
# │ └── test/
# │
# └── deepfake_in_the_wild # already provides facial images, use its test set directly
# └── test/
# construct the FF++_DeepFakes(c23) subset for our another <unseen DiFF (Diffusion face forgery detection) task> or optional <cross-manipulation exps in FF++>.
python dataset_preprocess.py --dataset FF++_each # extracting faces from videos and making FF++ train/val/ sets for four manipulations
# This would yield DS_FF++_each_cls/ dataset (we only use its DeepFakes subset for our DiFF task), placed in the following folder:
# finetune_datasets/
# └── deepfakes_detection/
# └── FaceForensics
# └── $num$_frames/ # default $num$ is 32
# └── DS_FF++_each_cls/
# └── $compression$/ # default $compression$ is c23 (there are three version in FF++/DFD: raw/c40/c23)
# ├── DeepFakes/ ── train/val/test/
# ├── Face2Face/ ── train/val/test/
# ├── FaceSwap/ ── train/val/test/
# └── NeuralTextures/ ── train/val/test/
- Pre-processing settings (num of extracting frames, compression version, etc) are specified in
/datasets/finetune/preprocess/config/default.py
.- You can include other datasets by following
/datasets/finetune/preprocess/dataset_preprocess.py
anddatasets/finetune/preprocess/config/default.py
⚡ Fine-tuning
cd ./fsfm-3c/finuetune/cross_dataset_DfD/
and run the script main_finetune_DfD.py
to fine-tune the model:
CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=2 main_finetune_DfD.py \
--accum_iter 1 \
--apply_simple_augment \
--batch_size 32 \
--nb_classes 2 \
--model vit_base_patch16 \
--epochs 10 \
--blr 2.5e-4 \
--layer_decay 0.65 \
--weight_decay 0.05 \
--drop_path 0.1 \
--reprob 0.25 \
--mixup 0.8 \
--cutmix 1.0 \
--dist_eval \
--finetune 'path to pre-trained model ckpt $model pre-trained on VF2$' \
--finetune_data_path 'data path for fine-tuning $path to FF++_c23$' \
--output_dir 'path to save finetuned model ckpt and logs' # default to ./checkpoint/$USR/experiments_finetune/$PID$
📜Paper Implementation: the $🖲️script$ for fine-tuning, fine-tuned checkpoint, and logs are available at 🤗here
🧩 Most settings adhere to the MAE finetuning recipe. Except for adapting from ImageNet to the DfD task, we did not make much effort to adjust the hyper-parameters.
--finetune
: ckpt of (FSFM) pre-trained ViT models. Get our pre-trained checkpoints from Pre-trained Checkpoints or download 🤗here- Here the effective batch size is 32
batch_size
(per gpu) * 1nodes
* 2 (gpus per node) = 64. blr
is the base learning rate. The actuallr
is computed by the linear scaling rule:lr
=blr
* effective batch size / 256.- The DfD fine-tuning hyper-parameters slightly differ from the default MAE baseline for ImageNet classification.
- Fine-tuning/Training time is ~1h for 10 epochs in 2 A6000 GPUs. (~6250MiB Memory-Usage per GPU)
✨ Fine-tuning with different dataset structure
-
--finetune_data_path
folder structure should be:--finetune_data_path/ \ ├── train/ \ │ ├── class-1/ (e.g., real) \ │ └── class-2/ (e.g., fake) \ └── val/ \ ├── class-1/ (e.g., real) \ └── class-2/ (e.g., fake) \
To fine-tune/train the model on arbitrary combinations of various datasets, just add
--finetune_data_path
like:CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=2 main_finetune_DfD.py \ -- (Omit other params...) --finetune_data_path ../../datasets/finetune_datasets/finetune_data_path_1 \ --finetune_data_path ../../datasets/finetune_datasets/finetune_data_path_2 \ --finetune_data_path ../../datasets/finetune_datasets/finetune_data_path_3
-
To create dataloader from the label-split file like (
train.txt, val.txt
), replace--finetune_data_path
with the following args:CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=2 main_finetune_DfD.py \ --(Omit other params...) --finetune_data_path [] \ # do not provided this arg !!! --train_split {train.txt} \ # path to the train label file --val_split {val.txt} \ # path to the val label file --dataset_abs_path None or abs_path \ # see below --delimiter_in_spilt ' ' \ # see below
- where
--train_split/--val_split
providesimage_path label
pairs. --dataset_abs_path
: If the--train_split/--val_split
provides the relative path to the image, this is the prefix path to form the full path; If the splits already provide the absolute path, set it to None.--delimiter_in_spilt
: The delimiter used to split the image_path and label in the--train_split/--val_split
, set' '
forimage_path label
; set','
forimage_path,label
; set', '
forimage_path, label
.
- where
📊 Cross-Datasets Evaluation
cd ./fsfm-3c/finuetune/cross_dataset_DfD/
and run the script main_test_DfD.py
to calculate testing results:
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=1 main_test_DfD.py \
--eval \
--apply_simple_augment \
--model vit_base_patch16 \
--nb_classes 2 \
--batch_size 320 \
--resume 'path to fine-tuned model ckpt $model fine-tuned on FF++_c23$' \
--output_dir 'path to save test results' # default to ./checkpoint/$USR/experiments_test/from_{FT_folder_name}/$PID$
📜Paper Implementation: the $🖲️script$ and 🖲️test_results for testing cross-dataset DfD.
-
The path to all test sets is placed in the
main_test_DfD.py
, modify it freely and follow the folder structure (provide the parent path oftest
sub-folder to dict variance$cross_dataset_test_path$
). -
To create a test dataloader from the labels file, append the following args:
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=1 main_test_DfD.py \ --(Omit other params...) --test_split 'test.txt' \ --dataset_abs_path 'absolute path to test data'
To further investigate the adaptability of our method against emerging unknown facial forgeries beyond conventional DeepFakes, we adopt a challenging cross-distribution testing.
⬇️ Dataset Preparation
We train only on the FF++_DeepFakes(c23) subset and test on the DiFF datasets. The DiFF contains high-quality face images synthesized by SOTA diffusion models across four sub-sets: T2I (Text-to-Image), I2I (Image-to-Image), FS (Face Swapping), and FE (Face Editing). This evaluation is more challenging than typical DfD, as both the unseen manipulations and generative models are significantly different. Download these datasets and refer to DiFF Folder Structure.
- FaceForensics++ (Deepfakes and original subsets for training/fine-tuning)
- DiFF (val/test data of all four subsets)
📁 DiFF Folder Structure
The following is the default Folder Structure for unseen DiFF detection. The paths in each directory are described in the comments.
datasets/
├── data/
│ ├── DiFF/
│ │ ├── DiFF_real/ # download data
│ │ │ ├── train/
│ │ │ ├── val/
│ │ │ └── test/
│ │ ├── val/ # download data (fake)
│ │ │ ├── FE
│ │ │ ├── FS
│ │ │ ├── I2I
│ │ │ └── T2I
│ │ └── test/ # download data (fake)
│ │ ├── FE
│ │ ├── FS
│ │ ├── I2I
│ │ └── T2I
│ │
│ └── FaceForensics/ # FF++
│ ├── dataset/ # download splits
│ │ └── splits/
│ │ ├── train.json
│ │ ├── val.json
│ │ └── test.json
│ ├── original_sequences/ # download data
│ │ └── youtube/ # videos of real faces in FF++
│ │ └── c23/
│ ├── manipulated_sequences/ # download data, videos of fake faces in FF++
│ │ └── DeepFakes/
│ │ └── c23/
│ └── facial_images_split/ # facial images (automatic creating by finetune/preprocess/dataset_preprocess.py)
│
├── finetune/preprocess/
│ ├── config/
│ │ ├── __init__.py
│ │ └── default.py # define folder structure
│ ├── tools/
│ │ └── util.py # Frame and Face Extraction functions
│ └── dataset_preprocess.py # to construct fine-tuning data (including train/val/test/) for DfD and DiFF tasks
│
└── finetune_datasets/ # final fine-tuning data
├── deepfakes_detection/
│ └── FaceForensics/ # training data for DiFF (automatic creating by finetune/preprocess/dataset_preprocess.py)
│
└── diffusion_facial_forgery_detection
└── DiFF/ # val/testing data for DiFF (automatic creating by finetune/preprocess/dataset_preprocess.py)
You can organize the Folder structure in
/datasets/finetune/preprocess/config/default.py
🗂️ Make Fine-tuning&Testing Dataset
We use DLIB for face detection with a 30% addition cropping size. Run /datasets/finetune/preprocess/dataset_preprocess.py
to make train/val/test datasets
for our downstream unseen diffusion facial forgery detection task.
cd /datasets/finetune/preprocess
python dataset_preprocess.py --dataset FF++_each # extracting faces from videos and making FF++ train/val/ sets for four manipulations
# This would yield DS_FF++_each_cls/ dataset (we only use its DeepFakes subset for DiFF task), placed in the following folder:
# finetune_datasets/
# └── deepfakes_detection/
# └── FaceForensics
# └── $num$_frames/ # default $num$ is 32
# └── DS_FF++_each_cls/
# └── $compression$/ # default $compression$ is c23 (there are three version in FF++/DFD: raw/c40/c23)
# ├── DeepFakes/
# │ ├── train/ # for fine-tuning/training
# │ ├── val/
# │ └── test/
# ├── Face2Face/
# ├── FaceSwap/
# └── NeuralTextures/
python dataset_preprocess.py --dataset DiFF # extracting faces from DiFF val/test/ sets for four subsets
# This would yield four val/test subsets for our unseen DiFF evaluations, placed in the following folder:
# finetune_datasets/
# └── diffusion_facial_forgery_detection/
# └── DiFF
# ├── val_subsets/
# │ ├── FE/ ── val/ ── [DiFF_real/, fake/]
# │ ├── FS/ ── val/ ── [DiFF_real/, fake/]
# │ ├── I2I/ ── val/ ── [DiFF_real/, fake/]
# │ └── T2I/ ── val/ ── [DiFF_real/, fake/]
# │
# └── test_subsets/
# ├── FE/ ── test/ ── [DiFF_real/, fake/]
# ├── FS/ ── test/ ── [DiFF_real/, fake/]
# ├── I2I/ ── test/ ── [DiFF_real/, fake/]
# └── T2I/ ── test/ ── [DiFF_real/, fake/]
⚡ Fine-tuning
cd ./fsfm-3c/finuetune/cross_dataset_unseen_DiFF/
and run the script main_finetune_DiFF.py
to fine-tune the model:
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=1 main_finetune_DiFF.py \
--accum_iter 1 \
--normalize_from_IMN \
--apply_simple_augment \
--batch_size 256 \
--nb_classes 2 \
--model vit_base_patch16 \
--epochs 50 \
--blr 5e-4 \
--layer_decay 0.65 \
--weight_decay 0.05 \
--drop_path 0.1 \
--reprob 0.25 \
--mixup 0.8 \
--cutmix 1.0 \
--dist_eval \
--finetune 'path to pre-trained model ckpt $model pre-trained on FF++_o_c23$' \
--data_path 'data path for fine-tuning $path to FF++_DF_c23$' \
--val_data_path 'data path for fine-tuning $path to DiFF_val_subsets$' \
--output_dir 'path to save finetuned model ckpt and logs' # default to ./checkpoint/$USR/experiments_finetune/$PID$
📜Paper Implementation: the $🖲️script$ for fine-tuning, fine-tuned checkpoints, and logs are available at 🤗here.
🧩 Most codes in/fsfm-3c/finuetune/cross_dataset_unseen_DiFF/
are inherited fromcross_dataset_DfD/
and tailored for this specific DiFF evaluation.
✨ We recommend building on thecross_dataset_DfD/
to expand your works.
--finetune
: ckpt of (FSFM) pre-trained ViT models. Get our pre-trained checkpoints from 🤗here, which was only pre-trained on FF++_o (c23, all_frames from train/val split), follow our statement.
📊 Cross-Datasets Evaluation
cd ./fsfm-3c/finuetune/cross_dataset_unseen_DiFF/
and run the script main_test_DiFF.py
to calculate testing results:
CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=2 python -m torch.distributed.launch --nproc_per_node=1 main_test_DiFF.py \
--normalize_from_IMN \
--apply_simple_augment \
--eval \
--model vit_base_patch16 \
--nb_classes 2 \
--batch_size 320 \
--resume 'path to fine-tuned model ckpt $model fine-tuned on FF++_DF_c23$' \
--output_dir 'path to save test results' # default to ./checkpoint/$USR/experiments_test/from_{FT_folder_name}/$PID$
📜Paper Implementation: the $🖲️script$ and 🖲️test_results for testing cross-dataset DiFF.
To evaluate the transferability of our method for FAS under significant domain shifts, we apply the leave-one-out (LOO) cross-domain evaluation on the widely-used benchmark.
⬇️ Dataset Preparation
For downstream 0-shot cross-domain FAS task, we directly follow Protocol 1 (MCIO) in few_shot_fas to prepare and preprocess data.
- Put the prepared datasets
data/
to our default Folder Structure, as follows:datasets/ └── finetune_datasets/ # final fine-tuning data └── face_anti_spoofing/ └── data/ # the prepared datasets from few_shot_fas ├── MCIO/ # we use this set(Protocol 1) │ ├── frame/ │ │ ├── casia/ │ │ │ ├── train/ ── [real/, fake/] │ │ │ └── test/ ── [real/, fake/] │ │ ├── celeb/ │ │ │ ├── train/ ── [real/, fake/] │ │ │ └── test/ ── [real/, fake/] │ │ ├── msu/ │ │ │ ├── train/ ── [real/, fake/] │ │ │ └── test/ ── [real/, fake/] │ │ ├── oulu/ │ │ │ ├── train/ ── [real/, fake/] │ │ │ └── test/ ── [real/, fake/] │ │ └── replay/ │ │ ├── train/ ── [real/, fake/] │ │ └── test/ ── [real/, fake/] │ └── txt/ │ └── [casia_fake_shot.txt, casia_fake_test.txt, ...] │ └── WCS/
⚡ Fine-tuning and Evaluation
cd ./fsfm-3c/finuetune/cross_domain_FAS/
and run the script train_vit.py
to fine-tune and evaluate the model:
python train_vit.py \
--pt_model 'path to pre-trained model ckpt $model pre-trained on VF2$' \
--op_dir 'path to save finetuned model ckpt and logs' \
--report_logger_path 'path to save performance.csv of evaluation' \
--config M # choose from [M, C, I, O] for Protocol 1
📜Paper Implementation: the $🖲️script$ for fine-tuning, fine-tuned checkpoint, logs,and evaluations are available at 🤗here
🧩 Code for this downstream task is built on the few_shot_fas, you could try more experiments on other protocols or scenarios freely.
--finetune
: ckpt of (FSFM) pre-trained ViT models. Get our pre-trained checkpoints from Pre-trained Checkpoints or download 🤗here- The data and label path is specified in the
sample_frames
function ofcross_domain_FAS/utils/utils.py
If our research helps your work, please consider giving us a star ⭐ or citing us:
@article{wang2024fsfm,
title={FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning},
author={Wang, Gaojian and Lin, Feng and Wu, Tong and Liu, Zhenguang and Ba, Zhongjie and Ren, Kui},
journal={arXiv preprint arXiv:2412.12032},
year={2024}
}