FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

Gaojian Wang^1,2 Feng Lin^1,2 Tong Wu^1,2 Zhenguang Liu^1,2 Zhongjie Ba^1,2 Kui Ren^1,2

¹State Key Laboratory of Blockchain and Data Security, Zhejiang University
²Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security

This is the implementation of FSFM-3C, a self-supervised pre-training framework to learn a transferable facial representation that boosts face security tasks.

Release🎉

2024-12: The demo of visualizing different facial masking strategies that are introduced in FSFM-3C for MIM is available at
2024-12: The online detectors (based on simply fine-tuned models of the paper implementation) is available at
2024-12: The pre-trained/fine-tuned models and pre-training/fine-tuning logs of the paper implementation are available at
2024-12: All codes including data-preprocessing, pre-training, fine-tuning, and testing are released at this page
2024-12: Our paper is available at

🔧 Installation

Git clone this repository, creating a conda environment, and activate it via the following command:

git clone https://github.com/wolo-wolo/FSFM.git
cd FSFM/
conda create -n fsfm3c python=3.9
conda activate fsfm3c
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117 # run this first. (our exp implementation)
pip install -r requirements.txt

🚀 FSFM Pre-training

The implementation of pre-training FSFM-3C ViT models from unlabeled facial images.

⚫ Pre-training Data

⬇️ Dataset Preparation

💡 FSFM can be readily pre-trained on various facial (images or videos) datasets and their combinations without annotations, learning a general facial representation that transcends specific domains or tasks. Thus, it can benefit from the larger scale and greater diversity of unlabeled faces widely available in the open world.

For paper implementation, we have pre-trained our model on the following datasets. Download these datasets optionally and refer to Folder Structure.

VGGFace2 for main experiments (raw data: images)
FaceForensics++ for our ablation studies (raw data: videos)
YoutubeFace for data scaling testing (raw data: frames)

⬇️ Toolkit Preparation

We use DLIB for face detection and the FACER toolkit for face parsing. Download the FACER toolkits in advance.

FACER

cd /datasets/pretrain/preprocess/tools
git clone https://github.com/FacePerceiver/facer

📁 Folder Structure

You can organize the Folder structure in /datasets/pretrain/preprocess/config/default.py

The following is the default Folder Structure. The paths in each directory are described in the comments.

datasets/
├── data/
│   ├── VGG-Face2/    # VGGFace2
│   │   ├── train/    # download data
│   │   ├── test/    # download data
│   │   └── facial_images/    # facial images (train + test) (automatic creating by pretrain/preprocess/dataset_preprocess.py)
│   │
│   ├── FaceForensics/    # FF++
│   │   ├── dataset/    # download splits
│   │   │   └── splits/
│   │   │       ├── train.json
│   │   │       ├── val.json
│   │   │       └── test.json
│   │   ├── original_sequences/    # download data 
│   │   │   └── youtube/    # real faces (we use c23 version) for pre-training
│   │   │       └── c23/
│   │   ├── manipulated_sequences/    # download data, fake faces for deepfake detection, not used in pre-training
│   │   └── facial_images_split/    # facial images (automatic creating by pretrain/preprocess/dataset_preprocess.py)
│   │
│   └── YoutubeFace/    # YoutubeFace
│       ├── frame_images_DB/    # download data 
│       └── facial_images/    # facial images (automatic creating by pretrain/preprocess/dataset_preprocess.py)
│
├── pretrain/preprocess/
│   ├── config/
│   │   ├── __init__.py
│   │   └── default.py    # define folder structure
│   ├── tools/
│   │   ├── facer/    # download FACER toolkit to here
│   │   └── util.py    # Frame and Face Extraction functions
│   ├── dataset_preprocess.py    # for face extraction from images or videos
│   └── face_parse.py    # for face parsing to make pre-training data
│
└── pretrain_datasets/    # final pre-training data (automatic creating by face_parse.py)
    ├── FaceForensics_youtube/    # FF++_o data for pre-training
    ├── YoutubeFace/    # YoutubeFace (YTF) data for pre-training
    └── VGGFace2/    # VGGFace2 (VF2) data for pre-training

🗂️ Make Pre-training Dataset

1) 🦱 Face Extraction

We use DLIB for face detection with a 30% addition cropping size. Run /datasets/pretrain/preprocess_dlib/dataset_preprocess.py to extract faces from images or videos:

cd /datasets/pretrain/
python dataset_preprocess.py --dataset [VF2, FF++_o, YTF]

The facial images from each dataset:

VF2 : ~300W facial images, VGGFace2, including the full train and test subsets
YTF : ~60W facial images, YouTubeFace, including 3,425 videos from YouTube, already broken to frames
FF++_o : ~10W facial images for 128_frames per video, ~43W for all_frames per video, from the original YouTube subset of FaceForensics++ (FF++) c23 (HQ) version, includes 720 training and 140 validation videos (~10W serves for our some ablations due to limited computational resources)

You can specific the FF_compression and FF_num_frames in /datasets/pretrain/preprocess/config/default.py, as an example for preprocessing facial video dataset.

2) 🧑‍ Face Parsing

We use the FACER toolkit for face parsing. Cropped faces are resized to 224×224, and parsing maps are saved as .npy files, enabling efficient facial masking during pre-training. Run /datasets/pretrain/preprocess_dlib/face_parse.py for processing:

python face_parse.py --dataset [FF++_o, YTF, VF2] 
# or CUDA_VISIBLE_DEVICES=0 python face_parse.py --dataset [FF++_o, YTF, VF2]

The resulting /datasets/pretrain_datasets/ folder structure should finally be:

pretrain_datasets/                           
└── specific_dataset
   ├── images (3*224*224 .png)
   ├── parsing_maps (1*224*224 .npy)
   └── vis_parsing_maps (optional for visualization)

⚫ Pre-training Model

🔄 Pre-training from Scratch

cd ./fsfm-3c/pretrain/ and run the script main_pretrain.py to pre-train the model.

CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=4 main_pretrain.py \
    --batch_size 256 \
    --accum_iter 4 \
    --epochs 400 \
    --model fsfm_vit_base_patch16 \
    --input_size 224 \
    --mask_ratio 0.75 \
    --norm_pix_loss \
    --weight_sfr 0.007 \
    --weight_cl 0.1 \
    --cl_loss SimSiam \
    --weight_decay 0.05 \
    --blr 1.5e-4 \
    --warmup_epochs 40 \
    --pretrain_data_path ../../datasets/pretrain_datasets/'{VGG-Face2, YoutubeFace, FaceForensics_youtube/128_frames/c23}'  \
    --output_dir 'path to save pretrained model ckpt and logs}' # default to: /fsfm-3c/pretrain/checkpoint/$USR/experiments_pretrain/$PID$

We use --accum_iter to maintain the effective batch size, which is 256 batch_size (per gpu) * 1 nodes * 4 (gpus per node) * 4 accum_iter = 4096.
blr is the base learning rate. The actual lr is computed by the linear scaling rule: lr = blr * effective batch size / 256.
Here we use --norm_pix_loss as the target for better representation learning. To train a baseline model (e.g., for visualization), use pixel-based construction and turn off --norm_pix_loss.
In --output_dir, we save the weights of online network and target network separately to checkpoint-$epoch$.pth (for downstream tasks) and checkpoint-te-$epoch$.pth (for resume pre-training), and also save the weights with min pre-training loss to checkpoint-min_pretrain_loss.pth and checkpoint-te-min_pretrain_loss.pth, respectively.

🚀 Model and Data Scaling

Model Scaling. To pre-train ViT-Small, ViT-Base, ViT-Large, or ViT-Huge, set --model to one of:

--model [fsfm_vit_small_patch16, fsfm_vit_base_patch16, fsfm_vit_large_patch16, fsfm_vit_huge_patch14 (with --patch_size 14)]

Data Scaling.

FSFM can be readily pre-trained on various facial image/video datasets (requires real faces only), you can follow ⚫Pre-training Data for preparation.

To pre-train the model on arbitrary combinations of various datasets, just add --pretrain_data_path like:

CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=4 main_pretrain.py \
    -- (Omit other params...)
    --pretrain_data_path ../../datasets/pretrain_datasets/VGG-Face2 \
    --pretrain_data_path ../../datasets/pretrain_datasets/YoutubeFace \
    --pretrain_data_path ../../datasets/pretrain_datasets/FaceForensics_youtube/128_frames/c23

💾 Pre-training/Resume from Checkpoint

To continue pre-training from pre-trained/model checkpoints:

CUDA_VISIBLE_DEVICES=0,1,2,3 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=4 main_pretrain.py \
    -- (Omit other params...)
    --resume 'path_to_model_ckpt/checkpoint-$epoch$.pth' \
    --resume_target_network 'path_to_model_ckpt/checkpoint-te-$epoch$.pth' \

🤗 Pre-trained Checkpoints

📥 Download Manually

We provide the model weights on the and will continuously update them, which can be downloaded from the following links (default placed in ./fsfm-3c/pretrain/checkpoint/pretrained_models/):

Backbone	Pre-trained data	Epochs	Online Network 🤗	Target Network 🤗	Logs 🤗	Normalize 🤗
ViT-B/16	VGG-Face2(~)	400	checkpoint-400.pth	checkpoint-te-400.pth	log.txt&log_detail.txt	pretrain_ds_mean_std.txt
coming soon

For Downstream Tasks: load the ViT weights from the Online Network and apply normalization from Normalize (instead of ImageNet's mean&std).

Resuming Weights for Continued Pre-training: additionally, download the Target Network and refer to Pre-training/Resume from Checkpoint

💡 Further Improvements: you can pre-train for more epochs, adopt larger ViTs, and use more faces. Due to computational limitations, we will continue to update models.

💻 Download Script

The models can be downloaded from huggingface_hub python /fsfm-3c/pretrain/download_pretrained_weitghts.py:

from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="Wolowolo/fsfm-3c", filename="pretrained_models/VF2_ViT-B/checkpoint-400.pth", local_dir="./pretrain/checkpoint/VF2_ViT-B", local_dir_use_symlinks=False)
hf_hub_download(repo_id="Wolowolo/fsfm-3c", filename="pretrained_models/VF2_ViT-B/checkpoint-te-400.pth", local_dir="./pretrain/checkpoint/VF2_ViT-B", local_dir_use_symlinks=False)
hf_hub_download(repo_id="Wolowolo/fsfm-3c", filename="pretrained_models/VF2_ViT-B/pretrain_ds_mean_std.txt", local_dir="./pretrain/checkpoint/VF2_ViT-B", local_dir_use_symlinks=False)

⚡ Fine-tuning FSFM Pre-trained ViTs for Downstream Tasks

The implementation of fine-tuning pre-trained model on various downstream face security-related tasks.

⚫ Cross-Dataset Deepfake Detection (DfD)

To evaluate the generalizability of our method across diverse deepfake detection scenarios, we follow the challenging cross-dataset setup.

⬇️ Dataset Preparation

For paper implementation, we fine-tune one detector on the FaceForensics++ (FF++, c23/HQ version) dataset and test it on unseen datasets: CelebDF-v2 (CDFv2), Deepfake Detection Challenge (DFDC), Deepfake Detection Challenge preview (DFDCp), and Wild Deepfake(WDF). Download these datasets and refer to DfD Folder Structure.

📁 DfD Folder Structure

You can organize the Folder structure in /datasets/finetune/preprocess/config/default.py

The following is the default Folder Structure for deepfake detection. The paths in each directory are described in the comments.

datasets/
├── data/
│   ├── Celeb-DF-v2/   # Celeb-DF (v2)
│   │   ├── Celeb-real/    # download data
│   │   ├── YouTube-real/    # download data
│   │   ├── Celeb-synthesis/    # download data 
│   │   └── List_of_testing_videos.txt    # download data
│   │
│   ├── DFDC/   # DeepFake Detection Challenge (Full)
│   │   └── test/   # download data
│   │       ├── ori_videos/   
│   │       ├── labels.csv
│   │       └── metadata.json
│   │
│   ├── DFDCP/   # DeepFake Detection Challenge (Preview)
│   │   ├── original_videos/    # download data
│   │   ├── method_A/    # download data
│   │   ├── method_B/    # download data
│   │   └── dataset.json   # download data
│   │
│   ├── deepfake_in_the_wild/   # DeepFake Detection Challenge (Preview)
│   │   ├── real_test/    # download data
│   │   └── fake_test/    # download data
│   │
│   └── FaceForensics/    # FF++
│       ├── dataset/    # download splits
│       │   └── splits/
│       │       ├── train.json
│       │       ├── val.json
│       │       └── test.json
│       ├── original_sequences/    # download data 
│       │   ├── youtube/    # videos of real faces in FF++
│       │   │   └── c23/
│       │   └── actors/
│       │       └── raw/    # videos of real faces in DFD (DeepFakeDetection) datasets
│       ├── manipulated_sequences/    # download data, videos of fake faces in FF++
│       │   ├── DeepFakes/
│       │   │   └── c23/
│       │   ├── Face2Face/
│       │   │   └── c23/
│       │   ├── FaceSwap/
│       │   │   └── c23/
│       │   ├── NeuralTextures/
│       │   │   └── c23/
│       │   └── DeepFakeDetection/    # videos of fake faces in DFD (DeepFakeDetection) datasets
│       │   │   └── raw/
│       └── facial_images_split/    # facial images (automatic creating by finetune/preprocess/dataset_preprocess.py)
│    
├── finetune/preprocess/
│   ├── config/
│   │   ├── __init__.py
│   │   └── default.py    # define folder structure
│   ├── tools/
│   │   └── util.py     # Frame and Face Extraction functions
│   └── dataset_preprocess.py    # to construct fine-tuning data (including train/val/test/) for DfD and DiFF tasks
│ 
└── finetune_datasets/    # final fine-tuning data (automatic creating by dataset_preprocess.py)
    └── deepfakes_detection/  #  data for DfD fine-tuning (automatic creating by finetune/preprocess/dataset_preprocess.py)
        ├── Celeb-DF-v2/           
        ├── deepfake_in_the_wild/
        ├── DFDC/
        ├── DFDCP/
        └── FaceForensics/

🗂️ Make Fine-tuning&Testing Dataset

We use DLIB for face detection with a 30% addition cropping size. Run /datasets/finetune/preprocess/dataset_preprocess.py to make train/val/test datasets for our downstream deepfakes detection task.

cd /datasets/finetune/preprocess

python dataset_preprocess.py --dataset FF++_all    # extracting faces from videos and making FF++ train/val/ sets 
# This would yield DS_FF++_all_cls/ dataset for our DfD model fine-tuning, placed in following folder:
# finetune_datasets/
# └── deepfakes_detection/                           
#     └── FaceForensics
#         └── $num$_frames/    # default $num$ is 32
#             └── DS_FF++_all_cls/
#                 └── $compression$/    # default $compression$ is c23 (there are three version in FF++/DFD: raw/c40/c23)
#                     ├── train/
#                     ├── val/
#                     └── test/ 

python dataset_preprocess.py --dataset [CelebDFv2, DFDC, DFDC_P, WildDeepfake, CelebDF, DFD]   # extracting faces and making test sets
# This would yield testing set for our cross-dataset DfD evaluation. placed in following folder:
# finetune_datasets/
# └── deepfakes_detection/                           
#     ├── [Celeb-DF-v2/DFDC/DFDCP]    # only facial images of test set 
#     │   └── $num$_frames/    # default $num$ is 32
#     │       └── test/
#     │    
#     └── deepfake_in_the_wild    # already provides facial images, use its test set directly
#         └── test/

# construct the FF++_DeepFakes(c23) subset for our another <unseen DiFF (Diffusion face forgery detection) task> or optional <cross-manipulation exps in FF++>.
python dataset_preprocess.py --dataset FF++_each    # extracting faces from videos and making FF++ train/val/ sets for four manipulations
# This would yield DS_FF++_each_cls/ dataset (we only use its DeepFakes subset for our DiFF task), placed in the following folder:
# finetune_datasets/
# └── deepfakes_detection/                           
#     └── FaceForensics
#         └── $num$_frames/    # default $num$ is 32
#             └── DS_FF++_each_cls/
#                 └── $compression$/    # default $compression$ is c23 (there are three version in FF++/DFD: raw/c40/c23)
#                     ├── DeepFakes/  ── train/val/test/ 
#                     ├── Face2Face/ ── train/val/test/ 
#                     ├── FaceSwap/ ── train/val/test/ 
#                     └── NeuralTextures/ ── train/val/test/

Pre-processing settings (num of extracting frames, compression version, etc) are specified in /datasets/finetune/preprocess/config/default.py.

You can include other datasets by following /datasets/finetune/preprocess/dataset_preprocess.py and datasets/finetune/preprocess/config/default.py

⚡ Fine-tuning

cd ./fsfm-3c/finuetune/cross_dataset_DfD/ and run the script main_finetune_DfD.py to fine-tune the model:

CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=2 main_finetune_DfD.py \
    --accum_iter 1 \
    --apply_simple_augment \
    --batch_size 32 \
    --nb_classes 2 \
    --model vit_base_patch16 \
    --epochs 10 \
    --blr 2.5e-4 \
    --layer_decay 0.65 \
    --weight_decay 0.05 \
    --drop_path 0.1 \
    --reprob 0.25 \
    --mixup 0.8 \
    --cutmix 1.0 \
    --dist_eval \
    --finetune 'path to pre-trained model ckpt $model pre-trained on VF2$' \
    --finetune_data_path 'data path for fine-tuning $path to FF++_c23$' \
    --output_dir 'path to save finetuned model ckpt and logs'  # default to ./checkpoint/$USR/experiments_finetune/$PID$

📜Paper Implementation: the $🖲️script$ for fine-tuning, fine-tuned checkpoint, and logs are available at 🤗here
🧩 Most settings adhere to the MAE finetuning recipe. Except for adapting from ImageNet to the DfD task, we did not make much effort to adjust the hyper-parameters.

--finetune: ckpt of (FSFM) pre-trained ViT models. Get our pre-trained checkpoints from Pre-trained Checkpoints or download 🤗here
Here the effective batch size is 32 batch_size (per gpu) * 1 nodes * 2 (gpus per node) = 64.
blr is the base learning rate. The actual lr is computed by the linear scaling rule: lr = blr * effective batch size / 256.
The DfD fine-tuning hyper-parameters slightly differ from the default MAE baseline for ImageNet classification.
Fine-tuning/Training time is ~1h for 10 epochs in 2 A6000 GPUs. (~6250MiB Memory-Usage per GPU)

✨ Fine-tuning with different dataset structure

--finetune_data_path folder structure should be:

--finetune_data_path/ \
 ├── train/ \
 │   ├── class-1/ (e.g., real) \
 │   └── class-2/ (e.g., fake) \
 └── val/ \
     ├── class-1/ (e.g., real) \
     └── class-2/ (e.g., fake) \

To fine-tune/train the model on arbitrary combinations of various datasets, just add --finetune_data_path like:

    CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=2 main_finetune_DfD.py \
    -- (Omit other params...)
    --finetune_data_path ../../datasets/finetune_datasets/finetune_data_path_1 \
    --finetune_data_path ../../datasets/finetune_datasets/finetune_data_path_2 \
    --finetune_data_path ../../datasets/finetune_datasets/finetune_data_path_3

To create dataloader from the label-split file like (train.txt, val.txt), replace --finetune_data_path with the following args:
```
 CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=2 main_finetune_DfD.py \
     --(Omit other params...)
     --finetune_data_path [] \    # do not provided this arg !!！
     --train_split {train.txt} \    # path to the train label file
     --val_split {val.txt} \    # path to the val label file
     --dataset_abs_path None or abs_path \    # see below
     --delimiter_in_spilt ' ' \    # see below
```
- where --train_split/--val_split provides image_path label pairs.
- --dataset_abs_path : If the --train_split/--val_split provides the relative path to the image, this is the prefix path to form the full path; If the splits already provide the absolute path, set it to None.
- --delimiter_in_spilt : The delimiter used to split the image_path and label in the --train_split/--val_split, set ' ' for image_path label; set ',' for image_path,label; set ', ' for image_path, label.

📊 Cross-Datasets Evaluation

cd ./fsfm-3c/finuetune/cross_dataset_DfD/ and run the script main_test_DfD.py to calculate testing results:

CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=1 main_test_DfD.py \
    --eval \
    --apply_simple_augment \
    --model vit_base_patch16 \
    --nb_classes 2 \
    --batch_size 320 \
    --resume 'path to fine-tuned model ckpt $model fine-tuned on FF++_c23$' \
    --output_dir 'path to save test results' # default to ./checkpoint/$USR/experiments_test/from_{FT_folder_name}/$PID$

📜Paper Implementation: the $🖲️script$ and 🖲️test_results for testing cross-dataset DfD.

The path to all test sets is placed in the main_test_DfD.py, modify it freely and follow the folder structure (provide the parent path of test sub-folder to dict variance $cross_dataset_test_path$ ).

To create a test dataloader from the labels file, append the following args:

CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=1 main_test_DfD.py \
      --(Omit other params...)
      --test_split 'test.txt' \
      --dataset_abs_path 'absolute path to test data'

⚫ Unseen Diffusion Facial Forgery Detection (DiFF)

To further investigate the adaptability of our method against emerging unknown facial forgeries beyond conventional DeepFakes, we adopt a challenging cross-distribution testing.

⬇️ Dataset Preparation

We train only on the FF++_DeepFakes(c23) subset and test on the DiFF datasets. The DiFF contains high-quality face images synthesized by SOTA diffusion models across four sub-sets: T2I (Text-to-Image), I2I (Image-to-Image), FS (Face Swapping), and FE (Face Editing). This evaluation is more challenging than typical DfD, as both the unseen manipulations and generative models are significantly different. Download these datasets and refer to DiFF Folder Structure.

FaceForensics++ (Deepfakes and original subsets for training/fine-tuning)
DiFF (val/test data of all four subsets)

📁 DiFF Folder Structure

The following is the default Folder Structure for unseen DiFF detection. The paths in each directory are described in the comments.

datasets/
├── data/
│   ├── DiFF/
│   │   ├── DiFF_real/    # download data
│   │   │   ├── train/
│   │   │   ├── val/
│   │   │   └── test/
│   │   ├── val/    # download data (fake)
│   │   │   ├── FE
│   │   │   ├── FS
│   │   │   ├── I2I
│   │   │   └── T2I
│   │   └── test/    # download data (fake)
│   │       ├── FE
│   │       ├── FS
│   │       ├── I2I
│   │       └── T2I
│   │
│   └── FaceForensics/    # FF++
│       ├── dataset/    # download splits
│       │   └── splits/
│       │       ├── train.json
│       │       ├── val.json
│       │       └── test.json
│       ├── original_sequences/    # download data 
│       │   └── youtube/    # videos of real faces in FF++ 
│       │      └── c23/
│       ├── manipulated_sequences/    # download data, videos of fake faces in FF++
│       │   └── DeepFakes/
│       │       └── c23/
│       └── facial_images_split/    # facial images (automatic creating by finetune/preprocess/dataset_preprocess.py)
│    
├── finetune/preprocess/
│   ├── config/
│   │   ├── __init__.py
│   │   └── default.py    # define folder structure
│   ├── tools/
│   │   └── util.py     # Frame and Face Extraction functions
│   └── dataset_preprocess.py    # to construct fine-tuning data (including train/val/test/) for DfD and DiFF tasks
│ 
└── finetune_datasets/    # final fine-tuning data 
    ├── deepfakes_detection/  
    │   └── FaceForensics/   # training data for DiFF (automatic creating by finetune/preprocess/dataset_preprocess.py)
    │                
    └── diffusion_facial_forgery_detection
        └── DiFF/    # val/testing data for DiFF (automatic creating by finetune/preprocess/dataset_preprocess.py)

You can organize the Folder structure in /datasets/finetune/preprocess/config/default.py

🗂️ Make Fine-tuning&Testing Dataset

We use DLIB for face detection with a 30% addition cropping size. Run /datasets/finetune/preprocess/dataset_preprocess.py to make train/val/test datasets for our downstream unseen diffusion facial forgery detection task.

cd /datasets/finetune/preprocess
python dataset_preprocess.py --dataset FF++_each    # extracting faces from videos and making FF++ train/val/ sets for four manipulations
# This would yield DS_FF++_each_cls/ dataset (we only use its DeepFakes subset for DiFF task), placed in the following folder:
# finetune_datasets/
# └── deepfakes_detection/                           
#     └── FaceForensics
#         └── $num$_frames/    # default $num$ is 32
#             └── DS_FF++_each_cls/
#                 └── $compression$/    # default $compression$ is c23 (there are three version in FF++/DFD: raw/c40/c23)
#                     ├── DeepFakes/    
#                     │   ├── train/    # for fine-tuning/training 
#                     │   ├── val/
#                     │   └── test/
#                     ├── Face2Face/ 
#                     ├── FaceSwap/
#                     └── NeuralTextures/

python dataset_preprocess.py --dataset DiFF    # extracting faces from DiFF val/test/ sets for four subsets
# This would yield four val/test subsets for our unseen DiFF evaluations, placed in the following folder:
# finetune_datasets/
# └── diffusion_facial_forgery_detection/                           
#     └── DiFF
#         ├── val_subsets/ 
#         │   ├── FE/  ── val/ ── [DiFF_real/, fake/]
#         │   ├── FS/  ── val/ ── [DiFF_real/, fake/]
#         │   ├── I2I/ ── val/ ── [DiFF_real/, fake/]
#         │   └── T2I/ ── val/ ── [DiFF_real/, fake/]
#         │  
#         └── test_subsets/
#             ├── FE/  ── test/ ── [DiFF_real/, fake/]
#             ├── FS/  ── test/ ── [DiFF_real/, fake/]
#             ├── I2I/ ── test/ ── [DiFF_real/, fake/]
#             └── T2I/ ── test/ ── [DiFF_real/, fake/]

⚡ Fine-tuning

cd ./fsfm-3c/finuetune/cross_dataset_unseen_DiFF/ and run the script main_finetune_DiFF.py to fine-tune the model:

CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python -m torch.distributed.launch --node_rank=0 --nproc_per_node=1 main_finetune_DiFF.py \
    --accum_iter 1 \
    --normalize_from_IMN \
    --apply_simple_augment \
    --batch_size 256 \
    --nb_classes 2 \
    --model vit_base_patch16 \
    --epochs 50 \
    --blr 5e-4 \
    --layer_decay 0.65 \
    --weight_decay 0.05 \
    --drop_path 0.1 \
    --reprob 0.25 \
    --mixup 0.8 \
    --cutmix 1.0 \
    --dist_eval \
    --finetune 'path to pre-trained model ckpt $model pre-trained on FF++_o_c23$' \
    --data_path 'data path for fine-tuning $path to FF++_DF_c23$' \
    --val_data_path 'data path for fine-tuning $path to DiFF_val_subsets$' \
    --output_dir 'path to save finetuned model ckpt and logs' # default to ./checkpoint/$USR/experiments_finetune/$PID$

📜Paper Implementation: the $🖲️script$ for fine-tuning, fine-tuned checkpoints, and logs are available at 🤗here.
🧩 Most codes in /fsfm-3c/finuetune/cross_dataset_unseen_DiFF/ are inherited from cross_dataset_DfD/ and tailored for this specific DiFF evaluation.
✨ We recommend building on the cross_dataset_DfD/ to expand your works.

--finetune: ckpt of (FSFM) pre-trained ViT models. Get our pre-trained checkpoints from 🤗here, which was only pre-trained on FF++_o (c23, all_frames from train/val split), follow our statement.

📊 Cross-Datasets Evaluation

cd ./fsfm-3c/finuetune/cross_dataset_unseen_DiFF/ and run the script main_test_DiFF.py to calculate testing results:

CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=2 python -m torch.distributed.launch --nproc_per_node=1 main_test_DiFF.py \
    --normalize_from_IMN \
    --apply_simple_augment \
    --eval \
    --model vit_base_patch16 \
    --nb_classes 2 \
    --batch_size 320 \
    --resume 'path to fine-tuned model ckpt $model fine-tuned on FF++_DF_c23$' \
    --output_dir 'path to save test results' # default to ./checkpoint/$USR/experiments_test/from_{FT_folder_name}/$PID$

📜Paper Implementation: the $🖲️script$ and 🖲️test_results for testing cross-dataset DiFF.

⚫ Cross-Domain Face Anti-Spoofing (FAS)

To evaluate the transferability of our method for FAS under significant domain shifts, we apply the leave-one-out (LOO) cross-domain evaluation on the widely-used benchmark.

⬇️ Dataset Preparation

For downstream 0-shot cross-domain FAS task, we directly follow Protocol 1 (MCIO) in few_shot_fas to prepare and preprocess data.

Put the prepared datasets data/ to our default Folder Structure, as follows:

datasets/
└── finetune_datasets/    # final fine-tuning data 
    └── face_anti_spoofing/  
       └── data/     # the prepared datasets from few_shot_fas 
            ├── MCIO/   # we use this set(Protocol 1)
            │   ├── frame/
            │   │   ├── casia/
            │   │   │    ├── train/ ── [real/, fake/]
            │   │   │    └── test/  ── [real/, fake/]
            │   │   ├── celeb/
            │   │   │    ├── train/ ── [real/, fake/]
            │   │   │    └── test/  ── [real/, fake/]
            │   │   ├── msu/
            │   │   │    ├── train/ ── [real/, fake/]
            │   │   │    └── test/  ── [real/, fake/]
            │   │   ├── oulu/
            │   │   │    ├── train/ ── [real/, fake/]
            │   │   │    └── test/  ── [real/, fake/]
            │   │   └── replay/
            │   │        ├── train/ ── [real/, fake/]
            │   │        └── test/  ── [real/, fake/]
            │   └── txt/
            │       └── [casia_fake_shot.txt, casia_fake_test.txt, ...]
            │
            └── WCS/

⚡ Fine-tuning and Evaluation

cd ./fsfm-3c/finuetune/cross_domain_FAS/ and run the script train_vit.py to fine-tune and evaluate the model:

python train_vit.py \
    --pt_model 'path to pre-trained model ckpt $model pre-trained on VF2$' \
    --op_dir 'path to save finetuned model ckpt and logs'  \
    --report_logger_path 'path to save performance.csv of evaluation' \
    --config M  # choose from [M, C, I, O] for Protocol 1

📜Paper Implementation: the $🖲️script$ for fine-tuning, fine-tuned checkpoint, logs，and evaluations are available at 🤗here
🧩 Code for this downstream task is built on the few_shot_fas, you could try more experiments on other protocols or scenarios freely.

--finetune: ckpt of (FSFM) pre-trained ViT models. Get our pre-trained checkpoints from Pre-trained Checkpoints or download 🤗here
The data and label path is specified in the sample_frames function of cross_domain_FAS/utils/utils.py

Citation

If our research helps your work, please consider giving us a star ⭐ or citing us:

@article{wang2024fsfm,
  title={FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning},
  author={Wang, Gaojian and Lin, Feng and Wu, Tong and Liu, Zhenguang and Ba, Zhongjie and Ren, Kui},
  journal={arXiv preprint arXiv:2412.12032},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
datasets		datasets
fsfm-3c		fsfm-3c
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

Release🎉

Table of Contents

🔧 Installation

🚀 FSFM Pre-training

⚫ Pre-training Data

⚫ Pre-training Model

🤗 Pre-trained Checkpoints

⚡ Fine-tuning FSFM Pre-trained ViTs for Downstream Tasks

⚫ Cross-Dataset Deepfake Detection (DfD)

⚫ Unseen Diffusion Facial Forgery Detection (DiFF)

⚫ Cross-Domain Face Anti-Spoofing (FAS)

Citation

About

Releases

Packages

Languages

License

wolo-wolo/FSFM

Folders and files

Latest commit

History

Repository files navigation

FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

Release🎉

Table of Contents

🔧 Installation

🚀 FSFM Pre-training

⚫ Pre-training Data

⚫ Pre-training Model

🤗 Pre-trained Checkpoints

⚡ Fine-tuning FSFM Pre-trained ViTs for Downstream Tasks

⚫ Cross-Dataset Deepfake Detection (DfD)

⚫ Unseen Diffusion Facial Forgery Detection (DiFF)

⚫ Cross-Domain Face Anti-Spoofing (FAS)

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages