This is the experiment code for our NDSS 2024 paper "TextGuard: Provable Defense against Backdoor Attacks on Text Classification".
torch
transformers==4.21.2
fastNLP==0.6.0
openbackdoor (commit id: d600dbec32b97a246b77c4c4d700ab2e01200151)
Please first follow OpenBackdoor repo to download the datasets and then soft link to our repo:
ln -s ../OpenBackdoor/datasets/ .
Besides, our generated backdoor data can be found here. You can download it and unzip it to the ./poison/ folder.
Our training code is train_cls.py. We first describe some key args:
--setting: backdoor attack setting, should be mix, clean or dirty.
--attack: It denotes the backdoor attack method or certified evaluation (--attack=noise).
--poison_rate: poisoning rate p.
--group: number of groups.
--hash: hash function we use. When it starts with ki (e.g. --hash=ki), it means we use the empirical defense technique Potential trigger word identification in the paper. Besides, it can be md5, sha1 or sha256 when not using this empirical defense technique.
--ki_t: the parameter K used in the empirical defense technique Potential trigger word identification.
--sort: used in the certified evaluation and not used in the empirical evaluation.
--not_split: It means we use the empirical defense technique Semantic preserving in the paper.
We use the parameter --attack noise to denote the certified evaluation setting.
Here are example commands that calculate certified accuracy using 3 groups under the mixed-label attack setting (p=0.1):
python train_cls.py --save_folder <exp_name> --attack noise --group 3 --target_word empty --setting mix --poison_rate 0.1 --sort --tokenize nltk
python train_cls.py --save_folder <exp_name> --attack noise --group 3 --target_word empty --data hsol --setting mix --poison_rate 0.1 --sort --tokenize nltk
python train_cls.py --save_folder <exp_name> --attack noise --group 3 --target_word empty --data agnews --num_class 4 --batchsize 32 --setting mix --poison_rate 0.1 --sort --tokenize nltk
When the parameter --attack is set to badnets, addsent, synbkd or stylebkd, we evaluate our methods under the empirical attack setting.
Here are example commands for empirical evaluations under the mixed-label BadWord attack setting (p=0.1):
python train_cls.py --save_folder <exp_name> --attack badnets --group 9 --setting mix --poison_rate 0.1 --tokenize nltk --not_split --hash ki --target_word empty --ki_t 20
python train_cls.py --save_folder <exp_name> --attack badnets --group 7 --setting mix --poison_rate 0.1 --tokenize nltk --not_split --hash ki --target_word empty --data hsol --ki_t 20
python train_cls.py --save_folder <exp_name> --attack badnets --group 9 --setting mix --poison_rate 0.1 --tokenize nltk --not_split --hash ki --target_word empty --data agnews --num_class 4 --batchsize 32