Feature purification: How adversarial training can perform robust deep learning

Why deep learning models, trained on many machine learning tasks, can obtain nearly perfect predictions of unseen data sampled from the same distribution but are extremely vulnerable to small perturbations of the input? How can adversarial training improve the robustness of the neural networks over such perturbations? In this work, we developed a new principle called "feature purification''. Mathematically, the principle states that for a multi-layer ReLU neural network, let w_i^0 be the weight of the i-th neuron at initialization, w_i be its weight after clean training (starting from w_i^0), and w_i' as its weight after adversarial training (starting from w_i), then for most of the neurons, there are scaling factors \beta_i and vectors v_i such that w_i = \beta_i w_i' + v_i with ||\beta_i w_i' || > ||v_i||, while both w_i and w_i' have nearly zero correlations with w_i^0. Conceptually, the principle says that while both clean training and adversarial training will discover certain features fundamentally different from the initial values, clean training actually already discovers a big portion of the "robust features" obtained by adversarial training. Thus, instead of needing to learn new "robust features" or completely remove "non-robust features", adversarial training can simply robustify the neural network by purifying the clean trained features. In this work, we present both experiments demonstrating the principle, and a mathematical model formally proving it. In particular, we show that for certain binary classification tasks, when we train a two-layer ReLU network using SGD, (1). Both clean training and adversarial training will learn features with nearly zero correlations with the (random) initialization. (2). After clean training, the network will provably achieve > 99% clean accuracy but 1% robust accuracy. (3). After robust training using algorithms like PGD, the network will provably achieve > 99% robust accuracy in polynomial samples and running time by simply "purifying" the clean trained features. Our lower bound that clean trained network has 1% robust accuracy holds even when an infinite number of clean training examples are used, and a weight decay regularizer is added during the training. Our work also sheds light on why "adversary examples are not bugs": They are not because neural networks are over-fitting to the data set due to the high complexity of the models with insufficiently many training data, but rather are an intrinsic property of the gradient-descent type training algorithms.

Date

Speakers

Yuanzhi Li

Affiliation

Carnegie Mellon University