Conceptually, the principle says that while both clean training and adversarial training will discover certain features fundamentally different from the initial values, clean training actually already discovers a big portion of the "robust features" obtained by adversarial training. Thus, instead of needing to learn new "robust features" or completely remove "non-robust features", adversarial training can simply robustify the neural network by purifying the clean trained features.
In this work, we present both experiments demonstrating the principle, and a mathematical model formally proving it. In particular, we show that for certain binary classification tasks, when we train a two-layer ReLU network using SGD, (1). Both clean training and adversarial training will learn features with nearly zero correlations with the (random) initialization. (2). After clean training, the network will provably achieve > 99% clean accuracy but 99% robust accuracy in polynomial samples and running time by simply "purifying" the clean trained features. Our lower bound that clean trained network has
Our work also sheds light on why "adversary examples are not bugs": They are not because neural networks are over-fitting to the data set due to the high complexity of the models with insufficiently many training data, but rather are an intrinsic property of the gradient-descent type training algorithms.