
Foundation系列共有三个部分,是对《Evasion attacks against machine learning at test time》《Intriguing properties of neural networks》《Explaining and harnessing adversarial examples》三篇文章的阅读笔记整理。本文介绍《Explaining and harnessing adversarial examples》。


Several machine learning models, including neural networks, consistently misclassify adversarial examples—inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks’ vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.

简而言之,本文分析了《Intriguing properties of neural networks》一文中提出的神经网络的“盲点”,指出该现象不是由于复杂的非线性或者过拟合,而是由于神经网络的“线性本质”。

有关工作(Related Work)

a) Box Constrained L-BFGS 能以高置信度找到对抗样本;
b) 对于某些数据集,对抗样本与原样本间的差别不能被人眼区分;
c) 不同结构的分类器和由不同样本训练出的分类器拥有相同的对抗样本;
d) 浅层的softmax回归模型依然有对抗样本;
e) 用对抗样本辅助训练模型可以提高泛化性能,但是循环求解带约束优化问题计算量巨大。

对抗样本的线性解释(Linear Explanation)

W T x ~ = W T x + W T η . W^T\tilde{x}=W^Tx+W^T\eta. η \eta 是加在原样本上的扰动,要求 η < ϵ \Vert\eta\Vert_\infty<\epsilon 。若设 W W n n 维向量,每个维度模的均值为 m m ,则激活函数将以 ϵ m n \epsilon mn 增长。可见,如果 η \eta W W 的方向接近一致的话,即使 η \eta 的模很小,对于网络的输出也可能产生较大的改变。

Fast Gradient Sign Method(FGSM)

η = ϵ s i g n ( x J ( θ , x , y ) ) . \eta=\epsilon \mathrm{sign}(\nabla_xJ(\theta,x,y)). 注:在神经网络可通过反向传播计算梯度。用这种方法可以快速生成对抗样本。


线性模型的对抗训练(Adversarial Training of Linear Models)

以logistic regression为例:
其中, ζ ( z ) = l o g ( 1 + e x p ( z ) ) \zeta(z)=\mathrm{log}(1+\mathrm{exp}(z))
由于该目标函数梯度即为 s i g n ( w ) -\mathrm{sign}(w) ,故将FGSM公式代入可得:
注:这也可以看成是在激活之前减去了一个L1惩罚项,但这与L1 weight decay并不相同。

深度网络的对抗训练(Adversarial Training of Deep Networks)

首先明确一点:只有当模型有能力学习如何抵抗对抗样本时,对抗训练才有意义(这看起来是一句废话,实际上却体现着万能近似定理(Universal Approximation Theorem))。
J ~ ( θ , x , y ) = α J ( θ , x , y ) + ( 1 α ) J ( θ , x + ϵ s i g n ( J ( θ , x , y ) ) , y ) \tilde{J}(\theta,x,y)=\alpha J(\theta,x,y)+(1-\alpha)J(\theta,x+\epsilon \mathrm{sign}(\nabla J(\theta,x,y)),y) 注1:原文中的公式似乎最后括号里少了个 y y

a) 向输入数据中加入服从 U ( ϵ , ϵ ) U(-\epsilon,\epsilon) 均匀分布的噪声,并最小化其期望上界;
b) 主动学习(Active learning),只是“专家”不再是人,而是一个“heuristic labeler”,它用附近已存在样本的label作为新样本的label。


关于模型能力(Model Capacitty)

