Pytorch_Part6_正则化

VisualPytorch发布域名+双服务器以下:
http://nag.visualpytorch.top/static/ (对应114.115.148.27)
http://visualpytorch.top/static/ (对应39.97.209.22)python

1、正则化之weight_decay

1. Regularization:减少方差的策略

偏差可分解为:误差,方差与噪声之和。即偏差 = 误差 + 方差 + 噪声之和
误差度量了学习算法的指望预测与真实结果的偏离程度,即刻画了学习算法自己的拟合能力
方差度量了一样大小的训练集的变更所致使的学习性能的变化,即刻画了数据扰动所形成的影响
噪声则表达了在当前任务上任何学习算法所能达到的指望泛化偏差的下界算法

image-20200326115333551

准确地来讲方差指的是不一样针对 不一样 数据集的预测值指望的方差,而非训练集和测试集Loss的差别。服务器

2. 损失函数:衡量模型输出与真实标签的差别

损失函数(Loss Function):\(Loss = f(\hat y , y)\)
代价函数(Cost Function):\(Cost = \frac{1}{N}\sum_i f(\hat y_i, y_i)\)
目标函数(Objective Function):\(Obj = Cost + Regularization\)网络

image-20200326115333551

L1 Regularization: \(\sum_i |w_i|\) 由于常在坐标轴(顶点)上取极值,容易训练出稀疏参数app

L2 Regularization: \(\sum_i w_i^2\) \(w_{i+1} = w_i - Obj' = w_i - (\frac{\partial Loss}{\partial w_i}+\lambda * w_i) = w_i(1-\lambda) - \frac{\partial Loss}{\partial w_i}\) ,所以常被称为权重衰减函数

3. 以简单的三层感知机为例

同时构建使用weight_dacay和没有的模型:能够看到,随着训练次数增长,不含正则项的模型loss趋于0.性能

# ============================ step 1/5 数据 ============================
def gen_data(num_data=10, x_range=(-1, 1)):

    w = 1.5
    train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
    train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())
    test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
    test_y = w*test_x + torch.normal(0, 0.3, size=test_x.size())

    return train_x, train_y, test_x, test_y


train_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))


# ============================ step 2/5 模型 ============================
class MLP(nn.Module):
    def __init__(self, neural_num):
        super(MLP, self).__init__()
        self.linears = nn.Sequential(
            nn.Linear(1, neural_num),
            nn.ReLU(inplace=True),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),
            nn.Linear(neural_num, 1),
        )

    def forward(self, x):
        return self.linears(x)


net_normal = MLP(neural_num=n_hidden)
net_weight_decay = MLP(neural_num=n_hidden)

# ============================ step 3/5 优化器 ============================
optim_normal = torch.optim.SGD(net_normal.parameters(), lr=lr_init, momentum=0.9)
optim_wdecay = torch.optim.SGD(net_weight_decay.parameters(), lr=lr_init, momentum=0.9, weight_decay=1e-2)
# 包含了weight_decay

# ============================ step 4/5 损失函数 ============================
loss_func = torch.nn.MSELoss()

# ============================ step 5/5 迭代训练 ============================

writer = SummaryWriter(comment='_test_tensorboard', filename_suffix="12345678")
for epoch in range(max_iter):

    # forward
    pred_normal, pred_wdecay = net_normal(train_x), net_weight_decay(train_x)
    loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y)

    optim_normal.zero_grad()
    optim_wdecay.zero_grad()

    loss_normal.backward()
    loss_wdecay.backward()

    optim_normal.step()
    optim_wdecay.step()

    ...

经过tensorborad查看参数变化,能够明显看出使用L2的模型参数更集中:学习

2、 正则化之Dropout

1. 随机:dropout probability

失活:weight = 0测试

指该层任何一个神经元都有prob的可能性失活,而非有prob的神经元会失活。优化

2. 带来如下三种影响:

  • 特征依赖性下降
  • 权重数值平均化
  • 数据尺度减少

假设prob = 0.3,在测试时不使用dropout,为了抵消这种尺度上的变化,须要在训练期间对权重除 \((1-p)\)
Test: \(100 = \sum_{100} W_x\)
Train: \(70 = \sum_{70} W_x \Longrightarrow 100 = \sum_{70} W_x/(1-p)\)

所以,在两种状态下通过网络层,获得的结果近似:

net = Net(input_num, d_prob=0.5)
net.linears[1].weight.detach().fill_(1.)

net.train() # 测试结束后调整回运行状态
y = net(x)
print("output in training mode", y)

net.eval()	# 测试开始时使用
y = net(x)
print("output in eval mode", y)

output in training mode tensor([9942.], grad_fn=<ReluBackward1>)
output in eval mode tensor([10000.], grad_fn=<ReluBackward1>)

3. 仍以线性回归为例:

同时构建有dropout和没有的模型:随着训练次数增长,含有dropout的模型会更加平滑。

class MLP(nn.Module):
    def __init__(self, neural_num, d_prob=0.5):
        super(MLP, self).__init__()
        self.linears = nn.Sequential(

            nn.Linear(1, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),
            nn.Linear(neural_num, 1),
        )

    def forward(self, x):
        return self.linears(x)


net_prob_0 = MLP(neural_num=n_hidden, d_prob=0.)
net_prob_05 = MLP(neural_num=n_hidden, d_prob=0.5)

咱们观察线性层的权重分布,能够明显看出来含有dropout的模型参数更集中,峰值更高

3、Batch Normalization

1. Batch Normalization:批标准化

批:一批数据,一般为mini-batch
标准化:0均值,1方差

优势:

  1. 能够更大学习率,加速模型收敛
  2. 能够不用精心设计权值初始化
  3. 能够不用dropout或较小的dropout
  4. 能够不用L2或者较小的weight decay
  5. 能够不用LRN(local response normalization)

注意到,\(\gamma, \beta\)是可学习的参数,若是该层不想进行BN,最后学习出来\(\gamma = \sigma_{\Beta}, \beta = \mu_{\Beta}\),即恒等变换。

详见《Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift》阅读笔记与实现

2. Internal Covariate Shift (ICS)

防止由于数据尺度/分布的不均使得梯度消失或爆炸,致使训练困难。

第四节提到的其余Normalization都是为了不ICS.

3. _BatchNorm

pytorch中nn.BatchNorm1d nn.BatchNorm2d nn.BatchNorm3d 都继承于_BatchNorm,而且有如下参数:

__init__(self, num_features,  	# 一个样本特征数量(最重要)
                eps=1e-5, 		# 分母修正项
                momentum=0.1, 	# 指数加权平均估计当前mean/var
                affine=True,	# 是否须要affine transform
                track_running_stats=True)	# 是训练状态,仍是测试状态

BatchNorm层主要参数:

  • running_mean:均值
  • running_var:方差
  • weight:affine transform中的gamma
  • bias: affine transform中的beta

训练:均值和方差采用指数加权平均计算

running_mean = (1 - momentum) * running_mean + momentum * mean_t

running_var = (1 - momentum) * running_var + momentum * var_t

测试:当前统计值

如上图所示:在1D,2D,3D中,特征数分别指 特征、特征图、特征核的数目。而BN是对于每一个特征对应的全部样本求的均值和方差,故如上图中三种状况样本数分别为5,3,3,而对应的\(\gamma, \beta\)维数即5,3,3

4. 仍以人民币二分类为例:

咱们原始的网络在卷积和线性层以后加入了BN层,注意其中卷积层后是2D BN,线性层后是1D

class LeNet_bn(nn.Module):
    def __init__(self, classes):
        super(LeNet_bn, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.bn1 = nn.BatchNorm2d(num_features=6)

        self.conv2 = nn.Conv2d(6, 16, 5)
        self.bn2 = nn.BatchNorm2d(num_features=16)

        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.bn3 = nn.BatchNorm1d(num_features=120)

        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, classes)

    def forward(self, x):
        out = self.conv1(x)
        out = self.bn1(out)
        out = F.relu(out)

        out = F.max_pool2d(out, 2)

        out = self.conv2(out)
        out = self.bn2(out)
        out = F.relu(out)

        out = F.max_pool2d(out, 2)

        out = out.view(out.size(0), -1)

        out = self.fc1(out)
        out = self.bn3(out)
        out = F.relu(out)

        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        return out

    def initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.xavier_normal_(m.weight.data)
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data, 0, 1)
                m.bias.data.zero_()
  1. 使用net = LeNet(classes=2)不通过初始化:

  1. 通过精心设计的初始化net.initialize_weights()

  1. 使用net = LeNet_bn(classes=2)结果以下:即便Loss有不稳定的区间,其最大值不像前两种超过1.5

4、Normalizaiton_layers

1. Layer Normalization

原由:BN不适用于变长的网络,如RNN
思路:逐层计算均值和方差
注意事项:

  1. 再也不有running_mean和running_var
  2. gamma和beta为逐元素
nn.LayerNorm(normalized_shape, # 该层特征形状
            eps=1e-05, 
            elementwise_affine=True	# 是否须要affine transform
            )

注意,这里的normalized_shape能够是输入后面任意维特征。好比[8, 6, 3, 4]为batch的输入,能够是[6,3,4], [3,4],[4],但不能是[6,3]

2. Instance Normalization

原由:BN在图像生成(Image Generation)中不适用,图像中输入的Batch各不相同,不能逐Batch标准化
思路:逐Instance(channel)计算均值和方差

nn.InstanceNorm2d(num_features, 
                eps=1e-05, 
                momentum=0.1, 
                affine=False, 
                track_running_stats=False)
# 一样还有1d, 3d

图像风格迁移就是一种不能BN的应用,输入的图片各不相同,只能逐通道求方差和均值

3. Group Normalization

原由:小batch样本中,BN估计的值不许
思路:数据不够,通道来凑

nn.GroupNorm(num_groups, 	# 分组个数,必须是num_channel的因子
            num_channels, 
            eps=1e-05, 
            affine=True)

注意事项:

  1. 再也不有running_mean和running_var
  2. gamma和beta为逐通道(channel)的

应用场景:大模型(小batch size)任务

当num_groups=num时,至关于LN
当num_groups=1时,至关于IN

4. 小结