配合阅读:python
笔者在[深度概念]·Attention机制概念学习笔记博文中,讲解了Attention机制的概念与技术细节,本篇内容配合讲解,使用Keras实现Self-Attention文本分类,来让你们更加深刻理解Attention机制。bash
做为对比,能够访问[TensorFlow深度学习深刻]实战三·分别使用DNN,CNN与RNN(LSTM)作文本情感分析,查看不一样网络区别与联系。网络
了解了模型大体原理,咱们能够详细的看一下究竟Self-Attention结构是怎样的。其基本结构以下学习
对于self-attention来说,Q(Query), K(Key), V(Value)三个矩阵均来自同一输入,首先咱们要计算Q与K之间的点乘,而后为了防止其结果过大,会除以一个尺度标度 ,其中 为一个query和key向量的维度。再利用Softmax操做将其结果归一化为几率分布,而后再乘以矩阵V就获得权重求和的表示。该操做能够表示为 ui
这里可能比较抽象,咱们来看一个具体的例子(图片来源于https://jalammar.github.io/illustrated-transformer/,该博客讲解的极其清晰,强烈推荐),假如咱们要翻译一个词组Thinking Machines,其中Thinking的输入的embedding vector用 表示,Machines的embedding vector用 表示。.net
当咱们处理Thinking这个词时,咱们须要计算句子中全部词与它的Attention Score,这就像将当前词做为搜索的query,去和句子中全部词(包含该词自己)的key去匹配,看看相关度有多高。咱们用 表明Thinking对应的query vector, 及 分别表明Thinking以及Machines对应的key vector,则计算Thinking的attention score的时候咱们须要计算 与 的点乘,同理,咱们计算Machines的attention score的时候须要计算 与 的点乘。如上图中所示咱们分别获得了 与 的点乘积,而后咱们进行尺度缩放与softmax归一化,以下图所示:翻译
显然,当前单词与其自身的attention score通常最大,其余单词根据与当前单词重要程度有相应的score。而后咱们在用这些attention score与value vector相乘,获得加权的向量。code
若是将输入的全部向量合并为矩阵形式,则全部query, key, value向量也能够合并为矩阵形式表示
其中 是咱们模型训练过程学习到的合适的参数。上述操做便可简化为矩阵形式
笔者使用Keras来实现对于Self_Attention模型的搭建,因为网络中间参数量比较多,这里采用自定义网络层的方法构建Self_Attention,关于如何自定义Keras能够参看这里:编写你本身的 Keras 层
Keras实现自定义网络层。须要实现如下三个方法:(注意input_shape是包含batch_size项的
)
build(input_shape)
: 这是你定义权重的地方。这个方法必须设 self.built = True
,能够经过调用 super([Layer], self).build()
完成。call(x)
: 这里是编写层的功能逻辑的地方。你只须要关注传入 call
的第一个参数:输入张量,除非你但愿你的层支持masking。compute_output_shape(input_shape)
: 若是你的层更改了输入张量的形状,你应该在这里定义形状变化的逻辑,这让Keras可以自动推断各层的形状。实现代码以下:
from keras.preprocessing import sequence from keras.datasets import imdb from matplotlib import pyplot as plt import pandas as pd from keras import backend as K from keras.engine.topology import Layer class Self_Attention(Layer): def __init__(self, output_dim, **kwargs): self.output_dim = output_dim super(Self_Attention, self).__init__(**kwargs) def build(self, input_shape): # 为该层建立一个可训练的权重 #inputs.shape = (batch_size, time_steps, seq_len) self.kernel = self.add_weight(name='kernel', shape=(3,input_shape[2], self.output_dim), initializer='uniform', trainable=True) super(Self_Attention, self).build(input_shape) # 必定要在最后调用它 def call(self, x): WQ = K.dot(x, self.kernel[0]) WK = K.dot(x, self.kernel[1]) WV = K.dot(x, self.kernel[2]) print("WQ.shape",WQ.shape) print("K.permute_dimensions(WK, [0, 2, 1]).shape",K.permute_dimensions(WK, [0, 2, 1]).shape) QK = K.batch_dot(WQ,K.permute_dimensions(WK, [0, 2, 1])) QK = QK / (64**0.5) QK = K.softmax(QK) print("QK.shape",QK.shape) V = K.batch_dot(QK,WV) return V def compute_output_shape(self, input_shape): return (input_shape[0],input_shape[1],self.output_dim)
这里能够对照一中的概念讲解来理解代码
若是将输入的全部向量合并为矩阵形式,则全部query, key, value向量也能够合并为矩阵形式表示
上述内容对应
WQ = K.dot(x, self.kernel[0]) WK = K.dot(x, self.kernel[1]) WV = K.dot(x, self.kernel[2])
其中 是咱们模型训练过程学习到的合适的参数。上述操做便可简化为矩阵形式
上述内容对应(为何使用batch_dot呢?这是因为input_shape是包含batch_size项的
)
QK = K.batch_dot(WQ,K.permute_dimensions(WK, [0, 2, 1])) QK = QK / (64**0.5) QK = K.softmax(QK) print("QK.shape",QK.shape) V = K.batch_dot(QK,WV)
这里 QK = QK / (64**0.5) 是除以一个归一化系数,(64**0.5)是笔者本身定义的,其余文章可能会采用不一样的方法。
项目完整代码以下,这里使用的是Keras自带的imdb影评数据集
#%% from keras.preprocessing import sequence from keras.datasets import imdb from matplotlib import pyplot as plt import pandas as pd from keras import backend as K from keras.engine.topology import Layer class Self_Attention(Layer): def __init__(self, output_dim, **kwargs): self.output_dim = output_dim super(Self_Attention, self).__init__(**kwargs) def build(self, input_shape): # 为该层建立一个可训练的权重 #inputs.shape = (batch_size, time_steps, seq_len) self.kernel = self.add_weight(name='kernel', shape=(3,input_shape[2], self.output_dim), initializer='uniform', trainable=True) super(Self_Attention, self).build(input_shape) # 必定要在最后调用它 def call(self, x): WQ = K.dot(x, self.kernel[0]) WK = K.dot(x, self.kernel[1]) WV = K.dot(x, self.kernel[2]) print("WQ.shape",WQ.shape) print("K.permute_dimensions(WK, [0, 2, 1]).shape",K.permute_dimensions(WK, [0, 2, 1]).shape) QK = K.batch_dot(WQ,K.permute_dimensions(WK, [0, 2, 1])) QK = QK / (64**0.5) QK = K.softmax(QK) print("QK.shape",QK.shape) V = K.batch_dot(QK,WV) return V def compute_output_shape(self, input_shape): return (input_shape[0],input_shape[1],self.output_dim) max_features = 20000 print('Loading data...') (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features) #标签转换为独热码 y_train, y_test = pd.get_dummies(y_train),pd.get_dummies(y_test) print(len(x_train), 'train sequences') print(len(x_test), 'test sequences') #%%数据归一化处理 maxlen = 64 print('Pad sequences (samples x time)') x_train = sequence.pad_sequences(x_train, maxlen=maxlen) x_test = sequence.pad_sequences(x_test, maxlen=maxlen) print('x_train shape:', x_train.shape) print('x_test shape:', x_test.shape) #%% batch_size = 32 from keras.models import Model from keras.optimizers import SGD,Adam from keras.layers import * from Attention_keras import Attention,Position_Embedding S_inputs = Input(shape=(64,), dtype='int32') embeddings = Embedding(max_features, 128)(S_inputs) O_seq = Self_Attention(128)(embeddings) O_seq = GlobalAveragePooling1D()(O_seq) O_seq = Dropout(0.5)(O_seq) outputs = Dense(2, activation='softmax')(O_seq) model = Model(inputs=S_inputs, outputs=outputs) print(model.summary()) # try using different optimizers and different optimizer configs opt = Adam(lr=0.0002,decay=0.00001) loss = 'categorical_crossentropy' model.compile(loss=loss, optimizer=opt, metrics=['accuracy']) #%% print('Train...') h = model.fit(x_train, y_train, batch_size=batch_size, epochs=5, validation_data=(x_test, y_test)) plt.plot(h.history["loss"],label="train_loss") plt.plot(h.history["val_loss"],label="val_loss") plt.plot(h.history["acc"],label="train_acc") plt.plot(h.history["val_acc"],label="val_acc") plt.legend() plt.show() #model.save("imdb.h5")
(TF_GPU) D:\Files\DATAs\prjs\python\tf_keras\transfromerdemo>C:/Files/APPs/RuanJian/Miniconda3/envs/TF_GPU/python.exe d:/Files/DATAs/prjs/python/tf_keras/transfromerdemo/train.1.py Using TensorFlow backend. Loading data... 25000 train sequences 25000 test sequences Pad sequences (samples x time) x_train shape: (25000, 64) x_test shape: (25000, 64) WQ.shape (?, 64, 128) K.permute_dimensions(WK, [0, 2, 1]).shape (?, 128, 64) QK.shape (?, 64, 64) _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 64) 0 _________________________________________________________________ embedding_1 (Embedding) (None, 64, 128) 2560000 _________________________________________________________________ self__attention_1 (Self_Atte (None, 64, 128) 49152 _________________________________________________________________ global_average_pooling1d_1 ( (None, 128) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 128) 0 _________________________________________________________________ dense_1 (Dense) (None, 2) 258 ================================================================= Total params: 2,609,410 Trainable params: 2,609,410 Non-trainable params: 0 _________________________________________________________________ None Train... Train on 25000 samples, validate on 25000 samples Epoch 1/5 25000/25000 [==============================] - 17s 693us/step - loss: 0.5244 - acc: 0.7514 - val_loss: 0.3834 - val_acc: 0.8278 Epoch 2/5 25000/25000 [==============================] - 15s 615us/step - loss: 0.3257 - acc: 0.8593 - val_loss: 0.3689 - val_acc: 0.8368 Epoch 3/5 25000/25000 [==============================] - 15s 614us/step - loss: 0.2602 - acc: 0.8942 - val_loss: 0.3909 - val_acc: 0.8303 Epoch 4/5 25000/25000 [==============================] - 15s 618us/step - loss: 0.2078 - acc: 0.9179 - val_loss: 0.4482 - val_acc: 0.8215 Epoch 5/5 25000/25000 [==============================] - 15s 619us/step - loss: 0.1639 - acc: 0.9368 - val_loss: 0.5313 - val_acc: 0.8106
1.https://zhuanlan.zhihu.com/p/47282410
个人博客即将同步至腾讯云+社区,邀请你们一同入驻:https://cloud.tencent.com/developer/support-plan?invite_code=33mf4d7ia3s48