通过非结构化稀疏性加快深度学习推理

Serving large neural networks can be expensive. It certainly doesn’t help that neural network size appears to correlate with how useful they are. (Case in point: GPT-3)

服务大型神经网络可能很昂贵。 当然,神经网络的大小似乎与它们的有用性没有关系。 (示例:GPT-3)

A natural impulse is to make neural networks smaller by cheating. In Song Han’s seminal Deep Compression Paper back in 2016 (which has been cited around 4000 times according to Google), he proposed two ways to cut corners: quantization and pruning. Quantization aims to reduce the number of bits in the weights and activations. Pruning aims to preserve useful weights and set the rest to zero. Both of which have since became vibrant fields of research. Note quantization and pruning are pretty orthogonal methods. They can and are meant to be combined.

一种自然的冲动是通过作弊使神经网络变小。 在2016年Song Han的开创性《 Deep Compression Paper》(据Google引用了4000次)中,他提出了两种偷工减料的方法:量化和修剪。 量化旨在减少权重和激活中的位数。 修剪旨在保留有用的权重,并将其余的权重设置为零。 从那以后,这两个领域都成为活跃的研究领域。 注意量化和修剪是非常正交的方法。 它们可以并且应该被合并。

We are mostly going to talk about pruning here. In the 4 years since 2016, much of the pruning research has focused on structured pruning. Why structured pruning? After setting useless weights to zero in pruning, we are left with a sparse weight matrix. It didn’t take people long to realize that current commodity hardware/software stack does not like unstructured sparsity (aka random sparsity) patterns, where the nonzeros can be anywhere in the matrix. Instead, current hardware/software likes structured computation.

我们主要在这里谈论修剪。 自2016年以来的四年中,大部分修剪研究都集中在结构化修剪上。 为什么要进行结构修剪? 在修剪中将无用的权重设置为零后,我们将得到一个稀疏的权重矩阵。 很快,人们就意识到当前的商品硬件/软件堆栈不喜欢非结构化稀疏(也称为随机稀疏)模式,其中非零可以位于矩阵中的任何位置。 相反,当前的硬件/软件喜欢结构化计算。

The quintessential structured computation is a dense matrix multiply, which has been optimized to hell and back by a bunch of really smart people. On Intel CPUs, one can expect around 90% of the hardware FLOPs with Intel MKL. On Nvidia GPUs, one can also expect close to roofline performance. Multiplying unstructured sparse matrices is kind of the polar opposite. Typically, existing state-of-the-art implementations can only achieve (much) less than 10% of the hardware FLOPs.

典型的结构化计算是一个密集的矩阵乘法,它已经被一群真正聪明的人优化到地狱。 在Intel CPU上,使用Intel MKL可以预期约90%的硬件FLOP。 在Nvidia GPU上,人们也可以期待接近屋顶的性能。 将非结构化稀疏矩阵相乘是相反的一种。 通常,现有的最新技术只能实现(远远)少于硬件FLOP的10%。

Structured pruning thus aims to strike some middle ground by introducing some structure in the sparsity pattern, most often in the form of blocks. Block sparsity turns out to be very efficient. OpenAI’s block sparse GPU kernels can achieve almost linear speedup with sparsity ratio and uses the hardware almost as efficiently as dense matrix multiplication.

因此,结构化修剪的目的是通过以稀疏模式引入某些结构(通常以块的形式)来取得一些中间立场。 区块稀疏性非常有效。 OpenAI的块稀疏GPU内核可以稀疏率实现几乎线性的加速,并且使用硬件的效率几乎与密集矩阵乘法一样。

Unfortunately, it is widely observed that structured pruning causes rather severe accuracy degradations, compared to unstructured pruning. This is rather intuitive, as one is imposing constraints on what weights can survive in the pruning process. Instead of acting as some form of regularization unfortunately, it almost always degrades performance. Recent studies have shown that the pruning method that simple magnitude-based unstructured pruning, similar to what Song Han used 4 years ago, still achieves state-of-the-art sparsity ratio/accuracy loss tradeoff consistently across different tasks.

不幸的是,与非结构化修剪相比,结构化修剪会导致相当严重的精度下降。 这是相当直观的,因为人们正在限制在修剪过程中可以承受的权重。 不幸的是,它没有充当某种形式的正则化,而是几乎总是降低性能。 最近的研究表明,类似于Song Han 4年前使用的基于幅度的非结构化修剪的简单修剪方法仍然可以在不同任务之间始终实现最先进的稀疏率/精度损失折衷。

Wouldn’t it be great if we can just speedup unstructured sparse matrix multiplications?

如果我们可以加速非结构化稀疏矩阵乘法,那不是很好吗?

In my recent paper on arXiv: https://arxiv.org/abs/2008.11849, I present a code generator called SparseRT to tackle this challenge. It turns out that clever compilation strategies can in effect do a lot of the work of the sparse matrix multiplication at compile time, accelerating the runtime at execution. In particular, if one knows the sparsity pattern of the sparse matrix at compile time, then one could optimize specifically for this sparsity pattern, in effect tailoring the program for this particular sparse matrix.

在我最近关于arXiv的论文中: https ://arxiv.org/abs/2008.11849,我提出了一个名为SparseRT的代码生成器来应对这一挑战。 事实证明,聪明的编译策略实际上可以在编译时完成稀疏矩阵乘法的许多工作,从而加快执行时的运行速度。 特别是,如果在编译时知道稀疏矩阵的稀疏模式,则可以专门针对该稀疏模式进行优化,从而针对该特定稀疏矩阵定制程序。

I show that at 90% sparsity, many deep learning operations such as 1x1 and 3x3 convolutions can be accelerated by 2 to 3x on GPUs. This is acceptable for speeding up deep learning inference, where people typically do not care about how long it takes to compile a model. The speedups on 20 deep learning problems at 90% sparsity are presented in the following figure.

我表明,稀疏度为90%时,许多深度学习操作(例如1x1和3x3卷积)可以在GPU上加速2到3倍。 这对于加快深度学习推理的速度是可以接受的,因为人们通常不在乎编译模型需要花费多长时间。 下图显示了以90%的稀疏度对20个深度学习问题的加速。

Image for post

It’s important to note that SparseRT comes with a bunch of disclaimers:

重要的是要注意,SparseRT带有许多免责声明:

Disclaimer 1: it only works for inference (right now). People have tried to use unstructured sparsity in training scenarios as well. However, in this case, the sparsity structure of the weight matrix tends to change on-the-fly. In addition, SparseRT currently doesn’t support a lot of the sparse gradient computations you’d need for deep learning (which typically take the form of sampled dense-dense matrix multiplication instead of sparse matrix multiplication.)

免责声明1:仅适用于推断(目前)。 人们也尝试在训练场景中使用非结构化稀疏性。 但是,在这种情况下,权重矩阵的稀疏结构往往会随时变化。 此外,SparseRT目前不支持深度学习所需的许多稀疏梯度计算(通常采用采样的密集矩阵乘法而不是稀疏矩阵乘法的形式。)

Disclaimer 2: it can’t use Tensor Cores. The paper presents results in fp32, which some would consider an obsolete data format for deep learning inference. SparseRT has support for fp16 on GPUs with vector instructions, and can typically provide a speedup over SparseRT running fp32. However, it’s very difficult to reconcile unstructured sparsity with dense matrix multiply accelerators like Tensor Cores. However, it’s not completely hopeless, and Tensor Core support should be available in the future.

免责声明2:它不能使用Tensor Core。 本文介绍了fp32中的结果,有人认为这是用于深度学习推理的过时数据格式。 SparseRT支持带有矢量指令的GPU上的fp16,通常可以比运行fp32的SparseRT提速。 但是,很难使用密集矩阵乘法加速器(如Tensor Cores)来协调非结构化稀疏性。 但是,这并非完全没有希望,而且将来应该会提供Tensor Core支持。

I should also mention other great efforts being made in this direction, particularly the one made by a team at Google/Stanford: https://arxiv.org/abs/2006.10901.

我还要提到在此方向上正在做出的其他巨大努力,特别是Google / Stanford团队的一项努力: https : //arxiv.org/abs/2006.10901

Cool. How much end-to-end speedup can SparseRT achieve?

凉。 SparseRT可以实现多少端到端加速?

The paper itself focuses on single operations (sparse 1x1 and 3x3 convolutions). End to end results are understandably more interesting to most practitioners. Those are tricky because end-to-end inference depends on other things than fast 1x1 and 3x3 convolutions. For example, on MobileNets, once the 1x1 convolutions become three times as fast, the depthwise convolutions become the bottleneck. On BERT, once the matrix multiplications involving the weight matrices become fast, the (dense) self attention unit becomes the bottleneck. Of course, one also needs to fuse element-wise layers etc. These tricky details are probably the reason why Nvidia and Intel have whole teams of engineers working on TensorRT and OpenVINO.

本文本身专注于单一运算(稀疏的1x1和3x3卷积)。 可以理解,对于大多数从业者来说,端到端的结果更加有趣。 这些都是棘手的,因为端到端推理取决于除快速1x1和3x3卷积之外的其他因素。 例如,在MobileNets上,一旦1x1卷积变得快三倍,深度卷积就会成为瓶颈。 在BERT上,一旦涉及权重矩阵的矩阵乘法变快,(密集)自我关注单元就会成为瓶颈。 当然,还需要融合元素层等。这些棘手的细节可能是Nvidia和Intel拥有由TensorRT和OpenVINO组成的整个工程师团队的原因。

Of course, all of those optimizations are compatible with SparseRT’s approach, and it can achieve pretty good end-to-end speedups. I will present the detailed results in more blog posts and follow-up papers. Just to give a sense of the speedups: SparseRT can achieve 2x end-to-end speedup over TensorRT on Jetson Nano in half precision on MobileNets at 90% sparsity, which typically means a 1.5 -2 % accuracy loss. This allows a Jetson Nano to have the same inference throughput as a Jetson TX2, at a third (!) of the price.

当然,所有这些优化都与SparseRT的方法兼容,并且可以实现相当不错的端到端加速。 我将在更多博客文章和后续文章中介绍详细的结果。 只是为了提速:SparseRT可以达到Jetson Nano上TensorRT两倍的端到端提速,在MobileNets上达到90%的稀疏度,精度为一半,这通常意味着1.5 -2%的精度损失。 这使Jetson Nano具有与Jetson TX2相同的推理吞吐量,而价格仅为价格的三分之一(!)。

Until next time…

直到下一次…

翻译自: https://towardsdatascience.com/speeding-up-deep-learning-inference-via-unstructured-sparsity-c87e3137cebc