3LC

papers deep learning compression

发布时间 : 2021-03-05 11:54

阅读 :

References & Notations
Abstract
Introduction
Distributed ML Background
Design
Evaluation
Questions

References & Notations

item	illustration
title	arXiv’18 3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning
Authors	Hyeontaek Lim, David G.Andersen, Michael Kaminsky

Abstract

减少通信量的措施容易犯的错误：

牺牲了最终的模型精度
用其他ML技术弥补损失（例如DGC的4种compensation），限制了通用性
导致高计算开销，只适用于计算能力>>通信能力的环境

3LC尝试平衡以下4点：

减少通信 traffic reduction
精度 accuracy
计算开销 computation overhead
通用性 generality

3LC结合了3个技术：

(lossy) 3-value quantization with sparsity multiplication
(lossless) quartic encoding
(lossless) zero-run encoding

3LC的结果：

压缩率： 39-107x
精度：几乎不变
不需要修改现有算法
ResNet-110的完整训练时间减少列16-23x

Introduction

其他略过，看3LC结合的3个技术。

3-value quantization with sparsity multiplication

有损压缩成{-1,0,1}
通过一个参数控制压缩程度
用error accumulation来控制误差

作者认为他对训练模型的精度影响很小，所以不需要修改ML算法来补偿潜在的精度损失？？？我直接疑惑，很多压缩算法都是这样的。

Quartic encoding

正常应该用2bit来表示3-value。
Quartic encoding将一组3-value一起压缩，可以节省大约20%空间。此外编码后的数据可以被进一步压缩。

Zero-run encoding

利用0多的冗余性，对Quartic encoding得到的数据做进一步的压缩。可以得到2x或更高的压缩率。

Distributed ML Background

略过

Design

详细说明见论文

3-value Quantization with Sparsity Multiplication

原梯度tensor T_in —(compress)—> 同样形状，由{-1,0,1}组成的tensor; 缩放因子M

压缩：

$M= max(|T_{in}|)·s \\ T_{quantized} = round\left(T_{in}\over M\right)$

解压缩：

$T_{out} = M·T_{quantized}$

$s$取值范围：1~2。默认设置$s$为1，$s$越大，$T_{quantized}$中的0越少，则在最后Zero-run encoding阶段压缩率越高。$s$越大，误差越大。

Alternative quantization techniques

Stochastic quantization

Stochastic quantization是一个好方法，TernGrad将用过。但是3LC中没用的原因是：

用round做压缩引入的bias误差可以通过error accumulation修正
在evaluation中，error correction比stochastic quantization的精度高。采用stochastic quantization方法需要更多的bit来表示，或者用额外的技术来补足精度。
stochastic quantization + error accumulation 在实验中无法收敛。

Squared quantization error minimization

略过

Alternative sparsification techniques

略过

Convergence

略过

Quartic Encoding

将5个3-value压缩成1byte。
5个3-value的最多信息量： $log_23^5 = 7.92 < 8$

压缩流程：

+1: {-1,0,1} —> {0,1,2}
type cast: 3-value —> uint8
flatten: matrix —> vector
pad 0: len(T) mod 5 == 0
$a=p_0·81+p_1·27+p_2·9+p_3·3+p_4$

Zero-run Encoding

利用了两个特点：

Quartic Encoding中，5个3-value占用一个字节，一个字节最多能表示256中状态，而原数据最多只有$3^5=243$种状态。
Quartic Encoding后有很多0（稀疏性）

现在有5个0，先通过Quartic Encoding压缩，得到121，流程如下：

+1： 00000->11111
type cast, flatten, pad0
81+27+9+3+1 = 121

若检测到压缩后的数据中有连续k(2<=k<=14)个121，则将这些连续的121替换成单字节，其值为243+k-2。

Evaluation

Traffic Reduction: 每个阶段节省了多少
Training time: 完整训练时间节省了多少
Accuracy
Convergence speed
Computation overhead

乱的很，在图表中没找到压缩耗时的结果图。

Questions

这些压缩算法能否通过自己之前设计的common operators(reduce, map, filter, random, sort, concat, extract)来实现？
- 3-value Quantization with Sparsity Multiplication
  压缩: reduce, map, concat
  解压: extract, map
- Quartic Encoding
  压缩: map, concat
  解压: map, extract
- Zero-run Encoding
  无法利用先进硬件的向量化技术。相对的，在作者阐述Quartic Encoding时，特别指出了可以利用向量化技术。
  明确一点：原作者此处肯定用的scan的方法做的。
  压缩：map+filter+concat（效率偏低）
  解压：分段解压+拼接？可能需要16倍于原大小的空间。如何拼接？

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达。