Efficient Methods and Hardware for Deep Learning
Song Han Stanford University May 25, 2017
Intro
Song Han
Bill Dally
PhD Candidate Stanford
Chief Scientist NVIDIA Professor Stanford 2
Deep Learning is Changing Our Lives Self-Driving Car
Machine Translation
This image is licensed under CC-BY 2.0
This image is in the public domain
This image is in the public domain
This image is licensed under CC-BY 2.0
3
AlphaGo
Smart Robots 3
Models are Getting Larger IMAGE RECOGNITION
16X
10X
Model
8 layers 1.4 GFLOP ~16% Error
SPEECH RECOGNITION
Training Ops 152 layers 22.6 GFLOP ~3.5% error
2012
2015
AlexNet
ResNet
Microsoft
465 GFLOP 12,000 hrs of Data ~5% Error 80 GFLOP 7,000 hrs of Data ~8% Error
2014
Deep Speech 1
2015
Deep Speech 2
Baidu
Dally, NIPS’2016 workshop on Efficient Methods for Deep Neural Networks 4
The first Challenge: Model Size Hard to distribute large models through over-the-air update
App icon is in the public domain Phone image is licensed under CC-BY 2.0
This image is licensed under CC-BY 2.0
5
The Second Challenge: Speed
ResNet18: ResNet50: ResNet101: ResNet152:
Error rate
Training time
10.76% 7.02%
6.21% 6.16%
2.5 days 5 days 1 week 1.5 weeks
Such long training time limits ML researcher’s productivity Training time benchmarked with fb.resnet.torch using four M40 GPUs 6
The Third Challenge: Energy Efficiency AlphaGo: 1920 CPUs and 280 GPUs,
$3000 electric bill per game This image is in the public domain
This image is in the public domain
on mobile: drains battery on data-center: increases TCO Phone image is licensed under CC-BY 2.0
This image is licensed under CC-BY 2.0
7
Where is the Energy Consumed? larger model => more memory reference => more energy
8
Where is the Energy Consumed? larger model => more memory reference => more energy Relative Energy Cost
peration
Energy [pJ] Operation
Relative Cost[pJ] Energy
bit int ADD 0.1ADD 1 32 bit int 0.1 bit float ADD 0.9 ADD 9 32 bit float 0.9 bit Register File 32 bit Register 1 File 10 1 bit int MULT 3.1MULT 31 3.1 32 bit int bit float MULT 32 bit float 3.7 MULT 37 3.7 Cache 50 5 bit SRAM Cache 32 bit SRAM 5 32 bit 640 DRAM Memory bit DRAM Memory 6400 640
Relative Energy Cost
Relative CostEnergy Cost Relative 1 9 10 31 37 50 6400 1
10
100
11000
1010000 100
1000
100
Figure 1: Energy table for 45nm CMOS process is 3 orders : Energy table for 45nm CMOS process [7]. Memory access[7]. is 3Memory orders ofaccess magnitude more of magnitude mo energy expensive than simple arithmetic. xpensive than simple arithmetic.
1
= 1000
This image is in the public domain
To achieve goal, we present a method to prune network connections in a manner ve this goal, we present athis method to prune network connections in a manner that preserves the that preserves t original accuracy. After an initial trainingallphase, we remove all connections whose weight is low accuracy. After an initial training phase, we remove connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layerThis to afirst sparse layer. This9fi reshold. This pruning converts a dense, fully-connected layer to a sparse layer.
Where is the Energy Consumed? larger model => more memory reference => more energy Relative Energy Cost
peration
Energy [pJ] Operation
Relative Cost[pJ] Energy
bit int ADD 0.1ADD 1 32 bit int 0.1 bit float ADD 0.9 ADD 9 32 bit float 0.9 bit Register File 32 bit Register 1 File 10 1 bit int MULT 3.1MULT 31 3.1 32 bit int bit float MULT 32 bit float 3.7 MULT 37 3.7 Cache 50 5 bit SRAM Cache 32 bit SRAM 5 32 bit 640 DRAM Memory bit DRAM Memory 6400 640
Relative Energy Cost
Relative CostEnergy Cost Relative 1 9 10 31 37 50 6400 1
10
100
11000
1010000 100
1000
100
how toprocess make deep more efficient? Figure 1: Energy table for 45nm CMOSlearning process is 3 orders : Energy table for 45nm CMOS [7]. Memory access[7]. is 3Memory orders ofaccess magnitude more of magnitude mo energy expensive than simple arithmetic. xpensive than simple arithmetic.
To achieve goal, we present a method to prune network connections in a manner ve this goal, we present athis method to prune network connections in a manner that preserves the that preserves t original accuracy. After an initial trainingallphase, we remove all connections whose weight is low accuracy. After an initial training phase, we remove connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layerThis to afirst sparse layer. This10fi reshold. This pruning converts a dense, fully-connected layer to a sparse layer. Battery images are in the public domain Image 1, image 2, image 2, image 4
Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design
11
Application as a Black Box
Algorithm
Spec 2006 This image is in the public domain
Hardware
CPU
This image is in the public domain
12
Open the Box before Hardware Design
?
Algorithm
This image is in the public domain
Hardware
?PU
This image is in the public domain
Breaks the boundary between algorithm and hardware
13
Agenda
Inference
Training
Agenda Algorithm
Inference
Training
Hardware
Agenda Algorithm
Inference
Training
Hardware
Agenda Algorithm Algorithms for
Efficient Inference
Algorithms for Efficient Training
Inference
Training
Hardware for
Efficient Inference
Hardware for
Efficient Training
Hardware
Hardware 101: the Family Hardware
General Purpose*
Specialized HW
CPU
GPU
FPGA
ASIC
latency oriented
throughput oriented
programmable
logic
fixed logic
* including GPGPU
Hardware 101: Number Representation s
(-1) x (1.M) x 2 FP32 FP16
1
8
23
S
E
M
1
5
S
E
Int16 Int8 Fixed point
M
Accuracy
10-38 - 1038
.000006%
6x10-5 - 6x104
.05%
0 – 2x109
½
0 – 6x104
½
0 – 127
½
-
-
31
S
M
1
15
S
M
1
7
S
M
S
Range
10
1
Int32
E
I
F
radix point
Dally, High Performance Hardware for Machine Learning, NIPS’2015
Hardware 101: Number Representation
Operation:
Energy (pJ)
Area (µm2)
8b Add
0.03
36
16b Add
0.05
67
32b Add
0.1
137
16b FP Add
0.4
1360
32b FP Add
0.9
4184
8b Mult
0.2
282
32b Mult
3.1
3495
16b FP Mult
1.1
1640
32b FP Mult
3.7
7700
5
N/A
640
N/A
32b SRAM Read (8KB) 32b DRAM Read
Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014 Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.
Agenda Algorithm Algorithms for
Efficient Inference
Algorithms for Efficient Training
Inference
Training
Hardware for
Efficient Inference
Hardware for
Efficient Training
Hardware
Part 1: Algorithms for Efficient Inference •
1. Pruning
•
2. Weight Sharing
•
3. Quantization
•
4. Low Rank Approximation
•
5. Binary / Ternary Net
•
6. Winograd Transformation
Part 1: Algorithms for Efficient Inference •
1. Pruning
•
2. Weight Sharing
•
3. Quantization
•
4. Low Rank Approximation
•
5. Binary / Ternary Net
•
6. Winograd Transformation
Pruning Neural Networks
[Lecun et al. NIPS’89] [Han et al. NIPS’15]
Pruning
Trained Quantization
Huffman Coding
24
Pruning Neural Networks
[Han et al. NIPS’15]
-0.01x2 +x+1
60 Million 6M
Pruning
Trained Quantization
10x less connections
Huffman Coding
25
Pruning Neural Networks
[Han et al. NIPS’15]
Accuracy Loss
0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5%
40%
50%
60%
70%
80%
90%
100%
Parameters Pruned Away
Pruning
Trained Quantization
Huffman Coding
26
Pruning Neural Networks
[Han et al. NIPS’15]
Pruning
Accuracy Loss
0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5%
40%
50%
60%
70%
80%
90%
100%
Parameters Pruned Away
Pruning
Trained Quantization
Huffman Coding
27
[Han et al. NIPS’15]
Retrain to Recover Accuracy Pruning
Pruning+Retraining
Accuracy Loss
0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5%
40%
50%
60%
70%
80%
90%
100%
Parameters Pruned Away
Pruning
Trained Quantization
Huffman Coding
28
[Han et al. NIPS’15]
Iteratively Retrain to Recover Accuracy Pruning
Pruning+Retraining
Iterative Pruning and Retraining
Accuracy Loss
0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5%
40%
50%
60%
70%
80%
90%
100%
Parameters Pruned Away
Pruning
Trained Quantization
Huffman Coding
29
Pruning RNN and LSTM
[Han et al. NIPS’15]
ptioning
ultimodal Recurrent Neural Networks, Mao et al. Alignments for Generating Image Descriptions, Karpathy and Fei-Fei al Image Caption Generator, Vinyals et al. onvolutional Networks for Visual Recognition and Description, Donahue et al. *Karpathy al, "Deep Visual- Chen and Zitnick Visual Representation for et Image Caption Generation,
Semantic Alignments for Generating KarpathyImage & Justin Johnson 2015. Lecture 10 - 51 Descriptions”,
8 Feb 2016
Figure copyright IEEE, 2015; reproduced for educational purposes.
Pruning
Trained Quantization
Huffman Coding
30
Pruning RNN and LSTM •
[Han et al. NIPS’15]
Original: a basketball player in a white uniform is playing with a ball Pruned 90%: a basketball player in a white uniform is playing with a basketball
90%
•
90%
• •
Original : a brown dog is running through a grassy field Pruned 90%: a brown dog is running through a grassy area
90%
• •
Original : a man is riding a surfboard on a wave Pruned 90%: a man in a wetsuit is riding a wave on a beach
95%
• •
Original : a soccer player in red is running in the field Pruned 95%: a man in a red shirt and black and white black shirt is running through a field
Pruning
Trained Quantization
Huffman Coding
31
Pruning Happens in Human Brain 1000 Trillion Synapses 500 Trillion Synapses
50 Trillion Synapses
This image is in the public domain
Newborn
This image is in the public domain
1 year old
This image is in the public domain
Adolescent
Christopher A Walsh. Peter Huttenlocher (1931-2013). Nature, 502(7470):172–172, 2013.
Pruning
Trained Quantization
Huffman Coding
32
[Han et al. NIPS’15]
Pruning Changes Weight Distribution
Before Pruning
After Pruning
After Retraining
Conv5 layer of Alexnet. Representative for other network layers as well.
Pruning
Trained Quantization
Huffman Coding
33
Part 1: Algorithms for Efficient Inference •
1. Pruning
•
2. Weight Sharing
•
3. Quantization
•
4. Low Rank Approximation
•
5. Binary / Ternary Net
•
6. Winograd Transformation
[Han et al. ICLR’16]
Trained Quantization
2.09, 2.12, 1.92, 1.87
2.0
Pruning
Trained Quantization
Huffman Coding
35
[Han et al. ICLR’16]
Trained Quantization Quantization: less bits per weight
Huffman Encod
g: less number of weights Cluster the Weights
ain Connectivity
2.09, 2.12,
same accuracy 1.92, 1.87
Generate Code Book
same accuracy
Quantize the Weights with Code Book
27x-31x reduction
Encode Weight
une Connections 9x-13x reduction
Train Weights
Encode Index
2.0 Retrain Code Book
32 bit e three stage compression pipeline: pruning, quantization and Huffman cod 8x less memory footprint umber of weights by 10⇥,4bit while quantization further improves the comp ⇥ and 31⇥. Huffman coding gives more compression: between 35⇥ an Pruning
Trained Quantization
Huffman Coding
36
[Han et al. ICLR’16]
e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)
cluster index (2 bit uint)
fine-tuned centroids
centroids
3
0
2
1
3: 2.00
1.96
1
1
0
3
2: 1.50
1.48
-1.03
0
3
1
0
1: 0.00
-0.04
1.49
3
1
2
2
0: -1.00
-0.07
2.09
-0.98
1.48
0.09
0.05
-0.14 -1.08
2.12
-0.91
1.92
0
1.87
0
1.53
cluster
lr
-0.97
gradient -0.03 -0.01
0.03
0.02
-0.03
0.12
0.02
-0.01
0.01
-0.02
0.12 group by 0.03
0.01
-0.02
-0.01
0.02
0.04
0.01
0.02
-0.01
0.01
0.04
-0.07 -0.02
0.01
-0.02
-0.01 -0.02 -0.01
0.01
0.04 reduce -0.02
0.02 0.04 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning
Trained Quantization
Huffman Coding
37
[Han et al. ICLR’16]
e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)
cluster index (2 bit uint)
fine-tuned centroids
centroids
3
0
2
1
3: 2.00
1.96
1
1
0
3
2: 1.50
1.48
-1.03
0
3
1
0
1: 0.00
-0.04
1.49
3
1
2
2
0: -1.00
-0.07
2.09
-0.98
1.48
0.09
0.05
-0.14 -1.08
2.12
-0.91
1.92
0
1.87
0
1.53
cluster
lr
-0.97
gradient -0.03 -0.01
0.03
0.02
-0.03
0.12
0.02
-0.01
0.01
-0.02
0.12 group by 0.03
0.01
-0.02
-0.01
0.02
0.04
0.01
0.02
-0.01
0.01
0.04
-0.07 -0.02
0.01
-0.02
-0.01 -0.02 -0.01
0.01
0.04 reduce -0.02
0.02 0.04 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning
Trained Quantization
Huffman Coding
38
[Han et al. ICLR’16]
e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)
cluster index (2 bit uint)
fine-tuned centroids
centroids
3
0
2
1
3: 2.00
1.96
1
1
0
3
2: 1.50
1.48
-1.03
0
3
1
0
1: 0.00
-0.04
1.49
3
1
2
2
0: -1.00
-0.07
2.09
-0.98
1.48
0.09
0.05
-0.14 -1.08
2.12
-0.91
1.92
0
1.87
0
1.53
cluster
lr
-0.97
gradient -0.03 -0.01
0.03
0.02
-0.03
0.12
0.02
-0.01
0.01
-0.02
0.12 group by 0.03
0.01
-0.02
-0.01
0.02
0.04
0.01
0.02
-0.01
0.01
0.04
-0.07 -0.02
0.01
-0.02
-0.01 -0.02 -0.01
0.01
0.04 reduce -0.02
0.02 0.04 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning
Trained Quantization
Huffman Coding
39
[Han et al. ICLR’16]
e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)
cluster index (2 bit uint)
fine-tuned centroids
centroids
3
0
2
1
3: 2.00
1.96
1
1
0
3
2: 1.50
1.48
-1.03
0
3
1
0
1: 0.00
-0.04
1.49
3
1
2
2
0: -1.00
-0.07
2.09
-0.98
1.48
0.09
0.05
-0.14 -1.08
2.12
-0.91
1.92
0
1.87
0
1.53
cluster
lr
-0.97
gradient -0.03 -0.01
0.03
0.02
-0.03
0.12
0.02
-0.01
0.01
-0.02
0.12 group by 0.03
0.01
-0.02
-0.01
0.02
0.04
0.01
0.02
-0.01
0.01
0.04
-0.07 -0.02
0.01
-0.02
-0.01 -0.02 -0.01
0.01
0.04 reduce -0.02
0.02 0.04 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning
Trained Quantization
Huffman Coding
40
[Han et al. ICLR’16]
e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)
cluster index (2 bit uint)
fine-tuned centroids
centroids
3
0
2
1
3: 2.00
1.96
1
1
0
3
2: 1.50
1.48
-1.03
0
3
1
0
1: 0.00
-0.04
1.49
3
1
2
2
0: -1.00
-0.07
2.09
-0.98
1.48
0.09
0.05
-0.14 -1.08
2.12
-0.91
1.92
0
1.87
0
1.53
cluster
lr
-0.97
gradient -0.03 -0.01
0.03
0.02
-0.03
0.12
0.02
-0.01
0.01
-0.02
0.12 group by 0.03
0.01
-0.02
-0.01
0.02
0.04
0.01
0.02
-0.01
0.01
0.04
-0.07 -0.02
0.01
-0.02
-0.01 -0.02 -0.01
0.01
0.04 reduce -0.02
0.02 0.04 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning
Trained Quantization
Huffman Coding
41
[Han et al. ICLR’16]
e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)
cluster index (2 bit uint)
fine-tuned centroids
centroids
3
0
2
1
3: 2.00
1.96
1
1
0
3
2: 1.50
1.48
-1.03
0
3
1
0
1: 0.00
-0.04
1.49
3
1
2
2
0: -1.00
-0.07
2.09
-0.98
1.48
0.09
0.05
-0.14 -1.08
2.12
-0.91
1.92
0
1.87
0
1.53
cluster
lr
-0.97
gradient -0.03 -0.01
0.03
0.02
-0.03
0.12
0.02
-0.01
0.01
-0.02
0.12 group by 0.03
0.01
-0.02
-0.01
0.02
0.04
0.01
0.02
-0.01
0.01
0.04
-0.07 -0.02
0.01
-0.02
-0.01 -0.02 -0.01
0.01
0.04 reduce -0.02
0.02 0.04 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning
Trained Quantization
Huffman Coding
42
[Han et al. ICLR’16]
e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)
cluster index (2 bit uint)
fine-tuned centroids
centroids
3
0
2
1
3: 2.00
1.96
1
1
0
3
2: 1.50
1.48
-1.03
0
3
1
0
1: 0.00
-0.04
1.49
3
1
2
2
0: -1.00
-0.07
2.09
-0.98
1.48
0.09
0.05
-0.14 -1.08
2.12
-0.91
1.92
0
1.87
0
1.53
cluster
lr
-0.97
gradient -0.03 -0.01
0.03
0.02
-0.03
0.12
0.02
-0.01
0.01
-0.02
0.12 group by 0.03
0.01
-0.02
-0.01
0.02
0.04
0.01
0.02
-0.01
0.01
0.04
-0.07 -0.02
0.01
-0.02
-0.01 -0.02 -0.01
0.01
0.04 reduce -0.02
0.02 0.04 -0.03
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning
Trained Quantization
Huffman Coding
43
Count
Before Trained Quantization:
Continuous Weight
[Han et al. ICLR’16]
Weight Value Pruning
Trained Quantization
Huffman Coding
44
Count
After Trained Quantization:
Discrete Weight
[Han et al. ICLR’16]
Weight Value Pruning
Trained Quantization
Huffman Coding
45
Count
After Trained Quantization:
Discrete Weight after Training
[Han et al. ICLR’16]
Weight Value Pruning
Trained Quantization
Huffman Coding
46
[Han et al. ICLR’16]
How Many Bits do We Need?
Pruning
Trained Quantization
Huffman Coding
47
[Han et al. ICLR’16]
How Many Bits do We Need?
Pruning
Trained Quantization
Huffman Coding
48
[Han et al. ICLR’16]
Pruning + Trained Quantization Work Together
Pruning
Trained Quantization
Huffman Coding
49
[Han et al. ICLR’16]
Pruning + Trained Quantization Work Together
AlexNet on ImageNet
Pruning
Trained Quantization
Huffman Coding
50
[Han et al. ICLR’16]
Huffman Coding
Huffman Encoding
Encode Weights
y
n
Encode Index
same accuracy 35x-49x reduction
• In-frequent weights: use more bits to represent nd Huffman coding. Pruning roves the compression rate: • Frequent weights: use less bits to represent etween 35⇥ and 49⇥. The n. The compression scheme Pruning
Trained Quantization
Huffman Coding
51
[Han et al. ICLR’16]
Summary of Deep Compression
Published as a conference paper at ICLR 2016
Quantization: less bits per weight Pruning: less number of weights
Huffman Encoding Cluster the Weights
Train Connectivity same accuracy
original network
Generate Code Book
same accuracy
Quantize the Weights with Code Book
27x-31x reduction
Encode Weights
same accuracy
Prune Connections original size
9x-13x reduction
Encode Index
35x-49x reduction
Train Weights Retrain Code Book
Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning reduces the number of weights by 10⇥, while quantization further improves the compression rate: between 27⇥ and 31⇥. Huffman coding gives more compression: between 35⇥ and 49⇥. The compression rate already included the meta-data for sparse representation. The compression scheme doesn’t incur any accuracy loss. Pruning Trained Quantization Huffman Coding 52
[Han et al. ICLR’16]
Results: Compression Ratio Network
Original Compressed Compression
Size Size Ratio
Original Accuracy
Compressed Accuracy
LeNet-300
1070KB
27KB
40x
98.36%
98.42%
LeNet-5
1720KB
44KB
39x
99.20%
99.26%
AlexNet
240MB
6.9MB
35x
80.27%
80.30%
VGGNet
550MB
11.3MB
49x
88.68%
89.09%
GoogleNet
28MB
2.8MB
10x
88.90%
88.92%
ResNet-18
44.6MB
4.0MB
11x
89.24%
89.28%
Can we make compact models to begin with? Compression
Acceleration
Regularization
53
SqueezeNet Input 64
1x1 Conv
Squeeze 16
1x1 Conv
Expand
3x3 Conv
Expand 64
64
Output Concat/Eltwise 128
Vanilla Fire module
Iandola et al, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 25X as many MACs vs GPU ● >100X as many MACs vs CPU ● 4 MiB of on-chip Accumulator memory ● 24 MiB of on-chip Unified Buffer (activation memory) ● 3.5X as much on-chip memory vs GPU ● Two 2133MHz DDR3 DRAM channels ● 8 GiB of off-chip weight DRAM memory David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 79
Google TPU
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 80
Inference Datacenter Workload
Layers
TPU Ops / TPU Nonlinear Name LOC Weights Weight Batch function Byte Size FC Conv Vector Pool Total
MLP0 0.1k 5 MLP1 1k 4 LSTM0 1k
5 4
24
34
LSTM1 1.5k 37 CNN0 1k CNN1 1k
4
58
19 16 72
56 13
16 89
ReLU ReLU sigmoid, tanh sigmoid, tanh ReLU ReLU
20M 5M
200 168
200 168
52M
64
64
% Deployed
61%
29% 34M
96
96
8M 100M
2888 1750
8 32
5%
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 81
Roofline Model: Identify Performance Bottleneck
nsightful visual CM 52.4 (2009): 65-76. David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 82
TPU Roofline
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 83
Log Rooflines for CPU, GPU, TPU
Star = TPU Triangle = GPU Circle = CPU
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 84
Linear Rooflines for CPU, GPU, TPU
Star = TPU Triangle = GPU Circle = CPU
David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 85
Why so far below Rooflines? Low latency requirement => Can’t batch more => low ops/byte
How to Solve this? less memory footprint => need compress the model
Challenge: Hardware that can infer on compressed model
86
EIE: the First DNN Accelerator for
Sparse, Compressed Model
Compression
Acceleration
Regularization
[Han et al. ISCA’16]
87
EIE: the First DNN Accelerator for
Sparse, Compressed Model 0*A=0
W*0=0
[Han et al. ISCA’16]
2.09, 1.92=> 2
Sparse Weight
Sparse Activation
Weight Sharing
90% static sparsity
70% dynamic sparsity
4-bit weights
10x less computation
3x less computation
5x less memory footprint
Compression
Acceleration
8x less memory footprint
Regularization
88
EIE: Reduce Memory Access by Compression ~ a
0
0
logically
a1
0
a3
⇥
1
P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0 Virtual Weight W0,0
physically
0
1
b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7
0 B B B B B B B B B B B B B @
~b b0 b1 0 b3 0 b5 b6 0
W0,1
W4,2
W0,3
W4,3 0
Relative Index
0
1
2
0
Column Pointer
0
1
2
3
1 C C C C C C C C C C C C C A
Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016, Hotchips 2016
[Han et al. ISCA’16]
Dataflow ~ a
0
a1
0 ⇥
0
a3 1
0
P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0
1
b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7
0 B B B B B B B B B B B B B @
~b b0 b1 0 b3 0 b5 b6 0
1 C C C C C C C C C C C C C A
rule of thumb:
0*A=0 W*0=0 Compression
Acceleration
Regularization
90
[Han et al. ISCA’16]
Dataflow ~ a
0
a1
0 ⇥
0
a3 1
0
P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0
1
b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7
0 B B B B B B B B B B B B B @
~b b0 b1 0 b3 0 b5 b6 0
1 C C C C C C C C C C C C C A
rule of thumb:
0*A=0 W*0=0 Compression
Acceleration
Regularization
91
[Han et al. ISCA’16]
Dataflow ~ a
0
a1
0 ⇥
0
a3 1
0
P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0
1
b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7
0 B B B B B B B B B B B B B @
~b b0 b1 0 b3 0 b5 b6 0
1 C C C C C C C C C C C C C A
rule of thumb:
0*A=0 W*0=0 Compression
Acceleration
Regularization
92
[Han et al. ISCA’16]
Dataflow ~ a
0
a1
0 ⇥
0
a3 1
0
P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0
1
b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7
0 B B B B B B B B B B B B B @
~b b0 b1 0 b3 0 b5 b6 0
1 C C C C C C C C C C C C C A
rule of thumb:
0*A=0 W*0=0 Compression
Acceleration
Regularization
93
[Han et al. ISCA’16]
Dataflow ~ a
0
a1
0 ⇥
0
a3 1
0
P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0
1
b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7
0 B B B B B B B B B B B B B @
~b b0 b1 0 b3 0 b5 b6 0
1 C C C C C C C C C C C C C A
rule of thumb:
0*A=0 W*0=0 Compression
Acceleration
Regularization
94
[Han et al. ISCA’16]
Dataflow ~ a
0
a1
0 ⇥
0
a3 1
0
P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0
1
b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7
0 B B B B B B B B B B B B B @
~b b0 b1 0 b3 0 b5 b6 0
1 C C C C C C C C C C C C C A
rule of thumb:
0*A=0 W*0=0 Compression
Acceleration
Regularization
95
[Han et al. ISCA’16]
Dataflow ~ a
0
a1
0 ⇥
0
a3 1
0
P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0
1
b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7
0 B B B B B B B B B B B B B @
~b b0 b1 0 b3 0 b5 b6 0
1 C C C C C C C C C C C C C A
rule of thumb:
0*A=0 W*0=0 Compression
Acceleration
Regularization
96
[Han et al. ISCA’16]
Dataflow ~ a
0
a1
0 ⇥
0
a3 1
0
P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0
1
b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7
0 B B B B B B B B B B B B B @
~b b0 b1 0 b3 0 b5 b6 0
1 C C C C C C C C C C C C C A
rule of thumb:
0*A=0 W*0=0 Compression
Acceleration
Regularization
97
[Han et al. ISCA’16]
Dataflow ~ a
0
a1
0 ⇥
0
a3 1
0
P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0
1
b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7
0 B B B B B B B B B B B B B @
~b b0 b1 0 b3 0 b5 b6 0
1 C C C C C C C C C C C C C A
rule of thumb:
0*A=0 W*0=0 Compression
Acceleration
Regularization
98
[Han et al. ISCA’16]
Dataflow ~ a
0
a1
0 ⇥
0
a3 1
0
P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0
1
b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7
0 B B B B B B B B B B B B B @
~b b0 b1 0 b3 0 b5 b6 0
1 C C C C C C C C C C C C C A
rule of thumb:
0*A=0 W*0=0 Compression
Acceleration
Regularization
99
z William J. Dally † iversity, NVIDIA EIE Architecture ⇤
⇤†
[Han et al. ISCA’16]
avan,horowitz,dally}@stanford.edu Weight decode Compressed DNN Model
Input Image
Encoded Weight Relative Index Sparse Format
4-bit
16-bit Virtual weight Weight Real weight
Look-up
4-bit
Relative Index
Index Accum
ALU Prediction Mem
16-bit
Absolute Index
Result
Figure 1. Efficient inferenceAddress engine that Accumulate works on the compressed deep neural network model for machine learning applications.
word, or speech0 sample. applications, 2.09, 1.92=> 2 rule of thumb: * A = 0 For embedded W * 0 = 0 mobile these resource demands become prohibitive. Table I shows Compression
Acceleration
Regularization
100
[Han et al. ISCA’16]
Micro Architecture for each PE
Act Value Act Index
Act Value
Act Queue
Encoded Weight
Act Index
Even Ptr SRAM Bank
Col Start/ End Addr
Sparse Matrix SRAM
Regs
Sparse Matrix Access
SRAM
Compression
Acceleration
Weight Decoder Address Accum
Odd Ptr SRAM Bank Pointer Read
Act SRAM
Relative Index
Regs
Bypass
Dest Act Regs
Src Act Regs
Absolute Address
Arithmetic Unit
Leading NZero Detect
ReLU
Act R/W
Comb
Regularization
101
5.
[Han et al. ISCA’16]
Speedup on EIE SpMat
CPU Dense (Baseline) 1000x
CPU Compressed
GPU Dense
507x
248x
mGPU Compressed
Ptr_Even
Act_1
Arithm
SpMat
Ptr_Odd
Speedup
Act_0
100x
25x
14x
10x
1x
5x
3x
2x
1x
1x
0.6x
24x 9x
21x 14x
5x
1x 0.5x
1x 1x
1.1x
1x
135x
1x
1.0x
EIE 92x
22x 10x
8x
9x
mGPU Compressed
EIE
1x 1x
1.0x
1x 0.5x
48x
25x 9x
3x
3x
2x
60x
33x 15x
9x
1x
1x 0.3x
189x
98x
63x
34x
16x 10x
2x 1x
0.5x
2x
15x 3x
1x 0.5x
1x
3x 0.6x
0.1x
Alex-6
Layout of one PE in EIE under TSMC 45nm process. Table II
Figure 6.
Alex-7
Alex-8
VGG-6
VGG-7
VGG-8
ry network er national cell ueue ad tRead mUnit W cell
Power (mW) 9.157 5.416 1.874 1.026 0.841
(59.15%) (20.46%) (11.20%) (9.18%)
0.112 1.807 4.955 1.162 1.122
(1.23%) (19.73%) (54.11%) (12.68%) (12.25%)
(%)
Area (µm2 ) 638,024 594,786 866 9,465 8,946 23,961 758 121,849 469,412 3,110 18,934 23,961
Energy Efficiency
98x
(%)
CPU Dense (Baseline)
100000x (93.22%) 10000x (0.14%) (1.48%) 1000x (1.40%) (3.76%) 100x (0.12%) (19.10%) 10x (73.57%) (0.49%) 1x (2.97%) (3.76%)
Figure 7.
2x
NT-We
NT-Wd
NT-LSTM
Geo Mean
Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
189x
MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS
x
mGPU Dense
618x
210x
115x
94x
56x
GPU Compressed
1018x
60x
37x 9x12x
26x 37x 5x 7x
1x
10x
Alex-6
GPU Dense 119,797x
61,533x
34,522x
25x 9x
CPU Compressed
1x
48x
14,826x
15x
Alex-7
15x
78x 101x
59x 7x10x 3x 1x
18x
7x
Alex-8
GPU Compressed
17x 10x 1x
13x
VGG-6
3x
mGPU Dense
mGPU Compressed
EIE
76,784x
61x 20x 10x
1x
11,828x
24,207x
10,904x
9,485x
8,053x
102x
14x
VGG-7
1x
2x
5x
25x
14x
8x
39x 14x 6x 6x
6x 6x 8x
1x
5x
VGG-8
NT-We
3x
1x
25x
7x
15x 20x 4x 5x 7x 1x
23x 6x 7x 1x
36x
9x
Geo Mean NT-Wd
NT-LSTM
Geo Mean
Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
1x 1xcorner. We placed and routed 1x the PE using the Synopsys IC B 0.6x 0.5x compiler (ICC). We used Cacti [25] to get SRAM area and Layer
nit: I/O and Computing. In the I/O mode, all of re idle while the activations and weights in every e accessed by a DMA connected with the Central s is one time cost. In the Computing mode, the eatedly collects a non-zero value from the LNZD and broadcasts this value to all PEs. This process until the input length is exceeded. By setting the gth and starting address of pointer array, EIE is to execute different layers.
Table III
ENCHMARK FROM STATE - OF - THE - ART
Size
energy numbers. We annotated the toggle rate from the RTL 9216, Alex-6 4096 simulation to the gate-level netlist, which was dumped to 4096, switching activity interchange format (SAIF), and estimated Alex-7 V. E VALUATION M ETHODOLOGY 4096 power using Prime-Time PX. tor, RTL and Layout. We the implemented a custom 4096, Alex-8 urate C++ simulator for the accelerator aimed to 1000 Comparison dife RTL behavior of synchronous circuits. Each Baseline. We compare EIE with three GPU mGPU EIE CPU 25088, module is abstracted as an ferent object thatoff-the-shelf implements VGG-6 computing units: CPU, GPU and mobile 4096 act methods: propagate and update, corresponding 4096, ation logic and the flip-flop GPU. in RTL. The simulator VGG-7 or design space exploration. It also serves as a 4096 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E or RTL verification. 4096, asure the area, power and critical delay, we that has been used in NVIDIA Digits Deep VGG-8 classpath processor, 1000 ted the RTL of EIE in Verilog. The RTL is verified Dev Box as a Acceleration CPU baseline. To run the benchmark Regularization 4096, Compression e cycle-accurate simulator.Learning Then we synthesized
NT-LSTM
g the Synopsys Design Compiler (DC) under the
Geo Mean NT-We
600
DNN MODELS
Weight% Act%
FLOP%
9%
35.1%
3%
9%
35.3%
3%
25%
37.5%
10%
4%
18.3%
1%
4%
37.5%
2%
23%
41.1%
9%
10%
100%
10%
Description Compressed AlexNet [1] for large scale image classification Compressed VGG-16 [3] for large scale image classification and object detection Compressed NeuralTalk [7]
102
CPU Dense (Baseline)
Wd
NT-LSTM
Speedup
1000x
25x
14x
GPU Compressed
mGPU Dense
Geo Mean 24x 9x
21x 14x
mGPU Compressed
135x
92x
22x 10x
34x
16x 10x
60x
33x 15x
9x
2x
1x
0.6x
5x
1x 0.5x
1x 1x
1.1x
1x
1x
1.0x
1x 1x
1.0x
1x 0.3x
1x
25x 9x
3x
3x
2x
1x
2x
1x
0.5x
0.5x
48x
15x
2x
1x 0.5x
3x 1x
3x 0.6x
0.1x
Alex-6
Figure 6.
Act_0
Ptr_Even
Act_1
Arithm
Alex-7
Alex-8
CPU Dense (Baseline)
Ptr_Odd
VGG-6
VGG-7
100000x
CPU Compressed
GPU Dense 119,797x
61,533x
34,522x
100x
Figure 7.
37x 9x12x
26x 37x 5x 7x
10x
1x
1x
10x
Alex-6
1x
78x 101x
59x
15x
7x10x
3x
1x
Alex-7
18x
7x
Alex-8
(%)
mGPU Dense
11,828x
17x 10x
1x
13x
VGG-6
61x 20x 10x
102x
14x
1x
VGG-7
NT-LSTM
Geo Mean
mGPU Compressed
EIE
EIE 1x
2x
5x
8x
25x
14x
5x
VGG-8
10,904x
9,485x
39x 14x 6x 6x
6x 6x 1x
8x
NT-We
1x
25x
7x
NT-Wd
24,207x 8,053x
15x 20x 4x 5x 7x 1x
NT-LSTM
23x 6x 7x 1x
36x
9x
Geo Mean
Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
10,904x corner. We placed and routed8,053x the PE using the Synopsys IC Area (µm2 ) 638,024 594,786 866 9,465 8,946 23,961 758 121,849 469,412 3,110 18,934 23,961
NT-Wd
76,784x
1000x
MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS
Power (mW) 9.157 5.416 1.874 1.026 0.841
GPU Compressed
10000x
Layout of one PE in EIE under TSMC 45nm process. Table II
NT-We
14,826x
mGPU Compressed
SpMat
VGG-8
Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
SpMat Energy Efficiency
5.
3x
1x
1x
5x
9x
[Han et al. ISCA’16] 189x
98x
63x
Energy Efficiency on EIE There is no batching in all cases. 10x
8x
EIE
618x
210x
115x
94x
56x
GPU Dense
1018x
507x
248x
100x
CPU Compressed
(%)
24,207x
Table III B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS
compiler(93.22%) (ICC). We used Cacti [25] to get SRAM area and Layer Size Weight% Act% FLOP% Description (0.14%) energy numbers. We annotated the toggle rate from the RTL 9216, (1.48%) Alex-6 9% 35.1% 3% (1.40%) 4096 Compressed simulation to the gate-level netlist, which was dumped to (3.76%) 4096, AlexNet [1] for 0.112 (1.23%) (0.12%) activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% 1.807 (19.73%)switching (19.10%) 4096 large scale image 4.955 (54.11%) (73.57%) the power using Prime-Time PX. 1.162 (12.68%) (0.49%) 4096, classification Alex-8 25% 37.5% 10% 1.122 (12.25%) (2.97%) 1000 Comparison Baseline. We compare EIE with three dif(3.76%) 25088, VGG-6 4% 18.3% 1% Compressed ferent off-the-shelf computing units: CPU, GPU and mobile 4096 nit: I/O and Computing. In the I/O mode, all of VGG-16 [3] for GPU. 36x re idle while the activations and weights in every 4096, VGG-7 4% 37.5% 2% large scale image 25x 23x e accessed by a DMA connected with the Central 4096 20x 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E classification and 15x s is one time cost. In the Computing mode, the 4096, eatedly collects a non-zero value from the LNZD VGG-8 23% 41.1% 9% object detection class processor, that has been used in NVIDIA Digits 7x Deep 1000 6x and broadcasts this value to all PEs. This process 5x Dev the Box as a CPU baseline. To run the benchmark 4096, Compressed 4x until the input length is Learning exceeded. By setting NT-We 10% 100% 10% gth and starting address on of pointer array, EIE is 600 NeuralTalk [7] CPU, we used MKL CBLAS GEMV to implement the to execute different layers. 9x 600, with RNN and 7xMKL SPBLAS CSRMV 7xV. EVALUATION METHODOLOGY 1x dense model and 1x for the NT-Wd 11% 100% 11% original 8791 LSTM for compressed sparse model. CPU socket and DRAM power 1201, automatic tor, RTL and Layout. We implemented a custom NTLSTM 10% 100% 11% 2400 image captioning urate C++ simulator for the aimed toby the pcm-power utility provided by Intel. areaccelerator as reported e RTL behavior of synchronous circuits. Each GPU mGPU EIE CPUX GPU, 2) GPU. We use NVIDIA GeForce GTX Titan module is abstracted as an object that implements act methods: propagate and update, corresponding The uncompressed DNN model is obtained from Caffe a state-of-the-art GPU for deep learning as our baseline ation logic and the flip-flop in RTL. The simulator or design space exploration. It alsonvidia-smi serves as a model zoo [28] and NeuralTalk model zoo [7]; The comusing utility to report the power. To run or RTL verification. pressed DNN model is produced as described in [16], [23]. benchmark, asure the area, power andthe critical path delay, we we used cuBLAS GEMV to implement ted the RTL of EIE in Verilog. RTL is verified The benchmark networks have 9 layers in total obtained the The original dense layer.Acceleration For the compressed sparse layer, Regularization Compression e cycle-accurate simulator. Then we synthesized from AlexNet, VGGNet, and NeuralTalk. We use the Imagewe stored thethesparse matrix in in CSR format, and used g the Synopsys Design Compiler (DC) under
ry network er national cell ueue ad tRead mUnit W cell
Wd
(59.15%) (20.46%) (11.20%) (9.18%)
Geo Mean
NT-LSTM
Geo Mean
103
[Han et al. ISCA’16]
Comparison: Throughput EIE Throughput (Layers/s in log scale) 1E+06
ASIC
1E+05
ASIC
ASIC
1E+04
GPU
1E+03 1E+02
CPU
ASIC mGPU FPGA
1E+01 1E+00
Core-i7 5930k
TitanX
22nm
28nm
CPU GPU
Compression
Tegra K1
28nm
mGPU
Acceleration
A-Eye
DaDianNao
TrueNorth
28nm
28nm
28nm
FPGA ASIC ASIC
Regularization
EIE
45nm
ASIC
64PEs
EIE
28nm
ASIC
256PEs
104
[Han et al. ISCA’16]
Comparison: Energy Efficiency EIE Energy Efficiency (Layers/J in log scale) 1E+06 1E+05 1E+04
ASIC
ASIC
ASIC
EIE
45nm
ASIC
64PEs
EIE
28nm
ASIC
256PEs
ASIC
1E+03 1E+02
GPU mGPU
1E+01 1E+00
FPGA
CPU Core-i7 5930k
TitanX
22nm
28nm
CPU GPU
Compression
Tegra K1
28nm
mGPU
Acceleration
A-Eye
DaDianNao
TrueNorth
28nm
28nm
28nm
FPGA ASIC ASIC
Regularization
105
Agenda Algorithm Algorithms for
Efficient Inference
Algorithms for Efficient Training
Inference
Training
Hardware for
Efficient Inference
Hardware for
Efficient Training
Hardware
Part 3: Efficient Training — Algorithms
• 1. Parallelization • 2. Mixed Precision with FP16 and FP32 • 3. Model Distillation • 4. DSD: Dense-Sparse-Dense Training
Part 3: Efficient Training — Algorithms
• 1. Parallelization • 2. Mixed Precision with FP16 and FP32 • 3. Model Distillation • 4. DSD: Dense-Sparse-Dense Training
Moore’s law made CPUs 300x faster than in 1990
But its over…
C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
Data Parallel – Run multiple inputs in parallel
Dally, High Performance Hardware for Machine Learning, NIPS’2015
Data Parallel – Run multiple inputs in parallel
• • •
Doesn’t affect latency for one input Requires P-fold larger batch size For training requires coordinated weight update Dally, High Performance Hardware for Machine Learning, NIPS’2015
Parameter Update One method to achieve scale is parallelization Parameter Server
∆p
p’ = p + ∆p
p’
Model! Workers Data! Shards Large scale distributed deep networks J Dean et al (2012)
Large Scale Distributed Deep Networks, Jeff Dean et al., 2013
Model Parallel
Split up the Model – i.e. the network
Dally, High Performance Hardware for Machine Learning, NIPS’2015
Model-Parallel Convolution – by output region (x,y)
Kernels Multiple 3D Kuvkj
6D Loop Forall output map j For each input map k For each pixel x,y For each kernel element u,v Bxyj += A(x-u)(y-v)k x Kuvkj
A ijij AAxyk
x BBxyj BBxyj xyj xyj BBxyj xyj BBxyj BBxyj xyj xyj
Input maps Axyk
Dally, High Performance Hardware for Machine Learning, NIPS’2015
Output maps Bxyj
Model-Parallel Convolution – By output map j (filter)
Kernels Multiple 3D Kuvkj
6D Loop Forall output map j For each input map k For each pixel x,y For each kernel element u,v Bxyj += A(x-u)(y-v)k x Kuvkj
x AA ijij A ij Bxyj
A ijij AAxyk
Input maps Axyk
Output maps Bxyj
Dally, High Performance Hardware for Machine Learning, NIPS’2015
Model Parallel Fully-Connected Layer (M x V)
bi
=
Wij
x
Dally, High Performance Hardware for Machine Learning, NIPS’2015
Input activations
Output activations
weight matrix
aj
Model Parallel Fully-Connected Layer (M x V)
bi bi
=
Wij Wij
x
Dally, High Performance Hardware for Machine Learning, NIPS’2015
Input activations
Output activations
weight matrix
aj
Hyper-Parameter Parallel
Try many alternative networks in parallel
Dally, High Performance Hardware for Machine Learning, NIPS’2015
Summary of Parallelism •
Lots of parallelism in DNNs • •
•
Data parallel • •
•
Run multiple training examples in parallel Limited by batch size
Model parallel • • • •
•
16M independent multiplies in one FC layer Limited by overhead to exploit a fraction of this
Split model over multiple processors By layer Conv layers by map region Fully connected layers by output activation
Easy to get 16-64 GPUs training one model in parallel Dally, High Performance Hardware for Machine Learning, NIPS’2015
Part 3: Efficient Training — Algorithms
• 1. Parallelization • 2. Mixed Precision with FP16 and FP32 • 3. Model Distillation • 4. DSD: Dense-Sparse-Dense Training
Mixed Precision
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
Mixed Precision Training VOLTA TRAINING METHOD VOLTA TRAINING METHOD VOLTA TRAINING METHOD VOLTA TRAINING METHOD W (F16) WW(F16) (F16) W (F16)
F16 F16 W F16 F16 WW W F16 Actv F16 F16 F16 Actv Actv Actv
F16 Actv F16 F16 F16 FWD Actv Actv FWD Actv FWD FWD
F16 W F16 F16 F16 WW W BWD-A F16 Actv Grad BWD-A BWD-A BWD-A F16 F16 F16 Actv Grad Actv Grad Actv Grad
F16 Actv Grad F16 F16 Actv Grad Actv Grad Actv Grad F16
Actv Grad
Actv Grad
F16 Actv F16 W Grad F16 F16 F16Actv WWGrad Actv Grad F16 Actv W Grad F16 BWD-W F16BWD-W F16 Actv Grad BWD-W BWD-W F16 F16F16 Actv Grad Actv Grad Actv Grad F16 F16 F16 F16
Master-W (F32) Master-W (F32) Master-W (F32) Master-W (F32) Master-W (F32)
F32 F32 F32 F32 F32 Weight Update F32 F32 F32 Weight Update Weight Update Weight Update
Updated Master-W Updated Master-W Updated Master-W Updated Master-W 5 5
5 55
Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, “Training with mixed precision”, NVIDIA GTC 2017
Inception V1 INCEPTION V1
12
Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, “Training with mixed precision”, NVIDIA GTC 2017
ResNet
RESNET50
13
Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, “Training with mixed precision”, NVIDIA GTC 2017
AlexNet Mode
Top1 accuracy, %
Top5 accuracy, %
Fp32
58.62
81.25
Mixed precision training
58.12
80.71
RESNET RESULTS
No scale of loss function … Inception V3 FP16 training 54.89 Top1 FP16 training,Mode loss scale = 1000 57.76 % accuracy, Fp32
71.75
INCEPTION-V3 RESULTS Mixed precision training 71.17
78.12 Top5 80.76 % accuracy, 90.52
90.10 30 Nvcaffe-0.16, DGX-1, SGD with momentum, 100 epochs, batch=1024, no augmentation, 1 crop, 1 model FP16 training, loss scale 1 Scale loss=function by 71.17 100x… ResNet-50
90.33
FP16 training, loss scale = 1, FP16 master weight storage Mode
Top1 70.53 accuracy, %
Top5 90.14 accuracy, %
Fp32
73.85
91.44
41 Nvcaffe-0.16, DGX-1, SGD with momentum, 100 epochs, batch=512, no augmentation, 1 crop, 1 model
Mixed precision training
73.6
91.11
Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, “Training with mixed precision”, NVIDIA GTC 2017 FP16 training 71.36 90.84
FP16 training, loss scale = 100
74.13
91.51
Part 3: Efficient Training Algorithm
• 1. Parallelization • 2. Mixed Precision with FP16 and FP32 • 3. Model Distillation • 4. DSD: Dense-Sparse-Dense Training
Model Distillation
Teacher model 1 (Googlenet)
Knowledge
Teacher model 2 (Vggnet)
Knowledge
Teacher model 3 (Resnet)
Knowledge
student model
student model has much smaller model size
Softened outputs reveal the dark knowledge
Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network
Softened outputs reveal the dark knowledge
• Method: Divide score by a “temperature” to get a much softer distribution
• Result: Start with a trained model that classifies 58.9% of the test frames correctly. The new model converges to 57.0% correct even when it is only trained on 3% of the data Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network
Part 3: Efficient Training Algorithm
• 1. Parallelization • 2. Mixed Precision with FP16 and FP32 • 3. Model Distillation • 4. DSD: Dense-Sparse-Dense Training
DSD: Dense Sparse Dense Training
der review as a conference paper at ICLR 2017 Dense
Dense
Sparse
Pruning Sparsity Constraint
Re-Dense Increase Model Capacity
ure 1: Dense-Sparse-Dense Training Flow. The sparse training regularizes the model, and the fi se training restores the pruned weights (red), increasing the model capacity without overfitti
DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide orithm 1: Workflow of DSD training range of deep neural networks on CNNs / RNNs / LSTMs.
tialization: W (0) with W (0) ⇠ N (0, ⌃) tput :W (t) . Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017 ——————————————– Initial Dense Phase ———————————————
DSD: Intuition
learn the trunk first
then learn the leaves
Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017
[Han et al. ICLR 2017]
DSD is General Purpose: Vision, Speech, Natural Language Baseline
DSD
Abs.
Imp.
ImageNet CNN
31.1%
30.0%
1.1%
3.6%
Vision
ImageNet CNN
31.5%
27.2%
4.3%
13.7%
ResNet-18
Vision
ImageNet CNN
30.4%
29.3%
1.1%
3.7%
ResNet-50
Vision
ImageNet CNN
24.0%
23.2%
0.9%
3.5%
Network
Domain
GoogleNet
Vision
VGG-16
Dataset
Type
Rel.
Imp.
Open Sourced DSD Model Zoo: https://songhan.github.io/DSD The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.
Compression
Acceleration
Regularization
133
[Han et al. ICLR 2017]
DSD is General Purpose: Vision, Speech, Natural Language Baseline
DSD
Abs.
Imp.
ImageNet CNN
31.1%
30.0%
1.1%
3.6%
Vision
ImageNet CNN
31.5%
27.2%
4.3%
13.7%
ResNet-18
Vision
ImageNet CNN
30.4%
29.3%
1.1%
3.7%
ResNet-50
Vision
ImageNet CNN
24.0%
23.2%
0.9%
3.5%
16.8
18.5
1.7
10.1%
Network
Domain
GoogleNet
Vision
VGG-16
Dataset
Type
Caption Flickr-8K LSTM
NeuralTalk
Rel.
Imp.
Open Sourced DSD Model Zoo: https://songhan.github.io/DSD The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.
Compression
Acceleration
Regularization
134
[Han et al. ICLR 2017]
DSD is General Purpose: Vision, Speech, Natural Language Baseline
DSD
Abs.
Imp.
ImageNet CNN
31.1%
30.0%
1.1%
3.6%
Vision
ImageNet CNN
31.5%
27.2%
4.3%
13.7%
ResNet-18
Vision
ImageNet CNN
30.4%
29.3%
1.1%
3.7%
ResNet-50
Vision
ImageNet CNN
24.0%
23.2%
0.9%
3.5%
16.8
18.5
1.7
10.1%
Network
Domain
GoogleNet
Vision
VGG-16
Dataset
Type
Caption Flickr-8K LSTM
NeuralTalk
Rel.
Imp.
DeepSpeech
Speech
WSJ’93
RNN
33.6%
31.6%
2.0%
5.8%
DeepSpeech-2
Speech
WSJ’93
RNN
14.5%
13.4%
1.1%
7.4%
Open Sourced DSD Model Zoo: https://songhan.github.io/DSD The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.
Compression
Acceleration
Regularization
135
https://songhan.github.io/DSD
DSD on Caption Generation
Baseline: a boy in a red shirt is climbing a rock wall. Sparse: a young girl is jumping off a tree. DSD: a young girl in a pink shirt is swinging on a swing.
Baseline: a basketball player in a red uniform is playing with a ball. Sparse: a basketball player in a blue uniform is jumping over the goal. DSD: a basketball player in a white uniform is trying to make a shot.
Baseline: two dogs are playing together in a field.
Baseline: a man and a woman are sitting on a bench.
Sparse: two dogs are playing in a field.
Sparse: a man is sitting on a bench with his hands in the air. DSD: a man is sitting on a bench with his arms folded.
DSD: two dogs are playing in the grass.
Baseline: a person in a red jacket is riding a bike through the woods. Sparse: a car drives through a mud puddle. DSD: a car drives through a forest.
Figure 3: Visualization of DSD training improves the performance of image captioning. Baseline model: Andrej Karpathy, Neural Talk model zoo.
the from the background. Training The good of DSD training generalizes Han et forest al. “DSD: Dense-Sparse-Dense for performance Deep Neural Networks”, ICLR 2017
beyond these examples, more image caption results generated by DSD training is provided in the supplementary 137 material.
A. Supplementary Material: More Examples of DSD Training Improves the Performance of Generated by NeuralTalk (Images from Flickr-8K Test Set) NeuralTalk Auto-Caption System
DSD on Caption Generation
Baseline: a boy is swimming in a pool. Sparse: a small black dog is jumping into a pool. DSD: a black and white dog is swimming in a pool.
Baseline: a group of people sit on a bench in front of a building. Sparse: a group of people are standing in front of a building. DSD: a group of people are standing in a fountain.
Baseline: a group of people are standing in front of a building. Sparse: a group of people are standing in front of a building. DSD: a group of people are walking in a park.
Baseline: two girls in bathing suits are playing in the water. Sparse: two children are playing in the sand. DSD: two children are playing in the sand.
Baseline: a man in a red shirt and jeans is riding a bicycle down a street. Sparse: a man in a red shirt and a woman in a wheelchair. DSD: a man and a woman are riding on a street.
Baseline: a man in a black jacket and a black jacket is smiling. Sparse: a man and a woman are standing in front of a mountain. DSD: a man in a black jacket is standing next to a man in a black shirt.
Baseline: a group of football players in red uniforms. Sparse: a group of football players in a field. DSD: a group of football players in red and white uniforms.
Baseline: a dog runs through the grass. Sparse: a dog runs through the grass. DSD: a white and brown dog is running through the grass.
Baseline model: Andrej Karpathy, Neural Talk model zoo.
Agenda Algorithm Algorithms for
Efficient Inference
Algorithms for Efficient Training
Inference
Training
Hardware for
Efficient Inference
Hardware for
Efficient Training
Hardware
CPUs for Training
CPUs Are Targeting Deep Learning Intel Knights Landing (2016) • 7 TFLOPS FP32 • 16GB MCDRAM– 400 GB/s • 245W TDP • 29 GFLOPS/W (FP32) • 14nm process Knights Mill: next gen Xeon Phi “optimized for deep learning” Intel announced the addition of new vector instructions for deep learning (AVX512-4VNNIW and AVX512-4FMAPS), October 2016 Slide Source: Sze et al Survey of DNN Hardware, MICRO’16 Tutorial.
Image Source: Intel, DataNext Source: Next Platform Image Source: Intel, Data Source: Platform
2
GPUs for Training GPUs Are Targeting Deep Learning Nvidia PASCAL GP100 (2016) • 10/20 TFLOPS FP32/FP16 • 16GB HBM – 750 GB/s • 300W TDP • 67 GFLOPS/W (FP16) • 16nm process • 160GB/s NV Link
Slide Source: Sze et al Survey of DNN Hardware, MICRO’16 Tutorial.
Data Source: NVIDIA
Source: Nvidia
3
GPUs for Training
Nvidia Volta GV100 (2017)
• • • • • • • •
Data Source: NVIDIA
15 FP32 TFLOPS 120 Tensor TFLOPS 16GB HBM2 @ 900GB/s 300W TDP 12nm process 21B Transistors die size: 815 mm2 300GB/s NVLink
What’s new in Volta: Tensor CoreACC TENSOR CORE 4X4X4 MATRIX-MULTIPLY
a new instruction that performs 4x4x4 FMA mixed-precision operations per clock
12X increase in throughput for the Volta V100 compared to the Pascal P100 8
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
Pascal v.s. Volta
Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations. https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
Pascal v.s. Volta
Left: Tesla V100 trains the ResNet-50 deep neural network 2.4x faster than Tesla P100.
Right: Given a target latency per image of 7ms, Tesla V100 is able to perform inference using the ResNet-50 deep neural network 3.7x faster than Tesla P100. https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
The GV100 SM is partitioned into four processing blocks, each with: • • • • • • • •
8 FP64 Cores 16 FP32 Cores 16 INT32 Cores two of the new mixed-precision Tensor Cores for deep learning a new L0 instruction cache one warp scheduler one dispatch unit a 64 KB Register File.
https://devblogs.nvidia.com/parallelforall/ cuda-9-features-revealed/
Tesla Product
Tesla K40
Tesla M40
Tesla P100
Tesla V100
GPU
GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GV100 (Volta)
GPU Boost Clock
810/875 MHz
1114 MHz
1480 MHz
1455 MHz
Peak FP32 TFLOP/s* 5.04
6.8
10.6
15
Peak Tensor Core TFLOP/s*
-
-
-
120
Memory Interface
384-bit GDDR5 384-bit GDDR5
4096-bit HBM2
4096-bit HBM2
Memory Size
Up to 12 GB
Up to 24 GB
16 GB
16 GB
TDP
235 Watts
250 Watts
300 Watts
300 Watts
Transistors
7.1 billion
8 billion
15.3 billion
21.1 billion
GPU Die Size
551 mm²
601 mm²
610 mm²
815 mm²
Manufacturing Process
28 nm
28 nm
16 nm FinFET+
12 nm FFN
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/
GPU / TPU
https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
achine learning models on our new Google Cloud TPUs
5
Google Cloud TPU
Cloud TPU delivers up to 180 teraflops to train and run machine learning models. source: Google Blog Our new Cloud TPU delivers up to 180 teraEops to train and run machine learning
149
in an afternoon using just one eighth of a TPU pod.
Google Cloud TPU
A “TPU pod” built with 64 second-generation TPUs delivers up to 11.5 petaflops of machine learning acceleration. A “TPU pod” built with 64 second-generation TPUs delivers up to 11.5 petaEops of “One of our new large-scale translation models used to take a full day to train machine learning acceleration. on 32 of the best commercially-available GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod.”— Google Blog
Introducing Cloud TPUs
150
Wrap-Up Algorithm Algorithms for
Efficient Inference
Algorithms for Efficient Training
Inference
Training
Hardware for
Efficient Inference
Hardware for
Efficient Training
Hardware
Future
Smart
Low Latency
Privacy
Mobility
Energy-Efficient 152
Outlook: the Focus for Computation
PC Era Mobile-First Era AI-First Era
Computing
Mobile
Computing
Brain-Inspired
Cognitive Computing Sundar Pichai, Google IO, 2016 153
Thank you! stanford.edu/~songhan