cs231n 2017 lecture15 - PDF Free Download

Efficient Methods and Hardware for Deep Learning

Song Han Stanford University May 25, 2017

Intro

Song Han

Bill Dally

PhD Candidate Stanford

Chief Scientist NVIDIA Professor Stanford 2

Deep Learning is Changing Our Lives Self-Driving Car

Machine Translation

This image is licensed under CC-BY 2.0

This image is in the public domain

This image is in the public domain

This image is licensed under CC-BY 2.0

3

AlphaGo

Smart Robots 3

Models are Getting Larger IMAGE RECOGNITION

16X

10X

Model

8 layers 1.4 GFLOP ~16% Error

SPEECH RECOGNITION

Training Ops 152 layers 22.6 GFLOP ~3.5% error

2012

2015

AlexNet

ResNet

Microsoft

465 GFLOP 12,000 hrs of Data ~5% Error 80 GFLOP 7,000 hrs of Data ~8% Error

2014

Deep Speech 1

2015

Deep Speech 2

Baidu

Dally, NIPS’2016 workshop on Efficient Methods for Deep Neural Networks 4

The first Challenge: Model Size Hard to distribute large models through over-the-air update

App icon is in the public domain Phone image is licensed under CC-BY 2.0

This image is licensed under CC-BY 2.0

5

The Second Challenge: Speed

ResNet18: ResNet50: ResNet101: ResNet152:

Error rate

Training time

10.76% 7.02%  6.21% 6.16%

2.5 days 5 days 1 week 1.5 weeks

Such long training time limits ML researcher’s productivity Training time benchmarked with fb.resnet.torch using four M40 GPUs 6

The Third Challenge: Energy Efficiency AlphaGo: 1920 CPUs and 280 GPUs,  $3000 electric bill per game This image is in the public domain

This image is in the public domain

on mobile: drains battery on data-center: increases TCO Phone image is licensed under CC-BY 2.0

This image is licensed under CC-BY 2.0

7

Where is the Energy Consumed? larger model => more memory reference => more energy

8

Where is the Energy Consumed? larger model => more memory reference => more energy Relative Energy Cost

peration

Energy [pJ] Operation

Relative Cost[pJ] Energy

bit int ADD 0.1ADD 1 32 bit int 0.1 bit float ADD 0.9 ADD 9 32 bit float 0.9 bit Register File 32 bit Register 1 File 10 1 bit int MULT 3.1MULT 31 3.1 32 bit int bit float MULT 32 bit float 3.7 MULT 37 3.7 Cache 50 5 bit SRAM Cache 32 bit SRAM 5 32 bit 640 DRAM Memory bit DRAM Memory 6400 640

Relative Energy Cost

Relative CostEnergy Cost Relative 1 9 10 31 37 50 6400 1

10

100

11000

1010000 100

1000

100

Figure 1: Energy table for 45nm CMOS process is 3 orders : Energy table for 45nm CMOS process [7]. Memory access[7]. is 3Memory orders ofaccess magnitude more of magnitude mo energy expensive than simple arithmetic. xpensive than simple arithmetic.

1

= 1000

This image is in the public domain

To achieve goal, we present a method to prune network connections in a manner ve this goal, we present athis method to prune network connections in a manner that preserves the that preserves t original accuracy. After an initial trainingallphase, we remove all connections whose weight is low accuracy. After an initial training phase, we remove connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layerThis to afirst sparse layer. This9fi reshold. This pruning converts a dense, fully-connected layer to a sparse layer.

Where is the Energy Consumed? larger model => more memory reference => more energy Relative Energy Cost

peration

Energy [pJ] Operation

Relative Cost[pJ] Energy

bit int ADD 0.1ADD 1 32 bit int 0.1 bit float ADD 0.9 ADD 9 32 bit float 0.9 bit Register File 32 bit Register 1 File 10 1 bit int MULT 3.1MULT 31 3.1 32 bit int bit float MULT 32 bit float 3.7 MULT 37 3.7 Cache 50 5 bit SRAM Cache 32 bit SRAM 5 32 bit 640 DRAM Memory bit DRAM Memory 6400 640

Relative Energy Cost

Relative CostEnergy Cost Relative 1 9 10 31 37 50 6400 1

10

100

11000

1010000 100

1000

100

how toprocess make deep more efficient? Figure 1: Energy table for 45nm CMOSlearning process is 3 orders : Energy table for 45nm CMOS [7]. Memory access[7]. is 3Memory orders ofaccess magnitude more of magnitude mo energy expensive than simple arithmetic. xpensive than simple arithmetic.

To achieve goal, we present a method to prune network connections in a manner ve this goal, we present athis method to prune network connections in a manner that preserves the that preserves t original accuracy. After an initial trainingallphase, we remove all connections whose weight is low accuracy. After an initial training phase, we remove connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layerThis to afirst sparse layer. This10fi reshold. This pruning converts a dense, fully-connected layer to a sparse layer. Battery images are in the public domain Image 1, image 2, image 2, image 4

Improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design

11

Application as a Black Box

Algorithm

Spec 2006 This image is in the public domain

Hardware

CPU

This image is in the public domain

12

Open the Box before Hardware Design

?

Algorithm

This image is in the public domain

Hardware

?PU

This image is in the public domain

Breaks the boundary between algorithm and hardware

13

Agenda

Inference

Training

Agenda Algorithm

Inference

Training

Hardware

Agenda Algorithm

Inference

Training

Hardware

Agenda Algorithm Algorithms for   Efficient Inference

Algorithms for Efficient Training

Inference

Training

Hardware for  Efficient Inference

Hardware for  Efficient Training

Hardware

Hardware 101: the Family Hardware

General Purpose*

Specialized HW

CPU

GPU

FPGA

ASIC

latency oriented

throughput oriented

programmable   logic

fixed logic

* including GPGPU

Hardware 101: Number Representation s

(-1) x (1.M) x 2 FP32 FP16

1

8

23

S

E

M

1

5

S

E

Int16 Int8 Fixed point

M

Accuracy

10-38 - 1038

.000006%

6x10-5 - 6x104

.05%

0 – 2x109

½

0 – 6x104

½

0 – 127

½ 

-

-

31

S

M

1

15

S

M

1

7

S

M

S

Range

10

1

Int32

E

I

F

radix point

Dally, High Performance Hardware for Machine Learning, NIPS’2015

Hardware 101: Number Representation

Operation:

Energy (pJ)

Area (µm2)

8b Add

0.03

36

16b Add

0.05

67

32b Add

0.1

137

16b FP Add

0.4

1360

32b FP Add

0.9

4184

8b Mult

0.2

282

32b Mult

3.1

3495

16b FP Mult

1.1

1640

32b FP Mult

3.7

7700

5

N/A

640

N/A

32b SRAM Read (8KB) 32b DRAM Read

Energy numbers are from Mark Horowitz “Computing’s Energy Problem (and what we can do about it)”, ISSCC 2014 Area numbers are from synthesized result using Design Compiler under TSMC 45nm tech node. FP units used DesignWare Library.

Agenda Algorithm Algorithms for   Efficient Inference

Algorithms for Efficient Training

Inference

Training

Hardware for  Efficient Inference

Hardware for  Efficient Training

Hardware

Part 1: Algorithms for Efficient Inference •

1. Pruning

•

2. Weight Sharing

•

3. Quantization

•

4. Low Rank Approximation

•

5. Binary / Ternary Net

•

6. Winograd Transformation

Part 1: Algorithms for Efficient Inference •

1. Pruning

•

2. Weight Sharing

•

3. Quantization

•

4. Low Rank Approximation

•

5. Binary / Ternary Net

•

6. Winograd Transformation

Pruning Neural Networks

[Lecun et al. NIPS’89] [Han et al. NIPS’15]

Pruning

Trained Quantization

Huffman Coding

24

Pruning Neural Networks

[Han et al. NIPS’15]

-0.01x2 +x+1

60 Million 6M

Pruning

Trained Quantization

10x less connections

Huffman Coding

25

Pruning Neural Networks

[Han et al. NIPS’15]

Accuracy Loss

0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5%

40%

50%

60%

70%

80%

90%

100%

Parameters Pruned Away

Pruning

Trained Quantization

Huffman Coding

26

Pruning Neural Networks

[Han et al. NIPS’15]

Pruning

Accuracy Loss

0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5%

40%

50%

60%

70%

80%

90%

100%

Parameters Pruned Away

Pruning

Trained Quantization

Huffman Coding

27

[Han et al. NIPS’15]

Retrain to Recover Accuracy Pruning

Pruning+Retraining

Accuracy Loss

0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5%

40%

50%

60%

70%

80%

90%

100%

Parameters Pruned Away

Pruning

Trained Quantization

Huffman Coding

28

[Han et al. NIPS’15]

Iteratively Retrain to Recover Accuracy Pruning

Pruning+Retraining

Iterative Pruning and Retraining

Accuracy Loss

0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5%

40%

50%

60%

70%

80%

90%

100%

Parameters Pruned Away

Pruning

Trained Quantization

Huffman Coding

29

Pruning RNN and LSTM

[Han et al. NIPS’15]

ptioning

ultimodal Recurrent Neural Networks, Mao et al. Alignments for Generating Image Descriptions, Karpathy and Fei-Fei al Image Caption Generator, Vinyals et al. onvolutional Networks for Visual Recognition and Description, Donahue et al. *Karpathy al, "Deep Visual- Chen and Zitnick Visual Representation for et Image Caption Generation,

Semantic Alignments for Generating KarpathyImage & Justin Johnson 2015. Lecture 10 - 51 Descriptions”,  

8 Feb 2016

Figure copyright IEEE, 2015; reproduced for educational purposes.

Pruning

Trained Quantization

Huffman Coding

30

Pruning RNN and LSTM •

[Han et al. NIPS’15]

Original: a basketball player in a white uniform is playing with a ball Pruned 90%: a basketball player in a white uniform is playing with a basketball

90%

•

90%

• •

Original : a brown dog is running through a grassy field Pruned 90%: a brown dog is running through a grassy area

90%

• •

Original : a man is riding a surfboard on a wave Pruned 90%: a man in a wetsuit is riding a wave on a beach

95%

• •

Original : a soccer player in red is running in the field Pruned 95%: a man in a red shirt and black and white black shirt is running through a field

Pruning

Trained Quantization

Huffman Coding

31

Pruning Happens in Human Brain 1000 Trillion Synapses 500 Trillion Synapses

50 Trillion Synapses

This image is in the public domain

Newborn

This image is in the public domain

1 year old

This image is in the public domain

Adolescent

Christopher A Walsh. Peter Huttenlocher (1931-2013). Nature, 502(7470):172–172, 2013.  

Pruning

Trained Quantization

Huffman Coding

32

[Han et al. NIPS’15]

Pruning Changes Weight Distribution

Before Pruning

After Pruning

After Retraining

Conv5 layer of Alexnet. Representative for other network layers as well.

Pruning

Trained Quantization

Huffman Coding

33

Part 1: Algorithms for Efficient Inference •

1. Pruning

•

2. Weight Sharing

•

3. Quantization

•

4. Low Rank Approximation

•

5. Binary / Ternary Net

•

6. Winograd Transformation

[Han et al. ICLR’16]

Trained Quantization

2.09, 2.12, 1.92, 1.87

2.0

Pruning

Trained Quantization

Huffman Coding

35

[Han et al. ICLR’16]

Trained Quantization Quantization: less bits per weight

Huffman Encod

g: less number of weights Cluster the Weights

ain Connectivity

2.09, 2.12,

same accuracy 1.92, 1.87

Generate Code Book

same accuracy

Quantize the Weights with Code Book

27x-31x reduction

Encode Weight

une Connections 9x-13x reduction

Train Weights

Encode Index

2.0 Retrain Code Book

32 bit e three stage compression pipeline: pruning, quantization and Huffman cod 8x less memory footprint umber of weights by 10⇥,4bit while quantization further improves the comp ⇥ and 31⇥. Huffman coding gives more compression: between 35⇥ an Pruning

Trained Quantization

Huffman Coding

36

[Han et al. ICLR’16]

e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)

cluster index (2 bit uint)

fine-tuned centroids

centroids

3

0

2

1

3: 2.00

1.96

1

1

0

3

2: 1.50

1.48

-1.03

0

3

1

0

1: 0.00

-0.04

1.49

3

1

2

2

0: -1.00

-0.07

2.09

-0.98

1.48

0.09

0.05

-0.14 -1.08

2.12

-0.91

1.92

0

1.87

0

1.53

cluster

lr

-0.97

gradient -0.03 -0.01

0.03

0.02

-0.03

0.12

0.02

-0.01

0.01

-0.02

0.12 group by 0.03

0.01

-0.02

-0.01

0.02

0.04

0.01

0.02

-0.01

0.01

0.04

-0.07 -0.02

0.01

-0.02

-0.01 -0.02 -0.01

0.01

0.04 reduce -0.02

0.02 0.04 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning

Trained Quantization

Huffman Coding

37

[Han et al. ICLR’16]

e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)

cluster index (2 bit uint)

fine-tuned centroids

centroids

3

0

2

1

3: 2.00

1.96

1

1

0

3

2: 1.50

1.48

-1.03

0

3

1

0

1: 0.00

-0.04

1.49

3

1

2

2

0: -1.00

-0.07

2.09

-0.98

1.48

0.09

0.05

-0.14 -1.08

2.12

-0.91

1.92

0

1.87

0

1.53

cluster

lr

-0.97

gradient -0.03 -0.01

0.03

0.02

-0.03

0.12

0.02

-0.01

0.01

-0.02

0.12 group by 0.03

0.01

-0.02

-0.01

0.02

0.04

0.01

0.02

-0.01

0.01

0.04

-0.07 -0.02

0.01

-0.02

-0.01 -0.02 -0.01

0.01

0.04 reduce -0.02

0.02 0.04 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning

Trained Quantization

Huffman Coding

38

[Han et al. ICLR’16]

e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)

cluster index (2 bit uint)

fine-tuned centroids

centroids

3

0

2

1

3: 2.00

1.96

1

1

0

3

2: 1.50

1.48

-1.03

0

3

1

0

1: 0.00

-0.04

1.49

3

1

2

2

0: -1.00

-0.07

2.09

-0.98

1.48

0.09

0.05

-0.14 -1.08

2.12

-0.91

1.92

0

1.87

0

1.53

cluster

lr

-0.97

gradient -0.03 -0.01

0.03

0.02

-0.03

0.12

0.02

-0.01

0.01

-0.02

0.12 group by 0.03

0.01

-0.02

-0.01

0.02

0.04

0.01

0.02

-0.01

0.01

0.04

-0.07 -0.02

0.01

-0.02

-0.01 -0.02 -0.01

0.01

0.04 reduce -0.02

0.02 0.04 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning

Trained Quantization

Huffman Coding

39

[Han et al. ICLR’16]

e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)

cluster index (2 bit uint)

fine-tuned centroids

centroids

3

0

2

1

3: 2.00

1.96

1

1

0

3

2: 1.50

1.48

-1.03

0

3

1

0

1: 0.00

-0.04

1.49

3

1

2

2

0: -1.00

-0.07

2.09

-0.98

1.48

0.09

0.05

-0.14 -1.08

2.12

-0.91

1.92

0

1.87

0

1.53

cluster

lr

-0.97

gradient -0.03 -0.01

0.03

0.02

-0.03

0.12

0.02

-0.01

0.01

-0.02

0.12 group by 0.03

0.01

-0.02

-0.01

0.02

0.04

0.01

0.02

-0.01

0.01

0.04

-0.07 -0.02

0.01

-0.02

-0.01 -0.02 -0.01

0.01

0.04 reduce -0.02

0.02 0.04 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning

Trained Quantization

Huffman Coding

40

[Han et al. ICLR’16]

e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)

cluster index (2 bit uint)

fine-tuned centroids

centroids

3

0

2

1

3: 2.00

1.96

1

1

0

3

2: 1.50

1.48

-1.03

0

3

1

0

1: 0.00

-0.04

1.49

3

1

2

2

0: -1.00

-0.07

2.09

-0.98

1.48

0.09

0.05

-0.14 -1.08

2.12

-0.91

1.92

0

1.87

0

1.53

cluster

lr

-0.97

gradient -0.03 -0.01

0.03

0.02

-0.03

0.12

0.02

-0.01

0.01

-0.02

0.12 group by 0.03

0.01

-0.02

-0.01

0.02

0.04

0.01

0.02

-0.01

0.01

0.04

-0.07 -0.02

0.01

-0.02

-0.01 -0.02 -0.01

0.01

0.04 reduce -0.02

0.02 0.04 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning

Trained Quantization

Huffman Coding

41

[Han et al. ICLR’16]

e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)

cluster index (2 bit uint)

fine-tuned centroids

centroids

3

0

2

1

3: 2.00

1.96

1

1

0

3

2: 1.50

1.48

-1.03

0

3

1

0

1: 0.00

-0.04

1.49

3

1

2

2

0: -1.00

-0.07

2.09

-0.98

1.48

0.09

0.05

-0.14 -1.08

2.12

-0.91

1.92

0

1.87

0

1.53

cluster

lr

-0.97

gradient -0.03 -0.01

0.03

0.02

-0.03

0.12

0.02

-0.01

0.01

-0.02

0.12 group by 0.03

0.01

-0.02

-0.01

0.02

0.04

0.01

0.02

-0.01

0.01

0.04

-0.07 -0.02

0.01

-0.02

-0.01 -0.02 -0.01

0.01

0.04 reduce -0.02

0.02 0.04 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning

Trained Quantization

Huffman Coding

42

[Han et al. ICLR’16]

e 2: Representing the matrix sparsity Quantization with relative index. Padding filler zero to prev Trained weights (32 bit float)

cluster index (2 bit uint)

fine-tuned centroids

centroids

3

0

2

1

3: 2.00

1.96

1

1

0

3

2: 1.50

1.48

-1.03

0

3

1

0

1: 0.00

-0.04

1.49

3

1

2

2

0: -1.00

-0.07

2.09

-0.98

1.48

0.09

0.05

-0.14 -1.08

2.12

-0.91

1.92

0

1.87

0

1.53

cluster

lr

-0.97

gradient -0.03 -0.01

0.03

0.02

-0.03

0.12

0.02

-0.01

0.01

-0.02

0.12 group by 0.03

0.01

-0.02

-0.01

0.02

0.04

0.01

0.02

-0.01

0.01

0.04

-0.07 -0.02

0.01

-0.02

-0.01 -0.02 -0.01

0.01

0.04 reduce -0.02

0.02 0.04 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (b Pruning

Trained Quantization

Huffman Coding

43

Count

Before Trained Quantization:  Continuous Weight

[Han et al. ICLR’16]

Weight Value Pruning

Trained Quantization

Huffman Coding

44

Count

After Trained Quantization:  Discrete Weight

[Han et al. ICLR’16]

Weight Value Pruning

Trained Quantization

Huffman Coding

45

Count

After Trained Quantization:  Discrete Weight after Training

[Han et al. ICLR’16]

Weight Value Pruning

Trained Quantization

Huffman Coding

46

[Han et al. ICLR’16]

How Many Bits do We Need?

Pruning

Trained Quantization

Huffman Coding

47

[Han et al. ICLR’16]

How Many Bits do We Need?

Pruning

Trained Quantization

Huffman Coding

48

[Han et al. ICLR’16]

Pruning + Trained Quantization Work Together

Pruning

Trained Quantization

Huffman Coding

49

[Han et al. ICLR’16]

Pruning + Trained Quantization Work Together

AlexNet on ImageNet

Pruning

Trained Quantization

Huffman Coding

50

[Han et al. ICLR’16]

Huffman Coding

Huffman Encoding

Encode Weights

y

n

Encode Index

same accuracy 35x-49x reduction

• In-frequent weights: use more bits to represent nd Huffman coding. Pruning roves the compression rate: • Frequent weights: use less bits to represent etween 35⇥ and 49⇥. The n. The compression scheme Pruning

Trained Quantization

Huffman Coding

51

[Han et al. ICLR’16]

Summary of Deep Compression

Published as a conference paper at ICLR 2016

Quantization: less bits per weight Pruning: less number of weights

Huffman Encoding Cluster the Weights

Train Connectivity same accuracy

original network

Generate Code Book

same accuracy

Quantize the Weights with Code Book

27x-31x reduction

Encode Weights

same accuracy

Prune Connections original size

9x-13x reduction

Encode Index

35x-49x reduction

Train Weights Retrain Code Book

Figure 1: The three stage compression pipeline: pruning, quantization and Huffman coding. Pruning reduces the number of weights by 10⇥, while quantization further improves the compression rate: between 27⇥ and 31⇥. Huffman coding gives more compression: between 35⇥ and 49⇥. The compression rate already included the meta-data for sparse representation. The compression scheme doesn’t incur any accuracy loss. Pruning Trained Quantization Huffman Coding 52

[Han et al. ICLR’16]

Results: Compression Ratio Network

Original Compressed Compression  Size Size Ratio

Original Accuracy

Compressed Accuracy

LeNet-300

1070KB

27KB

40x

98.36%

98.42%

LeNet-5

1720KB

44KB

39x

99.20%

99.26%

AlexNet

240MB

6.9MB

35x

80.27%

80.30%

VGGNet

550MB

11.3MB

49x

88.68%

89.09%

GoogleNet

28MB

2.8MB

10x

88.90%

88.92%

ResNet-18

44.6MB

4.0MB

11x

89.24%

89.28%

Can we make compact models to begin with? Compression

Acceleration

Regularization

53

SqueezeNet Input 64

1x1 Conv  Squeeze 16

1x1 Conv  Expand

3x3 Conv  Expand 64

64

Output Concat/Eltwise 128

Vanilla Fire module

Iandola et al, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 25X as many MACs vs GPU ● >100X as many MACs vs CPU ● 4 MiB of on-chip Accumulator memory ● 24 MiB of on-chip Unified Buffer (activation memory) ● 3.5X as much on-chip memory vs GPU ● Two 2133MHz DDR3 DRAM channels ● 8 GiB of off-chip weight DRAM memory David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 79

Google TPU

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 80

Inference Datacenter Workload

Layers

TPU Ops / TPU Nonlinear Name LOC Weights Weight Batch function Byte Size FC Conv Vector Pool Total

MLP0 0.1k 5 MLP1 1k 4 LSTM0 1k

5 4

24

34

LSTM1 1.5k 37 CNN0 1k CNN1 1k

4

58

19 16 72

56 13

16 89

ReLU ReLU sigmoid, tanh sigmoid, tanh ReLU ReLU

20M 5M

200 168

200 168

52M

64

64

% Deployed

61%

29% 34M

96

96

8M 100M

2888 1750

8 32

5%

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 81

Roofline Model: Identify Performance Bottleneck

nsightful visual CM 52.4 (2009): 65-76. David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 82

TPU Roofline

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 83

Log Rooflines for CPU, GPU, TPU

Star = TPU Triangle = GPU Circle = CPU

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 84

Linear Rooflines for CPU, GPU, TPU

Star = TPU Triangle = GPU Circle = CPU

David Patterson and the Google TPU Team, In-Data Center Performance Analysis of a Tensor Processing Unit 85

Why so far below Rooflines? Low latency requirement => Can’t batch more => low ops/byte

How to Solve this? less memory footprint => need compress the model

Challenge: Hardware that can infer on compressed model

86

EIE: the First DNN Accelerator for  Sparse, Compressed Model

Compression

Acceleration

Regularization

[Han et al. ISCA’16]

87

EIE: the First DNN Accelerator for  Sparse, Compressed Model 0*A=0

W*0=0

[Han et al. ISCA’16]

2.09, 1.92=> 2

Sparse Weight

Sparse Activation

Weight Sharing

90% static sparsity

70% dynamic sparsity

4-bit weights

10x less computation

3x less computation

5x less memory footprint

Compression

Acceleration

8x less memory footprint

Regularization

88

EIE: Reduce Memory Access by Compression ~ a

0

0

logically

a1

0

a3

⇥

1

P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0 Virtual Weight W0,0

physically

0

1

b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7

0 B B B B B B B B B B B B B @

~b b0 b1 0 b3 0 b5 b6 0

W0,1

W4,2

W0,3

W4,3 0

Relative Index

0

1

2

0

Column Pointer

0

1

2

3

1 C C C C C C C C C C C C C A

Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016, Hotchips 2016

[Han et al. ISCA’16]

Dataflow ~ a

0

a1

0 ⇥

0

a3 1

0

P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0

1

b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7

0 B B B B B B B B B B B B B @

~b b0 b1 0 b3 0 b5 b6 0

1 C C C C C C C C C C C C C A

rule of thumb:  0*A=0 W*0=0 Compression

Acceleration

Regularization

90

[Han et al. ISCA’16]

Dataflow ~ a

0

a1

0 ⇥

0

a3 1

0

P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0

1

b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7

0 B B B B B B B B B B B B B @

~b b0 b1 0 b3 0 b5 b6 0

1 C C C C C C C C C C C C C A

rule of thumb:  0*A=0 W*0=0 Compression

Acceleration

Regularization

91

[Han et al. ISCA’16]

Dataflow ~ a

0

a1

0 ⇥

0

a3 1

0

P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0

1

b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7

0 B B B B B B B B B B B B B @

~b b0 b1 0 b3 0 b5 b6 0

1 C C C C C C C C C C C C C A

rule of thumb:  0*A=0 W*0=0 Compression

Acceleration

Regularization

92

[Han et al. ISCA’16]

Dataflow ~ a

0

a1

0 ⇥

0

a3 1

0

P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0

1

b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7

0 B B B B B B B B B B B B B @

~b b0 b1 0 b3 0 b5 b6 0

1 C C C C C C C C C C C C C A

rule of thumb:  0*A=0 W*0=0 Compression

Acceleration

Regularization

93

[Han et al. ISCA’16]

Dataflow ~ a

0

a1

0 ⇥

0

a3 1

0

P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0

1

b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7

0 B B B B B B B B B B B B B @

~b b0 b1 0 b3 0 b5 b6 0

1 C C C C C C C C C C C C C A

rule of thumb:  0*A=0 W*0=0 Compression

Acceleration

Regularization

94

[Han et al. ISCA’16]

Dataflow ~ a

0

a1

0 ⇥

0

a3 1

0

P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0

1

b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7

0 B B B B B B B B B B B B B @

~b b0 b1 0 b3 0 b5 b6 0

1 C C C C C C C C C C C C C A

rule of thumb:  0*A=0 W*0=0 Compression

Acceleration

Regularization

95

[Han et al. ISCA’16]

Dataflow ~ a

0

a1

0 ⇥

0

a3 1

0

P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0

1

b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7

0 B B B B B B B B B B B B B @

~b b0 b1 0 b3 0 b5 b6 0

1 C C C C C C C C C C C C C A

rule of thumb:  0*A=0 W*0=0 Compression

Acceleration

Regularization

96

[Han et al. ISCA’16]

Dataflow ~ a

0

a1

0 ⇥

0

a3 1

0

P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0

1

b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7

0 B B B B B B B B B B B B B @

~b b0 b1 0 b3 0 b5 b6 0

1 C C C C C C C C C C C C C A

rule of thumb:  0*A=0 W*0=0 Compression

Acceleration

Regularization

97

[Han et al. ISCA’16]

Dataflow ~ a

0

a1

0 ⇥

0

a3 1

0

P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0

1

b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7

0 B B B B B B B B B B B B B @

~b b0 b1 0 b3 0 b5 b6 0

1 C C C C C C C C C C C C C A

rule of thumb:  0*A=0 W*0=0 Compression

Acceleration

Regularization

98

[Han et al. ISCA’16]

Dataflow ~ a

0

a1

0 ⇥

0

a3 1

0

P E0 w0,0 w0,1 0 w0,3 B C B P E1 B 0 0 w1,2 0 C B B C B B B P E2 B 0 w2,1 0 w2,3 C C B B B P E3 B 0 0 0 0 C C B B C=B B 0 0 w4,2 w4,3 C B B C B Bw5,0 0 0 0 C B B C B B C B 0 0 0 w @ @ 6,3 A 0 w7,1 0 0

1

b0 C b1 C C b2 C C b3 C C ReLU C ) b4 C C b5 C C C b6 A b7

0 B B B B B B B B B B B B B @

~b b0 b1 0 b3 0 b5 b6 0

1 C C C C C C C C C C C C C A

rule of thumb:  0*A=0 W*0=0 Compression

Acceleration

Regularization

99

z William J. Dally † iversity, NVIDIA EIE Architecture ⇤

⇤†

[Han et al. ISCA’16]

avan,horowitz,dally}@stanford.edu Weight decode Compressed DNN Model

Input Image

Encoded Weight Relative Index Sparse Format

4-bit   16-bit Virtual weight Weight Real weight

Look-up

4-bit   Relative Index

Index Accum

ALU Prediction Mem

16-bit   Absolute Index

Result

Figure 1. Efficient inferenceAddress engine that Accumulate works on the compressed deep neural network model for machine learning applications.

word, or speech0 sample. applications, 2.09, 1.92=> 2 rule of thumb: * A = 0 For embedded W * 0 = 0 mobile these resource demands become prohibitive. Table I shows Compression

Acceleration

Regularization

100

[Han et al. ISCA’16]

Micro Architecture for each PE

Act Value Act Index

Act Value

Act Queue

Encoded Weight

Act Index

Even Ptr SRAM Bank

Col Start/ End Addr

Sparse Matrix SRAM

Regs

Sparse Matrix Access

SRAM

Compression

Acceleration

Weight Decoder Address Accum

Odd Ptr SRAM Bank Pointer Read

Act SRAM

Relative Index

Regs

Bypass

Dest Act Regs

Src Act Regs

Absolute Address

Arithmetic Unit

Leading NZero Detect

ReLU

Act R/W

Comb

Regularization

101

5.

[Han et al. ISCA’16]

Speedup on EIE SpMat

CPU Dense (Baseline) 1000x

CPU Compressed

GPU Dense

507x

248x

mGPU Compressed

Ptr_Even

Act_1

Arithm

SpMat

Ptr_Odd

Speedup

Act_0

100x

25x

14x

10x

1x

5x

3x

2x

1x

1x

0.6x

24x 9x

21x 14x

5x

1x 0.5x

1x 1x

1.1x

1x

135x

1x

1.0x

EIE 92x

22x 10x

8x

9x

mGPU Compressed

EIE

1x 1x

1.0x

1x 0.5x

48x

25x 9x

3x

3x

2x

60x

33x 15x

9x

1x

1x 0.3x

189x

98x

63x

34x

16x 10x

2x 1x

0.5x

2x

15x 3x

1x 0.5x

1x

3x 0.6x

0.1x

Alex-6

Layout of one PE in EIE under TSMC 45nm process. Table II

Figure 6.

Alex-7

Alex-8

VGG-6

VGG-7

VGG-8

ry network er national cell ueue ad tRead mUnit W cell

Power (mW) 9.157 5.416 1.874 1.026 0.841

(59.15%) (20.46%) (11.20%) (9.18%)

0.112 1.807 4.955 1.162 1.122

(1.23%) (19.73%) (54.11%) (12.68%) (12.25%)

(%)

Area (µm2 ) 638,024 594,786 866 9,465 8,946 23,961 758 121,849 469,412 3,110 18,934 23,961

Energy Efficiency

98x

(%)

CPU Dense (Baseline)

100000x (93.22%) 10000x (0.14%) (1.48%) 1000x (1.40%) (3.76%) 100x (0.12%) (19.10%) 10x (73.57%) (0.49%) 1x (2.97%) (3.76%)

Figure 7.

2x

NT-We

NT-Wd

NT-LSTM

Geo Mean

Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.

189x

MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS

x

mGPU Dense

618x

210x

115x

94x

56x

GPU Compressed

1018x

60x

37x 9x12x

26x 37x 5x 7x

1x

10x

Alex-6

GPU Dense 119,797x

61,533x

34,522x

25x 9x

CPU Compressed

1x

48x

14,826x

15x

Alex-7

15x

78x 101x

59x 7x10x 3x 1x

18x

7x

Alex-8

GPU Compressed

17x 10x 1x

13x

VGG-6

3x

mGPU Dense

mGPU Compressed

EIE

76,784x

61x 20x 10x

1x

11,828x

24,207x

10,904x

9,485x

8,053x

102x

14x

VGG-7

1x

2x

5x

25x

14x

8x

39x 14x 6x 6x

6x 6x 8x

1x

5x

VGG-8

NT-We

3x

1x

25x

7x

15x 20x 4x 5x 7x 1x

23x 6x 7x 1x

36x

9x

Geo Mean NT-Wd

NT-LSTM

Geo Mean

Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.

1x 1xcorner. We placed and routed 1x the PE using the Synopsys IC B 0.6x 0.5x compiler (ICC). We used Cacti [25] to get SRAM area and Layer

nit: I/O and Computing. In the I/O mode, all of re idle while the activations and weights in every e accessed by a DMA connected with the Central s is one time cost. In the Computing mode, the eatedly collects a non-zero value from the LNZD and broadcasts this value to all PEs. This process until the input length is exceeded. By setting the gth and starting address of pointer array, EIE is to execute different layers.

Table III

ENCHMARK FROM STATE - OF - THE - ART

Size

energy numbers. We annotated the toggle rate from the RTL 9216, Alex-6 4096 simulation to the gate-level netlist, which was dumped to 4096, switching activity interchange format (SAIF), and estimated Alex-7 V. E VALUATION M ETHODOLOGY 4096 power using Prime-Time PX. tor, RTL and Layout. We the implemented a custom 4096, Alex-8 urate C++ simulator for the accelerator aimed to 1000 Comparison dife RTL behavior of synchronous circuits. Each Baseline. We compare EIE with three GPU mGPU EIE CPU 25088, module is abstracted as an ferent object thatoff-the-shelf implements VGG-6 computing units: CPU, GPU and mobile 4096 act methods: propagate and update, corresponding 4096, ation logic and the flip-flop GPU. in RTL. The simulator VGG-7 or design space exploration. It also serves as a 4096 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E or RTL verification. 4096, asure the area, power and critical delay, we that has been used in NVIDIA Digits Deep VGG-8 classpath processor, 1000 ted the RTL of EIE in Verilog. The RTL is verified Dev Box as a Acceleration CPU baseline. To run the benchmark Regularization 4096, Compression e cycle-accurate simulator.Learning Then we synthesized

NT-LSTM

g the Synopsys Design Compiler (DC) under the

Geo Mean NT-We

600

DNN MODELS

Weight% Act%

FLOP%

9%

35.1%

3%

9%

35.3%

3%

25%

37.5%

10%

4%

18.3%

1%

4%

37.5%

2%

23%

41.1%

9%

10%

100%

10%

Description Compressed AlexNet [1] for large scale image classification Compressed VGG-16 [3] for large scale image classification and object detection Compressed NeuralTalk [7]

102

CPU Dense (Baseline)

Wd

NT-LSTM

Speedup

1000x

25x

14x

GPU Compressed

mGPU Dense

Geo Mean 24x 9x

21x 14x

mGPU Compressed

135x

92x

22x 10x

34x

16x 10x

60x

33x 15x

9x

2x

1x

0.6x

5x

1x 0.5x

1x 1x

1.1x

1x

1x

1.0x

1x 1x

1.0x

1x 0.3x

1x

25x 9x

3x

3x

2x

1x

2x

1x

0.5x

0.5x

48x

15x

2x

1x 0.5x

3x 1x

3x 0.6x

0.1x

Alex-6

Figure 6.

Act_0

Ptr_Even

Act_1

Arithm

Alex-7

Alex-8

CPU Dense (Baseline)

Ptr_Odd

VGG-6

VGG-7

100000x

CPU Compressed

GPU Dense 119,797x

61,533x

34,522x

100x

Figure 7.

37x 9x12x

26x 37x 5x 7x

10x

1x

1x

10x

Alex-6

1x

78x 101x

59x

15x

7x10x

3x

1x

Alex-7

18x

7x

Alex-8

(%)

mGPU Dense

11,828x

17x 10x

1x

13x

VGG-6

61x 20x 10x

102x

14x

1x

VGG-7

NT-LSTM

Geo Mean

mGPU Compressed

EIE

EIE 1x

2x

5x

8x

25x

14x

5x

VGG-8

10,904x

9,485x

39x 14x 6x 6x

6x 6x 1x

8x

NT-We

1x

25x

7x

NT-Wd

24,207x 8,053x

15x 20x 4x 5x 7x 1x

NT-LSTM

23x 6x 7x 1x

36x

9x

Geo Mean

Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.

10,904x corner. We placed and routed8,053x the PE using the Synopsys IC Area (µm2 ) 638,024 594,786 866 9,465 8,946 23,961 758 121,849 469,412 3,110 18,934 23,961

NT-Wd

76,784x

1000x

MPLEMENTATION RESULTS OF ONE PE IN EIE AND THE OWN BY COMPONENT TYPE ( LINE 3-7), BY MODULE ( LINE 8-13). T HE CRITICAL PATH OF EIE IS 1.15 NS

Power (mW) 9.157 5.416 1.874 1.026 0.841

GPU Compressed

10000x

Layout of one PE in EIE under TSMC 45nm process. Table II

NT-We

14,826x

mGPU Compressed

SpMat

VGG-8

Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.

SpMat Energy Efficiency

5.

3x

1x

1x

5x

9x

[Han et al. ISCA’16] 189x

98x

63x

Energy Efficiency on EIE There is no batching in all cases. 10x

8x

EIE

618x

210x

115x

94x

56x

GPU Dense

1018x

507x

248x

100x

CPU Compressed

(%)

24,207x

Table III B ENCHMARK FROM STATE - OF - THE - ART DNN MODELS

compiler(93.22%) (ICC). We used Cacti [25] to get SRAM area and Layer Size Weight% Act% FLOP% Description (0.14%) energy numbers. We annotated the toggle rate from the RTL 9216, (1.48%) Alex-6 9% 35.1% 3% (1.40%) 4096 Compressed simulation to the gate-level netlist, which was dumped to (3.76%) 4096, AlexNet [1] for 0.112 (1.23%) (0.12%) activity interchange format (SAIF), and estimated Alex-7 9% 35.3% 3% 1.807 (19.73%)switching (19.10%) 4096 large scale image 4.955 (54.11%) (73.57%) the power using Prime-Time PX. 1.162 (12.68%) (0.49%) 4096, classification Alex-8 25% 37.5% 10% 1.122 (12.25%) (2.97%) 1000 Comparison Baseline. We compare EIE with three dif(3.76%) 25088, VGG-6 4% 18.3% 1% Compressed ferent off-the-shelf computing units: CPU, GPU and mobile 4096 nit: I/O and Computing. In the I/O mode, all of VGG-16 [3] for GPU. 36x re idle while the activations and weights in every 4096, VGG-7 4% 37.5% 2% large scale image 25x 23x e accessed by a DMA connected with the Central 4096 20x 1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-E classification and 15x s is one time cost. In the Computing mode, the 4096, eatedly collects a non-zero value from the LNZD VGG-8 23% 41.1% 9% object detection class processor, that has been used in NVIDIA Digits 7x Deep 1000 6x and broadcasts this value to all PEs. This process 5x Dev the Box as a CPU baseline. To run the benchmark 4096, Compressed 4x until the input length is Learning exceeded. By setting NT-We 10% 100% 10% gth and starting address on of pointer array, EIE is 600 NeuralTalk [7] CPU, we used MKL CBLAS GEMV to implement the to execute different layers. 9x 600, with RNN and 7xMKL SPBLAS CSRMV 7xV. EVALUATION METHODOLOGY 1x dense model and 1x for the NT-Wd 11% 100% 11% original 8791 LSTM for compressed sparse model. CPU socket and DRAM power 1201, automatic tor, RTL and Layout. We implemented a custom NTLSTM 10% 100% 11% 2400 image captioning urate C++ simulator for the aimed toby the pcm-power utility provided by Intel. areaccelerator as reported e RTL behavior of synchronous circuits. Each GPU mGPU EIE CPUX GPU, 2) GPU. We use NVIDIA GeForce GTX Titan module is abstracted as an object that implements act methods: propagate and update, corresponding The uncompressed DNN model is obtained from Caffe a state-of-the-art GPU for deep learning as our baseline ation logic and the flip-flop in RTL. The simulator or design space exploration. It alsonvidia-smi serves as a model zoo [28] and NeuralTalk model zoo [7]; The comusing utility to report the power. To run or RTL verification. pressed DNN model is produced as described in [16], [23]. benchmark, asure the area, power andthe critical path delay, we we used cuBLAS GEMV to implement ted the RTL of EIE in Verilog. RTL is verified The benchmark networks have 9 layers in total obtained the The original dense layer.Acceleration For the compressed sparse layer, Regularization Compression e cycle-accurate simulator. Then we synthesized from AlexNet, VGGNet, and NeuralTalk. We use the Imagewe stored thethesparse matrix in in CSR format, and used g the Synopsys Design Compiler (DC) under

ry network er national cell ueue ad tRead mUnit W cell

Wd

(59.15%) (20.46%) (11.20%) (9.18%)

Geo Mean

NT-LSTM

Geo Mean

103

[Han et al. ISCA’16]

Comparison: Throughput EIE Throughput (Layers/s in log scale) 1E+06

ASIC

1E+05

ASIC

ASIC

1E+04

GPU

1E+03 1E+02

CPU

ASIC mGPU FPGA

1E+01 1E+00

Core-i7 5930k  TitanX  22nm   28nm   CPU GPU

Compression

Tegra K1  28nm  mGPU

Acceleration

A-Eye   DaDianNao   TrueNorth  28nm   28nm  28nm  FPGA ASIC ASIC

Regularization

EIE  45nm  ASIC  64PEs

EIE  28nm  ASIC  256PEs

104

[Han et al. ISCA’16]

Comparison: Energy Efficiency EIE Energy Efficiency (Layers/J in log scale) 1E+06 1E+05 1E+04

ASIC

ASIC

ASIC

EIE  45nm  ASIC  64PEs

EIE  28nm  ASIC  256PEs

ASIC

1E+03 1E+02

GPU mGPU

1E+01 1E+00

FPGA

CPU Core-i7 5930k  TitanX  22nm   28nm   CPU GPU

Compression

Tegra K1  28nm  mGPU

Acceleration

A-Eye   DaDianNao   TrueNorth  28nm   28nm  28nm  FPGA ASIC ASIC

Regularization

105

Agenda Algorithm Algorithms for   Efficient Inference

Algorithms for Efficient Training

Inference

Training

Hardware for  Efficient Inference

Hardware for  Efficient Training

Hardware

Part 3: Efficient Training — Algorithms

• 1. Parallelization • 2. Mixed Precision with FP16 and FP32 • 3. Model Distillation • 4. DSD: Dense-Sparse-Dense Training

Part 3: Efficient Training — Algorithms

• 1. Parallelization • 2. Mixed Precision with FP16 and FP32 • 3. Model Distillation • 4. DSD: Dense-Sparse-Dense Training

Moore’s law made CPUs 300x faster than in 1990  But its over…

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

Data Parallel – Run multiple inputs in parallel

Dally, High Performance Hardware for Machine Learning, NIPS’2015

Data Parallel – Run multiple inputs in parallel

• • •

Doesn’t affect latency for one input Requires P-fold larger batch size For training requires coordinated weight update Dally, High Performance Hardware for Machine Learning, NIPS’2015

Parameter Update One method to achieve scale is parallelization Parameter Server

∆p

p’ = p + ∆p

p’

Model! Workers Data! Shards Large scale distributed deep networks J Dean et al (2012)

Large Scale Distributed Deep Networks, Jeff Dean et al., 2013

Model Parallel  

Split up the Model – i.e. the network

Dally, High Performance Hardware for Machine Learning, NIPS’2015

Model-Parallel Convolution – by output region (x,y)

Kernels Multiple 3D Kuvkj

6D Loop Forall output map j For each input map k For each pixel x,y For each kernel element u,v Bxyj += A(x-u)(y-v)k x Kuvkj

A ijij AAxyk

x BBxyj BBxyj xyj xyj BBxyj xyj BBxyj BBxyj xyj xyj

Input maps Axyk

Dally, High Performance Hardware for Machine Learning, NIPS’2015

Output maps Bxyj

Model-Parallel Convolution – By output map j (filter)

Kernels Multiple 3D Kuvkj

6D Loop Forall output map j For each input map k For each pixel x,y For each kernel element u,v Bxyj += A(x-u)(y-v)k x Kuvkj

x AA ijij A ij Bxyj

A ijij AAxyk

Input maps Axyk

Output maps Bxyj

Dally, High Performance Hardware for Machine Learning, NIPS’2015

Model Parallel Fully-Connected Layer (M x V)

bi

=

Wij

x

Dally, High Performance Hardware for Machine Learning, NIPS’2015

Input activations

Output activations

weight matrix

aj

Model Parallel Fully-Connected Layer (M x V)

bi bi

=

Wij Wij

x

Dally, High Performance Hardware for Machine Learning, NIPS’2015

Input activations

Output activations

weight matrix

aj

Hyper-Parameter Parallel 

Try many alternative networks in parallel

Dally, High Performance Hardware for Machine Learning, NIPS’2015

Summary of Parallelism •

Lots of parallelism in DNNs • •

•

Data parallel • •

•

Run multiple training examples in parallel Limited by batch size

Model parallel • • • •

•

16M independent multiplies in one FC layer Limited by overhead to exploit a fraction of this

Split model over multiple processors By layer Conv layers by map region Fully connected layers by output activation

Easy to get 16-64 GPUs training one model in parallel Dally, High Performance Hardware for Machine Learning, NIPS’2015

Part 3: Efficient Training — Algorithms

• 1. Parallelization • 2. Mixed Precision with FP16 and FP32 • 3. Model Distillation • 4. DSD: Dense-Sparse-Dense Training

Mixed Precision

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

Mixed Precision Training VOLTA TRAINING METHOD VOLTA TRAINING METHOD VOLTA TRAINING METHOD VOLTA TRAINING METHOD W (F16) WW(F16) (F16) W (F16)

F16 F16 W F16 F16 WW W F16 Actv F16 F16 F16 Actv Actv Actv

F16 Actv F16 F16 F16 FWD Actv Actv FWD Actv FWD FWD

F16 W F16 F16 F16 WW W BWD-A F16 Actv Grad BWD-A BWD-A BWD-A F16 F16 F16 Actv Grad Actv Grad Actv Grad

F16 Actv Grad F16 F16 Actv Grad Actv Grad Actv Grad F16

Actv Grad

Actv Grad

F16 Actv F16 W Grad F16 F16 F16Actv WWGrad Actv Grad F16 Actv W Grad F16 BWD-W F16BWD-W F16 Actv Grad BWD-W BWD-W F16 F16F16 Actv Grad Actv Grad Actv Grad F16 F16 F16 F16

Master-W (F32) Master-W (F32) Master-W (F32) Master-W (F32) Master-W (F32)

F32 F32 F32 F32 F32 Weight Update F32 F32 F32 Weight Update Weight Update Weight Update

Updated Master-W Updated Master-W Updated Master-W Updated Master-W 5 5

5 55

Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, “Training with mixed precision”, NVIDIA GTC 2017

Inception V1 INCEPTION V1

12

Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, “Training with mixed precision”, NVIDIA GTC 2017

ResNet

RESNET50

13

Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, “Training with mixed precision”, NVIDIA GTC 2017

AlexNet Mode

Top1 accuracy, %

Top5 accuracy, %

Fp32

58.62

81.25

Mixed precision training

58.12

80.71

RESNET RESULTS

No scale of loss function … Inception V3 FP16 training 54.89 Top1 FP16 training,Mode loss scale = 1000 57.76 % accuracy, Fp32

71.75

INCEPTION-V3 RESULTS Mixed precision training 71.17

78.12 Top5 80.76 % accuracy, 90.52

90.10 30 Nvcaffe-0.16, DGX-1, SGD with momentum, 100 epochs, batch=1024, no augmentation, 1 crop, 1 model FP16 training, loss scale 1 Scale loss=function by 71.17 100x… ResNet-50

90.33

FP16 training, loss scale = 1, FP16 master weight storage Mode

Top1 70.53 accuracy, %

Top5 90.14 accuracy, %

Fp32

73.85

91.44

41 Nvcaffe-0.16, DGX-1, SGD with momentum, 100 epochs, batch=512, no augmentation, 1 crop, 1 model

Mixed precision training

73.6

91.11

Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, “Training with mixed precision”, NVIDIA GTC 2017 FP16 training 71.36 90.84

FP16 training, loss scale = 100

74.13

91.51

Part 3: Efficient Training Algorithm

• 1. Parallelization • 2. Mixed Precision with FP16 and FP32 • 3. Model Distillation • 4. DSD: Dense-Sparse-Dense Training

Model Distillation

Teacher model 1 (Googlenet)

Knowledge

Teacher model 2 (Vggnet)

Knowledge

Teacher model 3 (Resnet)

Knowledge

student model

student model has much smaller model size

Softened outputs reveal the dark knowledge

Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network

Softened outputs reveal the dark knowledge

• Method: Divide score by a “temperature” to get a much softer distribution  

• Result: Start with a trained model that classifies 58.9% of the test frames correctly. The new model converges to 57.0% correct even when it is only trained on 3% of the data Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network

Part 3: Efficient Training Algorithm

• 1. Parallelization • 2. Mixed Precision with FP16 and FP32 • 3. Model Distillation • 4. DSD: Dense-Sparse-Dense Training

DSD: Dense Sparse Dense Training

der review as a conference paper at ICLR 2017 Dense

Dense

Sparse

Pruning Sparsity Constraint

Re-Dense Increase Model Capacity

ure 1: Dense-Sparse-Dense Training Flow. The sparse training regularizes the model, and the fi se training restores the pruned weights (red), increasing the model capacity without overfitti

DSD produces same model architecture but can find better optimization solution, arrives at better local minima, and achieves higher prediction accuracy across a wide orithm 1: Workflow of DSD training range of deep neural networks on CNNs / RNNs / LSTMs.

tialization: W (0) with W (0) ⇠ N (0, ⌃) tput :W (t) . Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017 ——————————————– Initial Dense Phase ———————————————

DSD: Intuition

learn the trunk first

then learn the leaves

Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017

[Han et al. ICLR 2017]

DSD is General Purpose: Vision, Speech, Natural Language Baseline

DSD

Abs.  Imp.

ImageNet CNN

31.1%

30.0%

1.1%

3.6%

Vision

ImageNet CNN

31.5%

27.2%

4.3%

13.7%

ResNet-18

Vision

ImageNet CNN

30.4%

29.3%

1.1%

3.7%

ResNet-50

Vision

ImageNet CNN

24.0%

23.2%

0.9%

3.5%

Network

Domain

GoogleNet

Vision

VGG-16

Dataset

Type

Rel.  Imp.

Open Sourced DSD Model Zoo: https://songhan.github.io/DSD The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

Compression

Acceleration

Regularization

133

[Han et al. ICLR 2017]

DSD is General Purpose: Vision, Speech, Natural Language Baseline

DSD

Abs.  Imp.

ImageNet CNN

31.1%

30.0%

1.1%

3.6%

Vision

ImageNet CNN

31.5%

27.2%

4.3%

13.7%

ResNet-18

Vision

ImageNet CNN

30.4%

29.3%

1.1%

3.7%

ResNet-50

Vision

ImageNet CNN

24.0%

23.2%

0.9%

3.5%

16.8

18.5

1.7

10.1%

Network

Domain

GoogleNet

Vision

VGG-16

Dataset

Type

Caption Flickr-8K LSTM

NeuralTalk

Rel.  Imp.

Open Sourced DSD Model Zoo: https://songhan.github.io/DSD The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

Compression

Acceleration

Regularization

134

[Han et al. ICLR 2017]

DSD is General Purpose: Vision, Speech, Natural Language Baseline

DSD

Abs.  Imp.

ImageNet CNN

31.1%

30.0%

1.1%

3.6%

Vision

ImageNet CNN

31.5%

27.2%

4.3%

13.7%

ResNet-18

Vision

ImageNet CNN

30.4%

29.3%

1.1%

3.7%

ResNet-50

Vision

ImageNet CNN

24.0%

23.2%

0.9%

3.5%

16.8

18.5

1.7

10.1%

Network

Domain

GoogleNet

Vision

VGG-16

Dataset

Type

Caption Flickr-8K LSTM

NeuralTalk

Rel.  Imp.

DeepSpeech

Speech

WSJ’93

RNN

33.6%

31.6%

2.0%

5.8%

DeepSpeech-2

Speech

WSJ’93

RNN

14.5%

13.4%

1.1%

7.4%

Open Sourced DSD Model Zoo: https://songhan.github.io/DSD The beseline results of AlexNet, VGG16, GoogleNet, SqueezeNet are from Caffe Model Zoo. ResNet18, ResNet50 are from fb.resnet.torch.

Compression

Acceleration

Regularization

135

https://songhan.github.io/DSD

DSD on Caption Generation

Baseline: a boy in a red shirt is climbing a rock wall. Sparse: a young girl is jumping off a tree. DSD: a young girl in a pink shirt is swinging on a swing.

Baseline: a basketball player in a red uniform is playing with a ball. Sparse: a basketball player in a blue uniform is jumping over the goal. DSD: a basketball player in a white uniform is trying to make a shot.

Baseline: two dogs are playing together in a field.

Baseline: a man and a woman are sitting on a bench.

Sparse: two dogs are playing in a field.

Sparse: a man is sitting on a bench with his hands in the air. DSD: a man is sitting on a bench with his arms folded.

DSD: two dogs are playing in the grass.

Baseline: a person in a red jacket is riding a bike through the woods. Sparse: a car drives through a mud puddle. DSD: a car drives through a forest.

Figure 3: Visualization of DSD training improves the performance of image captioning. Baseline model: Andrej Karpathy, Neural Talk model zoo.

the from the background. Training The good of DSD training generalizes Han et forest al. “DSD: Dense-Sparse-Dense for performance Deep Neural Networks”, ICLR 2017

beyond these examples, more image caption results generated by DSD training is provided in the supplementary 137 material.

A. Supplementary Material: More Examples of DSD Training Improves the Performance of Generated by NeuralTalk (Images from Flickr-8K Test Set) NeuralTalk Auto-Caption System

DSD on Caption Generation

Baseline: a boy is swimming in a pool. Sparse: a small black dog is jumping into a pool. DSD: a black and white dog is swimming in a pool.

Baseline: a group of people sit on a bench in front of a building. Sparse: a group of people are standing in front of a building. DSD: a group of people are standing in a fountain.

Baseline: a group of people are standing in front of a building. Sparse: a group of people are standing in front of a building. DSD: a group of people are walking in a park.

Baseline: two girls in bathing suits are playing in the water. Sparse: two children are playing in the sand. DSD: two children are playing in the sand.

Baseline: a man in a red shirt and jeans is riding a bicycle down a street. Sparse: a man in a red shirt and a woman in a wheelchair. DSD: a man and a woman are riding on a street.

Baseline: a man in a black jacket and a black jacket is smiling. Sparse: a man and a woman are standing in front of a mountain. DSD: a man in a black jacket is standing next to a man in a black shirt.

Baseline: a group of football players in red uniforms. Sparse: a group of football players in a field. DSD: a group of football players in red and white uniforms.

Baseline: a dog runs through the grass. Sparse: a dog runs through the grass. DSD: a white and brown dog is running through the grass.

Baseline model: Andrej Karpathy, Neural Talk model zoo.

Agenda Algorithm Algorithms for   Efficient Inference

Algorithms for Efficient Training

Inference

Training

Hardware for  Efficient Inference

Hardware for  Efficient Training

Hardware

CPUs for Training

CPUs Are Targeting Deep Learning Intel Knights Landing (2016) •  7 TFLOPS FP32 •  16GB MCDRAM– 400 GB/s •  245W TDP •  29 GFLOPS/W (FP32) •  14nm process Knights Mill: next gen Xeon Phi “optimized for deep learning” Intel announced the addition of new vector instructions for deep learning (AVX512-4VNNIW and AVX512-4FMAPS), October 2016 Slide Source: Sze et al Survey of DNN Hardware, MICRO’16 Tutorial.  Image Source: Intel, DataNext Source: Next Platform Image Source: Intel, Data Source: Platform

2

GPUs for Training GPUs Are Targeting Deep Learning Nvidia PASCAL GP100 (2016) •  10/20 TFLOPS FP32/FP16 •  16GB HBM – 750 GB/s •  300W TDP •  67 GFLOPS/W (FP16) •  16nm process •  160GB/s NV Link

Slide Source: Sze et al Survey of DNN Hardware, MICRO’16 Tutorial.  Data Source: NVIDIA

Source: Nvidia

3

GPUs for Training

Nvidia Volta GV100 (2017)

• • • • • • • •

Data Source: NVIDIA

15 FP32 TFLOPS 120 Tensor TFLOPS 16GB HBM2 @ 900GB/s 300W TDP 12nm process 21B Transistors die size: 815 mm2 300GB/s NVLink

What’s new in Volta: Tensor CoreACC TENSOR CORE 4X4X4 MATRIX-MULTIPLY

a new instruction that performs 4x4x4 FMA mixed-precision operations per clock   12X increase in throughput for the Volta V100 compared to the Pascal P100 8

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

Pascal v.s. Volta

Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations. https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

Pascal v.s. Volta

Left: Tesla V100 trains the ResNet-50 deep neural network 2.4x faster than Tesla P100.   Right: Given a target latency per image of 7ms, Tesla V100 is able to perform inference using the ResNet-50 deep neural network 3.7x faster than Tesla P100. https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

The GV100 SM is partitioned into four processing blocks, each with: • • • • • • • •

8 FP64 Cores 16 FP32 Cores 16 INT32 Cores two of the new mixed-precision Tensor Cores for deep learning a new L0 instruction cache one warp scheduler one dispatch unit a 64 KB Register File.

https://devblogs.nvidia.com/parallelforall/ cuda-9-features-revealed/

Tesla Product

Tesla K40

Tesla M40

Tesla P100

Tesla V100

GPU

GK110 (Kepler) GM200 (Maxwell) GP100 (Pascal) GV100 (Volta)

GPU Boost Clock

810/875 MHz

1114 MHz

1480 MHz

1455 MHz

Peak FP32 TFLOP/s* 5.04

6.8

10.6

15

Peak Tensor Core TFLOP/s*

-

-

-

120

Memory Interface

384-bit GDDR5 384-bit GDDR5

4096-bit HBM2

4096-bit HBM2

Memory Size

Up to 12 GB

Up to 24 GB

16 GB

16 GB

TDP

235 Watts

250 Watts

300 Watts

300 Watts

Transistors

7.1 billion

8 billion

15.3 billion

21.1 billion

GPU Die Size

551 mm²

601 mm²

610 mm²

815 mm²

Manufacturing Process

28 nm

28 nm

16 nm FinFET+

12 nm FFN

https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

GPU / TPU

https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/

achine learning models on our new Google Cloud TPUs

5

Google Cloud TPU

Cloud TPU delivers up to 180 teraflops to train and run machine learning models. source: Google Blog Our new Cloud TPU delivers up to 180 teraEops to train and run machine learning

149

in an afternoon using just one eighth of a TPU pod.

Google Cloud TPU

A “TPU pod” built with 64 second-generation TPUs delivers up to 11.5 petaflops of machine learning acceleration. A “TPU pod” built with 64 second-generation TPUs delivers up to 11.5 petaEops of “One of our new large-scale translation models used to take a full day to train machine learning acceleration. on 32 of the best commercially-available GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod.”— Google Blog

Introducing Cloud TPUs

150

Wrap-Up Algorithm Algorithms for   Efficient Inference

Algorithms for Efficient Training

Inference

Training

Hardware for  Efficient Inference

Hardware for  Efficient Training

Hardware

Future

Smart

Low Latency

Privacy

Mobility

Energy-Efficient 152

Outlook: the Focus for Computation

PC Era Mobile-First Era AI-First Era

Computing

Mobile   Computing

Brain-Inspired  Cognitive Computing Sundar Pichai, Google IO, 2016 153

Thank you! stanford.edu/~songhan