iwapt slides - PDF Free Download

Neural Network Assisted Tile Size Selection Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan Dept. of Computer Science and Engineering Ohio State University

June 22, 2010

iWAPT 2010 Workshop Berkeley, USA

iWAPT’10

Introduction:

Overview

Situation: I

New advances in parametric tiling → more user code to be tuned

I

The problem of tile size selection is complex and unsolved!

Our approach:

Ohio State

I

Use machine learning to create a performance predictor of tile size performance, for a specific program

I

Rely on the distribution shape to extract promising subspaces for empirical search

I

Outcome: < 2% of the space traversed → 90+% of maximal speedup achieved

2

iWAPT’10

Problem Statement:

Tiling

Ohio State

I

Tiling partition the computation into blocks

I

Note we consider only rectangular tiling here

I

For tiling to be legal, such a partitioning must be legal

3

iWAPT’10

Problem Statement:

Parametric Tiling

Automatic parametric tiling [ICS’09,CGO’10]: I

Produce code where the tile dimensions are parameters

I

Seamlessly find/apply all required transformation to make the code tilable

I

Actual tile sizes are given at run-time

I

very useful for tile size selection (no need to recompile)

I

recent progresses have generalized the approach: I I I I

Ohio State

Operates on arbitrary affine-control loops (imperfectly nested) Produce good quality code Even expose pipeline-parallelism if needed Software (from OSU): Pluto, PrimeTile/DynTile/PTile

4

iWAPT’10

Problem Statement:

Tile Size Selection Problem: how to select the tile size to have the best performance?

Ohio State

I

data reuse within the execution of a tile;

I

data reuse between tiles;

I

the layout in memory of the data used in a tile;

I

the relative penalty of misses at each level of the hierarchy, which is machine-dependent.

I

the cache replacement policy;

I

the interaction with other units, such at prefetching;

I

the interaction with vectorization, to enable a profitable steady-state for the vectorized loop(s);

I

...

5

iWAPT’10

Problem Statement:

Performance Distribution Performance distribution of fdtd-2d and syr2k dsyr2k: Performance Distribution with Tile Size Configurations

fdtd-2d: Performance distribution with Tile Size configurations 0.7

Execution time in Seconds

0.5

0.4

0.3

0.2

0.1

0.6 0.5 0.4 0.3 0.2 0.1

Ohio State

I

Search space: 10648 possible tile sizes I {1, 2, 4, 6, 8, 10, 12, 16, 30, 32, 40, 48, 64, 100, 128, 150, 200, 256, 300, 400, 500, 600}

I

Machine: Core i7 (1 thread)

I

2 "standard" distribution shapes

300:100:4

128:40:8

500:128:64

Tile sizes- Ti:Tj:Tk

200:48:128

40:16:12

64:30:200

12:8:30

30:10:300

4:2:40

42:100:4

48:128:64

30:40:8

35:48:128

20:16:12

25:30:200

12:8:30

16:10:300

8:4:500

1:1:1

4:2:40

Tile Sizes ( Ti:Tj:Tk)

8:4:500

0

0

1:1:1

Execution Time in Seconds

0.6

6

Performance Prediction:

iWAPT’10

Ojectives Correlate execution time with tile sizes I

(Static) performance models do exist...

I

... but fail to capture the interplay between all hardware components

I

Usually better suited for well-known problems (eg, uniform reuse + square tiles)

I

Another view: pruning the space of poor-performing tile sizes

Our approach:

Ohio State

I

Build a neural network to model the performance distribution

I

Focus directly on the execution time

I

ANN dedicated to a specific program + dataset size

7

Performance Prediction:

iWAPT’10

Neural Network Layout: I

Fully connected, multi-layer perceptron (MLP)

I

Input layer: the tile sizes (Ti , Tj , Tk )

I

Output layer: predicted execution time

I

One hidden layer consisting of 30 hidden neurons

I

Use Stuttgart Neural Network Simulator library

Training:

Ohio State

I

Select 5% (530 tuples) from the search space of 10648

I

Run the program on the machine using the tile size specified by the tuples

I

Train with resilient back-propagation (rprop), using the actual execution time for a tuple

I

Standard 10% cross-validation procedure 8

iWAPT’10

Performance Prediction:

Performance Prediction [1/2]

fdtd-2d: Predicted versus Actual Performance dsyr2k : Predicted versus Actual Performance 5 4.5

0.6

Execution Time in seconds

ExTime (Actual )

0.5

ExTime (Predicted)

0.4 0.3 0.2 0.1

3 2.5 2 1.5 1

0.5

0

6:12:1

40:10:4

100:300:12

30:300:300

100:40:300

128:2:300

256:200:256

64:4:16

10:400:500

600:128:32

40:600:12

10:1:256

16:400:400

32:4:4

30:64:150

20:2:16

12:400:8

45:128:6

16:2:8

12:1:48

10:12:8

0

Tile Sizes (Ti:Tj:Tk)

Ohio State

ExTime(Actual) ExTime(Predicted)

4 3.5

8:4:64

Execution Time in Seconds

0.7

Tile sizes - Ti:Tj:Tk

9

iWAPT’10

Performance Prediction:

Performance Prediction [2/2]

lu: Predicted versus Actual Performance

dgemm: Predicted versus Actual Performance 3.5

300:100:4

500:128:64

128:40:8

6:200:500

4:500:10

256:400:16

Tile sizes ( Ti:Tj:Tk)

30:64:400

256:64:4

10:256:12

2:10:1

1:32:256

64:40:16

32:2:128

12:12:16

0

Ohio State

0 1:1:1

0.1

1 0.5

200:48:128

0.2

ExTime (Predicted)

1.5

40:16:12

0.3

ExTime (Actual) 2

64:30:200

0.4

2.5

12:8:30

0.5

30:10:300

0.6

3

4:2:40

ExTime (Actual) ExTime (Predicted)

8:4:500

0.7

Execution Time in Seconds

Execution Time in Seconds

0.8

Tile Sizes (Ti:Tj:Tk)

10

Performance Prediction:

iWAPT’10

Discussions

I

for trmm, lu, 2d-jacobi, syr2k and doitgen, predict more than 90% of our search space with less than 10% deviation for the actual execution time

I

In total, can predict 80% and more with less than 10% deviation

I

Usually smaller deviation for the best tile sizes

→ These ANN are able to model the performance distribution Openings:

Ohio State

I

Program classifier w.r.t. performance distribution

I

Training: do not "fit" that much the training points?

11

iWAPT’10

Tile Size Selection:

Selecting the Best Tile Size

The performance distribution can drive the empirical search to focus on promising subspaces

Tile size selection: I

Random approach has a huge variability on some distribution shapes

I

Exhaustive search is likely not needed

I

Need for an intermediate solution I I I

Ohio State

Low number of empirical runs Good convergence, good variability General enough to work on arbitrary user codes

12

iWAPT’10

Tile Size Selection:

Overview of the Algorithm

Ohio State

1

Generate a parametrically tiled code

2

Randomly select x% of the tile size space, and run them on the machine

3

Train an ANN using this data

4

Use the ANN to predict performance of the entire space

5

Collect y tile sizes that are predicted best and not already ran

6

Run the y tile sizes on the machine, output the best found

13

iWAPT’10

Tile Size Selection:

Experimental Setup

Ohio State

I

Studied various kernels (perfectly/imperfectly nested, BLAS & stencils)

I

Only focused on single-threaded execution, on an Intel Core i7

I

Comparison: simple random search (R), ANN search (ANN)

I

Repeat each experiment 100 times, for various sampling rate

14

iWAPT’10

Tile Size Selection:

Experimental Results (y = 50)

Ohio State

doitgen

gemm

syr2k

lu

2d-jacobi

fdtd-2d

1%

R-best R-average R-worst ANN-best ANN-average ANN-worst

100% 98.71% 95.35% 100% 98.89% 97.26%

99.86% 96.29% 69.64% 99.86% 96.35% 82.93%

98.15% 94.80% 89.81% 100% 96.01% 89.79%

99.89% 92.19% 40.63% 100% 92.62% 79.68%

99.91% 94.10% 17.69% 99.91% 98.51% 94.23%

97.75% 84.15% 31.02% 100% 84.50% 66.53%

2%

R-best R-average R-worst ANN-best ANN-average ANN-worst

99.97% 98.71% 86.49% 100% 98.89% 97.26%

99.86% 96.42% 67.89% 99.86% 96.76% 89.83%

98.71% 94.80% 88.20% 100% 96.69% 89.65%

99.89% 92.87% 45.29% 100% 95.34% 85.80%

100% 97.60% 55.98% 100% 98.55% 94.17%

100% 84.10% 27.30% 100% 88.61% 60.65%

3%

R-best R-average R-worst ANN-best ANN-average ANN-worst

99.97% 98.77% 94.89% 99.97% 98.93% 97.64%

99.86% 96.47% 63.58% 99.86% 97.14% 91.01%

98.71% 94.80% 87.99% 100% 97.17% 92.27%

99.89% 94.27% 61.24% 100% 95.34% 85.80%

100% 98.39% 84.54% 100% 98.74% 94.50%

100% 85.47% 47.99% 100% 91.45% 63.34%

4%

R-best R-average R-worst ANN-best ANN-average ANN-worst

99.97% 98.80% 96.86% 100% 98.99% 98.28%

99.86% 96.65% 69.73% 99.86% 97.67% 93.65%

98.71% 94.93% 88.57% 100% 97.20% 92.66%

99.89% 92.19% 52.03% 100% 95.79% 85.80%

100% 98.41% 82.47% 100% 98.90% 94.50%

100% 85.55% 43.74% 100% 93.55% 79.26% 15

iWAPT’10

Tile Size Selection:

Some Related Work

Epshteyn et al. [LCPC’05]: I

Search-oriented contribution

I

Uses regression curves to approximate the performance distribution

I

Uses active learning to select good candidates for empirical evaluation

I

Good results for BLAS kernels

Yuki et al. [CGO’10]:

Ohio State

I

Aims at selecting/combining between different static models

I

Uses program features to characterize accesses, train ANN

I

Results demonstrated for matrix-like kernels

16

iWAPT’10

Tile Size Selection:

Conclusions and Future Work ANN is a candidate approach to connect tile sizes with performance I

Good prediction quality

I

Deviation usually smaller for the good points

I

Combined search heuristic proposed: I I

Strong variability improvement over naive random approach 90+% efficiency using < 2% of the space, likely can be improved further

Future work: I

Generalization! I I

I

Do not try to fit the random samples during training I I

Ohio State

Categorize benchmarks reg. the performance distribution shape Dataset size Reduce the training time problem: ANN configuration

17

Acknowledgements:

iWAPT’10

Acknowledgements

This work was funded in part by the U.S. National Science Foundation through award 0926688 and the Defense Advanced Research Projects Agency through AFRL Contract FA8650-09-C-7915. The opinions and findings in this document do not necessarily reflect the views of either the United States Government or the Ohio State University.

Ohio State

18