Neural Network Assisted Tile Size Selection Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan Dept. of Computer Science and Engineering Ohio State University
June 22, 2010
iWAPT 2010 Workshop Berkeley, USA
iWAPT’10
Introduction:
Overview
Situation: I
New advances in parametric tiling → more user code to be tuned
I
The problem of tile size selection is complex and unsolved!
Our approach:
Ohio State
I
Use machine learning to create a performance predictor of tile size performance, for a specific program
I
Rely on the distribution shape to extract promising subspaces for empirical search
I
Outcome: < 2% of the space traversed → 90+% of maximal speedup achieved
2
iWAPT’10
Problem Statement:
Tiling
Ohio State
I
Tiling partition the computation into blocks
I
Note we consider only rectangular tiling here
I
For tiling to be legal, such a partitioning must be legal
3
iWAPT’10
Problem Statement:
Parametric Tiling
Automatic parametric tiling [ICS’09,CGO’10]: I
Produce code where the tile dimensions are parameters
I
Seamlessly find/apply all required transformation to make the code tilable
I
Actual tile sizes are given at run-time
I
very useful for tile size selection (no need to recompile)
I
recent progresses have generalized the approach: I I I I
Ohio State
Operates on arbitrary affine-control loops (imperfectly nested) Produce good quality code Even expose pipeline-parallelism if needed Software (from OSU): Pluto, PrimeTile/DynTile/PTile
4
iWAPT’10
Problem Statement:
Tile Size Selection Problem: how to select the tile size to have the best performance?
Ohio State
I
data reuse within the execution of a tile;
I
data reuse between tiles;
I
the layout in memory of the data used in a tile;
I
the relative penalty of misses at each level of the hierarchy, which is machine-dependent.
I
the cache replacement policy;
I
the interaction with other units, such at prefetching;
I
the interaction with vectorization, to enable a profitable steady-state for the vectorized loop(s);
I
...
5
iWAPT’10
Problem Statement:
Performance Distribution Performance distribution of fdtd-2d and syr2k dsyr2k: Performance Distribution with Tile Size Configurations
fdtd-2d: Performance distribution with Tile Size configurations 0.7
Execution time in Seconds
0.5
0.4
0.3
0.2
0.1
0.6 0.5 0.4 0.3 0.2 0.1
Ohio State
I
Search space: 10648 possible tile sizes I {1, 2, 4, 6, 8, 10, 12, 16, 30, 32, 40, 48, 64, 100, 128, 150, 200, 256, 300, 400, 500, 600}
I
Machine: Core i7 (1 thread)
I
2 "standard" distribution shapes
300:100:4
128:40:8
500:128:64
Tile sizes- Ti:Tj:Tk
200:48:128
40:16:12
64:30:200
12:8:30
30:10:300
4:2:40
42:100:4
48:128:64
30:40:8
35:48:128
20:16:12
25:30:200
12:8:30
16:10:300
8:4:500
1:1:1
4:2:40
Tile Sizes ( Ti:Tj:Tk)
8:4:500
0
0
1:1:1
Execution Time in Seconds
0.6
6
Performance Prediction:
iWAPT’10
Ojectives Correlate execution time with tile sizes I
(Static) performance models do exist...
I
... but fail to capture the interplay between all hardware components
I
Usually better suited for well-known problems (eg, uniform reuse + square tiles)
I
Another view: pruning the space of poor-performing tile sizes
Our approach:
Ohio State
I
Build a neural network to model the performance distribution
I
Focus directly on the execution time
I
ANN dedicated to a specific program + dataset size
7
Performance Prediction:
iWAPT’10
Neural Network Layout: I
Fully connected, multi-layer perceptron (MLP)
I
Input layer: the tile sizes (Ti , Tj , Tk )
I
Output layer: predicted execution time
I
One hidden layer consisting of 30 hidden neurons
I
Use Stuttgart Neural Network Simulator library
Training:
Ohio State
I
Select 5% (530 tuples) from the search space of 10648
I
Run the program on the machine using the tile size specified by the tuples
I
Train with resilient back-propagation (rprop), using the actual execution time for a tuple
I
Standard 10% cross-validation procedure 8
iWAPT’10
Performance Prediction:
Performance Prediction [1/2]
fdtd-2d: Predicted versus Actual Performance dsyr2k : Predicted versus Actual Performance 5 4.5
0.6
Execution Time in seconds
ExTime (Actual )
0.5
ExTime (Predicted)
0.4 0.3 0.2 0.1
3 2.5 2 1.5 1
0.5
0
6:12:1
40:10:4
100:300:12
30:300:300
100:40:300
128:2:300
256:200:256
64:4:16
10:400:500
600:128:32
40:600:12
10:1:256
16:400:400
32:4:4
30:64:150
20:2:16
12:400:8
45:128:6
16:2:8
12:1:48
10:12:8
0
Tile Sizes (Ti:Tj:Tk)
Ohio State
ExTime(Actual) ExTime(Predicted)
4 3.5
8:4:64
Execution Time in Seconds
0.7
Tile sizes - Ti:Tj:Tk
9
iWAPT’10
Performance Prediction:
Performance Prediction [2/2]
lu: Predicted versus Actual Performance
dgemm: Predicted versus Actual Performance 3.5
300:100:4
500:128:64
128:40:8
6:200:500
4:500:10
256:400:16
Tile sizes ( Ti:Tj:Tk)
30:64:400
256:64:4
10:256:12
2:10:1
1:32:256
64:40:16
32:2:128
12:12:16
0
Ohio State
0 1:1:1
0.1
1 0.5
200:48:128
0.2
ExTime (Predicted)
1.5
40:16:12
0.3
ExTime (Actual) 2
64:30:200
0.4
2.5
12:8:30
0.5
30:10:300
0.6
3
4:2:40
ExTime (Actual) ExTime (Predicted)
8:4:500
0.7
Execution Time in Seconds
Execution Time in Seconds
0.8
Tile Sizes (Ti:Tj:Tk)
10
Performance Prediction:
iWAPT’10
Discussions
I
for trmm, lu, 2d-jacobi, syr2k and doitgen, predict more than 90% of our search space with less than 10% deviation for the actual execution time
I
In total, can predict 80% and more with less than 10% deviation
I
Usually smaller deviation for the best tile sizes
→ These ANN are able to model the performance distribution Openings:
Ohio State
I
Program classifier w.r.t. performance distribution
I
Training: do not "fit" that much the training points?
11
iWAPT’10
Tile Size Selection:
Selecting the Best Tile Size
The performance distribution can drive the empirical search to focus on promising subspaces
Tile size selection: I
Random approach has a huge variability on some distribution shapes
I
Exhaustive search is likely not needed
I
Need for an intermediate solution I I I
Ohio State
Low number of empirical runs Good convergence, good variability General enough to work on arbitrary user codes
12
iWAPT’10
Tile Size Selection:
Overview of the Algorithm
Ohio State
1
Generate a parametrically tiled code
2
Randomly select x% of the tile size space, and run them on the machine
3
Train an ANN using this data
4
Use the ANN to predict performance of the entire space
5
Collect y tile sizes that are predicted best and not already ran
6
Run the y tile sizes on the machine, output the best found
13
iWAPT’10
Tile Size Selection:
Experimental Setup
Ohio State
I
Studied various kernels (perfectly/imperfectly nested, BLAS & stencils)
I
Only focused on single-threaded execution, on an Intel Core i7
I
Comparison: simple random search (R), ANN search (ANN)
I
Repeat each experiment 100 times, for various sampling rate
14
iWAPT’10
Tile Size Selection:
Experimental Results (y = 50)
Ohio State
doitgen
gemm
syr2k
lu
2d-jacobi
fdtd-2d
1%
R-best R-average R-worst ANN-best ANN-average ANN-worst
100% 98.71% 95.35% 100% 98.89% 97.26%
99.86% 96.29% 69.64% 99.86% 96.35% 82.93%
98.15% 94.80% 89.81% 100% 96.01% 89.79%
99.89% 92.19% 40.63% 100% 92.62% 79.68%
99.91% 94.10% 17.69% 99.91% 98.51% 94.23%
97.75% 84.15% 31.02% 100% 84.50% 66.53%
2%
R-best R-average R-worst ANN-best ANN-average ANN-worst
99.97% 98.71% 86.49% 100% 98.89% 97.26%
99.86% 96.42% 67.89% 99.86% 96.76% 89.83%
98.71% 94.80% 88.20% 100% 96.69% 89.65%
99.89% 92.87% 45.29% 100% 95.34% 85.80%
100% 97.60% 55.98% 100% 98.55% 94.17%
100% 84.10% 27.30% 100% 88.61% 60.65%
3%
R-best R-average R-worst ANN-best ANN-average ANN-worst
99.97% 98.77% 94.89% 99.97% 98.93% 97.64%
99.86% 96.47% 63.58% 99.86% 97.14% 91.01%
98.71% 94.80% 87.99% 100% 97.17% 92.27%
99.89% 94.27% 61.24% 100% 95.34% 85.80%
100% 98.39% 84.54% 100% 98.74% 94.50%
100% 85.47% 47.99% 100% 91.45% 63.34%
4%
R-best R-average R-worst ANN-best ANN-average ANN-worst
99.97% 98.80% 96.86% 100% 98.99% 98.28%
99.86% 96.65% 69.73% 99.86% 97.67% 93.65%
98.71% 94.93% 88.57% 100% 97.20% 92.66%
99.89% 92.19% 52.03% 100% 95.79% 85.80%
100% 98.41% 82.47% 100% 98.90% 94.50%
100% 85.55% 43.74% 100% 93.55% 79.26% 15
iWAPT’10
Tile Size Selection:
Some Related Work
Epshteyn et al. [LCPC’05]: I
Search-oriented contribution
I
Uses regression curves to approximate the performance distribution
I
Uses active learning to select good candidates for empirical evaluation
I
Good results for BLAS kernels
Yuki et al. [CGO’10]:
Ohio State
I
Aims at selecting/combining between different static models
I
Uses program features to characterize accesses, train ANN
I
Results demonstrated for matrix-like kernels
16
iWAPT’10
Tile Size Selection:
Conclusions and Future Work ANN is a candidate approach to connect tile sizes with performance I
Good prediction quality
I
Deviation usually smaller for the good points
I
Combined search heuristic proposed: I I
Strong variability improvement over naive random approach 90+% efficiency using < 2% of the space, likely can be improved further
Future work: I
Generalization! I I
I
Do not try to fit the random samples during training I I
Ohio State
Categorize benchmarks reg. the performance distribution shape Dataset size Reduce the training time problem: ANN configuration
17
Acknowledgements:
iWAPT’10
Acknowledgements
This work was funded in part by the U.S. National Science Foundation through award 0926688 and the Defense Advanced Research Projects Agency through AFRL Contract FA8650-09-C-7915. The opinions and findings in this document do not necessarily reflect the views of either the United States Government or the Ohio State University.
Ohio State
18