Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures Tom Henretty1 Kevin Stock1 Louis-Noël Pouchet1 Franz Franchetti2 J. Ramanujam3 P. Sadayappan1 1 2 3
The Ohio State University Carnegie Mellon University Louisiana State University
March 29, 2011 ETAPS CC’11
Saarbrucken, Germany
Outline:
CC’11
Outline 1
Introduction
2
Vectorization of Stencils
3
Stream Alignment Conflict
4
Data Layout Transformation
5
Compiler Framework
6
Experimental Results
7
Conclusion
OSU / CMU / LSU
2
Introduction:
CC’11
Short-Vector SIMD I
Perform identical computation on small chunks of data I I I
I
Low latency, multiple SIMD units per CPU I
I
Maximal Speedup equals the vector size
Ubiquitous feature on modern processors I I I I
OSU / CMU / LSU
Operations are independent Vector size: from 2 to 64 Packing operations to form a vector (shuffle, extract, ...)
x86 – SSE, AVX Power – VMX / VSX ARM – NEON Cell SPU
3
Introduction:
CC’11
A Brief on Stencil Computations
I
Typically: iterative update of a structured (fixed) grid
I
Compute a point from neighbor points values I
I
Numerous application domains use stencils I I I
I
OSU / CMU / LSU
Same grid / multiple grids
Finite difference methods for solving PDEs Image processing Computational electromagnetics, CFD, numerical relativity, ...
Domain-Specific Languages for Stencils (Fenics, RNPL, ...)
4
Vectorization of Stencils:
CC’11
Stencil Example (a) 5 point stencil C code for (t = 0; t < TMAX; ++t) for (i = 1; i < N-1; ++i) for (j = 1; j < M-1; ++j) a[i][j] = b[i+1][j] + b[i][j-1] + b[i ][j] + b[i][j+1] + b[i-1][j];
M
j
i
N
(b) Arrays a, b, and stencil detail OSU / CMU / LSU
5
Vectorization of Stencils:
CC’11
Vectorization of Stencil Computation
I
Two “main” types of stencils I I
Jacobi-like: the output does not depend on the input Seidel-like: in-place update
I
Loop transformations expose tiling possibilities, and at least one inner-most parallel loop
I
Auto-vectorization successful (ICC, GCC)...
I
...But SIMD speedup is far from optimal!
OSU / CMU / LSU
6
Vectorization of Stencils:
CC’11
Performance Consideration
for (t = 0; t < for (i = 0; i for (j = 1; S1: C[i][j] = for (i = 0; i for (j = 1; S2: A[i][j] = } Performance:
T; ++t) { < N; ++i) j < N+1; ++j) A[i][j] + A[i][j-1]; < N; ++i) j < N+1; ++j) C[i][j] + C[i][j-1]; AMD Phenom Core2 Core i7
(a) Stencil code
1.2 GFlop/s 3.5 GFlop/s 4.1 GFlop/s
for (t = 0; t < for (i = 0; i for (j = 0; S3: C[i][j] = for (i = 0; i for (j = 0; S4: A[i][j] = } Performance:
T; ++t) { < N; ++i) j < N; ++j) A[i][j] + B[i][j]; < N; ++i) j < N; ++j) C[i][j] + B[i][j]; AMD Phenom Core2 Core i7
1.9 GFlop/s 6.0 GFlop/s 6.7 GFlop/s
(b) Non-Stencil code
Stencil code (a) has much lower performance than the non-stencil code (b) despite accessing 50% fewer data elements
OSU / CMU / LSU
7
Stream Alignment Conflict:
CC’11
Stream Alignment Conflict
for (i = 0; i < H; i++) for (j = 0; j < W - 1; j++) A[i][j] = B[i][j] + B[i][j+1];
VECTOR REGISTERS xmm1
I
J
K
L
xmm2
M
N
O
P
xmm3
J
K
L
M
MEMORY CONTENTS A
A
B
C
D
E
F
G
H ... ... ... ...
B
I
J
K
L
M
N
O
P ... ... ... ...
I
Load and shuffle: I Load [I,J,K,L] and [M,N,O,P] I Shuffle to create [J,K,L,M]
I
Multiple unaligned loads I Load [I,J,K,L] and [J,K,L,M] I
OSU / CMU / LSU
x86 ASSEMBLY movaps B(...), %xmm1 movaps 16+B(...),%xmm2 movaps %xmm2, %xmm3 palignr $4, %xmm1, %xmm3 ;; Register state here addps %xmm1, %xmm3 movaps %xmm3, A(...)
Not possible on architectures with alignment constraints
8
Data Layout Transformation:
CC’11
Overview of the Solution
I
Stream Alignment Conflict: adjacent elements in memory maps to adjacent vector slots
I
Key idea: break this property, to have both operands in identical vector slot
I
Achieved through Data Layout Transformation I I I
OSU / CMU / LSU
No shuffle needed No extra unaligned load But not trivial to achieve!
9
Data Layout Transformation:
CC’11
Data Layout Transformation Example 0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
S
(a) Original Layout V N V
V
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
N V
(b) Dimension Lifted
A
G
M
B
H
N
T
C
I
O
U
D
J
P
V
E
K
Q
W
F
L
R
X
(c) Transposed
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
A
G
M
S
B
H
N
T
C
I
O
U
D
J
P
V
E
K
Q
W
F
L
R
X
(d) Transformed Layout
for (i = 1; i < 24; ++i) B[i] = (A[i-1] + A[i] + A[i+1]) / 3; OSU / CMU / LSU
10
Data Layout Transformation:
CC’11
Handling Boundaries Compute Boundaries of Array Z F
L
R
A
G
M
S
B
H
N
T
Shuffle Opposite Boundaries of Array Y F
G’ M’ S’
E
K
Q
W
F
L
R
X
G
M
S
L
R
X
F
L
R
A
G
M
G
M
S
S
Compute Steady State of Array Z
Original Array Y A
G
M
B
H
N
S T
C
I
O
U
D
J
P
V
E
K
Q
W
F
L
R
X
F’ L’ R’
OSU / CMU / LSU
11
Data Layout Transformation:
CC’11
Higher-dimensional Stencils (a) Original Layout
0 1 2
n0 w0 c0 e0 s0
n1 w1 c1 e1 s1
n2 w2 c2 e2 s2
n3 w3 c3 e3 s3
(b) Transformed Layout
n0 n1 n2 n3
0
OSU / CMU / LSU
w0 w1 w2 w3 c0 c1 c2 c3
s0 s1 s2 s3
1
2
e0 e1 e2 e3
12
Compiler Framework:
CC’11
Overview of Code Generation Algorithm
1
Detect arrays/statements that suffers from SAC
2
Perform Dimension-Lift-and-Transpose of those arrays
3
Generate Vector code for the inner-loop I I I
OSU / CMU / LSU
Ghost cell copy-in and copy-out code Boundary code Steady state code
13
Compiler Framework:
CC’11
Detection of Stream Alignment Conflict
I
Standard compiler framework operating on array subscript functions
I
Main idea: detect cross-iteration reuse
I
Robust to stream offset via iteration shifting I I
I
Requires the window of the stencil to be constant I
OSU / CMU / LSU
Minimize the reuse distance Some alignment conflicts are artificial and fixed with stream realignment
The window size is used to compute the amount of ghost cells
14
Experimental Results:
CC’11
Experimental Setup
I
Experiments run on 3 architectures (x86): I I I
I
Data is L1-resident I
I
OSU / CMU / LSU
Intel Core2 Quad (Kentsfield): SAC resolved with low-performance shuffles AMD Phenom (K10): SAC resolved with average-performance shuffles Intel Core i7 (Nehalem): SAC resolved with fast redundant loads
assume tiling was performed beforehand if necessary
Tested compiler: Intel ICC 11.1
15
Experimental Results:
CC’11
Three Code Variants Evaluated
1
Ref: reference code I I
2
DLT: basic layout transformed I I
3
Straightforward C implementation with DLT arrays Always auto-vectorized by the compiler
DLTi: intrinsics + layout transformed I
OSU / CMU / LSU
Straightforward C implementation Always auto-vectorized by the compiler
C implementation with DLT arrays and SSE vector intrinsics
16
Experimental Results:
CC’11
Single Precision Results Single Precision DLT Results L1 Cache Resident 16
14
12
Gflop/s
10
8 Ref. 6
DLT DLTi
4
J-‐1D
OSU / CMU / LSU
J-‐2D-‐5pt
FDTD-‐2D
Core i7
Core2Quad
Core i7
Phenom
Core2Quad
Core i7
J-‐2D-‐9pt J-‐3D Hea?tut-‐3D Benchmark / Microarchitecture
Phenom
Core2Quad
Core i7
Phenom
Core2Quad
Core i7
Phenom
Core2Quad
Core i7
Phenom
Core2Quad
Core i7
Phenom
Phenom
0
Core2Quad
2
Rician-‐2D
17
Experimental Results:
CC’11
Double Precision Results Double Precision DLT Results L1 Cache Resident 8 7 6
Gflop/s
5 4 Ref.
3
DLT DLTi
2 1
J-1D
J-2D-5pt
J-2D-9pt
J-3D
FDTD-2D
Core i7
Core2Quad
Core i7
Phenom
Core2Quad
Core i7
Heatttut-3D
Phenom
Core2Quad
Core i7
Phenom
Core2Quad
Core i7
Phenom
Core2Quad
Core i7
Phenom
Core2Quad
Core i7
Phenom
Phenom
Core2Quad
0
Rician-2D
Benchmark / Microarchitecture
OSU / CMU / LSU
18
Experimental Results:
CC’11
Summary of Experiments
I
Performance improvement matches the shuffle/unaligned load costs
I
Tested higher-dimensional stencils show less improvement: I I
I
OSU / CMU / LSU
more intra-stencil dependences higher cache pressure
Manual check of the ASM showed no shuffle, no redundant load instructions
19
Conclusion:
CC’11
Conclusion
I
Stream Alignment Conflict is the performance bottleneck for auto-vectorized stencils
I
Impact varies with micro-architecture characteristics, but is always significant
I
A data layout transformation can solve this problem
I
Strong performance improvement observed I
OSU / CMU / LSU
Manual vectorization still beats automatic vectorization
20