cc slides - PDF Free Download

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures Tom Henretty1 Kevin Stock1 Louis-Noël Pouchet1 Franz Franchetti2 J. Ramanujam3 P. Sadayappan1 1 2 3

The Ohio State University Carnegie Mellon University Louisiana State University

March 29, 2011 ETAPS CC’11

Saarbrucken, Germany

Outline:

CC’11

Outline 1

Introduction

2

Vectorization of Stencils

3

Stream Alignment Conflict

4

Data Layout Transformation

5

Compiler Framework

6

Experimental Results

7

Conclusion

OSU / CMU / LSU

2

Introduction:

CC’11

Short-Vector SIMD I

Perform identical computation on small chunks of data I I I

I

Low latency, multiple SIMD units per CPU I

I

Maximal Speedup equals the vector size

Ubiquitous feature on modern processors I I I I

OSU / CMU / LSU

Operations are independent Vector size: from 2 to 64 Packing operations to form a vector (shuffle, extract, ...)

x86 – SSE, AVX Power – VMX / VSX ARM – NEON Cell SPU

3

Introduction:

CC’11

A Brief on Stencil Computations

I

Typically: iterative update of a structured (fixed) grid

I

Compute a point from neighbor points values I

I

Numerous application domains use stencils I I I

I

OSU / CMU / LSU

Same grid / multiple grids

Finite difference methods for solving PDEs Image processing Computational electromagnetics, CFD, numerical relativity, ...

Domain-Specific Languages for Stencils (Fenics, RNPL, ...)

4

Vectorization of Stencils:

CC’11

Stencil Example (a) 5 point stencil C code for (t = 0; t < TMAX; ++t) for (i = 1; i < N-1; ++i) for (j = 1; j < M-1; ++j) a[i][j] = b[i+1][j] + b[i][j-1] + b[i ][j] + b[i][j+1] + b[i-1][j];

M

j

i

N

(b) Arrays a, b, and stencil detail OSU / CMU / LSU

5

Vectorization of Stencils:

CC’11

Vectorization of Stencil Computation

I

Two “main” types of stencils I I

Jacobi-like: the output does not depend on the input Seidel-like: in-place update

I

Loop transformations expose tiling possibilities, and at least one inner-most parallel loop

I

Auto-vectorization successful (ICC, GCC)...

I

...But SIMD speedup is far from optimal!

OSU / CMU / LSU

6

Vectorization of Stencils:

CC’11

Performance Consideration

for (t = 0; t < for (i = 0; i for (j = 1; S1: C[i][j] = for (i = 0; i for (j = 1; S2: A[i][j] = } Performance:

T; ++t) { < N; ++i) j < N+1; ++j) A[i][j] + A[i][j-1]; < N; ++i) j < N+1; ++j) C[i][j] + C[i][j-1]; AMD Phenom Core2 Core i7

(a) Stencil code

1.2 GFlop/s 3.5 GFlop/s 4.1 GFlop/s

for (t = 0; t < for (i = 0; i for (j = 0; S3: C[i][j] = for (i = 0; i for (j = 0; S4: A[i][j] = } Performance:

T; ++t) { < N; ++i) j < N; ++j) A[i][j] + B[i][j]; < N; ++i) j < N; ++j) C[i][j] + B[i][j]; AMD Phenom Core2 Core i7

1.9 GFlop/s 6.0 GFlop/s 6.7 GFlop/s

(b) Non-Stencil code

Stencil code (a) has much lower performance than the non-stencil code (b) despite accessing 50% fewer data elements

OSU / CMU / LSU

7

Stream Alignment Conflict:

CC’11

Stream Alignment Conflict

for (i = 0; i < H; i++) for (j = 0; j < W - 1; j++) A[i][j] = B[i][j] + B[i][j+1];

VECTOR REGISTERS xmm1

I

J

K

L

xmm2

M

N

O

P

xmm3

J

K

L

M

MEMORY CONTENTS A

A

B

C

D

E

F

G

H ... ... ... ...

B

I

J

K

L

M

N

O

P ... ... ... ...

I

Load and shuffle: I Load [I,J,K,L] and [M,N,O,P] I Shuffle to create [J,K,L,M]

I

Multiple unaligned loads I Load [I,J,K,L] and [J,K,L,M] I

OSU / CMU / LSU

x86 ASSEMBLY movaps B(...), %xmm1 movaps 16+B(...),%xmm2 movaps %xmm2, %xmm3 palignr $4, %xmm1, %xmm3 ;; Register state here addps %xmm1, %xmm3 movaps %xmm3, A(...)

Not possible on architectures with alignment constraints

8

Data Layout Transformation:

CC’11

Overview of the Solution

I

Stream Alignment Conflict: adjacent elements in memory maps to adjacent vector slots

I

Key idea: break this property, to have both operands in identical vector slot

I

Achieved through Data Layout Transformation I I I

OSU / CMU / LSU

No shuffle needed No extra unaligned load But not trivial to achieve!

9

Data Layout Transformation:

CC’11

Data Layout Transformation Example 0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

S

(a) Original Layout V N V

V

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

N V

(b) Dimension Lifted

A

G

M

B

H

N

T

C

I

O

U

D

J

P

V

E

K

Q

W

F

L

R

X

(c) Transposed

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

0

1

2

3

A

G

M

S

B

H

N

T

C

I

O

U

D

J

P

V

E

K

Q

W

F

L

R

X

(d) Transformed Layout

for (i = 1; i < 24; ++i) B[i] = (A[i-1] + A[i] + A[i+1]) / 3; OSU / CMU / LSU

10

Data Layout Transformation:

CC’11

Handling Boundaries Compute Boundaries of Array Z F

L

R

A

G

M

S

B

H

N

T

Shuffle Opposite Boundaries of Array Y F

G’ M’ S’

E

K

Q

W

F

L

R

X

G

M

S

L

R

X

F

L

R

A

G

M

G

M

S

S

Compute Steady State of Array Z

Original Array Y A

G

M

B

H

N

S T

C

I

O

U

D

J

P

V

E

K

Q

W

F

L

R

X

F’ L’ R’

OSU / CMU / LSU

11

Data Layout Transformation:

CC’11

Higher-dimensional Stencils (a) Original Layout

0 1 2

n0 w0 c0 e0 s0

n1 w1 c1 e1 s1

n2 w2 c2 e2 s2

n3 w3 c3 e3 s3

(b) Transformed Layout

n0 n1 n2 n3

0

OSU / CMU / LSU

w0 w1 w2 w3 c0 c1 c2 c3

s0 s1 s2 s3

1

2

e0 e1 e2 e3

12

Compiler Framework:

CC’11

Overview of Code Generation Algorithm

1

Detect arrays/statements that suffers from SAC

2

Perform Dimension-Lift-and-Transpose of those arrays

3

Generate Vector code for the inner-loop I I I

OSU / CMU / LSU

Ghost cell copy-in and copy-out code Boundary code Steady state code

13

Compiler Framework:

CC’11

Detection of Stream Alignment Conflict

I

Standard compiler framework operating on array subscript functions

I

Main idea: detect cross-iteration reuse

I

Robust to stream offset via iteration shifting I I

I

Requires the window of the stencil to be constant I

OSU / CMU / LSU

Minimize the reuse distance Some alignment conflicts are artificial and fixed with stream realignment

The window size is used to compute the amount of ghost cells

14

Experimental Results:

CC’11

Experimental Setup

I

Experiments run on 3 architectures (x86): I I I

I

Data is L1-resident I

I

OSU / CMU / LSU

Intel Core2 Quad (Kentsfield): SAC resolved with low-performance shuffles AMD Phenom (K10): SAC resolved with average-performance shuffles Intel Core i7 (Nehalem): SAC resolved with fast redundant loads

assume tiling was performed beforehand if necessary

Tested compiler: Intel ICC 11.1

15

Experimental Results:

CC’11

Three Code Variants Evaluated

1

Ref: reference code I I

2

DLT: basic layout transformed I I

3

Straightforward C implementation with DLT arrays Always auto-vectorized by the compiler

DLTi: intrinsics + layout transformed I

OSU / CMU / LSU

Straightforward C implementation Always auto-vectorized by the compiler

C implementation with DLT arrays and SSE vector intrinsics

16

Experimental Results:

CC’11

Single Precision Results Single Precision DLT Results L1 Cache Resident 16

14

12

Gﬂop/s

10

8 Ref. 6

DLT DLTi

4

J-‐1D

OSU / CMU / LSU

J-‐2D-‐5pt

FDTD-‐2D

Core i7

Core2Quad

Core i7

Phenom

Core2Quad

Core i7

J-‐2D-‐9pt J-‐3D Hea?tut-‐3D Benchmark / Microarchitecture

Phenom

Core2Quad

Core i7

Phenom

Core2Quad

Core i7

Phenom

Core2Quad

Core i7

Phenom

Core2Quad

Core i7

Phenom

Phenom

0

Core2Quad

2

Rician-‐2D

17

Experimental Results:

CC’11

Double Precision Results Double Precision DLT Results L1 Cache Resident 8 7 6

Gflop/s

5 4 Ref.

3

DLT DLTi

2 1

J-1D

J-2D-5pt

J-2D-9pt

J-3D

FDTD-2D

Core i7

Core2Quad

Core i7

Phenom

Core2Quad

Core i7

Heatttut-3D

Phenom

Core2Quad

Core i7

Phenom

Core2Quad

Core i7

Phenom

Core2Quad

Core i7

Phenom

Core2Quad

Core i7

Phenom

Phenom

Core2Quad

0

Rician-2D

Benchmark / Microarchitecture

OSU / CMU / LSU

18

Experimental Results:

CC’11

Summary of Experiments

I

Performance improvement matches the shuffle/unaligned load costs

I

Tested higher-dimensional stencils show less improvement: I I

I

OSU / CMU / LSU

more intra-stencil dependences higher cache pressure

Manual check of the ASM showed no shuffle, no redundant load instructions

19

Conclusion:

CC’11

Conclusion

I

Stream Alignment Conflict is the performance bottleneck for auto-vectorized stencils

I

Impact varies with micro-architecture characteristics, but is always significant

I

A data layout transformation can solve this problem

I

Strong performance improvement observed I

OSU / CMU / LSU

Manual vectorization still beats automatic vectorization

20