yang globecom 2018 slides - PDF Free Download

Approximation Algorithms for Dependency-Aware Rule-Caching in Software-Defined Networks Jie Wu, Yang Chen and Huanyang Zheng Center for Networked Computing Temple University, USA

1. Introduction of Rule Caching l

Rule caching ¡

Install packet-processing rules in switches

¡

Switch types

¡

¡

Switch types

Pros

Cons

Hardware

Fast (>400 Gbps)

Small in capacity (2K~10K)

Software

Large in capacity

Slow (40 Gbps)

Hardware: Ternary Content Addressable Memory (TCAM) Software: Software-based

switches

General Rule Matching Problem Rule Table

l

l

Rule

Code

Priority

Weight

R1

000

6

10

R2

00*

5

60

R3

0**

4

30

R4

11*

3

5

R5

1*0

2

10

R6

10*

1

120

Rule dependency graph ¡

Use of wild card * to reduce rule number

¡

Directed acyclic graph: rule and all its decedents (to be in cache)

Maximum traffic-hit by placing no more than k rules ¡

NP-hard

[1]

[1] Cacheflow: Dependency-aware rule-caching for software-gdefined networks (SOSR’16)

Efficient Rule Caching l

Assumption ¡

¡

l

All rules form a forest of trees

¡

1100 1000

Action

Descendant constraint

1000

Port 1

Limited number of cached rules k

10**

Port 2

111*

Port 3 Software switch forwarding table

Objective ¡

TCAM forwarding table Rul e

Constraint ¡

l

Prefix coding to reduce rule number (optimal coding [2] is hard)

Maximize number of rules hits

Hit Miss

[2] Explicit path control in commodity data centers: Design and applications (ToN’16)

A Motivating Example

TCAM forwarding table Rule Rule

Action Action

k=2 k=3 With maximum hit

2. Solutions Greedy Solution One (Branch) l

branch branch

Definition ¡

Branch (which includes fork) l

¡

¡

A rule and all its descendants

RR

deny rule se deny rule ......

max branch max branch

......

RR

ma

(a) Branch in a max branch (b) Segment Branch in a max branch Max branch Fig. 2: Topological sorting order for a max l If it meets either of the two conditions: max segment (b) in non-increasing order fr (1) Branch size is k Thebranch remainder of the paper (2) If size < k, not a branch of another with a size ofis organized as 2 and 3 discuss two greedy solutions, one f k or less using a branch for caching and the other on model that cutsbranches a branch with a dummy r Maintaining max branches will include all cacheable complexity and performance analysis. Sec Definition Explanation optimal solution based on dynamic progra conducts a simulation study. Section 6 con Unit cost C Each rule has a unit cost II. G REEDY S OLUTION O N Weight W Rule hits

Unit benefit Δ"/ΔC

Given atorule set cost in a forest of trees (see Fi Ratio of rule weight rule consists of a rule and all its descendants. Du of the cache, we can only accommodate a

Greedy Solution One l

Steps ¡

¡

¡

l

Select the branch with the maximum unit benefit (Δ"/ΔC) Update unit benefit values of other branches Use a heap to maintain max unit benefit for each max branch

Time complexity O($ + & log $ + &2)

l

¡

n: rule number

¡

k: cache size

Approximation ratio: 2 ¡

First i items vs i+1th item

k=3 k=5 Optimal unit unit benefit benefit Optimal (43+13+7)/3=21 (43+13+7+12+20)/5=19

2. Solutions (cont’d) Greedy Solution Two branch (Segment) branch l

Definition ¡

Segment l

¡

RR

Cut off a branch

Deny rule l

l

......

max branch max branch

(a) Branch in a max branch

deny rule segment segment deny rule ......

RR

......

max segment max segment Segment in aa max maxsegment segment (b) Segment

Fig. 2: Topological sorting order for a max branch (a) and a A dummy rule to forward to the max segment (b) in non-increasing order from left to right. software switch Cut branches with low-weights

The remainder of the paper is organized as follows. Sections l Unit benefit (Δ"/ΔC+1) 2 and 3 discuss two greedy solutions, one for the basic model usingsegments a branch for caching and the other one for an enhanced l We only consider rule, together with without a forkmodel that cuts a branch with a dummy Converting a folk into complexity and performance analysis. Section 4 reviews an ¡ To avoid non-polynomial number multiple segments optimal solution basedofon dynamic programming. Section 5 choices conducts a simulation study. Section 6 concludes the paper. II. G REEDY S OLUTION O NE Given a rule set in a forest of trees (see Fig. 1), a tree branch

Alg 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Greedy Solution Two

when a fork is converted into multiple segments (see one ple in Fig. 3 (b)). eorem 4: The unit benefit loss in terms of the ratio of unit Steps fit of thelfork and the unit benefit of the corresponding ple segments is 6/5. max segment oof: Consider¡a Select fork with the l branches with weight with W the maximum unit cost C. The unit benefit of the forkbenefit is W/( C + l), e l deny rules¡areUpdate used. Theunit unit benefit of the segments benefit values of other W/( C + l + 1). Clearly, the fork to segment ratio is segments 1/( C + l). Since each branch needs at least one unit, Fig. 4: Constructing the global heap (g-heap) from the Constructing the global heap minimum C¡is Use l + 1.two The minimum l is 2. Therefore, heaps to maintain segments heaps of local heaps (l-heaps). (g-heap) from the max heaps maximum ratio is 6/5. of local heaps (l-heaps) eorem l 5: Time The Combined Greedy Algorithm has an complexity: an order can be {R0 , R1 , R6 , R5 , R4 , R3 , R2 } with du oximation ratio of 24/5 compared to the optimal scheme root R0 . Let T [R, m] be the subtree induced by rule R ¡ O(kn) ms of total weight. first m children (in terms of the depth-first order), an oof: We prove that for a segment with a deny rule, its descendants of these m children. For example, in Fi 1010 1000 weight W will be at least half the weight of the optimal T [R6 , 0]1110 only includes R6 , T [R6 , 1] includes R5 and R6 me when the greedy method is used based on the unit [R6 , 2] includes R2 , R3 , R4 , R5 , and R6 . Let d(R) b +denyTrule Rule Action fit. We only need to calculate the “waste” introduced by number of children of R. k=2 Miss Hit in the s ot used for the deny rule. Suppose the weight of the rule Port 1 cache-hits Let O[R, d, m]10** denote the optimal by the deny rule is more than W , then, the slot for the that we cache m rules out of T [R, d]. These m rules fo 100* Software rule would have been selected in an early round because only the dependencies within T [R, d] and will ignore the lot plus one deny rulebenefit will generate a segment with a unit dependencies. By definition, O[R0 , d(R0 ), k] is our obje Optimal unit Software fit ofwith over deny W/2. That is, there is no waste of “space” Let Wi denote the cache-hit of Ri . Theswitch initialization is: without deny rules rules ⇢forwarding table for the deny rule. If the weight of the rule used by the Wi if m 1 and i 6= 0 (20+7)/2=13.5 43/(1+1)=21.5 O[Ri , 0, m] = rule is no more than W , then, the optimal solution 0 otherwise the weight of the rule used by the deny rule included

2. Solutions (cont’d) Combined Greedy Solution l

Insight ¡

¡

Combine the two greedy solutions Use branch and segments with the same criterion l l

l

Maximum unit benefit Each maintains its own heap

Time complexity ¡

l

+deny rule

O(kn)

Approximation ratio ¡

24/5

k=3 Optimal unit benefit with deny rules (43+20)/(2+1)=21

2. Solutions (cont’d) Dynamic Programming (DP) Solution m rules containing vi and all its first d-1 children descendants

1

...

m-m’ rules l

Time complexity ¡

O(k2n)

m rules

d

containing vi and all its first d- 1 children descendants

containing vi and all its first d children descendants

m’ rules

containing vid and all its children descendants

method is used based on the unit T [R6 , 2] includes R2 , R3 , R4 , R5 , and R6 . Let d(R) be the will be atthe least half theintroduced weight of the [R6children , 0] only of includes lculate “waste” byoptimal numberT of R. R6 , T [R6 , 1] includes R5 and R6 , a greedy method used of based on the unitLet O[R, T [R6 ,d, 2]m] includes , Roptimal Letthe d(R) be 3 , R4 , Rcache-hits 5 , and R6 . in le. Suppose the is weight the rule denoteR2the sense eedthan to calculate the the “waste” introduced by re W , then, slot for the that wenumber cache of m children rules outofofR.T [R, d]. These m rules follow e deny rule. Suppose the weight of the rule Let O[R, d, m] denote the optimal cache-hits in the se selected in an early round because only the dependencies within T [R, d] and will ignore the other ule is more than W , then, the slot for the that we cache m rules out of T [R, d]. These m rules foll will generate a segment with a unit dependencies. By definition, O[R0 , d(R0 ), k] is our objective. ave been selected in an early round because only the dependencies within T [R, d] and will ignore the ot teny is,rule there is no waste of “space” Let W denote the cache-hit of RiO[R . The, d(R initialization is:object l T[R,d] Initialization will generate a segment with a unit lidependencies. By definition, ), k] is our 0 0 ⇢ he weight of the rule used by the W/2. That is, there is no waste of “space” Let Wi denote the cache-hit of Ri 1 . The is: Wi if m and initialization i 6= 0 ¡ Subtree of rule R, and its O[R , 0, m] = (1) ⇢ W ,If then, the optimal solution i rule. the weight of the rule used by the 0 otherwise Wi if m 1 and i 6= 0 first d children’s subtrees used by the included O[Ri , 0, m] = more than W ,deny then, rule the optimal solution 0 otherwise The optimal recurrence pattern is: ¡ Depth-first-search han 2 W . Combining the results f the rule used by the deny rule included n The optimal recurrence pattern is: 4, we have an. approximation ommore than 2 W Combining the resultsO[R li , Formulation d, m] = max O[Rni , d 1, m], 5.d Theorem 4, we have an approximation ⇤ h i , d, m] = max O[Ri , d 1, m], io O[R 0 0 /5Tab. =l 24/5. ⇤ max O[Rid 1,O[R,d,m] max s(R1 ), max s(R2 ), h io , d(R ), m ] + O[R , d 1, m m ] (2) id i 0 m 0m 0 0 Tab.max 1, max max 1 ),[R xshown s(R5 ),inand s(Rcache-hits ) are max O[Rid , d(Rid ), m ] + O[Ri , d 1, m m ] 6s(R 1 ], s(R ¡ Optimal by 2 ), 0m0 m (R ), max s(R ), and max s(R6in ) are [RIn1 ],Eq. 2, Rid is the d-th child of Ri . Note that O[Ri , d, m] 5[R 4 ], 4[R5 ], and 6 ], respectively. caching m rules T[R,d] ,aximum R4 ], [R4unit ], [Rbenefit In Eq. Rdid-th is the d-th Ri . child Note or that O[R 5 ], andof[R 6 ], respectively. i , d, ¡ R2, child ofchild Ri theofd-th 21.5 with either ignores its under uses these id: descendants ¡ O[R0unit , d(R0),k] of 21.5 with either ignores its descendants under the d-th child or uses th )S). has max a maximum s(R2 ) and benefit max s(R3 ) descendants. For the latter case, we additionally require that max s(R max s(R descendants. For the, d(R latter ),case, 0 we additionally require0 t 2 ) and 3 ) composition Our objective d100 [R⇤3 ] (SS). after the removal of R the of O[R 4, id id m ] and0 O[Ri , d 1, m m ] the composition of O[Rid , d(Rid ), m ] and O[Ri , d 1, m 2 , R3 ] and [R3 ] after the removal of R4 , l Rwith : tree root he next candidate a maximum follow the dependencies within T [R , d]. Ri is cached if and 0 s(R5 ) is the next candidate with a maximum follow the dependencies within iT [Ri , d]. Ri is cached if a R5 ) does not lhave rule only if all the rules in T [Ri , d] are cached. d(R0the ): alldeny R0’s children . max s(R5 ) does not have the deny rule only if all the rules in T [Ri , d] are cached. The time complexity of theofdynamic programming solution ule chain. The next candidate is The time complexity the dynamic programming solut d of a rule chain. The next candidate is 2 is O(kisn). because Eq. 2Eq. takes O(k)O(k) to calculate, efit 13 because no new no deny 2 O(kThis n). isThis is because 2 takes to calcul unitofbenefit of 13 because new deny and need compute Eq. 2Eq. for2 each rule rule Ri and eache teimmediate predecessor, R4 , is already in we to need to compute for each Ri and predecessor, R4 , is already in weand m (0 m  (0 m  mk).Consequently, the DP solution has haa ycase, ruledeny R3⇤ :rule 100R⇤3⇤(SS) replaced k). Consequently, the DP solution : 100is ⇤ (SS) is replaced than the solution one,one, but has a If: 1000. k = 5, [R66)] is 6 ) :S(R largercomplexity time complexity thangreedy the greedy solution but ha If kmax = S(R 5, max : [R6 ]larger is time

Dynamic Programming Solution (cont’d)

7. Simulation l

Comparison algorithms [1] ¡

Dependent l

¡

Cover l

l

Branch without using heap

Execution time

Cache

Cache-hit miss

Segment without using heaps

Our algorithms ¡

Branch

¡

Segment

¡

Combined

¡

DP (optimal)

15 min Rule update frequency

[1] Cacheflow: Dependency-aware rule-caching for software-defined networks (SOSR’16)

Settings l

Data sets ¡

CAIDA packet trace l

¡

Stanford Backbone packet trace l

l

l

12,000 forwarding rules 180,000 forwarding rules

Metrics ¡

Execution time

¡

Cache-hit ratio with TCAM size

¡

Cache-hit ratio with number of packets

Variables ¡

TCAM cache size: k= 63~2000

Simulation Results 300

200

100

0 2000

4000 6000 8000 10000 12000 Number of forwarding rules

(a) Algorithm execution time.

100

100

80

60

40 63

Branch,Dependent Segment Combined DP Cover

125 250 500 1000 2000 TCAM cache size (log scale)

(b) Cache hit traffic and TCAM size.

Cache-hit ratio (%)

Branch Segment Combined DP Cover Dependent

Cache-hit ratio (%)

Execution time/sec

~6 min

80

60

Branch,Dependent Segment Combined DP Cover

40 6.25 12.5 25 50 100 200 Number of packets/100k (log scale)

(c) Cache hit traffic and the number of packets.

Fig. 5: CAIDA packettrace trace. CAIDA packet links. Since CAIDA does not publish the policy used to pro15 minutes update time of 180K rules to mitigate a DDos l DP has a much larger execution time than others cess these packets, we build a policy by extracting forwarding attack [6]. There is a trade-off between caching efficiency and rules based on the destination IP addresses of the packets in program execution time. l Branch is faster than Dependent because of using heaps the trace. We obtain around 12,000 IP destination-based rules. C. Performance with the CAIDA traffic trace Stanford Backbone: The policy has around 180K Openl Our four algorithms achieve at least a 79.8% hit ratio with Flow rules that match on the destination’s IP address. We Fig. 5 shows results with the CAIDA traffic trace. We find 2,000 cache just 1.1% of the total rule table. in the dependency follow the processing method insize, [6] andwhich generate is a packet that there are many shallow dependencies trace that matches the routing policy by assigning a traffic graph. The depth of the dependency chains varies from 1 to 5. l DP achieves the best cache-hit ratio. volume to each rule drawn from a Zipf distribution [12]. The Many leaves’ depths are 2 or 3. These shallow dependencies resulting packet trace had around 30 million packets randomly incur the advantages of our greedy algorithms not so obvious. shuffled over 30 minutes. Fig. 5(a) shows the execution time of the six algorithms with the variability of rules. The dynamic programming method DP, B. Comparison Algorithms and Metrics

Simulation Results (cont’d) ~70 min

2000 1000 0 5630 11250 22500 45000 90000 180000 Number of forwarding rules (log scale)

(a) Algorithm execution time.

80

60

40 63

Branch,Dependent Segment Combined DP Cover

125 250 500 1000 2000 TCAM cache size (log scale)

(b) Cache hit traffic and TCAM size.

Cache-hit ratio (%)

3000

Branch Segment Combined DP Cover Dependent

Cache-hit ratio (%)

Execution time/sec

4000

100

100

80

60

Branch,Dependent Segment Combined DP Cover

40 9.375 18.75 37.5 75 150 300 Number of packets/100k (log scale)

(c) Cache hit traffic and the number of packets.

backbone Fig.Stanford 6: Stanford backbone trace router. l depth More result a much larger execution time Thus, the of therules dependency chainsin varies from 1 to 8 VI. C ONCLUSION and there are fewer shallow dependencies. The performances This paper studies the efficient rule cache problem in l Our three greedy algorithms achieve better ratios than CAIDA of our algorithms are better. Fig. 6(a) is the execution time SDNs. The hardware switches TCAM enable fast lookups result depending the forwarding rules that vary from 563because oneonwith the same TCAM size of rules, deeper dependencies for matching but have limited memory. The software to 18,000. It has more rules than CAIDA. The growth in switches have enough memory to store rules, but slow in number lmakes Algorithm DP try packets, more possibilities For 30 million DP’swhile cache-hit reaches 90.2%, matching.ratio Consequently, we need to place light-weight (highsearching for the optimal one. Then, the complexity increases hit) rules in TCAM hardwareand and Branch use software switches to Combined reaches 89.4%, Segment reaches 83.7% and it needs more time as shown in the red line. Reversely, handle the cache-miss traffic. We propose three greedy effecreaches 81.9% with 2,000 cache size. compared to Fig. 5(a), our three greedy algorithms’ times are tive algorithms. The first one with approximation 2 respects only 82%, 79% and 83%, respectively. This illustrates that the dependency constraint. The second one inserts deny rules these algorithms have better performances in the deeper chain. that relax the dependency constraint. The third one combines the first and second to have an approximation of

24 5 .

We also

8. Conclusion l

Hardware and software switches

l

Caching technology

l

¡

Wildcard (*) rule matching

¡

Rule dependency constraints

¡

Deny rule

¡

limited number of rules in TCAM

Objective ¡

l

Maximize cache-hit ratio

Solutions ¡

Three greedy algorithms with approximation ratios

¡

Optimal DP solution

Q&A