2 0 n o v ’12pvr/grascompcloudday2012/... · 2020. 10. 23. · network#failure#example#1:#...

58
TU Berlin / TLabs 20Nov ’12

Upload: others

Post on 25-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

TU   Berlin   /   T-­‐Labs  

2   0   N   o   v   ’12  

Page 2: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

We  Live  in  a  Connected  World  

20  Nov  2012  

Network  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   2  

Page 3: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

When  Users  NoEce  the  Network  

20  Nov  2012  

Like  electricity,  we  assume  it  is  magically  always  there  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   3  

Page 4: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Network  Failure  Example  1:  SoMware  Bugs  in  Inter-­‐Domain  Routers  

20  Nov  2012  

Router  type  A  Router  type  B  

?

0-­‐length  AS4_PATH  aVribute!  

Protocol-­‐compliant  but  confusing  message  

On  19th  August  2009,  CNCI  (AS9354),  a  small  ISP  in  Japan,  adver&sed  a  handful  of  BGP  updates  containing  an  empty  AS4_PATH  a?ribute  

Reset  session!  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   4  

Page 5: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

…what  could  possibly  go  wrong?  

20  Nov  2012  

BGP  Update  Rate  Percentage  Increase  

10x  increase  

à  rou&ng  

instabili&es  

[renesys]  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   5  

Page 6: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

What  Went  Wrong:  (CISCO)  Session  Reset  Flood  

20  Nov  2012  

Unaffected  router  Affected  router  

?  

?  

?  

?  

?  

Unreachable!  Repeated  

service  disrup&ons  Innocuous  soMware  fault  caused  Internet-­‐wide  outage  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   6  

Page 7: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Network  Failure  Example  2:  Planned  Network  Maintenance  

•  Amazon  EC2  disrupEon  on  21st  April  2011  –  Incorrectly  executed  network  change  during  a  planned  network  capacity  upgrade  

20  Nov  2012  

MisconfiguraEon  caused  catastrophic  outage  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   7  

Page 8: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

SoMware-­‐  and  config-­‐related  issues  

Affect  even  well  tested,  standard  Internet  technology    With  more  soMware  in  networks,  need  ways  to  deal  with  reliability  issues    

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   8  

Page 9: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Why  is  network  reliability  so  difficult  to  achieve?  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   9  

Page 10: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Networks  are  Hard  to  Manage  

New  control  requirements  led  to  great  complexity  –  Network  virtualizaEon,  VM  migraEon,  perf.  isolaEon,  …  

 Kept  working  by  “Masters  of  Complexity”    When  things  don’t  work?  –  Only  limited  tools:  

 ping,  traceroute,  tcpdump,  SNMP,  NetFlow    20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   10  

Page 11: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

SoMware-­‐Defined  Networking  (SDN)  

20  Nov  2012  

Control  

Control  

Control  

Control  

Third-­‐party  control  program  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   11  

Page 12: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

SDN  Promises  

Advantages  over  status  quo  of  management    Reduce  complexity    New  funcEonality  through  programmability    SDN  is  great,  but  …  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   12  

Page 13: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

…  at  the  risk  of  bugs  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   13  

Page 14: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

SoMware  Faults  

•  Will  make  communicaEon  unreliable  

•  Major  hurdle  for  success  of  SDN  

20  Nov  2012  

We  need  effecEve  ways  to  test  SDN  networks  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   14  

Page 15: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Roadmap  

Ø Intro  Ø OpenFlow  background  Ø NICE  [NSDI’12]:  systemaEcally  tesEng  OpenFlow  Apps  

Ø SOFT  [CoNEXT’12]:  automaEng  interop  tesEng  of  OpenFlow  Agents  

Ø Conclusions  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   15  

Page 16: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Quick  OpenFlow  101  

Host  B  Host  A  

Switch  2  Flow  Table  Rule  1  Rule  2  

Rule  N  

Switch  1  Packet  

OpenFlow  program  

Controller  

Install  rule;  forward  packet  

Default:  forward  to  controller  

Match   AcEons   Counters  Dst:  Host  B   Fwd:  Switch  2   pkts  /  bytes  

System  is  distributed  and  asynchronous  à  can  misbehave  under  corner  cases  

Execute  packet_in  event  handler  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   16  

Page 17: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Bugs  in  OpenFlow  Apps  

OpenFlow  program  

Host  B  Host  A  

Switch  2  

Controller  

Switch  1  Packet  

Install  rule    

?  

Goal:  systemaEcally  test  possible  behaviors  to  detect  bugs  

Install  rule  

Delayed!  

20  Nov  2012  

Drop  packet  

Inconsistent  distributed  state!  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   17  

Page 18: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Roadmap  

Ø Intro  Ø OpenFlow  background  Ø NICE  [NSDI’12]:  systemaEcally  tesEng  OpenFlow  Apps  

Ø SOFT  [CoNEXT’12]:  automaEng  interop  tesEng  of  OpenFlow  Agents  

Ø Conclusions  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   18  

Page 19: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

State-­‐space  exploraEon  via  Model  Checking  (MC)  

SystemaEcally  TesEng  OpenFlow  Apps  

Target  system  

Unmodified  OpenFlow  program  

Complex  environment  

Environment  model  

Switch  1  

Switch  2  

Host  A   Host  B  

•  Carefully-­‐craMed  streams  of  packets  

•  Many  orderings  of  packet  arrivals  and  events  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   19  

Page 20: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Scalability  Challenges  

Huge  space  of  possible  packets  

 

Huge  space  of  possible  

event  orderings    

Data-­‐plane  driven   Complex  network  behavior  

EnumeraEng  all  inputs  and  event  orderings  is  intractable  

Equivalence  classes  of  packets  

Domain-­‐specific  search  

strategies  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   20  

Page 21: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Network  topology  

Correctness  properEes  

(e.g.,  no  loops)  

Traces  of  property  violaEons  

Input   Output  NICE  

State-­‐space  search  

No  bugs  In  Controller  ExecuEon  

NICE  found  11  bugs  in  3  real  OpenFlow  Apps  

Unmodified  OpenFlow  program  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   21  

Page 22: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Model  Checking  

State-­‐Space  Model  State  0  

State  2  

State  6  

State  7  

State  4  

State  9  

State  1  

State  3  

State  5  

State  8  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   22  

Page 23: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

System  State  

20  Nov  2012  

State  

Controller  (global  variables)  

Environment:  Switches  (flow  table)  

 Simplified  switch  model  

End-­‐hosts  (network  stack)    Simple  clients/servers  

Communica&on  channels  (in-­‐flight  pkts)  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   23  

Page 24: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

TransiEon  System  State  0  

State  2  

State  6  

State  7  

State  4  

State  9  

State  1  

State  3  

ctrl  

packet_in(pkt  B)  

Run  actual  packet_in  handler  

State  5  

State  8  

20  Nov  2012  

Data-­‐dependent  transiEons!  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   24  

Page 25: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

CombaEng  Huge  Space  of  Packets  Packet  arrival  handler  

is  dst  broadcast?  

Flood  packet   Install  rule  and  forward  packet  

dst  in  mactable?  

Equivalence  classes  of  packets:  1. Broadcast  desEnaEon  2. Unknown  unicast  desEnaEon  3. Known  unicast  desEnaEon  

yes  

no  

no  

yes  

Code  itself  reveals  equivalence  classes  of  packets  

pkt  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   25  

Page 26: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Code  Analysis:  Symbolic  ExecuEon  (SE)  

Packet  arrival  handler  

is  λ.dst  broadcast?  yes   no  

Symbolic  packet  λ

Flood  packet  

λ  .dst  ∈  {Broadcast}

λ.dst  in  mactable?  no  

yes  

λ  .dst  ∉  {Broadcast}

Install  rule  and  forward  packet  

λ  .dst  ∉  {Broadcast}�∧�

λ  .dst  ∉  mactable λ  .dst  ∉  {Broadcast} ∧�

λ  .dst  ∈  mactable  

1  path  =  1  equivalence  

class  of  packets  =  1  packet  to  inject  

Infeasible  from  iniEal  state  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   26  

Page 27: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

New  packets  

Enable  new  transi&ons:  

host  /  send(pkt  B)  host  /  send(pkt  C)  

Symbolic  execuEon  

of  packet_in  handler  

State  0  

State  1  

Controller  state  1  

State  2  

host  discover_packets   State  

3  

host  send(pkt  B)  

State  4  

discover_packets  transi&on:  

Combining  SE  with  Model  Checking  

Controller  state  changes  

host  send(pkt  A)  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   27  

Page 28: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

CombaEng  Huge  Space  of  Orderings  

MC  +  SE  

FLOW-­‐IR  

NO-­‐DELAY  

UNUSUAL  

OpenFlow-­‐specific  search  strategies  for  up  to  20x  state-­‐space  reducEon:  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   28  

Page 29: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Specifying  App  Correctness  

•  Library  of  common  proper&es  – No  forwarding  loops  – No  black  holes  – Direct  paths  (no  unnecessary  flooding)  – Etc…  

•  Correctness  is  app-­‐specific  in  nature  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   29  

Page 30: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

API  to  Define  App-­‐Specific  ProperEes  

20  Nov  2012  

State  0  

State  1  

ctrl  packet_in(pkt  A)  

def  init():      init  local  vars      register(“packet_in”)    def  on_packet_in():      check  system-­‐wide  state  

Register  callbacks  to  observe  transiEons  

Execute  aMer  transiEons  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   30  

Page 31: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Prototype  ImplementaEon  

•  Built  a  NICE  prototype  in  Python  •  Target  the  Python  API  of  NOX  

20  Nov  2012  

Unmodified  OpenFlow  program  

Stub  NOX  API  

NICE  

Controller  state  &  transiEons  

UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   31  

Page 32: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Experiences  

•  Tested  3  unmodified  NOX  OpenFlow  Apps  – MAC-­‐learning  switch  – LB:  Web  server  load  balancer  [Wang  et  al.,  HotICE’11]  – TE:  Energy-­‐aware  traffic  engineering  [CoNEXT’11]  

•  Setup  –  Iterated  with  1,  2  or  3-­‐switch  topologies;  1,2,…  pkts  – App-­‐specific  properEes  •  LB:  All  packets  of  same  request  go  to  same  server  replica  •  TE:  Use  appropriate  path  based  on  network  load  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   32  

Page 33: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Results  

•  NICE  found  11  property  violaEons  à  bugs  – Few  secs  to  find  1st  violaEon  of  each  bug  (max  30m)  

– Few  simple  mistakes  (not  freeing  buffered  packets)  

– 3  insidious  bugs  due  to  network  race  condiEons  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   33  

Page 34: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Take  Aways  

•  Why  were  mistakes  easy  to  make?  – Centralized  programming  model  only  an  abstracEon  

•  Why  the  programmer  could  not  detect  them?  – Bugs  don’t  always  manifest  – TCP  masks  transient  packet  loss  – Plaxorm  lacks  runEme  checks  

•  Why  NICE  easily  found  them?  – Makes  corner  cases  as  likely  as  normal  cases  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   34  

Page 35: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Roadmap  

Ø Intro  Ø OpenFlow  background  Ø NICE  [NSDI’12]:  systemaEcally  tesEng  OpenFlow  Apps  

Ø SOFT  [CoNEXT’12]:  automaEng  interop  tesEng  of  OpenFlow  Agents  

Ø Conclusions  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   35  

Page 36: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Testbed  tesEng  

Mininet  

Interoperability  at  Deployment  Time  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   36  

OpenFlow  program   NICE  

SystemaEc  tesEng  

Page 37: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Release  

Interoperability  at  Deployment  Time  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   37  

OpenFlow  program  

OpenFlow  messages  

One  OpenFlow  API  specificaEon…    Are  OF  switches  interoperable?  

Interop  is  criEcal  for  the  success  of  SDN  

Page 38: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Interop:  How  Hard  Can  It  Be?  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   38  

OF  Switch  

ASIC  switch  chip  

OS  

OpenFlow  Agent  

Page 39: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

DefiniEon  of  Interoperability*  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   39  

“Being  able  to  accomplish  end-­‐user  applica&ons  using  different  types  of  systems,  whose  interfaces  are  completely  understood,  in  a  manner  that  requires  the  user  to  have  li?le  or  no  knowledge  of  the  unique  characteris&cs  of  those  systems”  

*  NB:  Many  other  definiEons  exist  

Page 40: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Interop:  How  Hard  Can  It  Be?  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   40  

OF  Switch  Inputs  

Hardware  correctness  is  formally  verified  

Packets  

OpenFlow  messages  

“Forwarding”  interface  

OpenFlow  interface  

ASIC  switch  chip  

OS  

OpenFlow  Agent  

Likely  source  of  OpenFlow  interop  issues  

Flow  Table  Hardware  AbstracEon  Layer  

Page 41: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

OpenFlow  SoMware  Agent  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   41  

SpecificaEons  •  Rapid  flux  (3  revisions  in  ~  1  year)  •  AmbiguiEes  (FlowMod  is  2.5  pages  long)  SpecificaEons  à  ImplementaEon  •  ImplementaEon  freedom  •  Vendors  may  not  follow  the  specs  

TesEng,  tesEng  and  tesEng…  

Switch  soMware  is  not  provably  correct  L  

Page 42: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   42  

Page 43: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Interop’12  TesEng  Event  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   43  

Open Networking Foundation White Paper

8

issues will be disclosed in this document. Many of the products tested are not commercially available yet.

The original proposal was to focus on test cases that applied to service provider, data center and enterprise use-cases. Vendors of controllers and applications running on controllers really determine what can be tested. For this event we were able to test:

• Topology discovery (LLDP method) • Layer 2 Ethernet/VLAN path (circuit) provisioning (primary and backup) • Layer 3 (IP) learning (shortest path primary and backup path) • Layer 3 (IP) load balancing • Enabling multi-controller connectivity using FlowVisor to slice the network

Each one of these applications requires the switches to support the OpenFlow v1.0 protocol.

Testing at the Interoperability Event

• Gather  various  vendors  in  Vegas  • Hook  up  switches  and  controllers  •  Create  and  run  test  cases  •  See  what  breaks  and  …  

•  Very  high  manual  effort  •  Test  cases  are  not  exhausEve  •  It  is  not  a  one  Eme  thing  

What  happens  in  Vegas,  stays  in  Vegas  What  happens  in  Vegas,  stays  in  Vegas  

Page 44: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

AutomaEng  Interop  TesEng  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   44  

Insight:  systemaEcally  crosscheck  OF  implementaEons  

Page 45: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

The  10,000  foot  view  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   45  

OF  Agent  1  

Test  inputs  

Input-­‐driven  execuEon  

Observable  behaviors  

Inconsistency!  

OF  Agent  2  

Page 46: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Challenges  

•  Manage  test  inputs  and  coverage  efficiently  – Or  manage  “path  explosion”  

•  Capture  behaviors  

•  Avoid  simultaneous  access  to  all  code  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   46  

Page 47: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

SOFT  (SystemaEc  OpenFlow  TesEng)  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   47  

OF  Agent  1  

Test  inputs  

Input-­‐driven  execuEon  

Observable  behaviors  

OF  Agent  2  Determine  mapping  inputs  à  behaviors  through  symbolic  

execuEon  

IdenEfy  inconsistencies  

•  Automated  soluEon  to  interop  tesEng  •  SystemaEc  code  coverage  •  No  simultaneous  access  to  all  agents  

Page 48: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Structured  Inputs  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   48  

...  *   *   *   *   *   *   *   *   *   *   *   *  FLOW    MOD   N1  

STAT  REQ   N2  1.0   1.0  

Further  reducEons  •  Some  inputs  are  independent  •  Many  inputs  are  enErely  concrete  •  Small  number  of  messages  •  Concrete  values  at  cost  of  completeness  

C1   C2  

Page 49: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Capturing  Behaviors  

Externally  observable  outputs  •  OpenFlow  reply  messages  •  Data  plane  packets  •  Normalize  harmless  nondeterminism  (e.g.,  Buffer  IDs)  

 Internal  state  changes  affect  successive  inputs  •  Use  concrete  probe  packets  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   49  

Page 50: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Example  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   50  

If  (  p  ==  OFPP_CTRL  )                  send_to_ctrl  (  )  else  if  (  p  <  25  )                  send_to_port(  p  )  else                  error(  BAD_PORT  )  

if  (  p  <  25  )                  send_to_port(  p  )  else                  error(  BAD_PORT  )  

Agent  1   Agent  2  

FWD   ERR   CTRL   ERR  

p:   1 24  

25  

...  

65535  

...  

FWD   ERR  

p:   1 24  

25  

65535  

Page 51: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

N-­‐version  Comparison  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   51  

If  (  p  ==  OFPP_CTRL  )                  send_to_ctrl  (  )  else  if  (  p  <  25  )                  send_to_port(  p  )  else                  error(  BAD_PORT  )  

if  (  p  <  25  )                  send_to_port(  p  )  else                  error(  BAD_PORT  )  

Agent  1   Agent  2  

FWD   ERR   CTRL   ERR  

p:   1 24  

25  

...  

65535  

...  

p:  

FWD   ERR  1 24  

25  

65535  

Page 52: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

N-­‐version  Comparison  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   52  

FWD   ERR   CTRL   ERR  

p:   1 24  

25  

...  

65535  

...  

FWD   ERR  

Page 53: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Results  

Ø Compared  Ø OpenFlow  1.0  Switch  Reference  ImplementaEon  Ø Open  VSwitch  1.0.0  

Ø Input  Sequences  containing  1  -­‐  4  messages  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   53  

Page 54: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Results  

Found  7  classes  of  inconsistencies    Mostly  related  to  message  validaEon      Result  of  underspecificaEon  

Ø No  expected  behavior  in  the  specificaEon  Ø Inconsistent  interpretaEon  of  the  specificaEon  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   54  

Page 55: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Results  -­‐  Example  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   55  

FlowMod  message  1.  Modify  VLAN  to  value  greater  than  212  2.  Forward  packet  

 

Reference  Implementa&on    

1.  Trim  VLAN  value  to  12  bits  2.  Install  the  rule  

 

Open  VSwitch    

1.  Silently  ignore  the  message  

Network  in  2  different  states  Which  one  is  assumed  by  the  controller?  

Page 56: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Conclusions  NICE  automates  the  tesEng  

of  OpenFlow  Apps  

SDN:  a  new  role  for  soMware  tool  chains  to  make  networks  more  dependable.  

 20  Nov  2012  

NICE  and  SOFT  are  a  step  in  this  direc&on!  UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   56  

hVp://code.google.com/p/nice-­‐of/  

SOFT  automates  interop  tesEng  of  OpenFlow  Agents  

Page 57: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Thanks  

20  Nov  2012   UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   57  

Peter  Perešíni  (EPFL)  

Maciej  Kuźniar  (EPFL)  

Daniele  Venzano  (EPFL)  

Dejan  KosEć  (EPFL  à  IMDEA  Networks)  

Jennifer  Rexford  (Princeton)  

Page 58: 2 0 N o v ’12pvr/GrascompCloudDay2012/... · 2020. 10. 23. · Network#Failure#Example#1:# SoMware#Bugs#in#Inter+Domain#Routers# 20Nov2012 Router#type#A# Router#type#B#? 0+length#AS4_PATHaribute!#

Thank  you!  QuesEons?  NICE  automates  the  tesEng  

of  OpenFlow  Apps  

SDN:  a  new  role  for  soMware  tool  chains  to  make  networks  more  dependable.  

 20  Nov  2012  

NICE  and  SOFT  are  a  step  in  this  direc&on!  UCL  -­‐  Doctoral  School  day  in  Cloud  CompuEng   58  

hVp://code.google.com/p/nice-­‐of/  

SOFT  automates  interop  tesEng  of  OpenFlow  Agents