graph mining for log data - david andrzejewski...paths/reachability! subgraphs! degree! graphs!! 10...

Post on 10-Sep-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Graph mining for log data  

1  

David  Andrzejewski  -­‐  @davidandrzej  Data  Sciences,  Sumo  Logic  Strata  –  Hardcore  Data  Science  Track  February  18,  2015      

This  talk:  Graph  Mining  +  Log  Data  

2  

  logs    graph  mining    applicaMon  examples  

 

YES  

This  talk:  Graph  Mining  +  Log  Data  

3  

  logs    graph  mining    applicaMon  examples  

   tools    scaling  

YES  

NO  

  Nodes  

Graphs!  

4  

  Nodes    Edges  

–  undirected    

Graphs!  

5  

  Nodes    Edges  

–  undirected  –  directed  

Graphs!  

6  

  Nodes    Edges  

–  undirected  –  directed  

  Components  

Graphs!  

7  

  Nodes    Edges  

–  undirected  –  directed  

  Components    Paths/reachability  

Graph  data  

8  

  Nodes    Edges  

–  undirected  –  directed  

  Components    Paths/reachability    Subgraphs  

Graphs!  

9  

  Nodes    Edges  

–  undirected  –  directed  

  Components    Paths/reachability    Subgraphs    Degree  

Graphs!  

10  

1  

3  

2  

2  

  Nodes    Edges  

–  undirected  –  directed  

  Components    Paths/reachability    Subgraphs    Degree    Labels  

Graphs!  

11  

12  

Graph   Nodes   Edges  Social   People   Friendship  

13  

Documents

Politics

Documents

Politics

Documents

Politics

Documents

Politics

Graph   Nodes   Edges  Social   People   Friendship  Web   Pages   Links  

14  

API   Auth  

User  

Org  

Graph   Nodes   Edges  Social   People   Friendship  Web   Pages   Links  

System   Services   API  Calls  

Anatomy  of  a  log  message:  Five  W’s  

15  

Anatomy  of  a  log  message:  Five  W’s  

16  

  When?  Timestamp  with  Mme  zone  

Anatomy  of  a  log  message:  Five  W’s  

17  

  When?  Timestamp  with  Mme  zone    Where?  Host,  module,  code  locaMon  

Anatomy  of  a  log  message:  Five  W’s  

18  

  When?  Timestamp  with  Mme  zone    Where?  Host,  module,  code  locaMon    Who?  AuthenMcaMon  context  

Anatomy  of  a  log  message:  Five  W’s  

19  

  When?  Timestamp  with  Mme  zone    Where?  Host,  module,  code  locaMon    Who?  AuthenMcaMon  context    What?  Log  level  and  key-­‐value  pairs  

Context:  Sumo  Logic  

20  

“Turning Machine Data Into IT and Business Insights”  

InteracMons  /  connecMons  in  log  data  

21  

  Human  –  Machine  –  behavior  analysis  

•  business  intelligence  •  security  

 

InteracMons  /  connecMons  in  log  data  

  Human  –  Machine  –  behavior  analysis  

•  business  intelligence  •  security  

   Machine  –  Machine  

–  API  calls    •  ops  /  troubleshooMng  

InteracMons  /  connecMons  in  log  data  

23  

  Human  –  Machine  –  behavior  analysis  

•  business  intelligence  •  security  

   Machine  –  Machine  

–  API  calls    •  ops  /  troubleshooMng  

  Human  –  Human  –  not  usually  logged...yet  

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …    …  orderID=34C8  received  for  requestID=082A  …  Retrieving  userID=11D2  for  requestID=082A  …  …  accountID=1234  access,  userID=11D2  …  ERROR  accountID=1234  not  found!    PROCESSING  FAILED:  webID=79F92  

Use  case:  troubleshooMng  

User  action  webID=7F92   Use  case:  troubleshooMng  

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …  

Use  case:  troubleshooMng  

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …    …  orderID=34C8  received  for  requestID=082A  …  

Use  case:  troubleshooMng  

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …    …  orderID=34C8  received  for  requestID=082A  …  Retrieving  userID=11D2  for  requestID=082A  …  

Use  case:  troubleshooMng  

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …    …  orderID=34C8  received  for  requestID=082A  …  Retrieving  userID=11D2  for  requestID=082A  …  …  accountID=1234  access,  userID=11D2  …  

Use  case:  troubleshooMng  

User  action  webID=7F92  Initiating  requestID=082A  for  webID=7F92  …    …  orderID=34C8  received  for  requestID=082A  …  Retrieving  userID=11D2  for  requestID=082A  …  …  accountID=1234  access,  userID=11D2  …  ERROR  accountID=1234  not  found!    PROCESSING  FAILED:  webID=79F92  

Use  case:  troubleshooMng  

Connected  components  

  Parse  fields                                          from  each  log  event      

ℓi1, ℓi2, ... ℓi

Retrieving  userID=11D2  for  requestID=082A  …  …  accountID=1234  access,  userID=11D2  …  

Connected  components  

  Parse  fields                                          from  each  log  event      Build  graph  

–  nodes  =  each  log  event  –  edges  =  do  a  pair  of  logs  match  on  any  field?    

 

ℓi1, ℓi2, ... ℓi

ℓi

eij =!

k{ℓik = ℓ

jk}

Retrieving  userID=11D2  for  requestID=082A  …  …  accountID=1234  access,  userID=11D2  …  

Connected  components  

  Parse  fields                                          from  each  log  event      Build  graph  

–  nodes  =  each  log  event  –  edges  =  do  a  pair  of  logs  match  on  any  field?    

  Calculate  undirected  connected  components    Output:  parMMon  over        

ℓi1, ℓi2, ... ℓi

ℓi

eij =!

k{ℓik = ℓ

jk}

ℓiO(n)

Retrieving  userID=11D2  for  requestID=082A  …  …  accountID=1234  access,  userID=11D2  …  

Distributed  systems  tracing  infrastructure  

  Dapper  (Google)  Zipkin  (Twiher)    X-­‐Trace  (UC-­‐Berkeley)  inCapacity  (LinkedIn)    Erlang  /  Akka    Commercial  products  

Use  case:  online  shopping    

  User  interacMons    –  state  transiMon  graph  –  internal  call  cascades  

Login  

Browse  

Check  out  

Add  to  cart  

Use  case:  online  shopping    

  User  interacMons    –  state  transiMon  graph  –  internal  call  cascades  

  Goals:  idenMfy  unusual...  –  ...  user  behavior  –  ...  service  behavior  

Login  

Browse  

Check  out  

Add  to  cart  

$ € ¥

Use  case:  online  shopping    

  idenMfy  visits  (eg,  connected  components)  

Visit  

37CF  

5450  

A84B  

...    

FF71  

Use  case:  online  shopping    

  idenMfy  visits  (eg,  connected  components)    “featurize”  

Visit   Login   Browse   Cart   Checkout  

37CF   1   7   1   0  

5450   0   3   2   1  

A84B   2   1   1347   0  

...    

...    

...    

FF71   2   13   2   0  

Use  case:  online  shopping    

  idenMfy  visits  (eg,  connected  components)    “featurize”    staMsMcal  modeling  /  machine  learning  

Visit   Login   Browse   Cart   Checkout  

37CF   1   7   1   0  

5450   0   3   2   1  

A84B   2   1   1347   0  

...    

...    

...    

...   ...  

FF71   2   13   2   0  

Use  case:  online  shopping    

  idenMfy  visits  (eg,  connected  components)    “featurize”      staMsMcal  modeling  /  machine  learning  

Visit   Login   Browse   Cart   Checkout  

37CF   1   7   1   0  

5450   0   3   2   1  

A84B   2   1   1347   0  

...    

...    

...    

...   ...  

FF71   2   13   2   0  

Use  case:  online  shopping    

  AlternaMve  featurizaMon  –  previous:  “node-­‐wise”  –  alternaMve  “edge-­‐wise”  

Visit   Login  >  Browse  

Browse  >  Cart  

Cart  >  Browse  

Browse  >  Checkout  

Login  >  Checkout  

...  

37CF   1   7   1   0   0   ...  

5450   1   3   2   1   0   ...  

A84B   0   0   0   0   799   ...  

...   ...   ...   ...   ...   ...   ...  

FF71   1   13   2   0   ...   ...  

ML  /  stats  detour:  fixed-­‐length  feature  vectors  Fischer  Iris  dataset  (1936)  

Sepal  length  

Sepal  width  

Petal  length  

Petal  width  

Species  

5.0   3.5   1.6   0.6   I.  setosa  

5.9   3.2   4.8   1.8   I.  versicolor  

6.1   2.6   5.6   1.4   I.  virginica  

...   ...   ...   ...   ...  Photo:  Danielle  Langlois  

ML  /  stats  detour:  fixed-­‐length  feature  vectors  

Sepal  length  

Sepal  width  

Petal  length  

Petal  width  

Species  

5.0   3.5   1.6   0.6   I.  setosa  

5.9   3.2   4.8   1.8   I.  versicolor  

6.1   2.6   5.6   1.4   I.  virginica  

...   ...   ...   ...   ...  Photo:  Danielle  Langlois  

Fischer  Iris  dataset  (1936)  

Always  Be  Featurizing  

Node   •  properMes  •  connecMvity  •  neighbors    

•  compromised  machine  

Target  enLty   Features   ApplicaLons  

Always  Be  Featurizing  

Node   •  properMes  •  connecMvity  •  neighbors    

•  compromised  machine  

Edge   •  properMes  •  nodes    •  node  features  

•  high  latency  •  rare  connect  

Target  enLty   Features   ApplicaLons  

Always  Be  Featurizing  

Node   •  properMes  •  connecMvity  •  neighbors    

•  compromised  machine  

Edge   •  properMes  •  nodes    •  node  features  

•  high  latency  •  rare  connect  

Graph   •  nodes  /  edges  •  connecMvity  •  subgraph    

•  failed  session  •  misbehavior  

Target  enLty   Features   ApplicaLons  

Use  case:  unusual  remote  access  detecMon      Remote  access  (eg,  SSH)  graphs    Are  our  observaMons  “typical”?    

Use  case:  unusual  remote  access  detecMon      Remote  access  (eg,  SSH)  graphs    Are  our  observaMons  “typical”?  

– machine-­‐edge:  connect  from  host  X  to  host  Y?  

 

Use  case:  unusual  remote  access  detecMon      Remote  access  (eg,  SSH)  graphs    Are  our  observaMons  “typical”?  

– machine-­‐edge:  connect  from  host  X  to  host  Y?  –  graph:  maximum  depth  /  path  length?  

Use  case:  unusual  remote  access  detecMon      Remote  access  (eg,  SSH)  graphs    Are  our  observaMons  “typical”?  

– machine-­‐edge:  connect  from  host  X  to  host  Y?  –  graph:  maximum  depth  /  path  length?  –  user-­‐edge:  that  user  A  connects  to  host  X?  

  GOAL:  understand  usage  of  (expensive!)  internal  service    –  each  observaMon  is  an  invoking  call  graph  

  How  are  different  invocaMons...  –  ...the  same?  –  ....different?  

Use  case:  understanding  internal  API  calls  

51  

  given  a  collecMon  of  graphs    return  sub-­‐graphs  which  occur  in                              graphs      

 

Frequent  substructure  mining  

52  

≥ T

  given  a  collecMon  of  graphs    return  sub-­‐graphs  which  occur  in                              graphs      

Frequent  substructure  mining  

53  

≥ T

  given  a  collecMon  of  graphs    return  sub-­‐graphs  which  occur  in                              graphs      

Frequent  substructure  mining  

54  

≥ T

  Frequent  subgraphs  presence/absence  as  feature  –  very  common:  “infrastructural”  stuff  –  somewhat  common:  different  usage  modes  

Use  case:  understanding  internal  API  calls  

55  

       

Request  

       

Auth  

       

Cache  

       

Shadow  path  

Standard     1   0   0  

OpMmized    

1   1   0  

“Shadowed”   1   0   1  

Feature-­‐based  graph  mining  strategy  

1.  Determine  your  goal     •  ID  unusual  access  

  Domain  knowledge  

Step   Example  

Feature-­‐based  graph  mining  strategy  

1.  Determine  your  goal    2.  Build  graph  representaMon  

•  ID  unusual  access  •  Remote  access  graph  

  Domain  knowledge    Graph  mining  

Step   Example  

...  

...  

Feature-­‐based  graph  mining  strategy  

1.  Determine  your  goal    2.  Build  graph  representaMon  3.  Frame  quesMon  graphically  

•  ID  unusual  access  •  Remote  access  graph  •  High  out-­‐degree?  

  Domain  knowledge    Graph  mining  

Step   Example  

...  

...  

Feature-­‐based  graph  mining  strategy  

1.  Determine  your  goal    2.  Build  graph  representaMon  3.  Frame  quesMon  graphically  4.  “Featurize”  graph  element(s)  

•  ID  unusual  access  •  Remote  access  graph  •  High  out-­‐degree?  •  Node  è  Out-­‐degree  

  Domain  knowledge    Graph  mining    Stats  /  ML  /  data  mining  

Step   Example  

Node   Out  Degree  

A   2  

B   0  

C   76  

...   ...  

Feature-­‐based  graph  mining  strategy  

1.  Determine  your  goal    2.  Build  graph  representaMon  3.  Frame  quesMon  graphically  4.  “Featurize”  graph  element(s)  5.  Apply  modeling  to  features  

•  ID  unusual  access  •  Remote  access  graph  •  High  out-­‐degree?  •  Node  è  Out-­‐degree  •  Fit  parametric  model  

  Domain  knowledge    Graph  mining    Stats  /  ML  /  data  mining  

Step   Example  

Acknowledgements,  etc  

61  

Team:  Jack  Cheng,  MarMn  Castellanos,  Leo  Gau,  Yuchen  Zhao,  Ariel  Smoliar    

Acknowledgements,  etc  

62  

Team:  Jack  Cheng,  MarMn  Castellanos,  Leo  Gau,  Yuchen  Zhao,  Ariel  Smoliar    

We’re  selling!  

Acknowledgements,  etc  

63  

Team:  Jack  Cheng,  MarMn  Castellanos,  Leo  Gau,  Yuchen  Zhao,  Ariel  Smoliar    

We’re  selling!  

We’re  recruiMng!  

  Alternate  approach:  spectral  clustering    Services  architecture  graph  

Use  case:  understanding  internal  API  calls  

64  

  Alternate  approach:  spectral  clustering    Services  architecture  graph  

Use  case:  understanding  internal  API  calls  

65  

Use  case:  customer  behavior  modeling  

  IDEA:  treat  visits  as  graphs  –  features:  node,  edge,  graph!  –  labels:  did  they  signup  /  convert  /  etc?  

top related