pcap graphs for cybersecurity and system tuning
TRANSCRIPT
1© Cloudera, Inc. All rights reserved.
Mirko Kämpf Solutions Architect
Network Traffic Analysis of Hadoop Clusters
Understand the common usage patterns and identify typical / atypical workloads
Marton Balassi Solutions Architect
2© Cloudera, Inc. All rights reserved.
Outline
•Motivation
•PCAP data capture
•Data Analysis with CDH
•Data Analysis with Gephi
• Summary
3© Cloudera, Inc. All rights reserved.
Understand the network load of a Hadoop cluster
• Network communication is often the limiting factor in distributed computing
• Storing files on DFS, heartbeats, data processing all have a footprint
• The current standard visual tools aggregate data on the host level
• Intrusion detection is critical in enterprise systems (Apache Spot)
4© Cloudera, Inc. All rights reserved.
PCAP Data Capture
• Packet capture, the standard API for capturing network traffic• Implementations: Libpcap for UNIX, WinPcap for Windows• Multiple analysis tools: tcpdump, nmap, Wireshark, Snort amongst others
Our approach:• Used pcapy, the Python pcap
extension for capturing• The capturing is initiated on the
individual machines• The captured data is written to the
local fs in Avro format, while the capturing is active
• Focus on network structure, packet data is ignored
Avro schema for PCAP data
5© Cloudera, Inc. All rights reserved.
• Formerly known as ONI• Initiative of Cloudera, Intel, and partners
• Focus on Cybersecurity for the Hadoop domain• Common data formats for advanced analytics• Reliable and robust data (ingestion) pipelines• Repeatable and reliable analysis and modeling procedures• Apache Spot uses a topic-model (LDA) approach, to classify traffic
Apache Spot
• We focus on clustering and visualization of typical workloads in this talk instead.
6© Cloudera, Inc. All rights reserved.
We implemented multiple ‘typical workloads’ and observed their behavior.
• Create reference data sets (PCAP data):• Scenario A: TeraSort (Big-Batch-Workload)• Scenario B: HDFS PUT,GET; HUE (Interactive Workload)• Scenario C: Idle cluster (Vacation time)• Scenario D: Kafka => Spark => HDFS (Realistic production Workload)• Scenario E: Twitter => Spark => HDFS (Realistic production Workload)
Our Activities
7© Cloudera, Inc. All rights reserved.
How it Works ...
• We collect raw data in Avro format, using the Snaffer (pcapy) script.• We transform the events to networks, using Hive (SQL API on Hadoop).• We analyze and visualize the networks using Gephi (open graph viz platform).
8© Cloudera, Inc. All rights reserved.
Initial Results: TeraSort
9© Cloudera, Inc. All rights reserved.
Initial Results: Twitter Collect
10© Cloudera, Inc. All rights reserved.
• Use a higher resolution: include ports in addition to hosts only • Use time dependent analysis: track time stamps per packet
• Combine time series analysis and graph analysis: use Gephi and Apache Spark
Let’s have a look inside ...
11© Cloudera, Inc. All rights reserved.
???
All ports on all hosts, used during an experiment …
12© Cloudera, Inc. All rights reserved.
Hosts & Ports in a 5 Node Hadoop Cluster
Static network:• 1.535 nodes• 2.997 edges
Network-clusters represent communication ports on individual hosts (bigger nodes in theCenter of the star) forming a Hadoop cluster.
This static view shows all potential communicationendpoints – no activity yet.
13© Cloudera, Inc. All rights reserved.
Weighted Communication Links during TeraSortCommunication Network• 1.535 node • 34.351 edges
Communication links represent real communication between ports on individual hosts in a Hadoop cluster.
This dynamic view shows all real communicationendpoints and allows a topological analysis.
15© Cloudera, Inc. All rights reserved.
PageRank & Eigenvector CentralityTopological Properties
Node sizes represent PageRank of a node based on Communication links.
Node colors still reflect the host on which the communication endpoints are active.
Node sizes represent Eigenvector Centrality based on Communication links.
Node colors still reflect the host on which the communication endpoints are active.
16© Cloudera, Inc. All rights reserved.
Which are the most central nodes?
NameNode
ResourceManager(internal) t.
Interesting ports:
2300023001
70518022
Most active Server:
172.28.209.73
17© Cloudera, Inc. All rights reserved.
Time Evolution of Dynamic Communication Processes
Host centricHost = Server
Cluster centricCluster = Functional Layer
18© Cloudera, Inc. All rights reserved.
Re-organization: Segregation by Components• Communication components are distributed
across servers.• Server centric analysis doesn’t help
• Communication layers can be interdependent.• Dependencies are not visible in event data set.
• Our Approach:• (1) Re-construct the communication structure.• (2) Segregate the communication activity by component / subsystem.• (3) Finally, we reconstruct the functional network of interacting components.
• This allows a dependency analysis for components, and hopefully also system tuning.
19© Cloudera, Inc. All rights reserved.
!!! WARNING !!!
Absolute values can be misleading.
Component Centric View
• port <=> host links removed
• Temporal networkslead to dynamicclusters
20© Cloudera, Inc. All rights reserved.
Central vs. External ComponentsImpact of the Selected Layout Algorithm
21© Cloudera, Inc. All rights reserved.
Two Experiments: TeraSort & Twitter Collect
Num
ber o
f pac
kets
22© Cloudera, Inc. All rights reserved.
5 Selected Channels during TeraSort
Num
ber o
f pac
kets
NameNodeNodeManager
YARN App ContainersYARN App ContainersYARN App Containers
Job AJob B
Job C
Replication factor: Job A : 3 Job B : 1 Job C : 5
23© Cloudera, Inc. All rights reserved.
5 Selected Channels during Twitter Collect
Num
ber o
f pac
kets
Active
Idle
24© Cloudera, Inc. All rights reserved.
Observations
• Both experiments show fundamental differences:•Only one active component vs. multiple competing communication channels.
• Common observation:•Background activity of an idle cluster shows periodic spikes (no surprise).•Different fluctuation levels on different channels
25© Cloudera, Inc. All rights reserved.
What’s next?
More Experiments & Data collection:• Ideal scenarios• Realistic workloads
Helpful Vizualization:• Provide a real time view of ongoing network activity using
Gephi streaming plugin (as shown in the Twitter Streaming demo).
Better Analysis:• Classify the components automatically …• Requires: to study activity time series,
e.g., using neuronal networks or non-linear statistics.• Understand the component structure and behavior over time …• Allows us: to find anomalies in the component structure and behavioral patterns.
27© Cloudera, Inc. All rights reserved.
Big Thanks To
Clouderans supporting the project ...
Alexander Bartfeld
Anton Vukovic
Rafael Arana
Zoltan Kiss
Nehme Tohme