adrien raulot, shahrukh zaidi - de laat · adrien raulot, shahrukh zaidi (uva) netflow information...
TRANSCRIPT
Large-scale NetFlow Information Management
Adrien Raulot, Shahrukh Zaidi
University of Amsterdam
Supervisor: Wim Biemolt (SURFnet)
February 5, 2018
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 1 / 24
What is NetFlow?
Traffic monitoring technology originaly developed by Cisco.
Flow : “a set of IP packets passing an observation point in thenetwork during a certain time interval. All packets belonging to aparticular flow have a set of common properties.”[4]
Important differences with regular packet capture methods:
NetFlow considered to be less privacy sensitiveNetFlow requires less computational resources for analysis
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 2 / 24
What is NetFlow?
Figure 1: Schematic overview of the NetFlow export process.[2]
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 3 / 24
NetFlow Analysis
Three main application areas[3]:
Flow analysis and reporting
Threat detection
Performance monitoring
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 4 / 24
NetFlow analysis techniques
NfDump:
Figure 2: Schematic overview of the NfDump tool set.[1]
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 5 / 24
Netflow Analysis techniques
Limitations of this setup[5]:
Inefficient file-based store: NfDump typically stores NetFlow data inseparate files for every 5 minutes time frame
Very slow processing speed: each file is read line by line from thebeginning. Therefore, analysis of large amounts of NetFlow data takesa lot of time.
Limited analysis methods: as network situations are becoming moreand more complex, new analysis approaches are required that allow forNetFlow data analysis.
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 6 / 24
Research question
Which data analysis technique could be used in order to analyse thecurrent SURFnet NetFlow data in a more time-efficient manner?
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 7 / 24
What is Apache Hadoop?
Framework for large datasets processing
Distributed, local computation & storage
Hadoop Distributed File System (HDFS)
YARN (Yet Another Resource Negotiator)
Batch, interactive & real-time jobs
Designed to be scalable
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 8 / 24
What is Apache Hadoop?
Figure 3: Schematic overview of Hadoop 2.0.
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 9 / 24
What is Apache Spark?
Hadoop-related project, but not only
Powerful computing engine for Big Data processing
In-memory
Built-in modules for streaming, SQL, machine learning, etc.
Binding for Java, Scala, Python and R
Ease of use
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 10 / 24
What is Apache Parquet?
Data-store for Hadoop
Column-oriented
Fast access to data
Figure 4: Schematic overview of a row vs column-oriented database.
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 11 / 24
Choice for analysis technique (summary)
Figure 5: Apache Parquetlogo.
Figure 6: Apache Hadooplogo.
Figure 7: Apache Sparklogo.
To-Do list:
1 Store NetFlow data into Parquet files on HDFS
2 Load Parquet files using PySpark (Python API)
3 Query the data using Spark SQL
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 12 / 24
Experiments: test environment
Hadoop cluster specifications:
∼ 100 nodes
∼ 600 cores
∼ 4TB of memory
∼ 2PB of storage
Apache Hadoop 2.7.2
Apache Spark 2.1.1
NfDump server specifications:
1x Dell PowerEdge R230
Intel Xeon CPU E3-1240L v5 @2.10GHz
4 cores
16GB of RAM
∼ 200GB of SSD storage
NfDump v1.6.12
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 13 / 24
Experiments: implementation
1 Convert NetFlow binary data to CSV
nfdump -r nfcapd.201801011245 -o csv
2 Write two Spark jobs in Python:Converter: Converts CSV data to Parquet formatQuerier: Loads Parquet data & executes queries
3 Write SQL query
query = ’SELECT ts, sa, da FROM nf_data’
4 Using the Querier, execute and cache the results
5 Proceed with next operations on the cached results
print results.count()
print results.show()
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 14 / 24
Experiments: test queries
Retrieve all flows containing a specific IP address
Retrieve all flows with a byte count larger than 100MBs
List the top 10 of Telnet connections with only the SYN flag set inthe IP header ordered by the number of bits per second
List the top 10 of IP addresses receiving the largest amount of traffic
Retrieve all flows with only the SYN flag set in the IP header
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 15 / 24
Results: retrieve all flows containing a specific IP address
5min 30min 1hr 3.5hrs 7hrs0
2
4
6
8
Execu
tion t
ime in m
inute
s
0:080:33
1:05
3:33
6:42NfDump
Hadoop+Spark
5min 30min 1hr 3.5hrs 7hrsTime frame
0
1
2
3
4
5
Execu
tion t
ime in m
inute
s
3:002:37
2:583:22
3:46
Figure 8: Execution time of retrieving all flows containing a specific IP address.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 16 / 24
Results: retrieve all flows with byte count >100MB
5min 30min 1hr 3.5hrs 7hrs0
2
4
6
8
Execu
tion t
ime in m
inute
s
0:08 0:281:06
3:39
6:52NfDump
Hadoop+Spark
5min 30min 1hr 3.5hrs 7hrsTime frame
0
1
2
3
4
5
Execu
tion t
ime in m
inute
s
3:15 3:07 3:092:50 2:53
Figure 9: Execution time of retrieving all flows with byte count larger than100MB.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 17 / 24
Results: list top 10 Telnet connections with only SYN flag set ordered by bps
5min 30min 1hr 3.5hrs 7hrs0
1
2
3
4
Execu
tion t
ime in m
inute
s
0:09
0:49
2:09
NfDump
Hadoop+Spark
5min 30min 1hr 3.5hrs 7hrsTime frame
0
1
2
3
4
Execu
tion t
ime in m
inute
s
3:22
2:29
3:09 3:15 3:15
Figure 10: Execution time of retrieving the top 10 of Telnet connections with onlythe SYN flag set ordered by the number of bits per second.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 18 / 24
Results: List top 10 IPs receiving most traffic
5min 30min 1hr 3.5hrs 7hrs0
5
10
15
20
25
Execu
tion t
ime in m
inute
s
0:191:25
4:04
11:12
23:22NfDump
Hadoop+Spark
5min 30min 1hr 3.5hrs 7hrsTime frame
0
1
2
3
4
5
Execu
tion t
ime in m
inute
s
2:39 2:38 2:42
3:374:03
Figure 11: Execution time of Retrieving the top 10 IP addresses receiving thelargest amount of traffic.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 19 / 24
Results: Retrieve all flows with only SYN flag set
5min 30min 1hr 3.5hrs 7hrs0
20
40
60
80
100
Execu
tion t
ime in m
inute
s
1:025:22
11:53
41:44
89:23NfDump
Hadoop+Spark
5min 30min 1hr 3.5hrs 7hrsTime frame
0
2
4
6
Execu
tion t
ime in m
inute
s
3:14 3:28 3:21
5:06
5:52
Figure 12: Execution time of retrieving all flows with only the SYN flag set in theIP header.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 20 / 24
Discussion
Execution time of NfDump increases linearly with longer time frames.
Hadoop scales very well:
Execution time of Spark with Hadoop does not increase significantlywhen dealing with larger amounts of data.
NfDump struggles with executing more complex queries, whereas thisis no problem for Spark and Hadoop.
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24
Conclusion and future work
Combination of Hadoop and Apache Spark is a viable option foranalyzing large-scale NetFlow data.
Tuning and optimization to the Spark implementation and Hadoopcluster may lead to even better performance.
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 22 / 24
Questions?
Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 23 / 24
References
NfDump.http://nfdump.sourceforge.net/.
I. Cisco.NetFlow. Introduction to Cisco IOS NetFlow C a technical overview,2007.
R. Hofstede, P. Celeda, B. Trammell, I. Drago, R. Sadre, A. Sperotto,and A. Pras.Flow monitoring explained: From packet capture to data analysis withnetflow and ipfix.IEEE Communications Surveys & Tutorials, 16(4):2037–2064, 2014.
G. Sadasivan.Architecture for ip flow information export.Architecture, 2009.
Z. Tian.Management of large scale NetFlow data by distributed systems.Master’s thesis, NTNU, 2016.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 24 / 24