adrien raulot, shahrukh zaidi - de laat · adrien raulot, shahrukh zaidi (uva) netflow information...

Large-scale NetFlow Information Management

Adrien Raulot, Shahrukh Zaidi

University of Amsterdam

Supervisor: Wim Biemolt (SURFnet)

February 5, 2018

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 1 / 24

What is NetFlow?

Traffic monitoring technology originaly developed by Cisco.

Flow : “a set of IP packets passing an observation point in thenetwork during a certain time interval. All packets belonging to aparticular flow have a set of common properties.”[4]

Important differences with regular packet capture methods:

NetFlow considered to be less privacy sensitiveNetFlow requires less computational resources for analysis


What is NetFlow?

Figure 1: Schematic overview of the NetFlow export process.[2]


NetFlow Analysis

Three main application areas[3]:

Flow analysis and reporting

Threat detection

Performance monitoring


NetFlow analysis techniques

NfDump:

Figure 2: Schematic overview of the NfDump tool set.[1]


Netflow Analysis techniques

Limitations of this setup[5]:

Inefficient file-based store: NfDump typically stores NetFlow data inseparate files for every 5 minutes time frame

Very slow processing speed: each file is read line by line from thebeginning. Therefore, analysis of large amounts of NetFlow data takesa lot of time.

Limited analysis methods: as network situations are becoming moreand more complex, new analysis approaches are required that allow forNetFlow data analysis.


Research question

Which data analysis technique could be used in order to analyse thecurrent SURFnet NetFlow data in a more time-efficient manner?


What is Apache Hadoop?

Framework for large datasets processing

Distributed, local computation & storage

Hadoop Distributed File System (HDFS)

YARN (Yet Another Resource Negotiator)

Batch, interactive & real-time jobs

Designed to be scalable


What is Apache Hadoop?

Figure 3: Schematic overview of Hadoop 2.0.


What is Apache Spark?

Hadoop-related project, but not only

Powerful computing engine for Big Data processing

In-memory

Built-in modules for streaming, SQL, machine learning, etc.

Binding for Java, Scala, Python and R

Ease of use


What is Apache Parquet?

Data-store for Hadoop

Column-oriented

Fast access to data

Figure 4: Schematic overview of a row vs column-oriented database.


Choice for analysis technique (summary)

Figure 5: Apache Parquetlogo.

Figure 6: Apache Hadooplogo.

Figure 7: Apache Sparklogo.

To-Do list:

1 Store NetFlow data into Parquet files on HDFS

2 Load Parquet files using PySpark (Python API)

3 Query the data using Spark SQL


Experiments: test environment

Hadoop cluster specifications:

∼ 100 nodes

∼ 600 cores

∼ 4TB of memory

∼ 2PB of storage

Apache Hadoop 2.7.2

Apache Spark 2.1.1

NfDump server specifications:

1x Dell PowerEdge R230

Intel Xeon CPU E3-1240L v5 @2.10GHz

4 cores

16GB of RAM

∼ 200GB of SSD storage

NfDump v1.6.12


Experiments: implementation

1 Convert NetFlow binary data to CSV

nfdump -r nfcapd.201801011245 -o csv

2 Write two Spark jobs in Python:Converter: Converts CSV data to Parquet formatQuerier: Loads Parquet data & executes queries

3 Write SQL query

query = ’SELECT ts, sa, da FROM nf_data’

4 Using the Querier, execute and cache the results

5 Proceed with next operations on the cached results

print results.count()

print results.show()


Experiments: test queries

Retrieve all flows containing a specific IP address

Retrieve all flows with a byte count larger than 100MBs

List the top 10 of Telnet connections with only the SYN flag set inthe IP header ordered by the number of bits per second

List the top 10 of IP addresses receiving the largest amount of traffic

Retrieve all flows with only the SYN flag set in the IP header


Results: retrieve all flows containing a specific IP address

5min 30min 1hr 3.5hrs 7hrs0

2

4

6

8

Execu

tion t

ime in m

inute

s

0:080:33

1:05

3:33

6:42NfDump

Hadoop+Spark

5min 30min 1hr 3.5hrs 7hrsTime frame

0

1

2

3

4

5

Execu

tion t

ime in m

inute

s

3:002:37

2:583:22

3:46

Figure 8: Execution time of retrieving all flows containing a specific IP address.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 16 / 24

Results: retrieve all flows with byte count >100MB


2

4

6

8

Execu

tion t

ime in m

inute

s

0:08 0:281:06

3:39

6:52NfDump

Hadoop+Spark


0

1

2

3

4

5

Execu

tion t

ime in m

inute

s

3:15 3:07 3:092:50 2:53

Figure 9: Execution time of retrieving all flows with byte count larger than100MB.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 17 / 24

Results: list top 10 Telnet connections with only SYN flag set ordered by bps


1

2

3

4

Execu

tion t

ime in m

inute

s

0:09

0:49

2:09

NfDump

Hadoop+Spark


0

1

2

3

4

Execu

tion t

ime in m

inute

s

3:22

2:29

3:09 3:15 3:15

Figure 10: Execution time of retrieving the top 10 of Telnet connections with onlythe SYN flag set ordered by the number of bits per second.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 18 / 24

Results: List top 10 IPs receiving most traffic


5

10

15

20

25

Execu

tion t

ime in m

inute

s

0:191:25

4:04

11:12

23:22NfDump

Hadoop+Spark


0

1

2

3

4

5

Execu

tion t

ime in m

inute

s

2:39 2:38 2:42

3:374:03

Figure 11: Execution time of Retrieving the top 10 IP addresses receiving thelargest amount of traffic.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 19 / 24

Results: Retrieve all flows with only SYN flag set


20

40

60

80

100

Execu

tion t

ime in m

inute

s

1:025:22

11:53

41:44

89:23NfDump

Hadoop+Spark


0

2

4

6

Execu

tion t

ime in m

inute

s

3:14 3:28 3:21

5:06

5:52

Figure 12: Execution time of retrieving all flows with only the SYN flag set in theIP header.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 20 / 24

Discussion

Execution time of NfDump increases linearly with longer time frames.

Hadoop scales very well:

Execution time of Spark with Hadoop does not increase significantlywhen dealing with larger amounts of data.

NfDump struggles with executing more complex queries, whereas thisis no problem for Spark and Hadoop.


Conclusion and future work

Combination of Hadoop and Apache Spark is a viable option foranalyzing large-scale NetFlow data.

Tuning and optimization to the Spark implementation and Hadoopcluster may lead to even better performance.


Questions?


References

NfDump.http://nfdump.sourceforge.net/.

I. Cisco.NetFlow. Introduction to Cisco IOS NetFlow C a technical overview,2007.

R. Hofstede, P. Celeda, B. Trammell, I. Drago, R. Sadre, A. Sperotto,and A. Pras.Flow monitoring explained: From packet capture to data analysis withnetflow and ipfix.IEEE Communications Surveys & Tutorials, 16(4):2037–2064, 2014.

G. Sadasivan.Architecture for ip flow information export.Architecture, 2009.

Z. Tian.Management of large scale NetFlow data by distributed systems.Master’s thesis, NTNU, 2016.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 24 / 24

http://nfdump.sourceforge.net/

adrien raulot, shahrukh zaidi - de laat · adrien raulot, shahrukh zaidi (uva) netflow information...

Documents