adrien raulot, shahrukh zaidi - de laat · adrien raulot, shahrukh zaidi (uva) netflow information...

24
Large-scale NetFlow Information Management Adrien Raulot, Shahrukh Zaidi University of Amsterdam Supervisor: Wim Biemolt (SURFnet) February 5, 2018 Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 1 / 24

Upload: others

Post on 27-Feb-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Large-scale NetFlow Information Management

Adrien Raulot, Shahrukh Zaidi

University of Amsterdam

Supervisor: Wim Biemolt (SURFnet)

February 5, 2018

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 1 / 24

Page 2: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

What is NetFlow?

Traffic monitoring technology originaly developed by Cisco.

Flow : “a set of IP packets passing an observation point in thenetwork during a certain time interval. All packets belonging to aparticular flow have a set of common properties.”[4]

Important differences with regular packet capture methods:

NetFlow considered to be less privacy sensitiveNetFlow requires less computational resources for analysis

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 2 / 24

Page 3: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

What is NetFlow?

Figure 1: Schematic overview of the NetFlow export process.[2]

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 3 / 24

Page 4: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

NetFlow Analysis

Three main application areas[3]:

Flow analysis and reporting

Threat detection

Performance monitoring

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 4 / 24

Page 5: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

NetFlow analysis techniques

NfDump:

Figure 2: Schematic overview of the NfDump tool set.[1]

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 5 / 24

Page 6: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Netflow Analysis techniques

Limitations of this setup[5]:

Inefficient file-based store: NfDump typically stores NetFlow data inseparate files for every 5 minutes time frame

Very slow processing speed: each file is read line by line from thebeginning. Therefore, analysis of large amounts of NetFlow data takesa lot of time.

Limited analysis methods: as network situations are becoming moreand more complex, new analysis approaches are required that allow forNetFlow data analysis.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 6 / 24

Page 7: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Research question

Which data analysis technique could be used in order to analyse thecurrent SURFnet NetFlow data in a more time-efficient manner?

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 7 / 24

Page 8: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

What is Apache Hadoop?

Framework for large datasets processing

Distributed, local computation & storage

Hadoop Distributed File System (HDFS)

YARN (Yet Another Resource Negotiator)

Batch, interactive & real-time jobs

Designed to be scalable

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 8 / 24

Page 9: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

What is Apache Hadoop?

Figure 3: Schematic overview of Hadoop 2.0.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 9 / 24

Page 10: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

What is Apache Spark?

Hadoop-related project, but not only

Powerful computing engine for Big Data processing

In-memory

Built-in modules for streaming, SQL, machine learning, etc.

Binding for Java, Scala, Python and R

Ease of use

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 10 / 24

Page 11: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

What is Apache Parquet?

Data-store for Hadoop

Column-oriented

Fast access to data

Figure 4: Schematic overview of a row vs column-oriented database.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 11 / 24

Page 12: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Choice for analysis technique (summary)

Figure 5: Apache Parquetlogo.

Figure 6: Apache Hadooplogo.

Figure 7: Apache Sparklogo.

To-Do list:

1 Store NetFlow data into Parquet files on HDFS

2 Load Parquet files using PySpark (Python API)

3 Query the data using Spark SQL

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 12 / 24

Page 13: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Experiments: test environment

Hadoop cluster specifications:

∼ 100 nodes

∼ 600 cores

∼ 4TB of memory

∼ 2PB of storage

Apache Hadoop 2.7.2

Apache Spark 2.1.1

NfDump server specifications:

1x Dell PowerEdge R230

Intel Xeon CPU E3-1240L v5 @2.10GHz

4 cores

16GB of RAM

∼ 200GB of SSD storage

NfDump v1.6.12

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 13 / 24

Page 14: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Experiments: implementation

1 Convert NetFlow binary data to CSV

nfdump -r nfcapd.201801011245 -o csv

2 Write two Spark jobs in Python:Converter: Converts CSV data to Parquet formatQuerier: Loads Parquet data & executes queries

3 Write SQL query

query = ’SELECT ts, sa, da FROM nf_data’

4 Using the Querier, execute and cache the results

5 Proceed with next operations on the cached results

print results.count()

print results.show()

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 14 / 24

Page 15: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Experiments: test queries

Retrieve all flows containing a specific IP address

Retrieve all flows with a byte count larger than 100MBs

List the top 10 of Telnet connections with only the SYN flag set inthe IP header ordered by the number of bits per second

List the top 10 of IP addresses receiving the largest amount of traffic

Retrieve all flows with only the SYN flag set in the IP header

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 15 / 24

Page 16: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Results: retrieve all flows containing a specific IP address

5min 30min 1hr 3.5hrs 7hrs0

2

4

6

8

Execu

tion t

ime in m

inute

s

0:080:33

1:05

3:33

6:42NfDump

Hadoop+Spark

5min 30min 1hr 3.5hrs 7hrsTime frame

0

1

2

3

4

5

Execu

tion t

ime in m

inute

s

3:002:37

2:583:22

3:46

Figure 8: Execution time of retrieving all flows containing a specific IP address.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 16 / 24

Page 17: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Results: retrieve all flows with byte count >100MB

5min 30min 1hr 3.5hrs 7hrs0

2

4

6

8

Execu

tion t

ime in m

inute

s

0:08 0:281:06

3:39

6:52NfDump

Hadoop+Spark

5min 30min 1hr 3.5hrs 7hrsTime frame

0

1

2

3

4

5

Execu

tion t

ime in m

inute

s

3:15 3:07 3:092:50 2:53

Figure 9: Execution time of retrieving all flows with byte count larger than100MB.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 17 / 24

Page 18: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Results: list top 10 Telnet connections with only SYN flag set ordered by bps

5min 30min 1hr 3.5hrs 7hrs0

1

2

3

4

Execu

tion t

ime in m

inute

s

0:09

0:49

2:09

NfDump

Hadoop+Spark

5min 30min 1hr 3.5hrs 7hrsTime frame

0

1

2

3

4

Execu

tion t

ime in m

inute

s

3:22

2:29

3:09 3:15 3:15

Figure 10: Execution time of retrieving the top 10 of Telnet connections with onlythe SYN flag set ordered by the number of bits per second.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 18 / 24

Page 19: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Results: List top 10 IPs receiving most traffic

5min 30min 1hr 3.5hrs 7hrs0

5

10

15

20

25

Execu

tion t

ime in m

inute

s

0:191:25

4:04

11:12

23:22NfDump

Hadoop+Spark

5min 30min 1hr 3.5hrs 7hrsTime frame

0

1

2

3

4

5

Execu

tion t

ime in m

inute

s

2:39 2:38 2:42

3:374:03

Figure 11: Execution time of Retrieving the top 10 IP addresses receiving thelargest amount of traffic.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 19 / 24

Page 20: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Results: Retrieve all flows with only SYN flag set

5min 30min 1hr 3.5hrs 7hrs0

20

40

60

80

100

Execu

tion t

ime in m

inute

s

1:025:22

11:53

41:44

89:23NfDump

Hadoop+Spark

5min 30min 1hr 3.5hrs 7hrsTime frame

0

2

4

6

Execu

tion t

ime in m

inute

s

3:14 3:28 3:21

5:06

5:52

Figure 12: Execution time of retrieving all flows with only the SYN flag set in theIP header.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 20 / 24

Page 21: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Discussion

Execution time of NfDump increases linearly with longer time frames.

Hadoop scales very well:

Execution time of Spark with Hadoop does not increase significantlywhen dealing with larger amounts of data.

NfDump struggles with executing more complex queries, whereas thisis no problem for Spark and Hadoop.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24

Page 22: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Conclusion and future work

Combination of Hadoop and Apache Spark is a viable option foranalyzing large-scale NetFlow data.

Tuning and optimization to the Spark implementation and Hadoopcluster may lead to even better performance.

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 22 / 24

Page 23: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

Questions?

Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 23 / 24

Page 24: Adrien Raulot, Shahrukh Zaidi - de Laat · Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 21 / 24. Conclusion and future work Combination of Hadoop

References

NfDump.http://nfdump.sourceforge.net/.

I. Cisco.NetFlow. Introduction to Cisco IOS NetFlow C a technical overview,2007.

R. Hofstede, P. Celeda, B. Trammell, I. Drago, R. Sadre, A. Sperotto,and A. Pras.Flow monitoring explained: From packet capture to data analysis withnetflow and ipfix.IEEE Communications Surveys & Tutorials, 16(4):2037–2064, 2014.

G. Sadasivan.Architecture for ip flow information export.Architecture, 2009.

Z. Tian.Management of large scale NetFlow data by distributed systems.Master’s thesis, NTNU, 2016.Adrien Raulot, Shahrukh Zaidi (UvA) NetFlow Information Management February 5, 2018 24 / 24