network address translation (nat) behaviour: final report · 2014-05-26 · ip addresses, source...
TRANSCRIPT
Network Address Translation (NAT) Behaviour: Final report Prepared by: A. Nur Zincir-Heywood Yasemin Gokcen Vahid Aghaevi Faculty of Computer Science Dalhousie University Scientific Authority: Rodney Howes DRDC Centre for Security Science The scientific or technical validity of this Contract Report is entirely the responsibility of the Contractor and the contents do not necessarily have the approval or endorsement of the Department of National Defence of Canada.
Defence R&D Canada – Centre for Security Science DRDC-RDDC-2014-C May 2014
IMPORTANT INFORMATIVE STATEMENTS CSSP-2012-CD-1029 Network Address Translation (NAT) Behaviour was supported by the Canadian Safety and Security Program (CSSP) which is led by Defence Research and Development Canada’s Centre for Security Science, in partnership with Public Safety Canada. Partners in the project include Public Safety Canada and Dalhousie University. CSSP is a federally-funded program to strengthen Canada’s ability to anticipate, prevent/mitigate, prepare for, respond to, and recover from natural disasters, serious accidents, crime and terrorism through the convergence of science and technology with policy, operations and intelligence
© Her Majesty the Queen in Right of Canada, as represented by the Minister of National Defence, 2014
© Sa Majesté la Reine (en droit du Canada), telle que représentée par le ministre de la Défense nationale, 2014
Network Address Translation (NAT) Behaviour: Final report
Contract Number: 78820-12-0024 Project Leader: A. Nur Zincir-Heywood
Graduate Students: Yasemin Gokcen Vahid Aghaevi
Date:March 28th, 2013
Network Information Management and Security Group
Faculty of Computer Science Dalhousie University
6050 University Avenue
Halifax, NS B3H 1W5
2
Contents Table of Figures and Tables ............................................................................................................ 3
Acronyms ........................................................................................................................................ 4
Abstract ........................................................................................................................................... 6
1. Introduction ................................................................................................................................. 7
2. Literature Review...................................................................................................................... 11
3. NAT Overview.......................................................................................................................... 15
3.1 Translation of the Endpoint ................................................................................................. 17
3.2 Visibility of NAT Operations .............................................................................................. 24
4. Data Mining Techniques Employed ......................................................................................... 26
4.1 C4.5 ..................................................................................................................................... 26
4.2 Naive Bayes......................................................................................................................... 28
4.3 Support Vector Machine (SVM) ......................................................................................... 30
5. Methodology and Evaluation .................................................................................................... 33
5.1 Data Sets Employed ............................................................................................................ 33
5.2 Passive Fingerprinting Approach ........................................................................................ 34
5.2.1 Packet Header Based Features - Time to Live (TTL) and Arrival Time ...................... 34
5.2.2Packet Payload Based Features - Http User Agent String ............................................. 36
5.3 Data Mining Based Approach ............................................................................................. 40
5.3.1 No Payload, IP Addresses and Port Numbers– Flow Based Features .......................... 40
5.4 Ground Truth – Labeling of the Data Sets ...................................................................... 41
6. Empirical Evaluation ................................................................................................................ 44
6.1 Evaluation Results Based On The First Labeling Scheme .................................................. 45
6.2 Evaluation Results Based On The Second Labeling Scheme ............................................. 47
8. Conclusion and Future Work .................................................................................................... 50
References: .................................................................................................................................... 52
3
Table of Figures and Tables Figure 1 NAT trace collection scenario ........................................................................................ 16 Figure 2: Home-Side trace: Consider the Source IP/Port and Destination IP/Port for the “HTTP GET” and the “200 OK HTTP” messages .................................................................................... 17 Figure 3 Home-Side trace: Consider the Source IP/Port and Destination IP/port for the three-way SYN/ACK handshake ................................................................................................................... 19 Figure 4 ISP-Side trace: Consider the Source IP/Port and Destination IP/port for the “HTTP GET” and the “200 OK HTTP” messages .................................................................................... 20 Figure 5 Home-Side trace: Consider the “Time To Live” and “Checksum” fields for “HTTP GET” message .............................................................................................................................. 21 Figure 6 ISP-Side trace: Consider the “Time To Live” and “Checksum” fields for “HTTP GET” message ......................................................................................................................................... 22 Figure 7 ISP-Side trace: Consider the Source IP/Port and Destination IP/port for the three-way SYN/ACK handshake ................................................................................................................... 24 Figure 8: Construction of a classification tree .............................................................................. 28 Figure 9: An example of Naive Bayes .......................................................................................... 29 Figure 10: Maximum margin hyperplanes for a SVM trained with samples from two classes.... 32 Figure 11: Propagation Behavior of TTL between IP Header and MPLS Labels ........................ 35
Table 1 Packet Header based features employed, * Normalized by log 37 Table 2: Packet Header based features employed, * Normalized by log...................................... 39 Table 3: Packet Header based features employed, * Normalized by log...................................... 39 Table 4: Packet Header based features employed, * Normalized by log...................................... 39 Table 5: Flow Based Features Employed ..................................................................................... 43 Table 6: Results for the flow-based feature set using the passive fingerprinting approach – Labeling Scheme 1 ........................................................................................................................ 45 Table 7: Training Results for the flow-based feature set using the data mining approach – Labeling Scheme 1 ........................................................................................................................ 46 Table 8: Test Results on the unseen dataset for the flow-based feature set using the data mining approach – Labeling Scheme 1 .................................................................................................... 46 Table 9: Results for the flow-based feature set using the passive fingerprinting approach –Labeling Scheme 2 ........................................................................................................................ 48 Table 10: Training Results for the flow-based feature set using the data mining approach – Labeling Scheme 2 ........................................................................................................................ 49 Table 11: Test Results on the unseen dataset for the flow-based feature set using the data mining approach – Labeling Scheme 2 ..................................................................................................... 49
4
Acronyms ACK Acknowledge
Dal Dalhousie University
DNS Domain Name Server
DPI Deep Packet Inspection
DR Detection Rate
DSL Digital Subscriber Line
eAddr External Address
ePort External Port
FN False Negative
FP False Positive
FPR False Positive Rate
FTP File Transfer Protocol
HTTP Hyper Text Transfer Protocol
HTTPS Hyper Text Transfer Protocol Secure
iAddr Internal Address
ID3 Iterative Dichotomiser 3
IP Internet Protocol
iPort Internal Port
ISP Internet Service Provider
LAN Local Area Network
NAPT Network Address and Port Translation
NAT Network Address Translation
OS Operating System
PC Personal Computer
QoS Quality of Service
RFC 1918 A standard; Address Allocation for
Private Internets
RFC 2663
A standard; IP Network Address
Translator (NAT) Terminology and
Considerations
5
ROC Receiver Operating Characteristic
SRM Structural Risk Minimization
SVM Support Vector Machine
SYN Synchronize
TCP Transmission Control Protocol
TN True Negative
TP True Positive
TTL Time to Live
UDP User Datagram Protocol
WiFi Wireless Fidelity
6
Abstract
Network Address Translation (NAT) is the mechanism, which is used to modify a packet's IP address information while it is in transit across a network routing device. Because NAT can hide a computer’s or even a network's IP address, identifying the presence of NAT in network traffic is an important task for network management and security. The aim of this work is to identify the presence of NAT in the network traffic by utilizing different approaches and evaluate the performance of these approaches under different network environments represented by the availability of different data fields. To this end, passive fingerprinting and data mining based approaches are used and evaluated under different test conditions. In these experiments, not only packet header and flow based features are employed without using source and destination IP addresses, source and destination port numbers and payload information, but also payload information is analyzed to understand how much performance gain is reached if it is available. Last but not least; experiments are also performed to identify NAT devices in encrypted as well as non-encrypted traffic.
7
1. Introduction
Usage of Network Address Translation (NAT) devices is very common in any area where
interconnection devices such as computers, laptops and mobiles connect to the Internet. While
NAT devices are generally used in local area networks (LAN), which include small groups of
computers, they can also be used just for one computer. In home networks most Internet Service
Providers (ISP) give WiFi-enabled NAT home gateways to their users. Thus, when users can
connect their devices to the Internet, the private IP addresses are hidden on the Internet by
encapsulating private IP addresses with a public IP address. NAT gateways modify IP address
information in IP packet headers during transition. Basically, NAT allows a single device, such
as a router, to act as agent between the Internet and a private network. This means that only a
single unique IP address is required to represent an entire group of computers to anything outside
their private network.
NATs are used for many reasons such as shortage of IPv4 addresses. Since an address is
4 bytes, the total number of available addresses is 2 to the power of 32, i.e. 4,294,967,296. This
represents the number of computers that can be directly connected to the Internet. In practice, the
real limit is much smaller for several reasons. Each physical network has to have a unique
Network Number comprising some of the bits of the IP address. The rest of the bits are used as a
Host Number to uniquely identify each computer on that network. The number of unique
Network Numbers that can be assigned on the Internet is therefore much smaller than 4 billion,
and it is very unlikely that all of the possible Host Numbers in each Network Number are fully
assigned. NAT usage provides one single public IP address for a group of computers and
therefore helps to solve some of the addressing related problems.
8
To be represented with a public IP address on the Internet is more advantageous for users.
Since their private IP addresses are not seen on the Internet, it is easier for them to keep their
systems secure. For home users, personal information, such as emails, financial details such as
credit cards or cheque numbers can be stolen. For business users, it is more dangerous. There is
essential company information such as marketing strategies. If these kinds of essential
information are stolen or accessed in any way, this may cause major privacy and security
problems. For these reasons, companies can use firewall technologies to keep their systems safe.
Firewalls are placed between the user and the Internet and verify all traffic before allowing it to
pass through so no unauthorized user would be allowed to access the company's file or email
server. The problem with firewall solutions is that they are expensive and difficult to set up and
maintain for home and small business users.
In this case, NAT becomes a viable alternative. A NAT device is also placed between the
user and the Internet and it automatically protects the systems without any special set-up because
it only allows connections that are originated on the inside network. For instance, an internal
client can connect to an outside FTP server, but an outside client will not be able to connect to an
internal FTP server because it would have to originate the connection, and NAT will not allow
that. It is still possible to make some internal servers available to the outside world by opening
inbound ports, which are well known Transmission Control Protocol (TCP) ports (e.g. 21 for
FTP) to specific internal addresses, thus making services such as FTP or Web available in a
controlled way.
Moreover, it is easier to manage the large networks for network administrators if there
are NATs. That is because a NAT divides large networks into smaller ones. There are groups of
computers behind NAT devices. While there are many computers, they are represented as one IP
9
address to the outside. Therefore, if any change happens within these groups, such as adding or
removing IP addresses, this does not affect the outside network.
All these advantages promote NAT usage. However, because of these reasons, NAT
technology also becomes useful for attackers and users who want to hide their real identities.
Hence, NAT usage increases both in legitimate environments and in illegitimate environments.
Thus, it becomes important to detect the number of devices behind a NAT gateway in order to
understand the anomalies in a given systems traffic and usage. Furthermore, it is not possible to
get information such as private IP addresses, which belong to the devices behind a NAT gateway
just by visualizing the traffic. In other words, identifying the NAT machines and the devices
behind such machine becomes a very challenging problem.
A NAT gateway has at least two interfaces and it has two different IP addresses for these
two interfaces; namely internal and external IP addresses. The internal IP address is for
communicating with hosts behind the NAT, while the external one is for communicating with the
outside network. Therefore, if anyone from outside of this network wants to analyze the network
traffic, the only information he/she will see will be the traffic between the NAT and the outside
networks rather than the hosts behind the NAT on the internal network. On the internal network,
there might be many devices, which are connected to the Internet behind the NAT and these
devices might have different services installed. However, these will all be opaque when standard
techniques (such as IP address or port number analysis or deep packet inspection) are used to
visualize or analyze such traffic (data).Thus, it is necessary both to understand whether there is a
NAT gateway and to determine the number of hosts behind it to support any quality of service or
security related analysis of a network/system traffic.
10
Therefore in this research, we study and evaluate different approaches and evaluate them
on different types of data sets to understand their pros and cons. To this end, one approach we
investigated was based on the analysis of packet level traces and HTTP user agent information as
studied in [2]. Indeed, such an approach becomes useful in the presence of HTTP user agent
information. So in the cases where such information is not available or not accessible, we also
analyzed the flow level traffic using data mining techniques. We investigated all these
approaches under both encrypted and non-encrypted traffic conditions.
11
2. Literature Review Even though not many, there are some works in the literature where the focus was on the
identification of NAT gateways and the number of end users behind such gateways. Different
algorithms were proposed, but generally, researchers used passive operating system
fingerprinting by analyzing certain parameters within the TCP protocol and evaluated
performances in their experimental system with synthetic NAT data sets.
Bellovin [8] claimed that consecutive packets carry sequential IP ID fields, which are
included in the IP header and generally are used as counters. Therefore, he claimed that it was
possible to count the string values of those IP ID fields to find the number of hosts. However, he
noted that there might have been some complications such as packets with zero IP IDs and
packets using byte-swapped counters. Also, he did not take the recent version of OpenBSD and
FreeBSD into consideration, because they use pseudo-random number generator for the IP ID
fields. He proposed an algorithm and tested with a synthetic NAT data.
Miller [6] and Phaal [5] both used passive OS fingerprinting. Phaal [5] took the
advantage of IP Time to live (TTL), while Miller [6] analyzed TCP dump packets to check
certain fields in the TCP/IP header.
Beverly [7] proposed a classifier to infer the operating system passively and find the
number of hosts behind a NAT. He used TTL, Do not fragment (DF), Window size and SYN
size parameters. He also took the advantage of Bellovin's IP ID approach. According to the
results of his evaluations, Bellovin's method performed better on counting hosts behind NAT
devices.
On the other hand, Murakami et al. [3] focused on the Medium Access Control (MAC)
address of a device and proposed a NAT router, which relays the MAC address of PCs based on
FreeBSD. NAT does not have information about the data link layer because it translates IP
12
addresses in the network layer. So they used two functions; obtaining source MAC address and
overwriting an Ethernet header. They added another mechanism by using pcap both to obtain
MAC addresses and to overwrite ethernet headers. According to their evaluation process, their
MAC address relaying NAT router confirmed that a LAN could identify PCs that are behind a
NAT from outside. However, it requires the use of their specific relaying system.
Ishikava et al. [1] proposed an alternate method to identify PCs behind a NAT router with
proxy authentication on a proxy server. Their target application was WWW. They used the
“realm” attribute in the authentication header for identifying client PCs using their MAC
addresses. In this case, the “realm” attribute is shown to the user as a prompt message. Therefore
their proposed system requires Java Runtime Environment (JRE) on each client PC. They
assumed that a web browser always adds the authentication header to its request message when
authentication has succeeded.
Rui et al. [4] proposed an algorithm based on the Support Vector Machine (SVM)
learning algorithm just to detect the presence of a NAT. Their traces were limited by eight
features and activity values (activeness of a host). They labeled their traffic data as ordinary
hosts and hosts behind a NAT. Then, they applied binary classification.
Maier et al. [2] focused on detecting DSL lines that use NAT to connect to the Internet.
They first aimed to find whether there was a NAT device, and then they aimed to find the
number of users behind that NAT. Additionally, they tried to find how many of those users
connected to the Internet at the same time. Their approach is based on IP TTL and HTTP user
agent strings. They extract the operating system and the browser family and versions from HTTP
user agent strings. Indeed, this necessitates deep packet inspection (DPI) into the payload of a
packet. They analyzed the user agent strings just from typical browsers and they ignored the ones
13
which came from mobile devices and gaming consoles. They employed two approaches to
predict the minimum number of users behind a NAT. In the first approach, they counted different
<TTL, OS> combinations as distinct hosts. In the second approach, for each <TTL, OS>
combination, they counted the number of different browser versions of the same browser family
as distinct hosts. They did not consider each different browser family as a distinct host. They
found 10% of DSL lines have more than one user active at the same time, and that 20% of the
lines have multiple hosts that are active within one hour of each other.
14
In summary, in our research, we reengineered the approach in [2], which seems to be the
best technique in the literature notwithstanding its requirement for payload information, to
understand its pros and cons better. In addition, for the cases where payload information not
available or opaque (such as encrypted traffic), we use flow based attributes via a classification
system in order to study whether one can learn general enough patterns to represent the behavior
of NAT devices in a given network/system under analysis. In this second approach we employed,
we are in some way similar to the work in [4]. However, we employ flow features rather than
packet features and we aim to understand NAT behavior to identify its presence in a given traffic
trace and the number of hosts behind such a NAT device.
15
3. NAT Overview
In computer networking, network address translation (NAT) is the process of modifying
IP address information in IP packet headers while in transit across a traffic routing device. It is
common to hide an entire IP address space, usually consisting of private IP addresses, behind a
single IP address in another (usually public) address space. To avoid ambiguity in the handling
of returned packets, a one-to-many NAT must alter higher-level information such as TCP/UDP
ports in outgoing communications and must maintain a translation table so that return packets
can be correctly translated back. RFC 2663 uses the term NAPT (network address and port
translation) for this type of NAT. Since this is the most common type of NAT, it is often referred
to simply as NAT.
The majority of NATs map multiple private hosts to one publicly exposed IP address. In
a typical configuration, a local network uses one of the designated "private" IP address subnets
(RFC 1918) [24]. A router on that network has a private address in that address space. The router
is also connected to the Internet with a "public" address assigned by an ISP. As traffic passes
from the local network to the Internet, the source address in each packet is translated on the fly
from a private address to the public address. The router tracks basic data about each active
connection (particularly the destination address and port). When a reply returns to the router, it
uses the connection tracking data that it stored during the outbound phase to determine the
private address on the internal network to which to forward the packet.
All Internet packets have a source IP address and a destination IP address. Typically
packets passing from the private network to the public network will have their source addresses
modified while packets passing from the public network back to the private network will have
their destination addresses modified. To avoid ambiguity in how to translate returned packets,
16
further modifications to the packets are required. The vast bulk of Internet traffic is TCP and
UDP packets, and for these protocols the port numbers are changed so that the combination of IP
and port information on the returned packet can be unambiguously mapped to the corresponding
private address and port information. Once an internal address (iAddr:iPort) is mapped to an
external address (eAddr:ePort), any packets from iAddr:iPort will be sent through eAddr:ePort.
Any external host can send packets to iAddr:iPort by sending packets to eAddr:ePort.
For the purpose of this research, first of all, we observed the behavior of the NAT
protocol in practice. In this case, we captured packets at both the input and the output sides of an
NAT device. To this end, we sent and captured packets from a client PC (at a home network) to
the web server at our faculty, namely www.dal.ca. Within the home network, the home network
router provides a NAT service. Figure 1 shows our Wireshark trace-collection scenario. We have
collected a Wireshark trace on the client PC in our home network. We call it the Home_Side
trace. Because we are also interested in the packets being sent by the NAT router into the ISP,
we have collected a second trace file at a PC on the ISP network, as shown in Figure 1. Client-to-
server packets captured by Wireshark at this point will have undergone NAT translation. The
Wireshark trace captured on the ISP side of the home router is called the ISP-Side trace.
Figure 1 NAT trace collection scenario
17
3.1 Translation of the Endpoint
NAT usage provides that all communication that are sent to external hosts actually
contain the external IP address and port information of the NAT device instead of the internal
host(s) IPs or port numbers.
Figure 2 shows that the HTTP GET sent from the client to the faculty server (whose IP
address is 129.173.21.171) at time 0.004374.
Figure 2: Home-Side trace: Consider the Source IP/Port and Destination IP/Port for the “HTTP GET” and the “200 OK HTTP” messages
The following are the source and destination IP addresses and TCP source and destination ports
on the IP datagram carrying this HTTP GET request.
Source IP: 192.168.137.2, Source Port: 1268
Destination IP: 129.173.21.171, Destination Port: 80
18
At time 0.013089, the corresponding 200 OK HTTP message received from the faculty server.
The following are the source and destination IP addresses and TCP source and destination ports
on the IP datagram carrying this HTTP 200 OK message.
Source IP: 129.173.21.171, Source Port: 80
Destination IP: 192.168.137.2, Destination Port: 1268
Recall that before a GET command can be sent to an HTTP server, TCP must first set up
a connection using the three-way SYN/ACK handshake. Considering Figure 3, you can find the
following information for SYN and ACK messages:
SYN Time: 0.002799
SYN Source IP: 192.168.137.2, Source Port: 1268
SYN Destination IP: 129.173.21.171, Destination Port: 80
ACK Time: 0.004032
ACK Source IP: 129.173.21.171, Source Port: 80
ACK Destination IP: 192.168.137.2, Destination Port: 1268
19
Figure 3 Home-Side trace: Consider the Source IP/Port and Destination IP/port for the three-way SYN/ACK handshake
In the following, we will focus on the two HTTP messages (GET and 200 OK) and the
TCP SYN and ACK segments identified above in the ISP-Side trace captured on the ISP
network. Because these captured frames have already been forwarded through the NAT router,
we are going to show that some of the IP addresses and port numbers have been changed as a
result of the NAT translation.
Note that the time stamps in ISP-Side and Home-Side traces are not synchronized since
the packet captures at the two locations shown in Figure 1 were not started simultaneously.
(Indeed, you can find that the timestamps of a packet captured at the ISP network is actually
bigger than the timestamp of the packet captured at the client PC).
Consider Figure 4, the NAT ISP-Side trace. The HTTP GET message has been sent from
the client to the faculty server at time 0.495146. The source and the destination IP addresses and
TCP source and destination ports on the IP datagram carrying this HTTP GET are as the
following:
Source IP: 129.173.67.98, Source Port: 61947
20
Destination IP: 129.173.21.171, Destination Port: 80
As you can see, in comparison to Figure 2, the destination IP and port have not been
changed, but the source IP and port have been translated by the NAT router.
Figure 4 ISP-Side trace: Consider the Source IP/Port and Destination IP/port for the “HTTP GET” and the “200 OK HTTP” messages
Our observations show that when a computer on the private (internal) network sends a
packet to the external network, the NAT device replaces the internal IP address in the source
field of the packet header (sender's address) with the external IP address of the NAT device.
NAT may then assign the connection a port number from a pool of available ports, inserting this
port number in the source port field (much like the post office box number), and forwards the
packet to the external network. The NAT device then makes an entry in a translation table
containing the internal IP address, original source port, and the translated source port.
Subsequent packets from the same connection are translated to the same port number. The
computer receiving a packet that has undergone the NAT device translation establishes a
21
connection to the port and IP address specified in the altered packet, oblivious to the fact that the
supplied address is being translated (analogous to using a post office box number).
A packet coming from the external network is mapped to a corresponding internal IP
address and the port number from the translation table, replacing the external IP address and the
port number in the incoming packet header. The packet is then forwarded over the inside
network. Otherwise, if the destination port number of the incoming packet is not found in the
translation table, the packet is dropped or rejected because the NAT device does not know where
to send it.
Most importantly, in addition to the IP address and the Port fields, two more fields in the
IP datagram have also been changed, Time to Live (TTL) and Checksum. Consider Figure 5 and
Figure 6 to compare the TTL and Checksum fields of the HTTP GET message in the Home-Side
and the ISP-Side traces.
Figure 5 Home-Side trace: Consider the “Time To Live” and “Checksum” fields for “HTTP GET” message
22
Figure 6 ISP-Side trace: Consider the “Time To Live” and “Checksum” fields for “HTTP GET” message
The TTL value can be thought of as an upper bound on the time that an IP datagram can
exist in an Internet system. The TTL field is set by the sender of the datagram, and reduced by
every router on the route to its destination. The purpose of the TTL field is to avoid a situation in
which an undeliverable datagram keeps circulating on an Internet system, and such a system
eventually becoming swamped by such “immortals". Under IPv4, every host that passes the
datagram must reduce the TTL by one unit. In practice, the TTL field is reduced by one at every
hop (router/gateway), including NAT devices (most of the time). In this case, we have also
observed that the TTL has been decreased by one in figure 6. It should be noted here that it is
possible to configure the NAT device, as well as every other network device, to not to decrease
the TTL value.
Regarding the checksum value, the major transport layer protocols, TCP and UDP, have a
checksum that covers all the data they carry, as well as the TCP/UDP header, plus a "pseudo-
header" that contains the source and destination IP addresses of the packet carrying the
23
TCP/UDP header. For an originating NAT to pass TCP or UDP successfully, it must re-compute
the TCP/UDP header checksum based on the translated IP addresses, not the original ones, and
put that checksum into the TCP/UDP header of the first packet of the fragmented set of packets.
The receiving NAT must re-compute the IP checksum on every packet it passes to the
destination host, and also recognize and re-compute the TCP/UDP header using the retranslated
addresses. In the ISP-Side trace file, Figure 7, you can find the following information for the
three-way SYN/ACK handshake:
SYN Time: 0.493603
SYN Source IP: 129.173.67.98, Source Port: 61947
SYN Destination IP: 129.173.21.171, Destination Port: 80
ACK Time: 0.494453
ACK Source IP: 129.173.21.171, Source Port: 80
ACK Destination IP: 129.173.67.98, Destination Port: 61947
Comparing figure 7 and figure 3, you can see that the source IP and the port in the ACK message
(direction from the home side to the ISP side) and destination IP and the port in the SYN
message (direction from the ISP side to the home side) have been translated by the NAT router.
24
Figure 7 ISP-Side trace: Consider the Source IP/Port and Destination IP/port for the three-way SYN/ACK handshake
3.2 Visibility of NAT Operations
Typically the internal host is aware of the true IP address and the TCP/UDP port of the
external host. The NAT device may function as the default gateway for the internal host.
However, the external host is only aware of the public IP address for the NAT device and the
particular port being used to communicate on behalf of a specific internal host.
As discussed above, NAT only translates the IP addresses and ports of its internal hosts,
possibly decreasing the TTL value by one, and re-computing the checksum. This results in
translating the IP addresses and ports of its internal hosts to hide the true endpoint of an internal
host on a private network. Because the internal addresses are all disguised behind one publicly
accessible address, it is impossible for the external hosts to differentiate between the traffic
originated from a network behind a NAT and one that did not. As a result, these networks are
25
ideal for attackers to hide their identities. If an attacker hides his/her identity behind a NAT
device, it is very difficult to find the exact attacker node.
Thus, a mechanism of identifying the NAT traffic is needed, as the attackers behind the
NAT devices can easily violate the network security. This is the main purpose of this study.
However achieving this purpose is not possible by using the typical network traffic analysis
techniques, because NAT only translates the IP addresses and ports of its internal hosts, possibly
decreasing the TTL value, and re-computing the checksum. In short, we use a combination of
techniques including machine learning (data mining) algorithms to achieve this goal.
26
4. Data Mining Techniques Employed
In this research, we have employed a data mining (machine learning) based approach to
find whether a NAT box exists in a network or not. To this end, we have employed a decision
tree system, e.g. C4.5, a function system, namely Support Vector Machine (SVM), and a
probabilistic system, namely Naive Bayes. The following will summarize all these techniques.
4.1 C4.5
C4.5 is a decision tree based classification algorithm developed by Ross Quinlan that is
an extension of the basic ID3 algorithm [9]. C4.5 is designed to address issues that are not dealt
with in ID3 such as choosing the appropriate attribute (based on information gain), trying to
reduce error pruning, and handling varieties of attributes types (continuous, number, string).
A decision tree is a hierarchical data structure for implementing a divide-and-conquer
strategy. C4.5 is an efficient non-parametric method that can be used to support both
classification and regression. In non-parametric models, C4.5 constructs decision trees from a set
of training data applying the concept of information entropy. The training data is a set, S, such
that each input of the set is an instance of already classified samples. Each sample (record) in the
set is represented as a vector and each input in the vector is represented as an attribute (feature)
of the sample. The training data is added to a vector where each input in the vector represents a
class that each sample belongs to. C4.5 can split the data into smaller subsets using the fact that
each attribute of the data can be used to make a decision. Therefore, the attribute with highest
information gain is used to make the decision of the split. As a result, the input space is divided
into local regions defined by a distance metric. In a decision tree, the local region is identified in
a sequence of recursive splits in small number of steps. A decision tree is composed of internal
27
decision nodes and terminal leaves. Each node, m, implements a test function fm(x) with discrete
outcomes labeling the branches. This process starts at the root and is repeated until a leaf node is
hit. The value of a leaf node constitutes the output. In the case of a decision tree for
classification, the goodness of a split is quantified by an impurity measure. A split is pure if, for
all branches, for all instances, choosing a branch belongs to the same class after the split. One
possible function to measure impurity is entropy, Eq. (1) [10].
= log (1)
If the split is not pure, then the instances should be split to decrease impurity, and there
are multiple possible attributes on which a split can be done. Indeed, this is locally optimal;
hence, there is no guarantee of finding the smallest decision tree. In this case, the total impurity
after the split can be measured by Eq. (2) [10]. In other words, when a tree is constructed, at each
step the split that results in the largest decrease in impurity is chosen. This is the difference
between the impurity of data reaching node m, Eq. (1), and the total entropy of data reaching its
branches after the split, Eq. (2). Figure 8 presents the construction of a classification tree. A more
detailed explanation of C4.5 algorithm can be found in [10].
= (2)
28
Figure 8: Construction of a classification tree
4.2 Naive Bayes
A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes'
theorem (from Bayesian statistics) with strong (naive) independence assumptions. In simple
terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a
class is unrelated to the presence (or absence) of any other feature. Depending on the precise
nature of the probability model, Naive Bayes classifiers can be trained efficiently in a supervised
learning approach. In many practical applications, parameter estimation for Naive Bayes models
uses the method of maximum likelihood [11]. A simple Naive Bayes probabilistic model can be
expressed as Eq. (3) in the following:
( | , , … , ) = ( ) ( | ), (3)
29
where ( | , , … , ) is the probabilistic model over dependent class variable C with a small
number of outcomes or classes, conditional on several feature variables F1 through Fn; Z is a
scaling factor dependent only on , , … , , i.e., a constant if the value of the feature variables
are known. A Naive Bayes classifier combines the probabilistic model with a decision rule that
aims to maximize a posterior, thus the classifier can be defined using Eq. (4) as follows:
( , , … . , ) = ( = ) ( = | = ) (4)
An advantage of the naive Bayes classifier is that it only requires a small amount of training data
to estimate the parameters (means and variances of the variables) necessary for classification.
Because independent variables are assumed, only the variances of the variables for each class
need to be determined and not the entire covariance matrix.
Figure 9: An example of Naive Bayes
Figure 9 shows an example of Naive Bayes. Naive Bayes is the simplest form of
Bayesian network. All attributes are independent given the value of class variable. This is called
conditional independence. The conditional independence assumption is not often true in the real
world problems.
The authors of [12] aimed to show that the Naive Bayes classifier might have been
successful to choose who would reply to the mailing for the 1998 KDD Data cup. They
evaluated their tests and explained time and space complexities of Naive Bayes by drawing
graphs. Time complexity for learning a Naive Bayes classifier is O(Np), where N is the number
30
of training examples and p is the number of features. Space complexity for Naive Bayes
algorithm is O(pqr), where p is the number of features, q is values for each feature, and r is
alternative values for the class.
4.3 Support Vector Machine (SVM)
The original SVM algorithm was invented by Vladimir N. Vapnik and the current
standard incarnation (soft margin) was proposed by Vapnik and Corinna Cortes in 1995 [15].
Classifying data is a common task in machine learning. Suppose some given data points each
belong to one of two classes, and the goal is to decide which class a new data point will be in. In
the case of support vector machines, a data point is viewed as a p-dimensional vector (a list
of p numbers), and we want to know whether we can separate such points with a (p 1)
dimensional hyperplane. This is called a linear classifier. There are many hyperplanes that might
classify the data. One reasonable choice as the best hyperplane is the one that represents the
largest separation, or margin, between the two classes. So we choose the hyperplane so that the
distance from it to the nearest data point on each side is maximized. If such a hyperplane exists,
it is known as the maximum-margin hyperplane and the linear classifier it defines is known as
a maximum margin classifier; or equivalently, the perceptron of optimal stability [13]
We consider data points of the form as follows:
{( , ), ( , ), ( , ), ( , ), … , ( , )}, (5)
Where = 1/-1, a constant denoting the class to which that point belongs and n is the
number of samples. Each is p-dimensional real vector. The scaling is important to guard
against variable (attribute) with larger variance. We can view this training data, by means of
dividing hyperplane, which takes
31
w.x + b = 0, (6)
where b is scalar and w is p=dimensional vector. The vector w points perpendicular to the
separating hyperplane. Adding the offset parameter b allows to increase the margin. Absent of b,
the hyperplane is forced to pass through the origin, restricting the solution. As we are interesting
in the maximum margin, we are interested SVM and the parallel hyperplanes. Parallel
hyperplanes can be described by the Eq. (7) in [13]:
w.x + b = 1
w.x + b = -1 (7)
A special property of SVM is that SVM simultaneously minimizes the empirical
classification error and maximizes the geometric margin. So SVM can be considered as a
Maximum Margin Classifier. SVM is based on the Structural Risk Minimization (SRM). SVM
maps an input vector to a higher dimensional space where a maximally separating hyperplane is
constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that
separate the data. The separating hyperplane is the hyperplane that maximize the distance
between the two parallel hyperplanes. An assumption is made that the larger the margin or
distance between these parallel hyperplanes the better the generalization error of the classifier
will be [14].
32
Figure 10: Maximum margin hyperplanes for a SVM trained with samples from two classes
33
5. Methodology and Evaluation
As discussed earlier, in this work, we considered and evaluated two different approaches,
namely a data mining approach and a passive operating system fingerprinting by analyzing
specific parameters within the TCP protocol. For the data mining approach, we employed the
classification models, C4.5, Naïve Bayes and SVM learning techniques introduced in section 4.
As for the passive fingerprinting approach, we re-engineer and employ the algorithm introduced
by Maier et al. [2]. For both approaches we used exactly the same data sets including both the
encrypted and the non-encrypted traffic. In doing so, we aim to understand the behavior of NAT
both with and without the payload information. Moreover, for the first approach, we analyzed the
used of both the packet only and the flow only features to detect the presence of NAT devices.
The following describes the data sets and experiments performed in this work.
5.1 Data Sets Employed
In this research, we employed traffic data sets from our faculty’s web server in the form
of TCP dump files, which did not have any payload, as well as the corresponding web server
access log files, which have the payload. The data was collected over a week in November 2012
and data sets have both encrypted and non-encrypted traffic.
In total, there are 533647 flows in our dataset. All these flows were matched with the web
access log data files which are encrypted (HTTPS) and non-encrypted (HTTP). We divided
(detailed in section 5.4) these flows into two categories: (i) NAT flows; and (ii) OTHER flows.
However, some flows did not have valid user agent strings. This means that some user agent
strings did not include OS information and some of them did not include browser related
information. We consider the flows that have these types of user agents in the OTHER category.
34
In the whole dataset, 95 different OSs and 105 different browser families with their versions
exist.
Furthermore we know that in our own faculty we have a NAT device for some of the
labs. Indeed, computers from these labs (behind the NAT device) access the web server (where
we collected the data sets). Thus, the choice of these data sets enables us to derive some ground
truth information about the presence of NAT devices in the data sets, too.
5.2 Passive Fingerprinting Approach
In this case, we first introduced the features used in the passive fingerprinting approach
that we adopted from Maier et al. [2] and re-engineered to use in our work. In applying this
approach certain features are used to identify a NAT device. Some of these features require the
payload to be known otherwise they cannot be extracted from the data set. Other features do not
have that requirement. We detail these features below.
5.2.1 Packet Header Based Features - Time to Live (TTL) and Arrival Time
It is known that networking stacks of operating systems use well-defined initial IP TTL
values (ttlinit) in outgoing packets. For instance, Windows uses 128, MacOS uses 64 and Debian
based systems use 64, too. The TTL field of the IP header is defined to be a timer limiting the
lifetime of a datagram. It is an 8-bit field and the units are in seconds. Each router (or other
modules) that handles a packet must decrement the TTL by at least one, even if the elapsed time
was much less than a second. Since this is very often the case, in effect, the TTL values serves as
a hop count limit on how far a datagram can propagate through the Internet as it is shown in
Figure 11 [18].When a router forwards a packet, it must reduce the TTL by at least one. If it
holds a packet for more than one second, it may decrement the TTL by one for each second.
35
Therefore, it is expected that if there is a NAT box routing in the network, it will decrement the
TTL values for each packet that passes through them. However, it is possible for a NAT box not
to decrement TTL values. Moreover, users could reconfigure their systems to use a different
TTL. In this case, we cannot detect accurately whether there is a NAT box or not by solely
relying on TTL counts.
However, assuming that TTL values are not modified or hidden, these TTL values in the
packets enable us to infer the presence of NAT. If the TTL is ttlinit-1, this means that the sending
host is directly connected to the Internet, so the monitoring point is one hop away from the host.
If the TTL is ttlinit-2 then there is a routing device such as a NAT gateway.
A NAT gateway can be a dedicated gateway such as a home router or it can be a regular
desktop or notebook. A dedicated NAT gateway will often directly interact with the Internet
services, e.g., by serving as DNS resolver for the local network or for synchronizing its time with
NTP servers. Moreover they generally do not use web (HTTP). It should be noted here that in
our datasets we cannot see any DNS records originated by the known NAT devices.
Figure 11: Propagation Behavior of TTL between IP Header and MPLS Labels
36
In addition to the TTL feature, arrival time feature is also used. The "Arrival Time"
feature reflects the timestamp recorded by the station that captures the traffic when the packet
arrives. The accuracy of this field is only as accurate as the time on the receiving station. Packet
captures from Windows systems are only represented with accuracy in seconds. In this work, we
use arrival times to match packets with access log data and flow based data. We will explain this
process in more detail in section 5.3.
In short, TTL and arrival time features embody the two features of the passive
fingerprinting approach that do not require any payload information. In other words, no deep
packet inspection is necessary to extract these features from the traffic.
5.2.2Packet Payload Based Features - Http User Agent String
The user agent string identifies the browser that the user uses to access the web. When a
user visits a webpage, his/her browser sends the user-agent string to the server hosting the site
that is visited. This string indicates, which browser the user is using, its version number, and
details about the user’s system, such as the operating system and its version.
We parsed the HTTP log files (access log files of the web server) and analyzed the user
agent strings. We extracted the OS and browser information from these strings to estimate a
lower bound for the number of hosts behind a NAT gateway. Maier et al. [2] limited their
analysis to user agent strings from typical browsers such as Firefox, Internet Explorer, Safari and
Opera.
However, we did not limit ourselves with the typical browsers, because in our data sets,
we also observed many user agent strings from Android based devices, iPhones and iPads.
Therefore, we took them into consideration in our study, too. We will discuss this in more detail
in the following sub-sections.
37
5.2.2.1User Agent String - OS Only
As we mentioned in section 5.2, it is possible to detect whether there is a NAT gateway
or not by analyzing the TTL values (assuming TTL values are not modified). For instance, if the
TTL value of a packet is 125, which belongs to a Windows system with an initial TTL of 128,
we can say that this computer is behind a routing device such as a NAT gateway.
Furthermore, in this case, in order to predict (calculate) the number of users behind a
NAT, we need to utilize the OS information, which belongs to the hosts. Thus, we can use a
heuristic as: Different <TTL, OS>combinations represent distinct hosts. Then, we can calculate
the number of different combinations and use this number to predict the number of users behind
a NAT. An example for this is shown in Table 1. In this example, this approach predicts a NAT
device’s presence with three users behind it. However, there might be many users using the same
combination, too. In this approach, they will be counted as one.
Table 1 Packet Header based features employed, * Normalized by log
From Packet Header From HTTP User Agent
TTL Proto OS Family Version
54 80/HTTP Intel Mac OS X 10.5 Firefox 3.05
54 80/HTTP Windows NT 5.1 Internet Explorer 6.0
54 80/HTTP Windows NT 5.1 Firefox 2.0
54 80/HTTP Windows NT 5.1 Firefox 3.0.3
54 80/HTTP Windows NT 6.0 Internet Explorer 7.0
54 80/HTTP Windows NT 6.0 Firefox 3.0.5
5.2.2.2 User Agent String - OS and Browser Version
In this case, in addition to the OS information, we can also extract and count the number
of different browser versions to predict the number of users behind a NAT device. This time, we
38
use the following heuristic: it is assumed that if the OSs is the same and the browsers families are
the same but the browser versions are different, then we can count these as two distinct hosts.
The rationale for this is that it is not possible to install different versions of a browser family on
the same computer. For example, in Table 1, TTL values are the same for each packet, while
there are different OSs such as Windows NT 5.1,Windows NT 6.0 and Intel Mac OS X 10.5. The
packets with Windows NT 5.1 OSs employ browsers from the same family, Firefox. Their TTLs,
OSs and browser families are the same. According to the previous approach, we could count
these two packets as belonging to one host. However, if we take into consideration the version of
the browsers, then the conclusion could be different. In this example, they use different Firefox
versions, so that means using our heuristic above, we can conclude that these packets belong to
two distinct hosts.
In summary in table 1, if we take all combinations into consideration there are three
Windows NT 5.1 OSs with the same TTL value but two different browser families (types) and
different versions. Two Windows NT 6.0 with the same TTL value have different browser
families. This can mean two different users or one user with two different browsers. Finally, the
packet with Intel Mac OS X 10.5 OS might be different host (user) or might be the same host
(user) with dual OSs. In this case, we assume that two different versions of Windows OSs (NT
5.1 and NT 6.0) would not be installed on the same machine and the two different versions of the
same browser would not be installed on the same machine either. Therefore, we count three hosts
by analyzing these packets using our heuristics presented above. However, if TTL values were
different, then we could say that those packets belong to different hosts. For instance, in Table 2
there are three hosts and in table 3, we count two distinct hosts.
39
Table 2: Packet Header based features employed, * Normalized by log
From Packet Header From HTTP User Agent
TTL Proto OS Family Version
56 80/HTTP Intel Mac OS X 10_8_2 Safari 6.0.2
119 80/HTTP Windows NT 6.2 Google Chrome 22.0
120 80/HTTP Windows NT 6.2 Google Chrome 22.0
Table 3: Packet Header based features employed, * Normalized by log
From Packet Header From HTTP User Agent
TTL Proto OS Family Version
56 80/HTTP Intel Mac OS X 10_7_5 Safari 6.0.1
119 80/HTTP Windows NT 6.1 Firefox 11.0
119 80/HTTP Windows NT 6.1 Internet
Explorer 9.0
Table 4: Packet Header based features employed, * Normalized by log
From Packet Header From HTTP User Agent
TTL Proto OS Family Version
48 80/HTTP Android 2.2.x Froyo Safari 6.0.1
48 80/HTTP Android 2.0/1 Eclair Firefox 11.0
48 80/HTTP Windows NT 6.1 Internet
Explorer 7.0
In this work, we also use the packets that belong to the Android and other mobile devices,
while mobiles were not taken into consideration in previous works. In our data sets, we observed
lots of records, which belonged to the mobile devices (in our network traces). There are many
numbers of Android devices, iPhone devices and iPad devices. When we analyze their user agent
40
strings, we can also see the device models; such as Nessus One and Samsung SGH-1896.As we
show in Table 4, in this case, we count three different mobile hosts (users).
5.3 Data Mining Based Approach
In this case, we will introduce the features applied in the data mining based approach. To this
end, we have introduced the learning techniques that will be studied in section 4. In this
approach, we only employ features (attributes) that are based on flow statistics. However, we do
not use payload information, IP addresses and/or port numbers as features to this approach. The
reasons behind this are as the following: One may not have access to payload information or the
payload may be encrypted and therefore opaque. Moreover, IP addresses can be spoofed and
ports numbers can be dynamically assigned. In short, such features can be biased. In some ways,
one can say that NATs and proxies are already doing this for free. Thus, our aim here is to
generate fingerprints (in other words signatures) automatically without using any biased features.
We detail the features of this approach below.
5.3.1 No Payload, IP Addresses and Port Numbers– Flow Based Features
As a different approach, we converted our packet-based dataset to a flow-based dataset.
To this end, NetMate [16] was employed to generate flows and compute features as shown in
Table 5 below. Flows are bidirectional and the first packet seen by the tool determines the
forward direction. We consider only UDP and TCP flows. UDP flows are terminated by a flow
timeout, whereas TCP flows are terminated upon proper connection teardown or by a flow
timeout, whichever occurs first. The flow timeout value employed in this work is 600 seconds as
recommended by the IETF [17].
41
5.4 Ground Truth – Labeling of the Data Sets
For evaluating the performance of each approach employed, we need to know actually
how many NAT devices exists in our data sets so that we could measure the detection rate for
each approach. We also need to know this information to be able to prepare training data sets for
the data mining approach. In this research, we do not know every NAT device (from the Internet)
that accesses the web server where we captured traffic. We only know for sure the ones (NAT
devices) that are in our own faculty and access the web server. So we decided to label the data
we captured using two different methods in order to see their effect on the performance: (i)
Assuming the only NAT devices are the ones we know; and (ii) Assuming the number of NATs
is more than the ones in (i) so predicting the number using the heuristics presented in the passive
fingerprinting approach. These are detailed below.
5.3.2.1 Labeling Scheme-1 – NAT Devices of the Faculty Only
As discussed earlier, we use datasets, which were captured on our faculty’s web server
over a week in November 2012. In this case, NetMate as an open source tool is used to convert
the packet-based data set into a flow-based data set.
As our first ground truth, the IP Address of the faculty’s NAT device is 129.173.13.94.
Therefore, the flows and packets with this IP address were all labeled as NAT. The remainder
was labeled as OTHER (meaning non-NAT). Hereafter, we will refer to those flows as NAT
flows and OTHER flows. As a result of this labeling process, there are 12168 NAT flows among
all the flows (out of a total 533647 flows) in the data set. It means that about 2.3% of our data set
is NAT traffic based on this labeling scheme. Please note that for the data mining approach
where we evaluated different learning techniques (C4.5, Naïve Bayes and SVM) to automatically
42
generate fingerprints, we removed the IP addresses and the port numbers (flows do not have
payloads) from the flows.
5.3.2.2 Labeling Scheme-2 – Potential NATs and Encrypted / Unencrypted Traffic
In this case, even though we do not have any ground truth information regarding NAT
devices other than the ones in our faculty, by using the fingerprints introduced in the first
approach discussed earlier, we can potentially label more traffic as NAT, if they match to anyone
of those fingerprints.
Among our data sets we have not only TCP dump traffic collected on the web server but
its corresponding web server access log files, too. To this end, we have two log files :(i) HTTP
web access logs;(ii) HTTPS web access logs. The first one represents the non-encrypted and the
second one represents the encrypted traffic.
However, to be able to use this information, first we need to match the traffic logs (packet
or flow) to the records in the access log files. Then we can label the flows without payload
information accurately in terms of encrypted NAT or non-encrypted NAT traffic. To achieve this
we checked the “arrival time” feature. Since we are dealing with the flows, we did not have
arrival time feature for each flow, but we have it in packet-based access log files. Therefore we
had to add time attribute to each flow in NetMate. After doing that modification, the time
attributes of the flows were compared to the time attributes of the packets in HTTP and HTTPS
files.
As a result, we have 517 Encrypted-NAT flows among 12168 NAT flows. We
can say Unencrypted-NAT for the rest 11651 NAT flows. Therefore, there are 533647 flows;
517 of them are Encrypted-NAT flows, 11651 of them are Unencrypted-NAT flows, 521479 of
them non-NAT flows, which are labeled as OTHER.
43
Table 5: Flow Based Features Employed
44
6. Empirical Evaluation
In traffic classification, two metrics are used in order to quantify the performance of the
classifier: Detection Rate (DR) and False Positive Rate (FP). We adopted these metrics to
evaluate the performance of the different approaches. Hereafter, DR reflects the number of NAT
flows correctly classified. It is calculated using DR = TP / (TP+FN); whereas the FP rate reflects
the number of non-NAT flows, referred as OTHER flows, incorrectly classified as NAT using
FPR = FP/ (FP+TN). Naturally, a high DR rate and a low FP rate are the desirable outcomes. In
addition to these rates, False Negative (FN) implies that NAT traffic is classified as OTHER. We
also measured the time it takes to train/test a data mining algorithm as well as its complexity.
For the complexity analysis, in our work, we consider the complexity of C4.5 algorithm
as the number of all rules in the tree that is generated by the algorithm at the end of the training
phase. The theoretical time complexity for learning a N N is
the number of training examples and p is the number of features. For linear SVMs, at the training
phase, one must estimate the vector w and bias b by solving a quadratic problem and at the test
phase, prediction is linear in the number of features and constant in the size of the training data.
For kernel SVMs, at the training phase, one must select the support vectors and at the test phase,
complexity is linear on the number of the support vectors (which can be lower bounded by
training set size * training set error rate) and linear on the number of features (since most kernels
only compute a dot product; this will vary for graph kernels, string kernels, etc.): In our work, we
considered SVM’s complexity as the number of kernel evaluations. Training time affects kernel
cache size [22]. As for the time analysis, during the training phase, time value shows the
time taken to run the algorithm on the training set. During testing phase, time value shows the
time taken to test the learned model on the given test data set.
45
6.1 Evaluation Results Based On The First Labeling Scheme
When labeling scheme-1 is used, there are 12168 NAT flows (just the ones we know for
sure) among the 533647 (total number of) flows in our data sets. This is approximately 2% of our
data set. On this data set, we evaluated both the passive fingerprinting approach and the data
mining approach (automatic fingerprinting). For the passive fingerprinting approach, our results
are shown in Table 6.
Table 6: Results for the flow-based feature set using the passive fingerprinting approach – Labeling Scheme 1
DR FPR
TTL NAT 0.939 0.19
OTHER 0.395 0.608
TTL + OS NAT 0.939 0.060
OTHER 0.597 0.4023
TTL + OS + Browser Version
NAT 0.947 0.052
OTHER 0.378 0.621
We also did trace route and IP lookup for the IP addresses that we labeled as NAT
according to this approach. Thus, we can see where those IP addresses belong originally.
For the data mining approach, we generated a training dataset by randomly sampling
(uniform probability) flows from the two classes: NAT and OTHER. Our first training dataset
consists of balanced number of flows from both classes. In other words, the number of NAT
flows (instances) is the same as the number of OTHER flows (instances). Then the test is
employed to the flows which are not used in the training phase. In this case, 7200 flows were
chosen randomly for the training data set, where each class had 3600 flows. The rest of the data
46
set was used for testing. Results of these training and testing experiments are shown in tables 7
and 8, respectively.
Table 7: Training Results for the flow-based feature set using the data mining approach – Labeling Scheme 1
DR FPR Complexity Time (sec)
C4.5 NAT 0.990 0.015 122 0.01
OTHER 0.985 0.010
SVM NAT 0.912 0.079 5074542 0.01
OTHER 0.921 0.088
NAIVE
BAYES
NAT 0.833 0.136 O(7200*41) 0.06
OTHER 0.864 0.167
Table 8: Test Results on the unseen dataset for the flow-based feature set using the data mining approach – Labeling Scheme 1
DR FPR Complexity Time (sec)
C4.5 NAT 0.910 0.001 864 4.63
OTHER 0.999 0.090
SVM NAT 0.000 (crashed) 0.000 (crashed) -1869722033
(crashed) 998.4 (crashed) OTHER 1.000 (crashed) 1.000 (crashed)
NAIVE BAYES
NAT 0.826 0.146 O(526447*41) 8.05
OTHER 0.854 0.174
Table 7 shows that the C4.5 based classifier gives the best performance among all the
machine learning algorithms evaluated on this data set. In this case, the C4.5 classifier correctly
identifies 99% of the flows that are originating from “behind a NAT device” with only around
2% false alarm rate. When this trained model of the C4.5 classifier (trained on just 7200 flows) is
tested on the whole data set (approximately half a million flows) the detection rate drops to 91%
bit with almost no false alarm rates, Table 8. Given that even the passive fingerprinting
47
approach, which employs payload information reaches at most to 94% as detection rate with XX
false alarms, the data mining approach employed using no payload information via C4.5
classifier is very promising to identify whether the traffic is originating behind a NAT device or
not on our data sets.
Moreover, decision tree algorithms such as C4.5 are very successful to find the most
informative features (attributes) among the features given to them during the training phase. The
first node in the resulting decision tree shows which feature has the most predictive power. Even
better, WEKA's Explorer utility offers purpose build options to find the most useful attributes in
a dataset. These options can be found under the Select Attributes tab. In our work, we use
statistical flow features, which are shown in Table 5 for all the data mining algorithms. Among
these, C4.5 selected the following as the most predictive top 5 features: min_fpktl, min_fiat,
mean_fpktl, max_fpktl and duration. From these features, it seems like the main predictions can
be made based on the forward packet length’s mean, minimum and maximum values as well as
the duration of the flow.
6.2 Evaluation Results Based On The Second Labeling Scheme
When labeling scheme-2 is used, there are 75297 NAT flows (the ones we know for sure
plus the potential ones) among the 533647 (total number of) flows in our data sets. This is
approximately 14% of our data set. We can label them as encrypted NATs and non-encrypted
NATs. After labeling, there are 74677 non-encrypted NAT flows while there are 620 encrypted
NAT flows among 75297 NAT flows in total. From this data set, we have formed a new training
set for the data mining approaches. In this case, our training data set consists of 400 encrypted
NAT flows, 400 non-encrypted NAT flows and 400 OTHER flows – all randomly chosen.
48
Again, we evaluated both the passive fingerprinting approach and the data mining approach on
the unseen flows. For the Passive fingerprinting approach, our results are shown in Table 9.
Table 9: Results for the flow-based feature set using the passive fingerprinting approach –Labeling Scheme 2
DR FPR
TTL Encrypted-NAT 1.0 0.9509
Unencrypted-NAT 0.616 0.383
OTHER 0.11 0.883
TTL + OS Encrypted-NAT 1.0 0.9536
Unencrypted-NAT 0.68 0.311
OTHER 0.115 0.884
TTL + OS + Browser Version Encrypted-NAT 1.0 0.9505
Unencrypted-NAT 0.57 0.424
OTHER 0.104 0.895
These results show that passive fingerprinting approach cannot perform well when the
traffic is encrypted. It has very high, 95%, false alarm rates. On other hand, it has around 57% -
68% detection rate for non-encrypted NAT flows depending on the fingerprints used. As for the
data mining approach under this labeling scheme, results of the training and testing are given in
Tables 10 and 11, respectively.
49
Table 10: Training Results for the flow-based feature set using the data mining approach – Labeling Scheme 2
DR FPR Complexity Time (sec)
C4.5 Encrypted-NAT 0.988 0.079
93 0.03 Unencrypted-NAT 0.803 0.043
OTHER 0.898 0.035
SVM Encrypted-NAT 0.985 0.244
83789 0.01 Unencrypted-NAT 0.528 0.160
OTHER 0.498 0.091
NAIVE BAYES
Encrypted-NAT 0.878 0.171
O(41*1200) 0.01 Unencrypted-NAT 0.693 0.205
OTHER 0.698 0.151
Table 11: Test Results on the unseen dataset for the flow-based feature set using the data mining approach – Labeling Scheme 2
DR FPR Complexity Time (sec)
C4.5 Encrypted-NAT 0.677 0.015
326 4.62 Unencrypted-NAT 0.793 0.197
OTHER 0.791 0.170
SVM Encrypted-NAT 0.000 0.000
238687 4.73 Unencrypted-NAT 0.777 0.417
OTHER 0.583 0.223
NAIVE BAYES
Encrypted-NAT 0.841 0.109
O(41*4620) 7.04 Unencrypted-NAT 0.686 0.325
OTHER 0.580 0.115
According to the test results, C4.5 classifier seems to have the best performance to detect
both encrypted and non-encrypted NAT flows during training experiments. However, even
though its performance is better than the passive fingerprinting approach during test, Naïve
50
Bayes based classifier achieves much better detection rate (84% vs. 68%) albeit with false alarm
rate at 11%. In our opinion this requires further experimentation and analysis. One of the reasons
behind this might be the nature of the training data (very different than the test data) or it might
be that a different set of features might be more indicative of the behavior for encrypted NAT
flows.
When we analyzed the solution of the C4.5 classifier after the training for this set of
experiments, we find that it uses min_fpktl, bpsh_cnt, mean_biat, mean_fiat, total_fhlen,
min_biat features as the most predictive features. Some of these overlap with the experiments
that we conducted under the first labeling scheme, however 4 of the 6 features are different than
the ones we’ve seen before, hence the need for a potentially new set of features might be the way
to improve the detection rate under this labeling scheme.
8. Conclusion and Future Work
In this work, we have evaluated two different approaches, namely (i) passive
fingerprinting and (ii) data mining (automatic fingerprinting), to identify and analyze NAT
behavior. For the passive fingerprinting approach, we have used three different heuristics using
information extracted from packet headers, specifically TTL and arrival time attributes, as well
as payload, specifically Web user string agent information. Indeed, payload information requires
the presence of web traffic for this approach to be applied. On the other hand, data mining
approach consists of evaluation of three machine learning algorithms, namely C4.5, SVM and
Naive Bayes. Since our aim is identifying the traffic that is originating from behind a NAT
device, we have employed and evaluated all techniques on the same data sets to measure their
performances. To this end, we have used two different labeling schemes, where the distinction
51
was drawn between flows we labeled as NAT traffic and OTHER, and between ones were
encrypted and which ones were not.
All in all, data mining approach using the C4.5 decision tree classifier seem to be the
most reliable performer. Using this approach we can achieve high detection rates with low false
alarms even on encrypted traffic that is originating from a network behind a NAT device.
Moreover, we can achieve this without using IP addresses, port numbers or payload information.
This is not the case for the passive fingerprinting approach, where the complexity is much less,
but certain type of payload information is required to apply it. Furthermore, the solution trees of
the trained models on our data sets can be used as the automatically generated “fingerprints” to
identify NAT devices.
As the next steps, we want to experiment with different set of features for the data mining
model to see if we can decrease the false alarm rate of encrypted NAT traffic. We also wish to
conduct tests on different data sets for generalization purposes. Last but not the least, once the
best model is chosen after these next set of experiments, we want to see if we can access a given
machine identified by this method behind a NAT device. Another interesting direction would be
to compare the traffic behind a NAT to the traffic behind a proxy device. Even though such two
devices have similar purposes, they do not work exactly the same way so the effects of their
different behavior would worth comparing the features and fingerprints found under such
different conditions.
52
References:
[1] Ishikawa, Y.; Yamai, N.; Okayama, K.; Nakamura, M.; , "An Identification Method of PCs behind
NAT Router with Proxy Authentication on HTTP Communication," Applications and the Internet
(SAINT), 2011 IEEE/IPSJ 11th International Symposium on , vol., no., pp.445-450, 18-21 July 2011.
[2] G. Maier, F. Schneider, and A. Feldmann. NAT Usage in Residential Broadband Networks.
Proceedings of the 12th International Conference on Passive and Active Network Measurement (PAM
2011), Atlanta, Georgia, March 2011.
[3] Murakami, R.; Yamai, N.; Okayama, K.; , "A MAC-address Relaying NAT Router for PC
Identification from Outside of a LAN," Applications and the Internet (SAINT), 2010 10th IEEE/IPSJ
International Symposium on, vol., no., pp.237-240, 19-23 July 2010.
[4] Li Rui; Zhu Hongliang; Xin Yang; Yang Yixian; Wang Cong; , "Remote NAT Detect Algorithm
Based on Support Vector Machine," Information Engineering and Computer Science, 2009. ICIECS
2009. International Conference on, vol., no., pp.1-4, 19-20 Dec. 2009.
[5] PHAAL, P. Detecting NAT devices using sFlow. http://www.sflow.org/detectNAT/
(last modified: 2009).
[6] MILLER, T. Passive OS fingerprinting: Details and techniques. http://www.ouah.org/
incosfingerp.htm (last modified: 2005).
[7] R. BEVERLY. A robust classifier for passive TCP/IP fingerprinting. In Proc. Conference on
Passive and Active Measurement (PAM) (2004).
[8] S. M. Bellovin, “A technique for counting natted hosts”, In IMW ’02: Proceedings of the 2nd ACM
SIGCOMM Workshop on Internet measurement, pp. 267–272, New York, NY, USA, 2002, ACM Press.
[9] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
[10] E. Alpaydin. Introduction to Machine Learning. MIT Press, 2004.
53
[11] A. A. Ghorbani, W.Lu, M. Tavallaee. Network Intrusion Detection and Prevention Concepts and
Techniques. Springer Science+Business Media, 2010.
[12] C. Fleizach and S. Fukushima. A Naive Bayes Classifier on 1998 KDD Cup.2006
[13] C.J.C Burges A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and
Knowledge Discovery 2, 121-167, 1998
[14] D.K. Srivastava and L. Bhambhu. Data Classification Using Support Vector Machine. Journal of
Theoretical and Applied Information Technology.2005-2009.
[15] N. Ye.The Handbook of Data Mining. Lawrence Erlbaum Associates, Inc. 2003.
[16] NetMate.http://www.ip-measurement.org/tools/netmate/
[17] 60. IETF. http://www3.ietf.org/proceedings/97apr/97apr-final/xrtftr70.htm.
[18] L. D. Ghein. MPLS Fundementals: Forwarding Labeled Packets. Cisco Press. 2007.
[19] R. Alshammari and A. N. Zincir-Heywood. Can encrypted traffic be identified without port numbers,
IP addresses and payload inspection? Computer Networks. Vol. 55, pp 1326-1350,April 2011.
[20] R. Alshammari, A. N. Zincir-Heywood, Machine learning based encrypted traffic classification:
Identifying ssh and skype, IEEE Symposium on Computational Intelligence for Security and Defense
Applications, pp. 1–8, 2009.
[21] R. Alshammari, N. Zincir-Heywood, Generalization of signatures for ssh encrypted traffic
identification, IEEE Symposium Computational Intelligence in Cyber Security, pp. 167–174, 2009.
[22] A. Borders, S. Ertekin, J. Weston, L. Bottou. Fast kernel Classifiers with Online and Active
Learning.Journal of Machine Learning Research. 2005.
[23] Receiver Operating Characteristic.http://en.wikipedia.org/wiki/Receiver_operating_characteristic
[24] S. Malik. Network Security Principles and Practices. Cisco Press. 2003.