firesense: firewall-based occupancy sensing -...
TRANSCRIPT
FireSense: Firewall-Based Occupancy Sensing
Richard Yu
University of California, Los Angeles
Table of Contents 1. Introduction ............................................................................................................................. 1
2. Background / Related Work .................................................................................................... 2
3. Tools ........................................................................................................................................ 2
3.1. PFSense ............................................................................................................................ 2
3.2. TShark .............................................................................................................................. 3
3.3. SVM ................................................................................................................................. 4
3.4. HMSVM ........................................................................................................................... 5
4. Implementation ........................................................................................................................ 5
4.1. The Sensor ........................................................................................................................ 5
TShark ....................................................................................................................... 5
Data Aggregator ........................................................................................................ 6
4.2. The Data Server ................................................................................................................ 9
4.3. Determining Occupancy ................................................................................................... 9
Peer Connections ...................................................................................................... 9
Router-Level Statistics ............................................................................................ 11
4.4. Gathering the Ground Truth ........................................................................................... 13
5. Evaluation ............................................................................................................................. 14
5.1. Binary Results ................................................................................................................ 14
5.2. Estimating Actual Occupancy ........................................................................................ 16
5.3. Network Traffic vs. Occupancy ..................................................................................... 16
6. Conclusion ............................................................................................................................ 18
7. Acknowledgments................................................................................................................. 18
8. References ............................................................................................................................. 19
9. Appendix ............................................................................................................................... 19
1
FireSense: Firewall-Based Occupancy Sensing
Richard Yu
University of California, Los Angeles
Abstract
As the field of occupancy detection grows, more and more sensors are being developed or repur-
posed into occupancy detectors. Each of these new sensors brings with them new fees in mainte-
nance and power. FireSense is a firewall-based sensor that seeks to repurpose existing infrastruc-
ture to add sensing capabilities with little overhead. The sensor monitors network traffic over the
firewall coverage space and reports network traffic statistics and events. Here, we show that the
information can be used for occupancy detection, but that is by no means the limit to FireSense’s
abilities.
1. INTRODUCTION
Detecting occupancy has been a popular area of research in recent years for its potential in appli-
cations including security and energy management. If a room or building is uninhabited, the lights
can be safely turned off, the HVAC system can be relieved, and even computing power can be
transferred to remote processes. Over time, the decreasing energy usage can add up to a large
amount of saved expenses. Using sensors can offload the responsibility of turning off lights and
appliances from humans onto automated systems.
FireSense is a novel sensor system that monitors internet traffic to determine occupancy. The sen-
sor is a software package built to work on a PFSense firewall that monitors incoming and outgoing
traffic. The advantages of the FireSense system are its low cost, unique area of sensing coverage,
and broad set of information. Often, sensors are single-function hardware that incur a separate
energy overhead just for monitoring occupancy. Because FireSense runs on a firewall, the in-
creased energy cost is minimal since the hardware can be used for other purposes and may already
be in place.
Further, many sensors have gaps in their ability to sense user presence. Motion detectors, the most
popular occupancy indictor, often fail to notice people as they work relatively motionless at their
computers. This often leads to a situation where people will find themselves at their desks only to
have the lights turn off, forcing them to furiously wave their arms to trigger the sensor. The new
system seeks to cover exactly this case, by detecting when a person is actively on their computer.
Data from FireSense can be combined with other sensors to create a more accurate and precise
view of an area’s level of occupancy and the behavior of the occupants.
Lastly, the FireSense sensor can provide much more detailed and broad information about network
usage than just the occupancy. A wide variety of individual and aggregate statistics and events can
be captured by the sensor and processed later on. Currently, occupancy is calculated on a backend
system using data captured from the sensor. An occupancy decision is made based on the results
of the Hidden Markov Support Vector Machine (HM-SVM) outlined by Altun et al. (2003)
2
2. BACKGROUND / RELATED WORK
Much work has been done to classify internet usage and connection information based on TCP
communications. Often, these methods would involve collecting packet information and commu-
nication statistics and feeding them into a machine learning algorithm. Popular inputs include TCP
port number, packet length statistics, flow duration, and packet inter-arrival time. (Nguyen &
Armitage, 2008) Moore and Zuev touched on the importance of using only packet arrival statistics
and packet header information to maintain user privacy. Their goal was to create a system that
could classify packets between bulk (ftp), databases, interactive (e.g. SSH, telnet), mail, services
(e.g. DNS, NTP), web access (www), peer-to-peer, attacks (worms and viruses), games, and mul-
timedia. (Moore & Zuev, 2005)
FireSense takes after IP classification technologies in a two ways. 1) FireSense only uses packet
header information, including TCP port and packet lengths. 2) FireSense follows the process of
gathering network data and uses a machine learning algorithm to create a Bayesian decision. How-
ever, FireSense also differs from traffic classification. Instead of stream-level information, much
of FireSense’s data are router-level traffic aggregates. It is possible to use coarse information, since
the decisions for FireSense are not as fine and particular. Also, due to the volume of the traffic and
the limitations of the PFSense hardware used in the experiment, gathering such finely-detailed
information may bog down the processing too much for the sensor to run in real-time.
IP Traffic classification technology and FireSense may be combined in the future. Being able to
classify flows into communication types can greatly benefit occupancy decisions. Instant messag-
ing and browser-based applications are more likely to indicate occupancy than bulk and peer-to-
peer traffic. In the meantime, though, simpler methods must be used to classify traffic, including
port numbers and header fields.
3. TOOLS
3.1. PFSense
Firewalls are often used in workspaces today to monitor and control the flow of traffic and to
separate networks. Because networks are often associated with physical spaces (e.g. a lab or office
space), monitoring traffic through a firewall can often be analogous to monitoring traffic for a
space. Throughout the course of this paper, we assume space of interest is covered uniquely by a
single firewall through which all user-generated traffic passes.
PFSense is a free and open source implementation of a firewall and router built atop FreeBSD. It
utilizes a web interface that allows administrators to create and manage user accounts, IP traffic
rules, forwarding, and traffic routing. Settings can also be managed by using a terminal directly on
the PFSense box. Using the latter method, the shell command line terminal can be accessed and
used as a normal FreeBSD operating system. (PFSense)
For testing, we used an ALIX.2 series dedicated PFSense board and connected to the terminal via
the serial connection. Files were downloaded onto the firewall using the web interface and the
sensor programs were run through the serial connection.
3
Table 1 Captured ports used in FireSense. Compiled from information from (Well known TCP and UDP ports used
by Apple software products, 2012), (Red Hat Enterprise Linux 4: Security Guide Appendix C. Common Ports, 2012),
and (Service Name and Transport Protocol Port Number Registry, 2012)
3.2. TShark
TShark is a terminal program used to capture, monitor, and extract information from packets over
a computer’s network interfaces. The program is the basis for the more well-known Wireshark.
Wireshark is actually a graphical user interface (GUI) that can be used to access TShark’s func-
tions. Both programs are controlled by a set of input and output filters that allow the user to control
what packets should be captured or ignored and what data should be gathered and reported for
PORT NAME DESCRIPTION
21 ftp FTP Control (Command)
22 ssh Secure Shell (SSH) service
23 telnet Telnet service
24 Private mail system
80 http HyperText Transfer Protocol (HTTP)
110 pop3 Post Office Protocol v3
113 auth Authentication and Ident protocols
143 imap Internet Message Access Protocol (IMAP)
194 irc Internet Relay Chat (IRC)
220 imap3 Internet Message Access Protocol version 3
443 http_secure Secure Hypertext Transfer Protocol (HTTP)
993 imaps Internet Message Access Protocol over Secure Sockets Layer (IMAPS)
994 ircs Internet Relay Chat over Secure Sockets Layer (IRCS)
995 pop3s Post Office Protocol version 3 over Secure Sockets Layer (POP3S)
1723 pptp Microsoft Point-to-Point Tunneling Protocol
2195 Apple Push Notification service
2196 Apple Push Notification - Feedback
3031 Remote AppleEvents
3283 NetAssistant Apple Remote Desktop
3389 RDP Windows Remote Desktop Protocol
3724 blizwow World of Warcraft Online gaming MMORPG
3784 VoIP program used by Ventrilo
5190 ICQ and AOL Instant Messenger
5222 xmpp Extensible Messaging and Presence Protocol
5900 VNC Apple Remote Desktop
5988 WEBM HTTP Apple Remote Desktop
6665 irc Internet Relay Chat (IRC)
6666 irc Internet Relay Chat (IRC)
6667 irc Internet Relay Chat (IRC)
6668 irc Internet Relay Chat (IRC)
6669 irc Internet Relay Chat (IRC)
8080 http-alt HTTP Alternative
4
captured packets. To process the packets, TShark first captures packets and stores them into a file.
From the file, TShark can process and extract the packet information for formatted output.
(Wireshark)
For FireSense, the TShark invocation is set to create a circular buffer of ten files, each at a size of
512 KB. Using a circular buffer keeps TShark’s data space controlled so it does not eventually run
the firewall out of memory. However, since the circular buffer continually overwrites itself, if the
packet capture rate is greater than the packet processing rate, the input and output iterators will
eventually collide and cause TShark to crash. Although not usually a problem on a small (personal)
scale, in a lab setting, crashing can be a major problem, causing the sensor to crash after only a
few minutes.
PFSense mitigates the risks and effects of TShark crashes in two ways: 1) the range of captured
packages is mitigated through the capture filter; and 2) FireSense can detect a crash and will restart
itself. Captured packets are limited to TCP connections using a limited set of ports. The ports
currently identified as important for FireSense are listed in Table 1.
3.3. SVM
Support Vector Machine (SVM) algorithms are a classification technology used in many applica-
tions. SVM algorithms take in N-dimensional data points and plot them on an N-dimensional
graph. Data points are then compared against an N-1-dimensional plane and classified as greater
or less than the plane. Once determined, the plane is used as a decision boundary to separate two
classifications of data points. In FireSense, the classifications used are “occupied” and “non-occu-
pied”. Planes can take on many shapes, defined by the SVM’s kernel. Popular kernels include
circular, linear, polynomial, and exponential. FireSense uses a linear kernel, declaring occupancy
when data point values are greater than those in the plane.
To create the decision boundary, data points with known classifications are input into the SVM
classifier. The classifier iterates through several boundary possibilities until it finally settles on
one with a small enough overall classification error. Once trained, an SVM classifier can be used
to classify new data points. The accuracy of the SVM is often gauged on how well it can classify
a data set not used in the training process.
The advantage of SVM is its ability to relate data channels to each other. Instead of considering
each channel individually, all dimensions are combined into a single formula that can have positive
or negative relationships between different data channels. FireSense uses SVM to correlate and
generate decision boundaries on a wide set of data channels.
Unfortunately, SVM is time-independent, but an occupancy inference can greatly benefit from
neighboring data. If nobody is in the area, but network traffic suddenly surges up, there is a high
probability that someone has just entered and started using a computer. If network access is cur-
rently low but has been high for a while in the recent past, there may just be a lull in network
usage, but the space may still be occupied. In order to account for time-based traffic and their
effects on state changes, Hidden Markov Models were added to SVM to create a better model for
occupancy.
5
3.4. HMSVM
Hidden Markov SVM is a system proposed by Y. Altun et al. in 2003. (Altun, Tsochantaridis, &
Hofmann, 2003) The system uses a Hidden Markov Model (HMM) to create transition labels based
on past and future behavior and classifications. These transition labels are then added to the data
as separate data channels before being fed into the SVM classifier. The result is a time-aware,
stateful data classifier. FireSense uses SVMhmm, an implementation of the Altun paper created by
a group in Cornell University based on SVMstruct. (Joachims, 2012)
4. IMPLEMENTATION
The FireSense system shown in Figure 1 consists of three major components: the sensor, the
server, and the processor. The sensor monitors IP traffic, aggregates the data, and reports it to the
server. The server stores information for retrieval from the processor. The processor downloads
information from the server and applies the HMSVM classification to determine occupancy. When
taken off the test bed, the processor can be moved onto the server.
4.1. The Sensor
FireSense’s Sensor module sits on the PFSense firewall. Its goal is to listen to network traffic,
aggregate the data, and send it to the data server. Figure 2 shows the block diagram for the Sensor
module. TShark is responsible for collecting packets and reporting them to the data aggregator.
The data aggregator is multithreaded python program and stores data metrics and events and re-
ports them to the server. Reports normally occur at regular intervals, but stream events are sent to
the data server as they happen. Streams occur between two IP addresses and exist as long as the
tracked TCP connection exists between the peers. For the rest of this paper, stream events will
refer to a stream’s opening and closing.
TShark
Certain settings must be passed into TShark for it to correctly connect to the rest of FireSense. The
first setting is to reduce the scope of captured packets to those with ports listed in Table 1. This
Figure 1 Block diagram showing the three major components of FireSense. 1) The Sensor is hosted on the PFSense
firewall box. 2) The Server is hosted on a virtual machine. 3) The processor is hosted on a backend PC.
6
serves the dual purposes of increasing the sensor’s maximum bandwidth and limiting the listening
ports, making it easy to filter the data and network metrics to only what is applicable to the appli-
cation. The second setting is to create a ring buffer of files for TShark to store the incoming capture
packets. Using a ring buffer limits the space that captured packets take so they do not fill up the
system’s entire memory. The down side to ring buffers is that they are susceptible to overflows,
causing TShark to crash when the input and output buffers collide. Setting the listening interface
to the LAN port ensures that all traffic going through is either to, from, or between local users.
Lastly, the capture output filters must be set correctly
to communicate with the data aggregator program. The
output features are listed in Table 2.
Data Aggregator
4.1.2.1. Filter
The data aggregator consists of three main sections: the
initial data parser and filter, the data managers, and the
data reporters. The filter contains the following two
rules for acceptable packets:
1) Only TCP and UDP packets are accepted.
2) Accepted packets have one peer inside the local net-
work and one peer outside the local network.
TSHARK ID DESCRIPTION
ip.proto IP Protocol
ip.src IP Source Address
ip.dst IP Destination Address
ip.len IP Packet Length
udp.srcport UDP Source Port
udp.dstport UDP Destination Port
tcp.srcport TCP Source Port
tcp.dstport TCP Destination Port
tcp.flags.ack TCP ACK Flag
tcp.flags.fin TCP FIN Flag
tcp.flags.reset TCP RESET Flag
http.user_agent HTTP User Agent
Table 2 TShark capture output filter items.
Figure 2 Sensor block diagram. Rounded squares represent sequence of events initiated by input form TShark. Dia-
monds represent events that can cause data to be sent to the Data Server.
7
Although the data aggregator is designed to accept and report on both TCP and UDP packets, in
practice, only TCP packets are sent to the data aggregator. Allowing UDP packets would overtask
the system and cause collisions in the TShark ring buffer.
4.1.2.2. Data Managers
The data managers are the computational meat of the sensor. There are three types of data manag-
ers: router, peer, and stream. Their functions are described below and summarized in Algorithm
1. The relationship between managers is described in Figure 3.
There is only one router manager and it directly
accepts all data that has passed through the fil-
tering stage. Data coming into the router man-
ager is used to update router-level metrics before
being passed to the appropriate peer manager
(based on the source and destination IP ad-
dresses).
There is one peer manager for each IP address in
the local network. Peer managers are created
when a new packet is captured whose source or
destination IP is a local address and does not al-
ready correspond to an existing manager. Peer
managers are torn down when there are no more
active TCP connections to the peer or when the
peer has not sent nor received a message within
a timeout period. Packets passed to the manager
are used to update peer-level metrics before be-
ing passed to a stream manager (TCP packets) or
discarded (UDP packets).
Receive Data
If data fails filter
Discard data
Else
Pass data to RouterManager
Update Router Metrics
If New Local IP
Create new PeerManager
Pass Data to PeerManager
Update Peer Metrics
If New IP/Port Pair
Create new StreamManager
Report Stream Open
Pass Data to StreamManager
Update Stream Metrics
If TCP.RESET or (TCP.FIN and TCP.ACK)
Report Stream Close
Tear down StreamManager
Algorithm 1 Sensor data capture behavior
Figure 3 Data Manager Layout
8
FireSense refers to ongoing TCP connections between two peers as streams. An example stream
is an ftp connection between peers A and B. The stream begins when the file transfer is set up and
ends when the transfer is torn down. Each peer manager contains a stream manager for each active
stream. New stream managers are created when a peer manager processes a packet whose foreign
IP is unknown. When a new stream is created, a stream open event is created that immediately
reports the stream creation to the data server. Stream managers are torn down when the connection
is destroyed or when no packet has been sent or received within a timeout period. Destroyed
streams are detected when a packet contains either the TCP RESET flag or both of the TCP FIN
and ACK flags. When a stream manager is destroyed, it generates a stream close event that is
immediately reported to the data server.
4.1.2.3. Reporters
There are three reporters in the sensor: router, peer, and stream. Each router runs in a separate
thread, making a total of four threads in the data aggregator: the input handler and three reporters.
Threads must obtain a lock on the Router Manager before writing to or reading any manager to
prevent data overwrites and collisions. Each reporter is responsible for managing timeouts and
periodic reports. Unique configurable update intervals exist for each reporter. Whenever an update
interval approaches, a reporter will search for managers that have timed out and remove them
before reporting metrics for each manager of the reporter’s type. Managers have their metrics
cleared after a report. Metrics are in terms of the reporting interval.
Periodic and event reports are sent to the data server through a TCP connection in JSON format.
The JSONs are labeled by the manager/reporter type and the message event trigger (open, close,
or periodic). The exact JSON formats can be found in the appendix. Peer IPs are anonymized by
performing an exclusive or between the last byte of the IP address and a mask byte randomly
generated when the data aggregator program begins. This protects the privacy of users by making
it harder to correlate activity with a specific user IP.
Figure 4 Data server file structure. Peer.txt, stream.txt, event.txt, and router.txt contain a list of JSON files received
from the sensor.
9
4.2. The Data Server
The FireSense data server is responsible for receiving and storing information for processing. It
listens for sensor reports on a configurable port and logs the data to file. Files are organized first
by date, then by anonymized local peer IP (if applicable) as shown in Figure 4. Files router.txt,
peer.txt, stream.txt, and event.txt contain lists of the received JSON objects from the sensor.
4.3. Determining Occupancy
Occupancy is detected by extracting features from the sensor reports, time-binning the features,
and running the classification algorithm on the resulting matrix. In general, features can be divided
into two subsets: peer connections and router-level statistics with the addition of time of day as a
feature. Features are summarized and binned into time-periods of size T. Typically, a time period
T = 1 minute works fairly well. A full list of features can be found in the appendix.
Peer Connections
4.3.1.1. Defining Peer Connections
Peer connections are defined as TCP connections between a peer and an external/foreign IP where
the port on either the external or the local side is one of the tracked ports. Table 3 lists examples
of what do and do not qualify as peer connections. Peer connections can be identified by the qual-
ifying end (external or local) and the tracked port number. If both sides use a tracked port number,
two connections are said to exist. In practice, the “no connection” scenario does not exist, since
the processing connections are the same as the capture filter connections shown in Table 1. Peer
connections are created based on the open/close events received from the sensor.
Once detected, connections are split into three useful categories: external connections, remote
connections, and web connections. Any connection that does not fall into those categories are not
used for occupancy processing. Foreign connections are all connections that contain the “foreign”
qualifier. Remote connections are local connections under a remote port listed in Table 4. Web
connections are external connections whose “user agent” field is filled in. User agent fields are
Table 3 List of remote ports.
LOCAL PORT EXTERNAL PORT CONNECTION
5600 80 Foreign 80
7200 5600 None
5222 8095 Local 5222
80 443 Local 80 / Foreign 443
Table 4 Example connections. Tracked port numbers can be found in Table 1. For any TCP connection, zero, one, or
two peer connections can be formed
PORT NAME DESCRIPTION
21 ftp FTP Control (Command)
22 ssh Secure Shell (SSH) service
23 telnet Telnet service
3283 NetAssistant Apple Remote Desktop
3389 RDP Windows Remote Desktop Protocol
5900 VNC Apple Remote Desktop
10
generally only filled in by web browsers, although automatic browser traffic (such as updates) can
also fill in the field. This field can be retrieved from http packets by TShark. The remaining unu-
tilized connections are local connections that are not under a remote port.
Foreign port and web connections are useful as inputs to the SVM algorithm. In general, when
monitored ports are more active or web usage is high, the likelihood of a person being at the com-
puter also increases. However, this assumption can be misleading at times. If a person is remotely
accessing a computer in the monitored space, they will likely generate increased traffic and, sub-
sequently, more port events. To deal with this issue, we mask external port connections with re-
mote connections. In other words, all connections are considered inactive while a remote connec-
tion exists.
Virtual private networks (VPNs) are another major concern for false positives in occupancy de-
tection. During the experiment, the firewall included an active PPTP connection but it did not
prove to be an issue. PFSense creates new interfaces that handle PPTP traffic. This keeps VPN
traffic separate from LAN traffic and, in turn, invisible to FireSense.
4.3.1.2. Using Peer Connections
Now that peer connections have been defined, it is important to know how they are linked to SVM
inputs, and ultimately, to the detection of occupancy. FireSense uses the number of currently active
connections on each port as the SVM feature. There are N+1 peer connection features fed to the
SVM, where N is the number of tracked ports and the extra feature summarizes web connections.
Each feature represents the current number of connections to that port P, currently active within
the monitored space.
𝐹𝑜𝑟𝑒𝑖𝑔𝑛 𝑃 = ∑ ∑ 𝑃𝑜𝑟𝑡 𝑃 𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠
𝑃𝑒𝑒𝑟𝑠
However, as mentioned at the beginning of Section 4.3, our features are time-binned. Peer con-
nection features must represent the status of the time-bin, so a new formula must be created based
on the instantaneous formula. The new feature we suggest for port P is the total number of unique
connections to P over the last period T. This includes connections that conform to any of the fol-
lowing criteria: 1) the connection began in period T or 2) the connection began before period T
and but was still active in period T. This rule is summarized in the following equation and illus-
trated in Figure 5.
𝐹𝑜𝑟𝑒𝑖𝑔𝑛 𝑃 = ∑ ( ∑ 𝐼𝑛𝑖𝑡𝑖𝑎𝑡𝑒𝑑 𝑃 𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠 + ∑ 𝐶𝑜𝑛𝑡𝑖𝑛𝑢𝑖𝑛𝑔 𝑃 𝐶𝑜𝑛𝑛𝑒𝑐𝑡𝑖𝑜𝑛𝑠)
𝑃𝑒𝑒𝑟𝑠
Figure 5 Illustration of connections counted in period T. Colored lines represent connection durations. Blue connec-
tions would be counted in period T. Red connections would not.
11
In total, thirty-two of the SVM features are attributed to foreign port and web connections. Most
of these features are sparse, and some are never populated in the sample set, but all are included
in the system for the sake of completeness and robustness.
Router-Level Statistics
Router-level statistics are taken from the periodic router updates from the sensor. These statistics
include information on active peers, TCP packet counts, and TCP packet sizes. The router update
period of the sensor during the experiment was arbitrarily set to five seconds, much shorter than
the average time bin used for occupancy evaluation. Although information within a time bin can
be easily summarized by summing or averaging values, more information can be gleaned by taking
and comparing varying statistical quantities. To investigate these options, a range of statistical
functions were identified for use as SVM features. Table 5 details the router statistics (horizontal)
and the summarizing functions (vertical) considered as features. Marked boxes indicate features
that were ultimately chosen for use. These values most closely resembled occupancy levels
throughout the day.
The chosen features, shown in Figure 6, are predominantly focused on outward traffic. Most user-
generated traffic is in request form, and outward traffic ideal for tracking requests. Additionally,
since inbound traffic is generated from external sources, it can be less predictable and more prone
to false spikes. The same argument applies for total packet count and sizes, since they are heavily
affected by inbound traffic.
We again stress that the router-level features actually are the results of statistics applied to period-
ically reported values from the sensor. Consider the case where five peers actively send or receive
packets within the time period T. For a router reporting period R, there are R/T samples within the
time bin. If, during any of those samples, all peers send or receive packets, the feature value will
be five. However, if each peer is active in a different sample, the result will be one. Note that there
is no feature that will give the number of unique peers active within the time bin. Table 6 lists the
router-level features and their equations.
Router
Stats
Functions
Active
Peers
Packet
Count
Packet
Count
(Out)
Packet
Count
(In)
Largest
Packet
Largest
Packet
(Out)
Largest
Packet
(In)
Max X X
Min X X
Sum X
Mean
Standard
Deviation
Median
Interquartile
Range (IQR) X
Range
Table 5 Available vs. chosen router statistics. The left margin indicates statistical summaries; the top margin indicates
router statistics. Items with an X are used as SVM features.
12
FEATURE EQUATION
Maximum Active Peers 𝑀𝑎𝑥𝑇(∑ 𝑈𝑛𝑖𝑞𝑢𝑒 𝑃𝑒𝑒𝑟𝑠𝑅 )
Maximum Packet Count (Out) 𝑀𝑎𝑥𝑇(𝐶𝑜𝑢𝑛𝑡𝑅(𝑃𝑎𝑐𝑘𝑒𝑡𝑠))
Minimum Packet Count (Out) 𝑀𝑖𝑛𝑇(𝐶𝑜𝑢𝑛𝑡𝑅(𝑃𝑎𝑐𝑘𝑒𝑡𝑠))
Sum of Packet Count (Out) ∑ 𝐶𝑜𝑢𝑛𝑡𝑅(𝑃𝑎𝑐𝑘𝑒𝑡𝑠)𝑇
Interquartile Range of Packet Count (Out) 𝐼𝑄𝑅(𝐶𝑜𝑢𝑛𝑡𝑅(𝑃𝑎𝑐𝑘𝑒𝑡𝑠))
Minimum Largest Packet (Out) 𝑀𝑖𝑛𝑇(𝑀𝑎𝑥𝑅(𝑃𝑎𝑐𝑘𝑒𝑡 𝑆𝑖𝑧𝑒))
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 4 8 12 16 20 24
Occupancy and Router-Level Features
TCP_Out_Max_Min Peers_Max TCP_Out_Max TCP_Out_Min
TCP_Out_IQR TCP_Out_Sum Occupancy
Table 6 Router-level features and their equations.
Figure 7 Occupancy plotted on above of router-level statistic features for one day. Both occupancy and the statistics
are plotted in as fractions of their maximum values in the day.
Figure 6 Camera server file system layout
13
4.4. Gathering the Ground Truth
Ground truth occupancy for the experiment was gathered by monitoring entry points into the lab.
Pictures are taken from cameras mounted near the doors and stored in the directory structure shown
in Figure 7. The pictures can be examined at a later time to establish entry and exit times.
To save researchers from needing to look at the entire days’ data, camera functionality has been
linked to sensors on the doors that produce open and close events. When an event occurs, a mes-
sage is sent to the camera server, housed on the same virtual machine as the FireSense server but
using a different port. The message contains a door identifier and either “Open” or “Closed”.
When the server receives an open message, it begins recording pictures of the door. Pictures are
recorded every half-second until the corresponding close message is received. All pictures taken
during one open-to-close period are stored in a folder labeled by the open time.
The camera server is implemented as a multithreaded program, containing one thread per door, an
additional thread to manage the server interface, and finally, the main thread. The door threads
have two states: waiting for an open door and processing an open door. Each door thread has an
associated open door Boolean value. The Booleans can only bet set or unset by the server interface
thread. Algorithm 2 shows the functionality of the door and interface threads respectively. Figure
8 shows a functional example of the detection process.
Algorithm 2 Camera server algorithms. A) Door thread process. B) Server interface process.
Loop Forever
If door open event
Get time
Create time-stamped folder
While door is open
Record picture
Wait 0.5 seconds
(a)
Loop Forever
Wait for message
Get door, open/close from message
If open
Set door indicator to open
If close
Set door indicator to close
(b)
Figure 8 Camera server and door sensor example. Sensors monitor the doors and send open and close events to the
server, which then begins or ends the capture process. The boxed text: “L Open” and “R Closed” are messages
relayed to the server by the sensors.
14
5. EVALUATION
5.1. Binary Results
The FireSense system was tested over a two and a half week period, where eleven weekdays were
evaluated. Five days were used as the training set for the SVM classifier. Time bins were set to
one minute. Once the data was processed into binned features, the test set was fed into both
Matlab’s svmtrain function and the SVMhmm tool with binary occupancy ground truth (occupied
vs. non-occupied). This way, we could compare a normal SVM classifier with the combination
SVM and HMM classifier. The results are summarized in Table 7. We calculate accuracy as the
percentage of correctly-calculated time bins through the day. Figure 9 shows the calculated versus
actual binary occupancy values for one
day. Results from the SVMhmm tool tend
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16 20 24
One-Day Occupancy: SVM vs SVM HMM
Matlab SVM SVM HMM Occupancy
Figure 9 One-Day Occupancy values. Ground truth, SVM HMM result, and Matlab SVM result
DATE MATLAB SVM SVM HMM
4/8/2013 84.04% 92.58%
4/9/2013 90.43% 91.93%
4/10/2013 92.36% 92.83%
4/11/2013 94.41% 96.24%
4/12/2013 91.52% 94.90%
Average 90.55% 93.70%
(b)
DATE MATLAB SVM SVM HMM
3/18/2013 82.18% 87.51%
3/20/2013 86.44% 96.40%
3/22/2013 73.59% 85.76%
4/15/2013 95.26% 95.78%
4/16/2013 92.03% 94.55%
4/17/2013 96.16% 98.17%
Average 87.61% 93.03%
(c)
Table 7 (a, b, c) Occupancy accuracies of Matlab svmtrain/svmclassify versus SVMhmm. A) Overall accuracy on the
entire data set. B) Accuracy over the training data set. C) Accuracy over the remainder of the set.
DATE MATLAB
SVM
SVM
HMM
3/18/2013 82.18% 87.51%
3/20/2013 86.44% 96.40%
3/22/2013 73.59% 85.76%
4/8/2013 84.04% 92.58%
4/9/2013 90.43% 91.93%
4/10/2013 92.36% 92.83%
4/11/2013 94.41% 96.24%
4/12/2013 91.52% 94.90%
4/15/2013 95.26% 95.78%
4/16/2013 92.03% 94.55%
4/17/2013 96.16% 98.17%
Average 88.95% 93.33%
(a)
15
to be more stable than the SVM counterparts due to its stateful nature.
Unfortunately, SVMhmm uses an algorithm involving past and future data, resulting in classifiers
whose practical operation differs from their ideal operation. To explore this problem, we simulated
running the classification in delayed real-time. The system could only rely on data from the begin-
ning of the day up to the current time plus the delay. For a five minute delay, the occupancy at
11:35 AM is calculated with data from 12:00 AM to 11:40 AM. For a ten minute delay, data ranges
from 12:00 AM to 11:45 AM. The results are shown in Table 8 and illustrated in Figure 10.
DATE IDEAL 1M DELAY 5M DELAY 10M DELAY 20M DELAY
3/18/2013 86.34% 83.45% 85.52% 87.31% 89.01%
3/20/2013 95.62% 92.83% 93.93% 94.54% 96.40%
3/22/2013 85.95% 75.87% 80.06% 82.51% 84.28%
4/8/2013 91.34% 85.15% 87.93% 89.80% 91.73%
4/9/2013 92.03% 89.87% 90.82% 91.14% 91.36%
4/10/2013 92.92% 92.15% 93.82% 93.58% 92.97%
4/11/2013 96.29% 96.50% 96.49% 96.34% 96.39%
4/12/2013 94.97% 91.59% 93.01% 94.07% 94.76%
4/15/2013 95.69% 95.55% 95.75% 95.88% 95.85%
4/16/2013 94.62% 95.95% 97.27% 97.82% 97.73%
4/17/2013 98.19% 98.34% 98.19% 97.96% 98.17%
Average 93.09% 90.66% 92.07% 92.81% 93.51%
(a)
DATE IDEAL 1M DELAY 5M DELAY 10M DELAY 20M DELAY
4/8/2013 91.34% 85.15% 87.93% 89.80% 91.73%
4/9/2013 92.03% 89.87% 90.82% 91.14% 91.36%
4/10/2013 92.92% 92.15% 93.82% 93.58% 92.97%
4/11/2013 96.29% 96.50% 96.49% 96.34% 96.39%
4/12/2013 94.97% 91.59% 93.01% 94.07% 94.76%
Average 93.51% 91.05% 92.41% 92.99% 93.44%
(b)
DATE IDEAL 1M DELAY 5M DELAY 10M DELAY 20M DELAY
3/18/2013 86.34% 83.45% 85.52% 87.31% 89.01%
3/20/2013 95.62% 92.83% 93.93% 94.54% 96.40%
3/22/2013 85.95% 75.87% 80.06% 82.51% 84.28%
4/15/2013 95.69% 95.55% 95.75% 95.88% 95.85%
4/16/2013 94.62% 95.95% 97.27% 97.82% 97.73%
4/17/2013 98.19% 98.34% 98.19% 97.96% 98.17%
Average 92.73% 90.33% 91.78% 92.67% 93.57%
(c)
Table 8 SVMhmm accuracies for real-time processing with delays over a) the total data set; b) the training set; and c)
the non-training set.
16
Figure 10 Occupancy calculations by SVMhmm with varying real-time delays from 1 minute to 20 minutes.
5.2. Estimating Actual Occupancy
In addition to binary occupancy decisions, FireSense information can also be used to give a hint
to the actual occupancy level of the space. To do this, we attempted to track the number of active
devices in the lab. Peer connection features were calculated for each individual peer. A peer was
considered “active” when, during any minute, the number of active connections on any port is
greater than 30% of the maximum for that port. For example, the daily maximum of port 80 con-
nections may be 30 and the daily maximum of port 443 connections may be 20. If, within a time
bin, port 443 has over 0.3 * 20 = 6 connections or port 80 has over 0.3 * 30 = 9 connections, the
peer is considered active. Furthermore, peer activity is held for five minutes after the detected
activity drops off, to simulate a person being in the lab but not performing active connections. The
results are detailed in Figure 11 for one day. In this figure, we compare activity calculations with
and without port 80 included. While port 80 tends to be very active during occupancy, it is also
active during passive periods. This default http port is used by many programs that do not involve
user interaction, including passive Dropbox synchronization and program updates.
Although this results show some false positives, the general trends of occupancy and active devices
have many similarities. The results show promise for use in tracking occupant activity inside the
space once occupancy has been established. Additionally, while this calculation alone is obviously
not sufficient for determining the occupancy level of the space, it can possibly be used in conjunc-
tion with other sensor data to calculate a good estimate of occupancy.
5.3. Network Traffic vs. Occupancy
Although network traffic can be a great identifier of occupancy, there are often cases when people
are in the lab but not using the network. They could be engaging in other activities or the network
0
1
2
3
4
5
6
0 4 8 12 16 20 24
Occupancy Calculations by Delay
Occupancy SVM HMM 1m Delay 5m Delay 10m Delay 20m Delay
17
(a)
(b)
(c)
(d)
0
2
4
6
8
10
12
0 4 8 12 16 20 24
0
2
4
6
8
10
12
0 4 8 12 16 20 24
0
2
4
6
8
10
12
0 4 8 12 16 20 24
0
2
4
6
8
10
12
0 4 8 12 16 20 24
Figure 11 Active Peer Calculations vs. Occupancy. Occupancy is plotted in blue, calculated values are plotted in
orange. A) Occupancy vs. active peers. B) Occupancy vs. active peers with a running average filter of size 10. C)
Occupancy vs. active peers without port 80. D) Occupancy vs. active peers without port 80 with a running average
filter of size 10.
18
could be down. One test day, 3/18/13 summarized in Figure 12, shows this behavior. It is the
reason why occupancy results on this day are notably less accurate than the others, losing as much
as 7% accuracy versus the average in some cases.
6. CONCLUSION
We have shown that FireSense data can be used to generate occupancy decisions for a majority of
situations, with average accuracies over 90%. Additionally, we saw how calculating the number
of active peers shows correlation with the actual level of occupancy, though not enough to create
a reasonable estimate on its own. Lastly, we saw how occupants may not always generate network
traffic, leading to prediction errors.
As an occupancy sensor, FireSense works fairly well, but it does have flaws and blind spots. How-
ever, if occupancy information is otherwise known or determined by a combination of heteroge-
neous sensors, FireSense’s strengths and weaknesses can both be used to gauge user activity. When
occupancy is high and network traffic is low, occupants may not be on their computers, there might
be network traffic, or there may be collaborative activity. As these areas of research advance, sys-
tems like FireSense that can be attached to existing infrastructure and maintained at minimal cost
can become very cheap and useful tools in an ever-expanding repertoire.
7. ACKNOWLEDGMENTS
This project was supported by: Haksoo Choi, who helped set up the virtual machine for the camera
and data servers, set up access to the PFSense firewall, and helped with understanding the PFSense
infrastructure; Kevin Ting, who set up and managed the door sensors and messages coming into
the camera server; and Professor Mani Srivastava, for guidance and ideas throughout this en-
deavor.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 4 8 12 16 20 24
Peers_Max TCP_Out_Max_Min TCP_Out_Max TCP_Out_Min
TCP_Out_IQR TCP_Out_Sum Occupancy
Figure 12 Chart showing discrepancy between occupancy and network traffic, specifically around 16:00-17:00. The
thick black line indicates occupancy.
19
8. REFERENCES
(2012, December). Retrieved from PFSense: http://pfsense.org
Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003). Hideen Markov Support Vector Machines.
International Conference on Machine Learning (ICML).
Joachims, T. (2012, December). SVM hmm: Sequence Tagging with Structural Support Vector
Machines. Retrieved from Cornell Department of Computer Science:
http://www.cs.cornell.edu/people/tj/svm_light/svm_hmm.html
Moore, A. W., & Zuev, D. (2005). Internet Traffic Classification Using Bayesian Analysis
Techniques. SIGMETRICS'05. Banff, Alberta, Canada: ACM.
Nguyen, T. T., & Armitage, G. (2008). A Survey of Techniques for Internet Traffic Classification
using Machine Learning. IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 10,
NO. 4, 56-76.
Red Hat Enterprise Linux 4: Security Guide Appendix C. Common Ports. (2012, December).
Retrieved from CentOS: http://www.centos.org/docs/4/html/rhel-sg-en-4/ch-ports.html
Service Name and Transport Protocol Port Number Registry. (2012, December). Retrieved from
IANA: http://www.iana.org/assignments/service-names-port-numbers/service-names-
port-numbers.xml
Well known TCP and UDP ports used by Apple software products. (2012, December). Retrieved
from Apple Support: http://support.apple.com/kb/ts1629
Wireshark. (2012, December). Retrieved from http://www.wireshark.org/
9. APPENDIX
Figure 13 List of features used in SVM classifiers for occupancy.
Time of Day Foreign 143 Foreign 2195 Foreign 5222 Foreign 8080
Foreign 21 Foreign 194 Foreign 2196 Foreign 5900 Web Accesses
Foreign 22 Foreign 220 Foreign 3031 Foreign 5988 Max Peers
Foreign 23 Foreign 443 Foreign 3283 Foreign 6665 Max TCP Out Count
Foreign 24 Foreign 993 Foreign 3389 Foreign 6666 Min TCP Out Count
Foreign 80 Foreign 994 Foreign 3724 Foreign 6667 IQR TCP Out Count
Foreign 110 Foreign 995 Foreign 3784 Foreign 6668 Sum TCP Out Count
Foreign 113 Foreign 1723 Foreign 5190 Foreign 6669 Min TCP Out Max Size
20
Router Peer Stream
Type: 'Router' Type: ‘Peer’ Type: ‘Stream’
Event: 'Periodic' Event: ‘Periodic’ Event: ‘Open’/’Close’/’Periodic’
Time Time Time
Peers (anonymized) Local IP (Anonymized) UserAgent
PeerCount TcpStats Local IP (Anonymized)
ActivePeerCount Counts Foreign IP
TcpStats Total LocalPorts
Counts: Incoming ForeignPorts
Total Outgoing Counts
Incoming MaxSizes Total
Outgoing Incoming Incoming
MaxSizes Outgoing Outgoing
Incoming UdpStats MaxSizes
Outgoing Counts Incoming
UdpStats Incoming Outgoing
Counts Outgoing
Total MaxSizes
Incoming Incoming
Outgoing Outgoing
MaxSizes
Incoming
Outgoing
Figure 14 JSON formats for sensor communication to the server. Values with single quotes are static values that are
used to identify the JSON type. Event JSONs carry the same information as stream JSONs, but have either “Open”
or “Close” in the event slot.