geolocation tool – a transient geographic …iyoun/geo/paper/geolocationtool.pdfgeolocation tool...

15
December 12, 2007 Ph. D. student: Inja Youn, George Mason University Geolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers and Internet entered nearly every aspect of today’s life, the hacker’s attacks of networks, servers, and workstations had also become a serious threat for businesses, military, as well as for the average person. The volume of internet traffic data increased exponentially with the advent of the broadband. A 2003 survey with the agreement of George Mason University CIO, showed a 1.2 to 3.0 gigabytes/hour of header-only internet traffic data at the university level. As the George Mason University traffic is increasing over the years, it becomes clear that the traditional tools for data exploration and visualization and anomaly detection based on the fixed-size data (albeit very large) are becoming less effective, or even infeasible, when dealing with massive data streams. Static visualization creates a significant chance that the attack will be discovered “after the fact”, when consequences are much harder to mitigate than in the case of real time intrusion detection. Methodology The proposed Geo-Location tool is devised to identify the geographical location of IP traffic. The actual version of the tool displays the traffic data per packet or per session as a colored point on the Earth map. For now, different colors correspond to different ports, allowing the user to concentrate on the traffic of her choice, such as SSH (port 22) or Telnet (port 23). The tool uses evolutionary data graphics to accommodate massive data streams. The points fade linearly with the passing of the time, thus the old information is downplayed or discarded as new traffic arrives. This is in contrast with the traditional dynamic graphic display and exploration, where the animation is a fixed data set. The Geo-Location tool is a Windows application written in C++. A cross-platform version written with wxDev-C++ and wxWidgets is underway. The complex visualization tasks are handled using a modern video card and OpenGL. Map drawing part uses data provided by Dr. Wong from Department of Earth Systems and GeoInformation Sciences, George Mason University. Each country is represented as a list of polygons, as it can be seen in Figure 1. The overlapped polygons represent enclaves. For example, the republic of San Marino is an enclave of Italy.

Upload: others

Post on 19-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

December 12, 2007

Ph. D. student: Inja Youn, George Mason University

Geolocation Tool – A Transient Geographic Mapping System for Intrusion Detection

Background As computers and Internet entered nearly every aspect of today’s life, the hacker’s attacks of networks, servers, and workstations had also become a serious threat for businesses, military, as well as for the average person. The volume of internet traffic data increased exponentially with the advent of the broadband. A 2003 survey with the agreement of George Mason University CIO, showed a 1.2 to 3.0 gigabytes/hour of header-only internet traffic data at the university level. As the George Mason University traffic is increasing over the years, it becomes clear that the traditional tools for data exploration and visualization and anomaly detection based on the fixed-size data (albeit very large) are becoming less effective, or even infeasible, when dealing with massive data streams. Static visualization creates a significant chance that the attack will be discovered “after the fact”, when consequences are much harder to mitigate than in the case of real time intrusion detection.

Methodology The proposed Geo-Location tool is devised to identify the geographical location of IP traffic. The actual version of the tool displays the traffic data per packet or per session as a colored point on the Earth map. For now, different colors correspond to different ports, allowing the user to concentrate on the traffic of her choice, such as SSH (port 22) or Telnet (port 23). The tool uses evolutionary data graphics to accommodate massive data streams. The points fade linearly with the passing of the time, thus the old information is downplayed or discarded as new traffic arrives. This is in contrast with the traditional dynamic graphic display and exploration, where the animation is a fixed data set. The Geo-Location tool is a Windows application written in C++. A cross-platform version written with wxDev-C++ and wxWidgets is underway. The complex visualization tasks are handled using a modern video card and OpenGL.

Map drawing part uses data provided by Dr. Wong from Department of Earth Systems and GeoInformation Sciences, George Mason University. Each country is represented as a list of polygons, as it can be seen in Figure 1. The overlapped polygons represent enclaves. For example, the republic of San Marino is an enclave of Italy.

Page 2: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Figure 1. World map data plot, as provided by the Department of Earth Systems

The map is triangulated using the “Triangle” tool of Dr. Shewchuk of Carnegie Mellon University. As the size of the data files is more important than obtaining sharp-angled triangles, the constrained Delaunay triangulation is the perfect choice. The constrained Delaunay triangulation does not add intermediate points to the triangulation, thus the resulting size of the triangulation files is relatively small. The result of the triangulation can be seen in Figure 2.

The triangulation is used both for point location (when given a point it is required to find the triangle and the country where the point is located), and for drawing using OpenGL. OpenGL can only draw convex polygons; as most country contours are not convex, we need to perform a tessellation to be able to draw the map efficiently.

Page 3: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Figure 2. World map data plot, after performing Delaunay triangulation

To efficiently find the triangle in which a given point is located (used to highlight the country pointed by cursor), the location of the points of coordinates (-180°, -90°), (-180°, -89°) … (0°, 0°), (0°, 1°), (0°, 2°) … (1°, 0°) … (180°, 90°) is pre-calculated using a simple algorithm. The first step is determining whether two points P1 and P2 are on the same side of the AB line using the inner (dot) product. Function SameSide(P1, P2, A, B) if AP1•AB and AP2•AB have the same sign then return true else return false The second step consists of testing if a point P is inside the triangle ABC Function Inside(P, A, B, C) if SameSide(P, A, B, C) and SameSide(P, B, C, A) and SameSide(P, C, A, B) then return true else return false

Page 4: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Finally, each point of the pre-calculated grid is tested against each triangle of the constrained Delaunay triangulation in a sequential manner.

Using this pre-calculated grid, the point location algorithm starts from the closest pre-calculated point and goes from triangle to triangle, until the destination point is located, as illustrated in Figure 3.

Figure 3. The point location algorithm

The analysis in [5] gives a O(n1/3) average time, for a randomly located pre-calculated index point on the map. The grid indexing greatly reduces the time, thus the algorithm rarely needs to process more than 10 triangles to locate the triangle in which a given point belongs.

The Mercator projection has the advantage that is conformal (which means it preserves angles). It is very good for zooming and panning. However, it has a lot of distortion (with Greenland being almost of the size of Africa, even though in reality Greenland is ten times smaller than Africal). Also, the map cannot represent latitudes near the poles (where the projection goes to infinity). However, this is not an issue for the IP Geo-Location tool.

The Mercator projection formula is the following:

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞

⎜⎝⎛ +=

=

24tanln ϕπ

λ

y

x

The Inverse Mercator formula is:

2)(tan2 1 πϕ

λ

−=

=

− ye

x

A snapshot of Mercator projection displayed by the Geo-Location tool is available in Figure 4.

Page 5: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Figure 4. Mercator Projection snapshot

The Plate Carrée (Equirectangular) projection is the oldest of the known projections. It is not conformal, it has distortion, but it is numerically simplest. The Plate Carrée formula is:

ϕλ

2

1

cycx

==

A Plate Carrée snapshot, as displayed by the GeoSpatial tool can be viewed in Figure 5.

Page 6: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Figure 5. Plate Carrée Projection snapshot

The Winkel Tripel projection is the minimum distortion projection for the whole Earth maps. The Winkel Tripel formula is:

2)sinc()sin(

2)sinc(

)2/sin()cos(2)cos( 1

αϕϕ

αλϕϕλ

+=

+=

y

x

Where λ is the longitude, φ is the latitude, and

( )

ααα

πφλφα)sin()sinc(

)/2(cos ,)2/cos()cos(cos 11

1

=

== −−

The Winkel Tripel direct formula is complicated; hence the inverse Winkel Tripel has no closed form and can be only computed numerically. The Geo-Location tool uses a numerial algorithm based on the Newton-Raphson method, provided in [5]. A snapshot of the Winkel Tripel projection is provided by Figure 6.

Page 7: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Figure 6 Winkel Tripel projection shapshot

The Geo-Location tool also has zooming capabilities, as seen in Figure 7.

The GeoBytes Company generously provided the Geo-Location database free of charge. It has been tweaked to be completely loaded in the memory and heavily indexed to make the tool scalable. Currently, the Geo-Location tool handles IPV4 addresses only.

Below, there is an example of the output provided by the module build on the Geo-Location database: Ip Address ... 129.174.87.0

City ......... Fairfax

Region ....... Virginia

Country ...... United States

Latitude ..... 38.803600

Longitude .... -77.321200

Page 8: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Figure 7. Geo-Location tool zooming feature

The packet capture is handled with the help of the libpcap library (for Linux/Unix/Mac) and winpcap library for Windows platform. The library contains a driver which extens the operating system to provide low-level network access and offers the capability of reading packets offline from a .pcap file (as implemented now in the tool) or capturing/sniffing the packets in real time (implementation in progress).

As illustrated in Figure 8, a pcap packet is constituted by: pcap header, ip header, protocol header (TCP/UDP/ICMP/…) and packet content.

Page 9: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Figure 8. PCAP packet structure

The pcap header us detailed in Figure 9. Only the timestamp (marked in red) is of interest for the GeoLocation tool.

Figure 9. PCAP header structure. The extracted information is marked in red. The IP header structure is detailed in Figure 10. The Geo-Location tool extracts the protocol used, the source and destination IP address (marked in red). The version and header length (IHL) information is only used for filtering the IPv4 packets and position the pointer to read the protocol header.

Page 10: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Figure 10. IP header structure. The extracted information is marked in red Finally, the tool extracts the source and destination ports from TCP and UDP packets (marked in red in Figure 11 and Figure 12 respectively).

Figure 11. TCP header structure. The extracted information is marked in red

Page 11: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Figure 12. UDP packet structure. The extracted information is marked in red The summary of the information extracted can be seen in Figure 13.

Figure 13. Summary of extracted information from IP packets The Geo-Location tool uses a STL Red-Black tree (std::map) to keep the packet information up to date, indexed by the IP address. This has the advantage of insertion/update/delete time of O(log(n)), as well as an linear-time iteration through the IP addresses.

Page 12: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Figure 14. Internal Geo-Location structure for managing the IP addresses information The final result of the Geo-Location tool is presented in Figure 14 and Figure 15.

Figure 15 Snapshot of incoming GMU traffic from Europe

Page 13: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

Figure 16. Snapshot of the incoming GMU traffic from the whole world (Mercator projection) For each packet, the Geo-Location tool displays a point at the latitude and longitude found by querying the Geo-Location database. The point has a default fading time of 1 second. Each point fades gradually, making use of the OpenGL transparency and blending capabilities.

Research Directions An important feature of the Geo-Location tool would be the detection and geo-location of the IP addresses behind middleboxes, such as proxies and network address translators (NAT). The Geolocation Tool will implement different techniques for detection and location of such hidden IP addresses. For the Web traffic, the detection of users behind proxies and NATs will be possible by using JavaScript, cookies or Java applets. By transmitting the computer’s local IP address back to server and by the use of cookie IDs, these methods will be able to identify the computers behind the middleboxes. An effective method of geolocating the IP addresses difficult cases (such as VPN networks) would be by embedding JavaScript within pdf documents. When the document is open, the embedded JavaScript code is able to “phone” to a distributed server network. By measuring the “network distance” and correlating it with existing measurements, it will be possible to make a good assessment of the location of the remote computer. The methods for identifying and geolocating the “hidden” users are actively explored.

Page 14: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

The filters used to detect the attacks are subject of future research, with several interesting direction to explore. One possibility is to filter only the low-frequency packets, as an attacker may send the probing packets infrequently to avoid attracting attention. Another possibility is to detect the attacks using a Markov model. The tool will allow pluggable filters. A related but slightly different problem is related to the streams of documents and images. As the text and imagery is continuously streamed through the web crawling or email sniffing, it is important to geographically classify and cluster the documents, so that they can be queried later for specific keywords and locations. In text mining, the tool will geographically locate the documents using their IP addresses as well as internal addresses found by parsing the documents. First, it will attach meta-data to the documents and images connecting them in a labeled sparse graph in real time. Second, the proposed tool will cluster the streaming documents using an adaptive Gaussian mixture and the Expectation Maximization algorithms on the keyword terms. A metric of similarity of the images and video data (similar with what Google YouTube is trying to do for identifying the copyrighted videos) will be subject to further research, to be used for relating the images and videos among themselves. Finally, the text mining part will provide an interface for query by the user, similar to the Google Search Engine. The user will be able to get answers to queries like “Show me the geographical places related to this image” or “Show me documents related to Fairfax and “Fall festival” keyword.

Validation The Geolocation Tool performance will be tested on the GMU traffic data. The tool should successfully scale to accommodate the traffic at this level. The GMU Center for Secure Information System generously offered to provide the necessary hardware resources for this research project. The results will be cross-validated between different methods. The text and image mining can be validated against existing indexing engines either open source or commercial products such as Google.

References [1] Dodge M., and Kitchin R. (2001), Atlas of Cyberspace, Harlow, England: Addison-Wesley. [2] Geobytes IP Locator Files. http://www.geobytes.com/ [3] Casado M. and Freedman M. J. (2007), “Peering through the Shroud: The Effect of Edge

Opacity on IP-based Client Identification”, Proc. 4th USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI '07) Cambridge, MA

[4] Ipbuker C. and Bildirici I. O. (2005), “Computer Program for the Inverse Transformation of the Winkel Projection”, Journal of Surveying Engineering, Volume 131, Issue 4, pp. 125-129, November 2005

Page 15: Geolocation Tool – A Transient Geographic …iyoun/geo/paper/GeolocationTool.pdfGeolocation Tool – A Transient Geographic Mapping System for Intrusion Detection Background As computers

[5] Mücke E. P., Saias I., and Zhu B. (1996) “Fast Randomized Point Location Without Preprocessing in Two- and Three-dimensional Delaunay Triangulations”, Proceedings of the Twelfth Annual Symposium on Computational Geometry, Association for Computing Machinery

[6] Priebe C. E. (1994) “Adaptive Mixtures,” Journal of the American Statistical Association, 89, 796-806.

[7] Shewchuk J. R. (2002), “Delaunay Refinement Algorithms for Triangular Mesh Generation,” Computational Geometry: Theory and Applications 22(1-3):21-74

[8] “Triangle, A Two-Dimensional Quality Mesh Generator and Delaunay Triangulator”. http://www.cs.cmu.edu/~quake/triangle.html

[9] “WinPcap: The Windows Packet Capture Library” http://www.winpcap.org/