web mining shah mohammad nur alam sawn 03/03/2014
TRANSCRIPT
![Page 1: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/1.jpg)
Web Mining
Shah Mohammad Nur Alam Sawn
03/03/2014
![Page 2: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/2.jpg)
What is Web Mining?
Discovering desired and useful information from the World Wide Web
![Page 3: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/3.jpg)
Exploiting Geographical Location Information of Web Pages
Orkut Buyukkokten([email protected]) Junghoo Cho([email protected]) Hector Garcia-Molina([email protected]) Luis Gravano([email protected])
Department of Computer science ,Stanford University, Stanford, Ca 94305.
Department of Computer science, Columbia University, New York, 10027.
(December 27,2008)
![Page 4: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/4.jpg)
“Proof of Concept” using mapping databases
Ways of exploiting information from internet:
Improve the search engine; such as, not showing irrelevant information about the query.
To identify the “globality” of resources; such as, use of hyperlink and exploiting information about web sites then it can estimated how global a web entity is.
![Page 5: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/5.jpg)
Problems of exploit geographical location information of entities
How to compute geographical information? How to exploit this information?
![Page 6: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/6.jpg)
Computing geographical information
Information Extraction; such as, automatically analyze web pages to extract geographic entities like area or zip code.
Network IP Address Analysis; such as, focus on the location of their hosting web sites.
![Page 7: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/7.jpg)
Exploiting the Information using databases
Site Mapper (http://www.internic.net/)
It has the phone numbers of network administrators of all Class A and B domains. From this database, extracted the area code of the domain administrator and built a Site-Mapper table with area code information for IP addresses belonging to Class A and Class B addresses.
![Page 8: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/8.jpg)
Area Mapper (http://www.zipinfo.com/)
It maps cities and townships to a given area code. In some cases, entire states (e.g., Montana) correspond to one area code. In other cases, a big city often has multiple area codes (e.g., Los Angeles). Then write scripts to convert the above data into a table with entries that maintained for each area code the corresponding set of cities/counties.
![Page 9: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/9.jpg)
• Zip-Code Mapper (http://www.zipinfo.com/)
This mapped each zip code to a range of longitudes and latitudes.
![Page 10: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/10.jpg)
States
Cities
RefreshZoom
Map
URL
City
Zip code
Area Code
Ip
Graphical Interface of Proof of Concept Prototype
Output of search
Input
![Page 11: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/11.jpg)
Geospatial Data Mining on the Web: Discovering Locations of Emergency Service Facilities. (2012)
Wenwen Li, Michael F. Goodchild, Richard L. Church , and Bin Zhou GeoDa Center for Geospatial Analysis and Computation, School of
Geographical Sciences and Urban Planning, Arizona State University, Tempe AZ 85287 ([email protected])
Department of Geography, University of California, Santa Barbara Santa Barbara, CA 93106 {good,church}@geog.ucsb.edu
Institute of Oceanographic Instrumentation, Shandong Academy of Sciences Qingdao, Shandong, China 266001 ([email protected] )
![Page 12: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/12.jpg)
Google search image of fire station
Actual Location
Google result
![Page 13: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/13.jpg)
Process of Web Crowler A Web crawler is an Internet bot that systematically
browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, or an automatic indexer.
![Page 14: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/14.jpg)
Defining New Class Address Structure
![Page 15: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/15.jpg)
Form of street address for Identifying target webpages
![Page 16: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/16.jpg)
d1:Distance between p and the location of the foremost digit in the number block closest (before) to location p.
d2: Distance between p and the location of the last digit of the first number that appears(for detecting 5-digit ZIP code), or the last digit of the second number after p if the token distance of the first and second number block equals
r1: regular expression [1-9][0-9]*[\\s\\r\\n\\t]*([a-zA-Z0-9\\.]+[\\s\\r\\n\\t])+
r2: : regular expression "city-Pattern "[\\s\\r\\n\\t,]?+
("statePattern")?+[\\s\\r\\n\\t,]*\\d{5}(-\\d{4})*
Cont.
![Page 17: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/17.jpg)
Decision rules of desired addresses by training data based on semantic information
Station + Num
Key word Station andTitle web page as fire
Station on web page title
![Page 18: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/18.jpg)
Architecture of Proposed Cyber Miner
Here input is seeding web urls and output is target address
![Page 19: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/19.jpg)
Search Results of Cyber Miner
Location of all fire station obtained by Cyber Miner from address database
![Page 20: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/20.jpg)
Web-based geographic search engine for location aware search in Singapore
• Flora S. TsaiSchool of Electrical & Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore 2010.
![Page 21: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/21.jpg)
Geo search
This is able to search for location-specific information in Singapore based Web sites. The user is able to view their search locations on a satellite map instead of the two-dimensional maps currently used in street directories. The Web-based search engine is able to search for locations based on area names, building names, and groups of landmark types, business names, and business categories. Furthermore, the user is also able to use their current coordinates as a parameter so that the search engine is able to return results in order of the distance from the user’s current location.
![Page 22: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/22.jpg)
Google earth
Using google earth for their search
![Page 23: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/23.jpg)
Keyhole Markup Language
Keyhole Markup Language (KML) is a file format used to display geographic data in an earth browser such as Google Earth, Google Maps and Google Maps for mobile.
![Page 24: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/24.jpg)
Street directory
http://www.streetdirectory.com/Usefull for mobile phone only and it is also web map service which merge with google earth
![Page 25: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/25.jpg)
Global Positioning System
Google Earth allows download of tracks and waypoints from GPS devices creates KML files for the waypoints and tracks downloaded.
![Page 26: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/26.jpg)
Design
![Page 27: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/27.jpg)
Design Cont.
• BusinessAreaAddress, where the address is stored without the postal code;
• BusinessAreaPostal, where the postal code is stored;• Area, where the keywords of the area are stored, e.g.
Causeway Point;• General Area, where the General Area of the location
is stored, e.g. Yishun.
![Page 28: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/28.jpg)
Algorithms
Here use the Haversine’s Formula for faster processing.• For two points on a sphere of radius R with latitudes
Ø1 and Ø2, latitude separation Δ Ø= Ø1 - Ø2 and longitude separation Δ λ.
• where angles are in radians, and the distance d between the two points is related to their locations by the formula: h=haversin(Δ Ø)+cos(Ø1 )cos(Ø2 )haversin(Δ λ)……(1)
![Page 29: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/29.jpg)
Algorithms Cont.• Let h denote haversin (d/R) given from above. d can then be
solved either by simply applying the inverse haversine (if available) or by using the arcsine (inverse sine) function:
• d=(R)haversin-1 (h)=(2R)arcsin(√h)………………..(2)• This formula is only an approximation when applied to
Earth as earth is not a perfect sphere, its radius R varies from 6356.78 km at the poles to 6367.45 km at the equator. The error is therefore 0.1% depending on the location due to this slight elipticity. Assuming that the geometric mean of R = 6367.45 km is used.
• The output of this formula is calculating distance from two coordinates
![Page 30: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/30.jpg)
Result
The database from which these results are taken contain 1652 entries with the following categories: • Apparel, Bank, Cinema, Department Store, Duty Free Shop,
Electronics, F&B (food and bev- erage), Fast Food, Food Court, Furniture, Health and Beauty, Minim-art, Musical Instruments, Restaurant, Snack Bar, Sports, Stationery,Seafood, and Supermarket.
• The landmark type searched for are Building, Road, MRT stations, Schools and Shopping Centres. General Area searched under Advanced have various roads grouped into one big area, e.g. Tan-jong Katong and Haig Road are both grouped under the Katong area
![Page 31: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/31.jpg)
Simple search
Input
Output
![Page 32: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/32.jpg)
Advance search
![Page 33: Web Mining Shah Mohammad Nur Alam Sawn 03/03/2014](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e0e5503460f94af83e7/html5/thumbnails/33.jpg)
Thank you for your patience!