Download - Web Characterization
![Page 1: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/1.jpg)
Web Characterization
Week 9
LBSC 690
Information Technology
![Page 2: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/2.jpg)
Outline
• What is the Web?
• What’s on the Web?
• What is the nature of the Web?
• Preserving the Web
![Page 3: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/3.jpg)
Defining the Web
• HTTP, HTML, or URL?
• Static, dynamic or streaming?
• Public, protected, or internal?
![Page 4: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/4.jpg)
Economics of the Web in 1995
• Affordable storage– 300,000 words/$
• Adequate backbone capacity– 25,000 simultaneous transfers
• Adequate “last mile” bandwidth– 1 second/screen
• Display capability– 10% of US population
• Effective search capabilities– Lycos (now google), Yahoo
![Page 5: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/5.jpg)
Nature of the Web
• Over one billion pages by 1999– Growing at 25% per month!
– Google indexed about 3 billion pages in 2003
• Unstable– Changing at 1% per week
• Redundant– 30-40% (near) duplicates
• e.g., unix man page tree
![Page 6: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/6.jpg)
Source: Michael Lesk, How Much Information is there in the World?
![Page 7: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/7.jpg)
Number of Web Sites
![Page 8: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/8.jpg)
Web Sites by Country, 2002
![Page 9: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/9.jpg)
What’s a Web “Site”?
• OCLC counts any server at port 80– Misses many servers at other ports
• Some servers host unrelated content– Geocities
• Some content requires specialized servers– rtsp
![Page 10: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/10.jpg)
World Trade in 2001
Rank Exporters Value Share change Rank Importers Value Share change
1 United States 730.8 11.9 -6 1 United States 1180.2 18.3 -62 Germany 570.8 9.3 3 2 Germany 492.8 7.7 -13 J apan 403.5 6.6 -16 3 J apan 349.1 5.4 -84 F rance 321.8 5.2 -1 4 United Kingdom 331.8 5.2 -35 United Kingdom 273.1 4.4 -4 5 F rance 325.8 5.1 -26 China 266.2 4.3 7 6 China 243.6 3.8 87 Canada 259.9 4.2 -6 7 Italy 232.9 3.6 -28 Italy 241.1 3.9 0 8 Canada 227.2 3.5 -79 Netherlands 229.5 3.7 -2 9 Netherlands 207.3 3.2 -5
10 Hong Kong, China 191.1 3.1 -6 10 Hong Kong, China 202.0 3.1 -6 domestic exports 20.3 0.3 -14 retained imports a 31.2 0.5 -11 re-exports 170.8 2.8 -5
Source: World Trade Organization
![Page 11: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/11.jpg)
Source: Global Reach
English English
2000 2005
Global Internet User Population
Chinese
![Page 12: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/12.jpg)
Widely Spoken Languages
0
200
400
600
800
Spea
kers
(M
illio
ns)
Chi
nese
Eng
lish
Hin
di-U
rdu
Span
ish
Por
tugu
ese
Ben
gali
Rus
sian
Ara
bic
Japa
nese
Source: http://www.g11n.com/faq.html
![Page 13: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/13.jpg)
Source: James Crawford, http://ourworld.compuserve.com/homepages/JWCRAWFORD/can-pop.htm
![Page 14: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/14.jpg)
English JapaneseGerman FrenchChinese SpanishItalian SwedishMalay KoreanPortuguese DutchDanish CzechFinnish RussianPolish HungarianNorwegian EstonianGreek BulgarianCroatian BasqueThai TurkishArabic AlbanianOthers & Unknown
Source: Jack Xu, Excite@Home, 1999
Web Page Languages
![Page 15: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/15.jpg)
European Web Size: Exponential Growth
0
1
10
100
1,000
10,000
Oct
-96
Oct
-97
Oct
-98
Oct
-99
Oct
-00
Oct
-01
Oct
-02
Oct
-03
Oct
-04
Oct
-05
Bil
lio
ns
of
Wo
rds
English Other European
Source: Extrapolated from Grefenstette and Nioche, RIAO 2000
![Page 16: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/16.jpg)
European Web Content
Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997
![Page 17: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/17.jpg)
Live Streams
source: www.real.com, Feb 2000
529
1367
English
OtherLanguages
Almost 2000 Internet-accessible
Radio and TelevisionStations
![Page 18: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/18.jpg)
Streaming Media
• SingingFish indexes 35 million streams
• 60% of queries are for music– Then movies– Then sports– Then news
![Page 19: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/19.jpg)
Crawling the Web
![Page 20: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/20.jpg)
Web Crawl Challenges• Temporary server interruptions
• Discovering “islands” and “peninsulas”
• Duplicate and near-duplicate content
• Dynamic content
• Link rot
• Server and network loads
• Have I seen this page before?
![Page 21: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/21.jpg)
Duplicate Detection
• Structural– Identical directory structure (e.g., mirrors, aliases)
• Syntactic– Identical bytes– Identical markup (HTML, XML, …)
• Semantic– Identical content– Similar content (e.g., with a different banner ad)– Related content (e.g., translated)
![Page 22: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/22.jpg)
Robots Exclusion Protocol
• Based on voluntary compliance by crawlers
• Exclusion by site– Create a robots.txt file at the server’s top level– Indicate which directories not to crawl
• Exclusion by document (in HTML head)– Not implemented by all crawlers
<meta name="robots“ content="noindex,nofollow">
![Page 23: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/23.jpg)
Link Structure of the Web
![Page 24: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/24.jpg)
The Deep Web
• Dynamic pages, generated from databases
• Not easily discovered using crawling
• Perhaps 400-500 times larger than surface Web
• Fastest growing source of new information
![Page 25: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/25.jpg)
Content of the Deep Web
![Page 26: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/26.jpg)
Deep Web• 60 Deep Sites Exceed Surface Web by 40 Times
NameType URL
Web Size
(GBs)
National Climatic Data Center (NOAA)
Public http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html
366,000
NASA EOSDIS Public http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html
219,600
National Oceanographic (combined with Geophysical) Data Center (NOAA)
Public/Fee http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/
32,940
Alexa Public (partial)
http://www.alexa.com/ 15,860
Right-to-Know Network (RTK Net) Public http://www.rtk.net/ 14,640
MP3.com Public http://www.mp3.com/
![Page 27: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/27.jpg)
Hands on: The Wayback Machine
• Internet Archive– Stored Alexa.com Web crawls since 1997– http://archive.org
• Check out Maryland’s Web site in 1997
• Check out the history of your favorite site
![Page 28: Web Characterization](https://reader035.vdocuments.mx/reader035/viewer/2022062309/568143f6550346895db081e9/html5/thumbnails/28.jpg)
Discussion Point
• Can we save everything?
• Should we?
• Do people have a right to remove things?