web characterization
DESCRIPTION
Web Characterization. Week 9 LBSC 690 Information Technology. Outline. What is the Web? What’s on the Web? What is the nature of the Web? Preserving the Web. Defining the Web. HTTP, HTML, or URL? Static, dynamic or streaming? Public, protected, or internal?. - PowerPoint PPT PresentationTRANSCRIPT
Web Characterization
Week 9
LBSC 690
Information Technology
Outline
• What is the Web?
• What’s on the Web?
• What is the nature of the Web?
• Preserving the Web
Defining the Web
• HTTP, HTML, or URL?
• Static, dynamic or streaming?
• Public, protected, or internal?
Economics of the Web in 1995
• Affordable storage– 300,000 words/$
• Adequate backbone capacity– 25,000 simultaneous transfers
• Adequate “last mile” bandwidth– 1 second/screen
• Display capability– 10% of US population
• Effective search capabilities– Lycos (now google), Yahoo
Nature of the Web
• Over one billion pages by 1999– Growing at 25% per month!
– Google indexed about 3 billion pages in 2003
• Unstable– Changing at 1% per week
• Redundant– 30-40% (near) duplicates
• e.g., unix man page tree
Source: Michael Lesk, How Much Information is there in the World?
Number of Web Sites
Web Sites by Country, 2002
What’s a Web “Site”?
• OCLC counts any server at port 80– Misses many servers at other ports
• Some servers host unrelated content– Geocities
• Some content requires specialized servers– rtsp
World Trade in 2001
Rank Exporters Value Share change Rank Importers Value Share change
1 United States 730.8 11.9 -6 1 United States 1180.2 18.3 -62 Germany 570.8 9.3 3 2 Germany 492.8 7.7 -13 J apan 403.5 6.6 -16 3 J apan 349.1 5.4 -84 F rance 321.8 5.2 -1 4 United Kingdom 331.8 5.2 -35 United Kingdom 273.1 4.4 -4 5 F rance 325.8 5.1 -26 China 266.2 4.3 7 6 China 243.6 3.8 87 Canada 259.9 4.2 -6 7 Italy 232.9 3.6 -28 Italy 241.1 3.9 0 8 Canada 227.2 3.5 -79 Netherlands 229.5 3.7 -2 9 Netherlands 207.3 3.2 -5
10 Hong Kong, China 191.1 3.1 -6 10 Hong Kong, China 202.0 3.1 -6 domestic exports 20.3 0.3 -14 retained imports a 31.2 0.5 -11 re-exports 170.8 2.8 -5
Source: World Trade Organization
Source: Global Reach
English English
2000 2005
Global Internet User Population
Chinese
Widely Spoken Languages
0
200
400
600
800
Spea
kers
(M
illio
ns)
Chi
nese
Eng
lish
Hin
di-U
rdu
Span
ish
Por
tugu
ese
Ben
gali
Rus
sian
Ara
bic
Japa
nese
Source: http://www.g11n.com/faq.html
Source: James Crawford, http://ourworld.compuserve.com/homepages/JWCRAWFORD/can-pop.htm
English JapaneseGerman FrenchChinese SpanishItalian SwedishMalay KoreanPortuguese DutchDanish CzechFinnish RussianPolish HungarianNorwegian EstonianGreek BulgarianCroatian BasqueThai TurkishArabic AlbanianOthers & Unknown
Source: Jack Xu, Excite@Home, 1999
Web Page Languages
European Web Size: Exponential Growth
0
1
10
100
1,000
10,000
Oct
-96
Oct
-97
Oct
-98
Oct
-99
Oct
-00
Oct
-01
Oct
-02
Oct
-03
Oct
-04
Oct
-05
Bil
lio
ns
of
Wo
rds
English Other European
Source: Extrapolated from Grefenstette and Nioche, RIAO 2000
European Web Content
Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997
Live Streams
source: www.real.com, Feb 2000
529
1367
English
OtherLanguages
Almost 2000 Internet-accessible
Radio and TelevisionStations
Streaming Media
• SingingFish indexes 35 million streams
• 60% of queries are for music– Then movies– Then sports– Then news
Crawling the Web
Web Crawl Challenges• Temporary server interruptions
• Discovering “islands” and “peninsulas”
• Duplicate and near-duplicate content
• Dynamic content
• Link rot
• Server and network loads
• Have I seen this page before?
Duplicate Detection
• Structural– Identical directory structure (e.g., mirrors, aliases)
• Syntactic– Identical bytes– Identical markup (HTML, XML, …)
• Semantic– Identical content– Similar content (e.g., with a different banner ad)– Related content (e.g., translated)
Robots Exclusion Protocol
• Based on voluntary compliance by crawlers
• Exclusion by site– Create a robots.txt file at the server’s top level– Indicate which directories not to crawl
• Exclusion by document (in HTML head)– Not implemented by all crawlers
<meta name="robots“ content="noindex,nofollow">
Link Structure of the Web
The Deep Web
• Dynamic pages, generated from databases
• Not easily discovered using crawling
• Perhaps 400-500 times larger than surface Web
• Fastest growing source of new information
Content of the Deep Web
Deep Web• 60 Deep Sites Exceed Surface Web by 40 Times
NameType URL
Web Size
(GBs)
National Climatic Data Center (NOAA)
Public http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html
366,000
NASA EOSDIS Public http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html
219,600
National Oceanographic (combined with Geophysical) Data Center (NOAA)
Public/Fee http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/
32,940
Alexa Public (partial)
http://www.alexa.com/ 15,860
Right-to-Know Network (RTK Net) Public http://www.rtk.net/ 14,640
MP3.com Public http://www.mp3.com/
Hands on: The Wayback Machine
• Internet Archive– Stored Alexa.com Web crawls since 1997– http://archive.org
• Check out Maryland’s Web site in 1997
• Check out the history of your favorite site
Discussion Point
• Can we save everything?
• Should we?
• Do people have a right to remove things?