Transcript
Page 1: Web Characterization

Web Characterization

Week 9

LBSC 690

Information Technology

Page 2: Web Characterization

Outline

• What is the Web?

• What’s on the Web?

• What is the nature of the Web?

• Preserving the Web

Page 3: Web Characterization

Defining the Web

• HTTP, HTML, or URL?

• Static, dynamic or streaming?

• Public, protected, or internal?

Page 4: Web Characterization

Economics of the Web in 1995

• Affordable storage– 300,000 words/$

• Adequate backbone capacity– 25,000 simultaneous transfers

• Adequate “last mile” bandwidth– 1 second/screen

• Display capability– 10% of US population

• Effective search capabilities– Lycos (now google), Yahoo

Page 5: Web Characterization

Nature of the Web

• Over one billion pages by 1999– Growing at 25% per month!

– Google indexed about 3 billion pages in 2003

• Unstable– Changing at 1% per week

• Redundant– 30-40% (near) duplicates

• e.g., unix man page tree

Page 6: Web Characterization

Source: Michael Lesk, How Much Information is there in the World?

Page 7: Web Characterization

Number of Web Sites

Page 8: Web Characterization

Web Sites by Country, 2002

Page 9: Web Characterization

What’s a Web “Site”?

• OCLC counts any server at port 80– Misses many servers at other ports

• Some servers host unrelated content– Geocities

• Some content requires specialized servers– rtsp

Page 10: Web Characterization

World Trade in 2001

Rank Exporters Value Share change Rank Importers Value Share change

1 United States 730.8 11.9 -6 1 United States 1180.2 18.3 -62 Germany 570.8 9.3 3 2 Germany 492.8 7.7 -13 J apan 403.5 6.6 -16 3 J apan 349.1 5.4 -84 F rance 321.8 5.2 -1 4 United Kingdom 331.8 5.2 -35 United Kingdom 273.1 4.4 -4 5 F rance 325.8 5.1 -26 China 266.2 4.3 7 6 China 243.6 3.8 87 Canada 259.9 4.2 -6 7 Italy 232.9 3.6 -28 Italy 241.1 3.9 0 8 Canada 227.2 3.5 -79 Netherlands 229.5 3.7 -2 9 Netherlands 207.3 3.2 -5

10 Hong Kong, China 191.1 3.1 -6 10 Hong Kong, China 202.0 3.1 -6 domestic exports 20.3 0.3 -14 retained imports a 31.2 0.5 -11 re-exports 170.8 2.8 -5

Source: World Trade Organization

Page 11: Web Characterization

Source: Global Reach

English English

2000 2005

Global Internet User Population

Chinese

Page 12: Web Characterization

Widely Spoken Languages

0

200

400

600

800

Spea

kers

(M

illio

ns)

Chi

nese

Eng

lish

Hin

di-U

rdu

Span

ish

Por

tugu

ese

Ben

gali

Rus

sian

Ara

bic

Japa

nese

Source: http://www.g11n.com/faq.html

Page 13: Web Characterization

Source: James Crawford, http://ourworld.compuserve.com/homepages/JWCRAWFORD/can-pop.htm

Page 14: Web Characterization

English JapaneseGerman FrenchChinese SpanishItalian SwedishMalay KoreanPortuguese DutchDanish CzechFinnish RussianPolish HungarianNorwegian EstonianGreek BulgarianCroatian BasqueThai TurkishArabic AlbanianOthers & Unknown

Source: Jack Xu, Excite@Home, 1999

Web Page Languages

Page 15: Web Characterization

European Web Size: Exponential Growth

0

1

10

100

1,000

10,000

Oct

-96

Oct

-97

Oct

-98

Oct

-99

Oct

-00

Oct

-01

Oct

-02

Oct

-03

Oct

-04

Oct

-05

Bil

lio

ns

of

Wo

rds

English Other European

Source: Extrapolated from Grefenstette and Nioche, RIAO 2000

Page 16: Web Characterization

European Web Content

Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

Page 17: Web Characterization

Live Streams

source: www.real.com, Feb 2000

529

1367

English

OtherLanguages

Almost 2000 Internet-accessible

Radio and TelevisionStations

Page 18: Web Characterization

Streaming Media

• SingingFish indexes 35 million streams

• 60% of queries are for music– Then movies– Then sports– Then news

Page 19: Web Characterization

Crawling the Web

Page 20: Web Characterization

Web Crawl Challenges• Temporary server interruptions

• Discovering “islands” and “peninsulas”

• Duplicate and near-duplicate content

• Dynamic content

• Link rot

• Server and network loads

• Have I seen this page before?

Page 21: Web Characterization

Duplicate Detection

• Structural– Identical directory structure (e.g., mirrors, aliases)

• Syntactic– Identical bytes– Identical markup (HTML, XML, …)

• Semantic– Identical content– Similar content (e.g., with a different banner ad)– Related content (e.g., translated)

Page 22: Web Characterization

Robots Exclusion Protocol

• Based on voluntary compliance by crawlers

• Exclusion by site– Create a robots.txt file at the server’s top level– Indicate which directories not to crawl

• Exclusion by document (in HTML head)– Not implemented by all crawlers

<meta name="robots“ content="noindex,nofollow">

Page 23: Web Characterization

Link Structure of the Web

Page 24: Web Characterization

The Deep Web

• Dynamic pages, generated from databases

• Not easily discovered using crawling

• Perhaps 400-500 times larger than surface Web

• Fastest growing source of new information

Page 25: Web Characterization

Content of the Deep Web

Page 26: Web Characterization

Deep Web• 60 Deep Sites Exceed Surface Web by 40 Times

NameType URL

Web Size

(GBs)

National Climatic Data Center (NOAA)

Public http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html

366,000

NASA EOSDIS Public http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html

219,600

National Oceanographic (combined with Geophysical) Data Center (NOAA)

Public/Fee http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/

32,940

Alexa Public (partial)

http://www.alexa.com/ 15,860

Right-to-Know Network (RTK Net) Public http://www.rtk.net/ 14,640

MP3.com Public http://www.mp3.com/

Page 27: Web Characterization

Hands on: The Wayback Machine

• Internet Archive– Stored Alexa.com Web crawls since 1997– http://archive.org

• Check out Maryland’s Web site in 1997

• Check out the history of your favorite site

Page 28: Web Characterization

Discussion Point

• Can we save everything?

• Should we?

• Do people have a right to remove things?


Top Related