the “deep web”

The “Deep Web”

ISC 110 Final ProjectKaila Ryan - 12/12/2013

What is the “Deep Web”?

Web content which is hidden behind an HTML form, and is generally not able to be indexed by search engines (Madhavan et. All, 2009).

Largely made up of web-connected databases (Wright, 2009).

Shopping catalogs

Scientific research data

Public transport information, etc. Requires “valid input values” to access (Madhavan et. All, 2009). In other

words, a query or another similar form of typed input.

Web-crawlers not yet sophisticated enough to automate formulation of relevant queries, so this data cannot be reached by them.

A bit about search engines...

Most modern search engines use automated “web crawler” programs to index websites

Crawlers follow a “trail” of links from webpage to webpage, indexing each new page it finds so that it becomes searchable- part of the “surface web” (Wright, 2009).

Because of the very nature of how they function, traditional crawling methods fail to index some documents, such as:

Databases, which require specific queries to access the information contained in them

Impossible (or at least inefficient and impractical) to use every possible query on every database found.

Task of figuring out how to narrow down possible queries to relevant terminology has been challenging.

Finding the Deep Web:

No single, exhaustive method of locating this data is available- yet.

Many competing theories and projects working toward the creation of functioning Deep Web crawlers and search engines.

Primary methods of locating Deep Web content at present:

Directories, like “The Hidden Wiki” (requires Tor browser)

Referral by current users of a particular site/service/database Many in the field of Information Science focused on development of

technology capable of “surfacing” Deep Web content, through the use of new methods of locating and querying databases, and indexing the results of these queries.

Google has a team dedicated specifically to this task

The Deep Web's value:

You may be asking yourself, “Why should we bother surfacing the 'Deep Net'? What is it worth to us?”

Ability to automate database querying and indexing opens up potential for automated cross-referencing of otherwise unconnected databases.

Invaluable to the field of medical and scientific research.

Important step in the movement toward a semantic web.

Could potentially be used to search for answers to complex questions, for which all of the information is available, but is either not unified, or not easily accessible (“What is the cheapest way to get from X to Y at 9am on a Sunday?”)

In general, ability to discover a wealth of knowledge that is already freely available, but hidden: up to 96% of the Web may be considered the Deep Web.

Sources

Bergman, M. K. (2001, Sept 24). The deep web: Surfacing hidden value.Deep Content, Retrieved fromhttp://grids.ucs.indiana.edu/courses/xinformatics/searchindik/deepwebwhitepaper.pdf

Bin He, Mitesh Patel, Zhen Zhang, and Kevin Chen-Chuan Chang. 2007.Accessing the deep web. Commun. ACM 50, 5 (May 2007), 94-101.DOI=10.1145/1230819.1241670http://doi.acm.org/10.1145/1230819.1241670

Jayant Madhavan, David Ko, Łucja Kot, Vignesh Ganapathy, AlexRasmussen, and Alon Halevy. 2008. Google's Deep Web crawl. Proc.VLDB Endow. 1, 2 (August 2008), 1241-1252.

Wright, A. (2009, Feb 23). Exploring a 'deep web' that google can't grasp.The New York Times. Retrieved fromhttp://cob.jmu.edu/williamson/mktg470/reading/search/2009/Exploring a‘Deep Web’ That Google Can’t Grasp.pdf

the “deep web”

Documents

deep web isc

semantic web

deep content

deep net

surface web wright

webconnected databases

automated web crawler

taskthe deep webs value