1
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep (Invisible) Web
- Manoj Ravuru
2
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Outline
Web and Search Engines
Types of Web
What is Deep Web? How big it is? Is it important?
What makes it Deep and what is in it?
Deep Web content classification and categories
Crawling and Indexing Deep Web
Deep Web Statistics
3
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Outline
Deep Web Quality
How to find and use Deep Web?
Deep Web Gateways
Deep Web Issues
Summary
References
4
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Web and Search Engines
In 1991, Web was created by Tim Berners-Lee, a researcher at the CERN high-energy physics laboratory in Switzerland.
Berners-Lee designed the Web to be platform-independent.
To enable this cross-platform capability, Berners-Lee created HTML, or Hypertext Markup Language - simplified version of SGML (Standard Generalized Markup Language).
The simplicity of Markup languages format makes it easier to introduce the concept of search engines which the user can use to search and retrieve HTML documents of their interest on the web.
This Shallow Web, also known as the Surface Web or Static Web, is a collection of Web sites indexed by automated search engines.
Search engines Web crawler follows URL links on the Web, and indexes every word on every HTML page on the web and store them in huge databases that can be searched on demand.
5
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Types of Web
Static Web
Dynamic Web
Opaque Web
Private Web
Proprietary Web
Pay per click Web
6
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
What is Deep Web?
Web pages accessing vast information repository that search engines cannot or will not index.
Mainly refers to the rich content information that search engines don't have direct access to, like databases.
Deep Web pages are dynamically created as the result of a specific search.
Deep Web also called Invisible Web.
Term invisible in "Invisible Web" is actually a misnomer.
Deep Web information is available via the Web but isn't accessible by the search engines.
7
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
How big is Invisible Web ?Cannot be determined accurately.
In a word, it's humungous.
Deep web is approximately 500 times bigger than the searchable or surface Web. May be bigger than that.
Considering that Google alone covers around 8 billion pages, that's just mind boggling.
Major search engines together index only 20% of the Web, then they miss 80% of the content.
Deep Web includes images, sounds, presentations and many other types of media not visible to search engines.
8
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Is Deep Web Important ?
Web as a vast library. Requires more digging to find what’s needed.
Search engines only search a very small portion of the web make the Invisible Web a very tempting resource. There's a lot more information out there than one could ever imagine.
Significant content of Deep Web is quality content that exists in documents within searchable databases on the web which conventional search engines (well known and mostly used) can't access it.
Currently businesses, researchers, consumers etc, may not get quality and needed information.
Search Engines themselves have problems in providing relevant content – at least for bit complicated or obscure queries.
9
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Why the name “Invisible” ?
Spiders crawling through the Web, when run into a page from the Invisible Web, they don't know quite what to do with it.
Spiders can record the address of the page it couldn’t access, but can't tell the information the page contains.
Main factors are due to technical barriers ex: databases, passwords protected pages, script-based pages.
10
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
What makes it Deep ?
Proprietary sites
Sites requiring a registration
Sites with scripts
Dynamic sites
Ephemeral sites
Sites blocked by local webmasters
Sites blocked by search engine policy
Sites with special formats
Searchable databases
11
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Other factors…
Exclude Pages by policy.
Spiders/crawlers do not report what it can't index.
Task of actually finding all the pages on the Web.
12
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep Web resource classification
Dynamic content - Dynamic pages in response to a submitted query
Unlinked content - Pages which are not linked to by other pages
Limited access content - Sites that require registration or limit access to their pages
Scripted content - Pages that are only accessible through links produced by JavaScript and Flash which require special handling.
Non-text content - Multimedia (image) files, Usenet archives and documents in non-HTML file formats such as PDF and DOC documents
13
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep web content categories
14
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Crawling & Indexing the Deep Web
Major search engines such as Google, AltaVista, Inktomi does index dynamic context through the use of following programs
Paid partnership programs
Trusted feed services
Premium inclusion programs
Quigo's QUIBOT remotely crawls through pages from the deep Web, enabling it to index a large portion of the deep Web and making this content available to users searching on Quigo and partner portals.
Quigo's DeepWebGateway enables search engines to index deep Web content that they do not access directly. This technology also solves other problems related to deep Web crawling and indexing, such as spider traps and personalization.
15
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep Web StatisticsPublic information on the deep Web is currently 400 to 550 times larger than the commonly defined World Wide Web.
The deep Web contains 7,500 terabytes of information compared to 29 terabytes of information in the surface Web.
The deep Web contains nearly 550 billion individual documents compared to the 2.5 billion of the surface Web.
Ninety-five percent of the deep web contains publicly accessible information that is not subject to fees or subscriptions.
More than 200,000 deep Web sites presently exist.
60 of the largest deep-Web sites collectively contain about 750 terabytes of information -- sufficient by themselves to exceed the size of the surface Web 40 times.
On average, deep Web sites receive fifty per cent greater monthly traffic than surface sites and are more highly linked to than surface sites.
16
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep Web Statistics (contd)
Deep Web sites tend to be narrower, with deeper content, than conventional surface sites.
Total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface Web.
More than half of the deep Web content resides in topic-specific databases.
Eighty-five percent of Web users use search engines to find needed information, but nearly as high a percentage cite the inability to find desired information as one of their biggest frustrations.
More than 95% of deep Web information is publicly available without restriction.
International Data Corporation predicts that the number of surface Web documents will grow from the current two billion or so to 13 billion within three years, a factor increase of 6.5 times. Deep Web growth should exceed this rate, perhaps increasing about nine-fold over the same period.
17
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep Web Quality
About a three-fold improved likelihood for obtaining quality results from the deep Web as compared to the surface Web.
Overall precision and recall would be higher due to presence of highly relevant information for each subject area.
Degree of content overlap between deep Web sites to be much less than for surface Web sites.
Observations from working with the deep Web sources and data suggest there are important information categories where duplication does exist. Prominent among these are yellow/white pages, genealogical records, and public records with commercial potential such as SEC filings. On the other hand, there are entire categories of deep Web sites whose content appears uniquely valuable. These mostly fall within the categories of topical databases, publications, and internal site indices which accounts in total for about 80% of deep Web sites.
Duplication will be lower within the deep Web.
18
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Finding Deep Web
General web directorieswww.completeplanet.com, www.thebighub.com
Deep Web search engines that sends single query to dozens of databases simultaneously.
www.alltheweb.com, www.brightplanet.com
Specialized Databaseswww.nsdl.org, http://catalog.loc.gov
Use Google and other search engines to locate searchable databases.
Example for Google & Yahoo: languages database or toxic chemicals database
19
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep Web search strategies to follow
Be aware that the Deep Web exists.
Use a general search engine for broad topic searching.
Use a searchable database for focused searches.
Register on special sites and use their archives.
Call the reference desk at a local college if in need of a proprietary Web site. Many college libraries subscribe to these services and provide free on-site searching.
Many libraries offer free remote online access to commercial and research databases for anyone with a library card.
20
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep Web Gateways – Web Directories
Infomine [http://infomine.ucr.edu/] is a virtual library of Internet resources relevant to faculty, students, and research staff at the university level.
It contains useful Internet resources such as databases, electronic journals, electronic books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other types of information.
Infomine is librarian built. Librarians from the University of California, Wake Forest University, California State University, the University of Detroit - Mercy, and other universities and colleges have contributed to building Infomine.
21
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Infomine Web Directory
22
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep Web Gateways – Web Directories
Digital Librarian [http://www.digital-librarian.com/] is librarian’s choice of the best of the web.
23
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep Web Gateways – Search Engines
Turbo10 is a meta search engine that provides a universal interface to Deep Web search engines.
Turbo10 is designed to help search Deeper and browse faster.
Turbo10 has developed search technology since 2001. It connects Internet searchers to Deep Web search engines.
24
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Turbo10 Deep Web Search Engine
25
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep Web Gateways – Search Engines
AlltheWeb [http://www.alltheweb.com/] combines one of the largest and freshest indices with the most powerful search features that allow anyone to find anything faster than with any other search engine.
AlltheWeb's index (provided by Yahoo!) includes billions of web pages, as well as tens of millions of PDF and MS Word® files. Yahoo! frequently scans the entire web to ensure that our content is fresh and to eliminate broken links.
AlltheWeb offers a variety of specialized search tools and advanced search features, and supports searching in 36 different languages. Our image, audio, and video searches include hundreds of millions of multimedia files.
AllTheWeb provides with the controls necessary to find the most relevant content through some of the most sophisticated advanced search features available.
26
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
AlltheWeb – Deep Web Search Engine
27
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep Web Gateways – Specialized Databases
NSDL (National Science Digital Library - http://nsdl.org/) was established as an online library which directs users to exemplary resources for science, technology, engineering, and mathematics (STEM) education and research.
NSDL provides an organized point of access to STEM content that is aggregated from a variety of other digital libraries, NSF-funded projects, and NSDL-reviewed web sites.
NSDL also provides access to services and tools that enhance the use of this content in a variety of contexts.
28
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
NSDL – Specialized Database
29
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Other notable Deep Web resources
Deep Query Manager (DQM), BrightPlanet's <http://www.brightplanet.com/> powerful search tool designed to retrieve information from thousands of Deep Web databases and search engines at one time.
AlphaSearch <http://www.calvin.edu/library/searreso/internet/as/> is an extremely useful directory of "gateway" sites that collect and organize Web sites that focus on a particular subject.
Many databases that make up GPO Access. <http://www.access.gpo.gov/>.
Telephone directory databases such as Anywho <http://www.anywho.com/>.
30
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Deep Web Issues
Complete indexing for Deep Web is impossible.
Deep web content is dynamic and can change faster than the contents in static/surface web.
There is no bright line that separates content sources on the Web. Users need to choose the database (Deep Web resource) of their interest on their own.
Deep Web phenomenon is not well known to the Internet-searching public.
Value of deep web content is incalculable.
31
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Summary
World Wide Web
“Invisible/Deep Web”“Visible/Surface Web”
Search DirectoriesSearch Engines
Examples:Librarians Index to
the Internet,Yahoo
Fee-based
Specialized, searchable Databases
Examples:Google,Yahoo,
Altavista
Free
Examples:Library Catalogs,
digital library archives,Dictionaries,
Encyclopedias, Article databases
Examples:Library Catalogs,
digital library archives,Dictionaries,
Encyclopedias, Article databases
32
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Summary
Deep Web content is highly relevant to every information need, market, and domain.
The deep Web is the largest growing category of new information on the Internet.
Serious information seekers can no longer avoid the importance or quality of deep Web information.
Deep Web information is only a component of total information available. Searching must evolve to encompass the complete Web.
Directed query technology is the only means to integrate deep and surface Web information.
33
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Summary
Specific vertical market services are already evolving to partially address the deep web challenges. These will likely need to be supplemented with a persistent query system customizable by the user that would set the queries, search sites, filters, and schedules for repeated queries.
Search directories that offer hand-picked information chosen from the surface Web to meet popular search needs
Use search engines for more robust surface-level searches and content-aggregation vertical "infohubs" for deep Web information to provide answers where comprehensiveness and quality are imperative.
34
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
References
1. Wikipedia, the free encyclopedia, “Deep Web” 24 April 2007http://en.wikipedia.org/wiki/Hidden_web2. Wendy Boswell, “The Invisible Web” 21 April 2007http://websearch.about.com/od/invisibleweb/a/invisible_web.htm.3. Chris Sherman, "The Invisible Web“ 20 April 2007http://www.freepint.co.uk/issues/080600.htm#feature.4. Joe Barker, “Invisible or Deep Web” 9 March 2007http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html5. Michael K. Bergman, “The Deep Web: Surfacing Hidden Value” September 24, 2001 http://www.press.umich.edu/jep/07-01/bergman.html6. Laura Cohen, “The Deep Web” 22 November 2006 http://www.internettutorials.net/deepweb.html7. Marcus P. Zillman, “Deep Web Research” April 23, 2007 http://deepwebresearch.blogspot.com/8. Paul Bruemmer, “Indexing Deep Web Content” March 27, 2002 http://www.searchengineguide.com/wi/2002/0327_wi2.html
35
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
References
9. Danny Sullivan, “Invisible Web Gets Deeper” August 2, 2000 http://searchenginewatch.com/showPage.html?page=216287110. Chris Sherman, “Search for the invisible web” September 6, 2001http://technology.guardian.co.uk/online/story/0,3605,547140,00.html11. “Greg Linden”, Deep Web Strategy March 2007 http://www.semantic-web.at/10.57.1089.press.greg-linden-on-google-s-deep-web-strategy.htm12. Alex Wright, “In search of the deep Web” 9 March 2004 http://archive.salon.com/tech/feature/2004/03/09/deep_web/index_np.html13. Danny Sullivan, “Invisible Web" Revealed” June 11, 1999 http://searchenginewatch.com/showPage.html?page=216732114. Michael Cross, “The hidden potential of the web” April 21, 2004 http://society.guardian.co.uk/e-public/story/0,13927,1195901,00.html
36
CSE 8337 – Spring 2007 Project 2 – Deep Web Manoj Ravuru - Student ID 22508269
Thank You !!!
Manoj Ravuru