ils 501 unit 3 searching issues ils 501 / dr. liu, ils scsu

55
ILS 501 Unit 3 Searching Issues ILS 501 / Dr. Liu, ILS SCSU

Upload: stephanie-harrell

Post on 29-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

ILS 501 Unit 3

Searching Issues

ILS 501 / Dr. Liu, ILS SCSU

The Rise of Search

Search is the 2nd most popular online activity, after email.

Percentage of net users who search on a typical day grew 70% from 2002 to 2009

Pew Internet and Americal Life Project

ILS 501 / Dr. Liu, ILS SCSU

The Rise of Search

ILS 501 / Dr. Liu, ILS SCSU

What is a search engine?

A program that searches documents for specified keywords and returns a list of the documents where the keywords were found.

Typically, a search engine works by sending out a spider to fetch as many documents as possible. Another program, called an indexer, then reads these documents and creates an index based on the words contained in each document.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 4

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 5

What is web search engine?

A search engines is a huge database of web page files that have been assembled automatically by the machine.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 6

What a search engine does?

It uses software indexers (spiders or "robots") to “crawl” around the Web and,

Build indexes based on what they find in available Web pages.

How Do Search Engines Work?

1) Crawling:A ‘spider’ or ‘robot’ explores your site, following links from page to page.

2) Indexing:Data from the crawl is stored in the search engine index. The stored copy is referred to as the ‘cached page’.

3) Ranking:The Search Engine algorithm looks at a variety of factors (over 200) to determine the importance of a web page and where it should rank for any given keyword phrase.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 8

How search engines work? Crawler-Based Search Engines

They "crawl" or "spider" the web, then people search through what they have found.

Human-Powered Directories

Hybrid Search Engines

(Source:http://www.searchenginewatch.com)

Web site to explain PageRank

b1a1

b3

b4

d1d2

e1

e2c1

b2

PageRank - Motivation

The number incoming links to a page is a measure of importance and authority of the page.

Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing links are important.

Expanding the Root Set

PageRank

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 13

Three elements of Crawler-Based Search Engine

The spider (crawler). The spider visits a web page, reads it, and then follows links to other pages within the site. The spider returns to the site on a regular basis, such as every month or two, to look for changes.

The index. It is like a catalog containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information.

Search engine software. It is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what is most relevant.

(Source: http://www.searchenginewatch.com)

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 14

A search engine is an index compiler

Search engines compile their databases by employing "spiders" or "robots" to crawl through web space from link to link, identifying and pages.

Once the spiders get to a web site, they typically index most of the words on the publicly available pages at the site.

Two earch Methods

The Searchable Subject Index, Search Title & Meta, i.e. Yahoo

The Full-Text Search Engine Use Spider to search Title but also Content ,

i.e. Google

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 15

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 16

What are top 10 search providers in 2009?

They are …..? Ranked by Nielsen MegaView Search:

Top 10 Search Providers for August 2009, Ranked by Searches (U.S.)

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 17

How many types of search engines exist?

Three common search engines:

Directory – Subject Search Individual search engine – Keyword search Metasearch engine – Meta search through multi-

engines

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 18

DIRECTORY by Subjects

Galaxy GoGuides LookSmart NexTag OpenDirectory Yahoo* Zeal

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 19

INDIVIDUAL SEARCH ENGINES by Keywords

AllTheWeb AltaVista Entireweb Google WistNut

HotBot Lycos Yahoo NexTag OverTure

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 20

What is metasearch engine?

It does not crawl the web compiling their own searchable databases. Instead, they search the databases of multiple sets of individual search engines simultaneously.

It provides a quick way of finding out which engines are retrieving the best results for you in your search.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 21

What Are "Meta-Search" Engines? How Do They Work?

“In a meta-search engine, you submit keywords in its search box, and it transmits your search simultaneously to several individual search engines and their databases of web pages. Within a few seconds, you get back results from all the search engines queried. Meta-search engines do not own a database of Web pages; they send your search terms to the databases maintained by search engine companies.”

From http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/MetaSearch.html

Better Meta-SearchersUC Berkeley - Teaching Library Internet Workshops

Meta-Search Tool

What's Searched(As of date at bottom of

page. They change often.)

Complex Search Ability Results Display

Clustyclusty.com

Currently searches a number of free, search engines and directories, not Google or Yahoo.

Accepts and "translates" complex searches with Boolean operators and field limiting.

Results accompanied with subject subdivisions based on words in search results, intended to give the major themes. Click on these to search within results on each theme.

Dogpilewww.dogpile.com

Searches Google, Yahoo, LookSmart, Ask.com, MSN search, and more. Sites that have purchased ranking and inclusion are blended in. Watch for Sponsored by... links below search results.

Accepts Boolean logic, especially in advanced search modes.

 

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 22

Meta-Search Engines for SERIOUS Deep DiggingUC Berkeley - Teaching Library Internet Workshops

Meta-Search Tool

What's Searched(As of date at bottom of

page. They change often.)

Complex Search Ability Results Display

SurfWaxwww.surfwax.com

A better than average set of search engines.Can mix with educational, US Govt tools, and news sources, or many other categories.

Accepts " ", +/-. Default is AND between words. I recommend fairly simple searches, allowing SurfWax's SiteSnaps and other features to help you dig deeply into results.

Click on source link to view complete search results there.Click on to view helpful "SiteSnap™" extracted from most sites in frame on right.Many additional features for probing within a site.

Copernic Agent www.copernic.com

Select from list of search engines by clicking the Properties button following Advanced Search search box.

ALL, ANY, Phrase, and more. Also Boolean searching within results under Refine (powerful!).

Must be downloaded and installed, but Basic version is free of charge. Table comparing versions.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 23

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 24

METASEARCH ENGINES

Dogpile Clusty Ixquick Mamma MetaCrawler Metor Profusion qbSearch Surfwax Vivisimo

Free search engine for your site?

For your website Freefind Atomz

For your desktop Google Desktop 5 (search engine) Microsoft desktop search engine Copernic Desktop Search Professional 3.1 Everything

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 26

Atomz Search

Add site search to your site in minutes.

1. Create an account.

2. Crawl your site.

3. Add search box to your site.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 27

http://www.freefind.com/

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 28

Add a site search engine to your website Easy to install:

1. Enter your website address

2. Enter your email address

3. Click the button. You're done!

http://www.freefind.com/

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 29

Why so many search engines?

Because of different ….

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 30

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 31

Why so many search engines –

Different Coverage

They vary in coverage. In fact coverage is very much incomplete, with the largest search engine providing access to only a minor portion of the web.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 32

Why so many search engines –

Different Search Capabilities

They have different tools and capabilities. Some have NEAR as an operator, some can search by different parameters, and so forth.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 33

Why so many search engines –

Different Spider or Crawlers

They have different spider or crawlers indexing the web.

They go out at different intervals, they crawl to different depths (only the first page, the first three pages, or perhaps all pages), and the spiders differ in indexing techniques.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 34

Why so many search engines –

Different Ways of Ranking

They differ in how they rank items for display after the items are retrieved.

Most rank on the basis of how many times the terms you search for are found or where they are found (more weight to higher placement) in the target websites.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 35

Why so many search engines –

Different Ranking Protocols

They differ with protocols.

Google, for example, uses an algorithm that ranks output on the basis of the number of other websites that have linked to the websites your search retrieves.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 37

Portal definitions ….

A portal is a Web site that is commonly used as a gateway to other Web sites.

(Source: http://www.searchenginewatch.com)

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 38

What is a Portal? A portal is a client-server application (including web-

based interface pages, related java applets, configuration files, and Perl

and C-CGI scripts) for use on a organization’s web server.

It is a set of support materials for target community members.

It is designed to facilitate substantive communication between members in the community(ies).

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 41

Boolean Search?

Boolean search is named after 19th century mathematician George Boole, who developed theories for working with sets of information.

Boolean search allows you to specify the relationships among your keywords and phrases.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 42

What are Boolean search commands

AND

OR

NOT

NEAR

NESTING

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 43

Boolean AND command

The Boolean AND command is used to require that all search terms be present on the web pages listed in results.

Your example command is?

Cats AND dogs

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 44

Boolean OR command

The Boolean OR command is used to allow any of the specified search terms to be present on the web pages listed in results.

Your example command is?house OR home

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 45

Boolean NOT commandThe Boolean NOT command is used to require that a particular search term NOT be present on web pages listed in results.

Examples:

Cats NOT dogs

canine NOT dog

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 46

Be careful using the NOT Boolean operator.

If seek documents on the Mustang automobile, there are many documents retrieved might be about the mustang horse.

"Mustang NOT horse?"

What’s the problem? This search strategy would reject articles or websites that mentioned the term "horse power."

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 48

Boolean Nesting command

Nesting ( ) allows you to build complex queries. You nest queries using parentheses

Example: impeachment AND (clinton OR johnson)

YAHOO! CONFIDENTIAL | 49

Advanced Search-Google1) Search with “quotes” for better phrase matching

2) Keyword + site:www.site.com - Search only a specific site

3) Keyword + site:www.site.com/folder/ - Search Folder

4) intitle:keyword phrase – Titles only with kw

5) Keyword + filetype:ppt/doc/mp3/pdf/etc – Search by filetype

6) Kw + site: + folder + filetype – Starting to see the power!

7) Kw + site: + folder + filetype + downthemall + prefs = Research Powerhouse - Check a specific folder on a website for a specific file type…then show them all… and with one click down load everything in the folder!

SEO

SEO = Search Engine Optimization

Using targeted keywords and phrases so a website’s pages will rank high on SERPs.

Note that SEO also stands for Search Engine Optimizer

SERP = Search Engine Results Page

Definition

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 52

What do you need to do before searching?

Find the focus of your question Clarify the key concepts Determine the key terms for the concepts Prepare alternative terms to describe these

concepts Chose a way to start looking

Google search with “Index of/” “Index of/”inurl:lib 1. index of mpeg4

3. index of mp3 4. index of cnki 5. index of rmvb 6. index of rm 7. index of movie 8. index of swf 9. index of jpg 10. index of admin 12. index of pdf 13. index of doc 14. index of wmv 15. index of mdb 16. index of mpg 17. index of mtv 18. index of software 19. index of mov 20. index of asf 23. index of lib 24. index of vod 25. index of rar 27. index of exe 28. index of iso 29. index of video 30. index of book 31. index of soft 32. index of chm 33. index of password 34. index of game 35. index of music 36. index of dvd 37. index of mid 38. index of ebook 40. index of download

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 59

Find an exact file you need

“index of/” MTV “index of/” MPEG “index of/” rmvb ...

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 60

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 61

Recall and Precision Measurement of quality in search?

Recall & Precision?

There is normally an inverse relationship between recall and precision.

Recall is a measure of the proportion of relevant documents that are captured by a search formulation

N of relevant retrieved docs

Recall = ------------------------------ N of relevant docs

For example you are searching a database with 100 articles dealing with dolphins caught by tuna fishermen and you only retrieve ten of the 100 because you only searched for the terms dolphin AND tuna, your recall would be ten percent.

You can improve your recall by finding more relevant terms and using the Boolean OR to increase the set. Thus Porpoise OR dolphin would have better recall.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 62

Precision assesses the purity of the output: the extent to which retrieved documents are relevant.

N of relevant retrieved docs

Precision= ---------------------------- N of retrieved docs

For example, if half the articles you retrieve are relevant, your precision is fifty percent.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 63

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 64

Recall and Precision Recall

The ratio or percentage of relevant documents you retrieve out of the total number of relevant posting in the database.

PrecisionThe ratio or percent of relevant documents/postings your search retrieves.

There is normally an inverse relationship between recall and precision.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 65

Recall & Precision–Inverse Result

Recall - get more hits

Precision - get what exactly you need

There is normally an inverse relationship between recall and precision. This means as you increase one, the other declines.

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 66

How to being a professional searcher?

04/19/23 ILS 501 / Dr. Liu, ILS SCSU 67

Questions?