search and the new economy session 1 basics of web search engines

48
Prof. Panos Ipeirotis Search and the New Economy Session 1 Basics of Web Search Engines

Upload: yoshi

Post on 26-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Search and the New Economy Session 1 Basics of Web Search Engines. Prof. Panos Ipeirotis. Who am I?. Prof. Panagiotis Ipeirotis (a.k.a. Panos) Email: [email protected] AIM: ipeirotis Office: KMC 8-84 (see “Staff Information” on Blackboard) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Search and the New Economy Session 1 Basics of Web Search Engines

Prof. Panos Ipeirotis

Search and the New Economy

Session 1

Basics of Web Search Engines

Page 2: Search and the New Economy Session 1 Basics of Web Search Engines

Who am I?

• Prof. Panagiotis Ipeirotis (a.k.a. Panos)– Email: [email protected]– AIM: ipeirotis– Office: KMC 8-84– (see “Staff Information” on Blackboard)

• Joined Stern in 2004, “A Computer Scientist in a Business School”

• Research in web mining and in data integration – EconoMining Project:

• Is there positive buzz about iPod Touch? What is the characteristic for which customers would pay the most?

• Which seller on eBay has a reputation of delivering fast? How much higher the merchant can charge and still make a sale?

– Web Searching:• What mergers and acquisitions took place in 2007?

Page 3: Search and the New Economy Session 1 Basics of Web Search Engines

Who are you?

http://pages.stern.nyu.edu/~panos/teaching/W08.html

Mixture of marketing / media + technology backgrounds

70% have LinkedIn accounts Penetration of Facebook accounts less obvious

Page 4: Search and the New Economy Session 1 Basics of Web Search Engines

Class days, time, place

• KMC 3-120• Tuesday Jan 29 (6pm-9pm)• Thursday Jan 31 (6pm-9pm)• Sunday Feb 3 (9am-12n)• Sunday Feb 3 (1pm-4pm)• Tuesday Feb 5 (6pm-9pm)• Thursday Feb 7 (6pm-9pm)

Course Overview

Teaching assistant

• Nikolay Archak ([email protected])

Page 5: Search and the New Economy Session 1 Basics of Web Search Engines

Course Overview

Class requirements

• 6 Assignments

• One take-home final exam or a project• Both due on February 14th

• Submit your proposal for a project by February 1st

Page 6: Search and the New Economy Session 1 Basics of Web Search Engines

Blackboard• http://sternclasses.nyu.edu/

• Use your Stern username and password• Confirm that you can access the course as soon as possible

• Information about your classroom colleagues• All readings• All assignment descriptions• All assignment submissions (well, almost)• All online discussions• Grades, announcements, exam guidelines, stock tips….

Course Overview

Page 7: Search and the New Economy Session 1 Basics of Web Search Engines

Key Objectives of CourseA. Understand the technology behind “search” (Jan 29)

How search engines discover and rank web pages? How can we identify issues and opportunities in a web site? (Mainly lecture-based)

B. Understand search engine advertising (Jan 31)Advertising on the web: banner ads, contextual ads, keyword ads, Optimizing a website for organic and paid search (Lecture + example discussion)

C. Harnessing the wisdom of the crowds (Feb 3)

Who owns your data? Privacy threats, the changing face of intellectual property (Case presentation + discussion)

Leveraging social networks for marketing, blog analysis, opinion mining and buzz tracking, long tail and recommender/reputation systems, prediction markets and wikis (Lecture + Case Discussion, focus on cases)

D. Data ownership issues (Feb 5)

At its core: A hands-on, “how-to mentality” class

Page 8: Search and the New Economy Session 1 Basics of Web Search Engines

Questions?

Page 9: Search and the New Economy Session 1 Basics of Web Search Engines

Objectives of today’s class

1. Understand the disruptive power of information

2. Learn how information is stored on the Web

3. Learn how search engines discover and rank information

4. Learn how users search for information (Analytics)

Page 10: Search and the New Economy Session 1 Basics of Web Search Engines

Information is ubiquitousHow IT changed these industries?

Telephony

NewspapersMusic

Radio

Advertising

Banking

Email

Travel

Video/TV

Retail / POS Stock MarketManufacturing

Page 11: Search and the New Economy Session 1 Basics of Web Search Engines

Information technology is ubiquitous

Telephony

NewspapersMusic

Radio

Advertising

Banking

Email

Travel

Video/TV

Retail / POS Stock MarketManufacturing

What is common in all disruptive changes?

Page 12: Search and the New Economy Session 1 Basics of Web Search Engines

Key concepts

1. Digitization

2. Information Asymmetries– At the root of every disruption caused by search

technologies– “Web search” is only part of the equation

Google's mission is to organize the world's information and make it

universally accessible and useful

Page 13: Search and the New Economy Session 1 Basics of Web Search Engines

Objectives of today’s class

1. Understand the disruptive power of information

2. Learn how information is stored on the Web

3. Learn how search engines discover and rank information

4. Learn how users search for information

Page 14: Search and the New Economy Session 1 Basics of Web Search Engines

In Assignment 1 you created a website

• Can you find it on Google?– If yes, how– If no, why?

Page 15: Search and the New Economy Session 1 Basics of Web Search Engines

Why is this important?

Search Engines Influence Consumers

Page 16: Search and the New Economy Session 1 Basics of Web Search Engines

Slide adapted from Marti Hearst, Lew & Davis

Let’s cover the basics

• Internet and Web are not synonymous• Internet is a global communication network

connecting millions of computers• World Wide Web (WWW) is one component of the

Internet, along with e-mail, chat, etc

Internet vs. WWW

Page 17: Search and the New Economy Session 1 Basics of Web Search Engines

How Does the WWW Work?

• You created a web page index.html for the class on your PC

• Then you copy the page to a directory /sne/w08/ on a the NYU computer that runs a “web server”

• The computer’s name is “homepages.nyu.edu”

Web server

Page 18: Search and the New Economy Session 1 Basics of Web Search Engines

Reading a URL

http://homepages.nyu.edu/sne/w08/index.htmlhttp:// = HyperText Transfer Protocol (i.e., Web) homepages = service name (often is www).nyu = domain name .edu/ = top level domaini141/ = directory namef07/ = directory nameindex.html = file name of web page

Page 19: Search and the New Economy Session 1 Basics of Web Search Engines

Publishing on the Web

1. You create the web page on your computer

Internet

NYU Web Server

RandomWeb User

NYU Student

Page 20: Search and the New Economy Session 1 Basics of Web Search Engines

2. You send the files to the NYU Web server

Publishing on the Web

NYU Web Server

Internet

RandomWeb User

FTP

NYU Student

Page 21: Search and the New Economy Session 1 Basics of Web Search Engines

3. A web user requests your home page URL

Publishing on the Web

Internet

NYU Web Server

RandomWeb User

NYU Student

http request

Page 22: Search and the New Economy Session 1 Basics of Web Search Engines

4. The NYU Web server serves up your page

Publishing on the Web

Internet

NYU Web Server

RandomWeb User

Stern StudentClient

http response

Page 23: Search and the New Economy Session 1 Basics of Web Search Engines

When anyone can publish, how do we find what we need?• The information is spread across multiple autonomous computers• With millions of choices, how do we find what we need?

Information on the Web

?Internet

Page 24: Search and the New Economy Session 1 Basics of Web Search Engines

Objectives of today’s class

1. Understand the disruptive power of information

2. Learn how information is stored on the Web

3. Learn how search engines discover and rank information

4. Learn how users search for information

Page 25: Search and the New Economy Session 1 Basics of Web Search Engines

How Search Engines Work

i. Gather the contents of all web pages (using a program called a crawler or spider)

ii. Organize the contents of the pages in a way that allows efficient retrieval (indexing)

iii. Take in a query, determine which pages match, and show the results (ranking and display of results)

Three main parts:

Page 26: Search and the New Economy Session 1 Basics of Web Search Engines

How do Search Engines Discover Information?

• How do crawlers find web pages? Start with a list of domain names,

visit the home pages there. Look at the hyperlink on the home

page, and follow those links to more pages.

Keep a list of URLs visited, and those still to be visited.

Each time the program loads in a new HTML page, add the links in that page to the list to be crawled.

Page 27: Search and the New Economy Session 1 Basics of Web Search Engines

Standard Web Search Engine Architecture

Inverted index

Search engine servers

Google Document

StorageCrawler

machines

Send discovered pages to mothership

Create an “inverted

index”user

query

Show results to user For each word,

the pages that contain the word

Page 28: Search and the New Economy Session 1 Basics of Web Search Engines

Crawler behavior varies

• Parts of a web page that are indexed– Until recently, only the first few parts of the page

were retrieved/stored

• How deeply a site is indexed – Google/Yahoo/MSN get only the first top levels

• How frequently the site is crawled– Can be few minutes (news), hours (blogs), days, or

weeks (my site )

What are the implications?

Page 29: Search and the New Economy Session 1 Basics of Web Search Engines

Indexing

Record the following information about each page

• List of words– Is the word in the title?– How far down in the page?– Was the word in boldface?

• URLs of pages pointing to this one

• Anchor text on pages pointing to this one

• …many other “secret ingredients”

Page 30: Search and the New Economy Session 1 Basics of Web Search Engines

The importance of anchor text

<a href=http://behind-the-enemy-lines …>An MBA course the way it should be</a>

The anchor text summarizes what the website is about.

(Gives also birth to the “GoogleBombing” phenomenon)http://en.wikipedia.org/wiki/Google_bomb

<a href=http://behind-the-enemy-lines…>Finally, another course on prediction markets</a>

Page 31: Search and the New Economy Session 1 Basics of Web Search Engines

Text-based retrieval is not enough

• So far, we examined how text is used for retrieving pages

• However, text alone is not enough. Why?

Page 32: Search and the New Economy Session 1 Basics of Web Search Engines

Measuring Importance of Linking

PageRank Algorithm

• Idea: important pages are pointed to by other important pages

• Method:– Each link from one page to another is counted

as a “vote” for the destination page • The number of incoming links is important!• But it is not enough!

– But each “vote” is different! Pagerank places more importance to votes that come from pages with large number of votes (and so on, and so on)

• Compare, for example, the cases for the circled page in cases A and B

B

A

Page 33: Search and the New Economy Session 1 Basics of Web Search Engines

People who bought this also bought…

Page A

Page BPage CPage D

People who bought this also bought…

Page D

Page CPeople who bought this also bought…

Page C

Page A

People who bought this also bought…

Page B

Page APage C

Computing PageRank – don’t need to ‘know’

( )

( )( )( )j G i

PR jPR iOutDegree j

(ignoring damping factor for illustration)

Page 34: Search and the New Economy Session 1 Basics of Web Search Engines

People who bought this also bought…

Page A

Page BPage CPage D

People who bought this also bought…

Page D

Page CPeople who bought this also bought…

Page C

Page A

People who bought this also bought…

Page B

Page APage C

Computing PageRank

( )

( )( )( )j G i

PR jPR iOutDegree j

Page 35: Search and the New Economy Session 1 Basics of Web Search Engines

PageRank

People who bought this also bought…

Page A

Page BPage CPage D

People who bought this also bought…

Page D

Page CPeople who bought this also bought…

Page C

Page A

People who bought this also bought…

Page B

Page APage C.250 .250

.250 .250

( )

( )( )( )j G i

PR jPR iOutDegree j

Page 36: Search and the New Economy Session 1 Basics of Web Search Engines

PageRank

People who bought this also bought…

Page A

Page BPage CPage D

People who bought this also bought…

Page D

Page CPeople who bought this also bought…

Page C

Page A

People who bought this also bought…

Page B

Page APage C.250 .250

.250 .250

.250/3

.250

.250/3

.250/2

.250.250/3 .250/2

( )

( )( )( )j G i

PR jPR iOutDegree j

Page 37: Search and the New Economy Session 1 Basics of Web Search Engines

PageRank

People who bought this also bought…

Page A

Page BPage CPage D

People who bought this also bought…

Page D

Page CPeople who bought this also bought…

Page C

Page A

People who bought this also bought…

Page B

Page APage C

.250/3

.250

.250/3

.250/2

.250.250/3 .250/2

.375 .083

.083 .458

( )

( )( )( )j G i

PR jPR iOutDegree j

Page 38: Search and the New Economy Session 1 Basics of Web Search Engines

PageRank

People who bought this also bought…

Page A

Page BPage CPage D

People who bought this also bought…

Page D

Page CPeople who bought this also bought…

Page C

Page A

People who bought this also bought…

Page B

Page APage C

.375/3

.083

.375/3

.083/2

.458.375/3 .083/2

.375 .083

.083 .458

( )

( )( )( )j G i

PR jPR iOutDegree j

Page 39: Search and the New Economy Session 1 Basics of Web Search Engines

PageRank

People who bought this also bought…

Page A

Page BPage CPage D

People who bought this also bought…

Page D

Page CPeople who bought this also bought…

Page C

Page A

People who bought this also bought…

Page B

Page APage C

.375/3

.083

.375/3

.083/2

.458.375/3 .083/2

.500 .125

.125 .250

( )

( )( )( )j G i

PR jPR iOutDegree j

Page 40: Search and the New Economy Session 1 Basics of Web Search Engines

PageRank

People who bought this also bought…

Page A

Page BPage CPage D

People who bought this also bought…

Page D

Page CPeople who bought this also bought…

Page C

Page A

People who bought this also bought…

Page B

Page APage C.400 .133

.133 .333

.400/3

.133

.400/3

.133/2

.333.400/3 .133/2

( )

( )( )( )j G i

PR jPR iOutDegree j

Page 41: Search and the New Economy Session 1 Basics of Web Search Engines

How PageRank is used

1. Locate the pages that contain the query text2. Weight the “text score” with the “link score”3. Rank results

Lesson: PageRank of competitors matters!Do not obsess (only) about your PageRank

Page 42: Search and the New Economy Session 1 Basics of Web Search Engines

Cool! Let’s Get some PageRank

• Obvious incentives to game the system

• Or at least to speed up the process of going up in the results

Page 43: Search and the New Economy Session 1 Basics of Web Search Engines

A few spam technologies

• Cloaking– Serve fake content to search engine robot– DNS cloaking: Switch IP address.

Impersonate

• Doorway pages– Pages optimized for a single keyword that re-

direct to the real target page (typically get real content from legitimate pages and synthesize)

• Keyword Spam– Misleading meta-keywords, excessive

repetition of a term, fake “anchor text”– Hidden text with colors, CSS tricks, etc.

Is this a SearchEngine spider?

N

Y

SPAM

FakeDoc

Cloaking

Meta-Keywords = “… London hotels, hotel, holiday inn, hilton, discount, Pageing, reservation, sex, mp3, britney spears, viagra, …”

Page 44: Search and the New Economy Session 1 Basics of Web Search Engines

Gaming PageRank: Link spam• Link spam: Inflating the rank of a page by creating

nepotistic links to it– From own sites: Link farms– From partner sites: Link exchanges– From unaffiliated sites (e.g. blogs, guest books, web

forums, etc.)

• The more links, the better– Generate links automatically– Use scripts to post to blogs– Synthesize entire web sites– Synthesize many web sites (DNS spam)

• The more important the linking page, the better– Buy expired highly-ranked domains– Post links to high-quality blogs

Page 45: Search and the New Economy Session 1 Basics of Web Search Engines

Gaming PageRank and Trust

TrustRank Algorithm

• Initial votes come only from trusted pages

• Compare, for example, the cases for the circled page in cases A and B

• The main reason behind the initial success of Google

• Get links from trusted, quality sites!

B

A

NYU student

MIT student

Links from untrusted sources

Page 46: Search and the New Economy Session 1 Basics of Web Search Engines

Other ranking factors

• Location, Location, Location...and Frequency– Query words in title, or in first few sentences– The more frequent the query words, the better

• Clickthrough measurement– How often users click on your URL, when they see it– How long do they stay (using toolbars!)

Page 47: Search and the New Economy Session 1 Basics of Web Search Engines

How to rank high in the results

• Position your keywords (title, headings, early on page)• Make text visible (no tiny fonts, no white-on-white)• “Alt text” for images: Accessibility + search engines• Frames can kill, (Flash, AJAX also problematic)

• Have relevant content• Do not change topics• Build links (nice to build a real community)• Just say no to search engine spamming

• Submit your key pages• Verify often your listing

Page 48: Search and the New Economy Session 1 Basics of Web Search Engines

Objectives of today’s class

1. Understand the disruptive power of information

2. Learn how information is stored on the Web

3. Learn how search engines discover and rank information

4. Learn how users search for information (after the break)