1 information management on the world-wide web junghoo “john” cho ucla computer science

1

Information Management Information Management on the World-Wide Webon the World-Wide Web

Junghoo “John” ChoJunghoo “John” Cho

UCLA Computer ScienceUCLA Computer Science

2

The Web and Information GaloreThe Web and Information Galore

3

10 Years Ago10 Years Ago

Reading papers for Reading papers for researchresearch– Stacks of papersStacks of papers– Long waitLong wait

4

With WebWith Web

5

Challenges (1)Challenges (1)

Information overloadInformation overload– Too much information, too little timeToo much information, too little time

6

Information OverloadInformation Overload

““XML” to GoogleXML” to Google– 14 Million14 Million matching documents! matching documents!

““XML” to AmazonXML” to Amazon– 464464 matching books! matching books!

Which one to read?Which one to read?

7


Hidden WebHidden Web

– Not indexed by Search EnginesNot indexed by Search Engines– ““Hidden” from an average userHidden” from an average user– Browse every site manually?Browse every site manually?

…

8


TransienceTransience

9


Scattered & unstructured dataScattered & unstructured data– All Computer Science faculty members and All Computer Science faculty members and

graduate students in the US?graduate students in the US?

10

Projects In Our GroupProjects In Our Group

Web ArchiveWeb Archive Hidden Web IntegrationHidden Web Integration Page Ranking AlgorithmPage Ranking Algorithm User Recommendation SystemUser Recommendation System

11

User Recommendation SystemUser Recommendation System

464 books on XML464 books on XML Which one to read?Which one to read?

– The one that my The one that my colleagues and friends colleagues and friends recommend?recommend?

12

Amazon’s Recommendation SystemAmazon’s Recommendation System

1 – 5 star rating by individual users1 – 5 star rating by individual users Books can be sorted by “average user Books can be sorted by “average user

rating”rating”

13

My Typical ScenarioMy Typical Scenario

Sort books by their average user ratingSort books by their average user rating Browse top 20 books to decide what to readBrowse top 20 books to decide what to read

14

QuestionsQuestions

Is “5 star” by one user better than “4.9 star” Is “5 star” by one user better than “4.9 star” by 100 users?by 100 users?– Intuitively, I prefer 4.9 star by 100 usersIntuitively, I prefer 4.9 star by 100 users– More “reliable” ratingMore “reliable” rating

How much can I trust the rating of a How much can I trust the rating of a particular person?particular person?– How do I know that the person’s rating is How do I know that the person’s rating is

reliablereliable

15

Our ApproachOur Approach

““Inherent quality” or “rating” of a bookInherent quality” or “rating” of a book– How many users recommend the book (i.e., How many users recommend the book (i.e.,

give high rating) if all users have read the give high rating) if all users have read the book?book?

More user rating More user rating More information on More information on the “quality” of the bookthe “quality” of the book– An average user is likely to give high rating for An average user is likely to give high rating for

a high-quality booka high-quality book

16

Probabilistic Rating ModelProbabilistic Rating Model

How likely is the book of “4 star rating”?How likely is the book of “4 star rating”?– Rating probability distributionRating probability distribution

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Book rating/quality

Prob

abil

ity

dens

ity

17

Update of Rating ProbabilityUpdate of Rating Probability

As more users provide rating, we update As more users provide rating, we update our probability distributionour probability distribution

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Book rating/quality

Prob

abil

ity

dens

ity

18



0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Book rating/quality

Prob

abil

ity

dens

ity

After five-star ratingby a user

19



0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Book rating/quality

Prob

abil

ity

dens

ity

After one-star ratingby a user

20



0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

Book rating/quality

Prob

abil

ity

dens

ity

After many ratings

21

Bayesian Inference TheoryBayesian Inference Theory

Given a user rating UR, what is the inherent rating Given a user rating UR, what is the inherent rating IR?IR?

)(

)()|()|(

URP

IRPIRURPURIRP

Probability of book rating BEFORE user ratingProbability of book rating

AFTER user rating

22

User ModelUser Model

The characteristics of a userThe characteristics of a user

Sensitivity: Slope of the curveSensitivity: Slope of the curve+1: good, –1 : bad, 0: not useful+1: good, –1 : bad, 0: not useful

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

Good Bad

Book quality

Use

r ra

ting

Book qualityU

ser

rati

ng

23

User ModelUser Model

The characteristics of a userThe characteristics of a user

Bias: Average “height” of the curveBias: Average “height” of the curve

1

2

3

4

5

1 2 3 4 5

1

2

3

4

5

1 2 3 4 5

Positive bias Negative bias

Book quality

Use

r ra

ting

Book qualityU

ser

rati

ng

24

Iterative Model RefinementIterative Model Refinement

As more users rate a book, we get better As more users rate a book, we get better estimates on book qualityestimates on book quality

As we estimate a book quality better, we get As we estimate a book quality better, we get better idea on a user’s sensitivity and biasbetter idea on a user’s sensitivity and bias

25

Iterative Model RefinementIterative Model Refinement

User-providedRating

Book Rating Estimate

UserCharacteristics

26

Final RecommendationFinal Recommendation

Recommend the book with the highest Recommend the book with the highest expected ratingexpected rating

27

Initial ResultsInitial Results

Our system prefers a 4.9-star book by 100 Our system prefers a 4.9-star book by 100 people to a 5-star book by 1 userpeople to a 5-star book by 1 user

If a user gives random ratings, the system If a user gives random ratings, the system ignores the user’s ratingignores the user’s rating

More thorough evaluation on the wayMore thorough evaluation on the way

28

Other ProjectsOther Projects

Web ArchiveWeb Archive Hidden Web IntegrationHidden Web Integration Page Ranking AlgorithmPage Ranking Algorithm

29

Ph.D. Students on the ProjectsPh.D. Students on the Projects

Alex NtoulasAlex Ntoulas Rob AdamsRob Adams Victor LiuVictor Liu– In Dr Chu’s groupIn Dr Chu’s group

30

Thank YouThank You

Questions?Questions?

1 information management on the world-wide web junghoo “john” cho ucla computer science

Documents

star rating

average user rating

update of rating probability

user slide

high rating

reliable rating

persons rating

average user rating