computational journalism at columbia, fall 2013, lecture 5: hybrid filtering

Upload: jonathan-stray

Post on 14-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    1/24

    Fron%ersof

    Computa%onalJournalism

    ColumbiaJournalismSchool

    Week5:HybridFiltersOctober2,2013

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    2/24

    Week5:HybridFiltering

    FilteringCommentsbyVo%ng

    User-itemrecommenda%onsystems

    GeneralHybridFilters

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    3/24

    FilteringComments

    Thousandsofcomments,whatarethegoodones?

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    4/24

    Commentvo%ng

    Problem:pungcommentswithmostvotesattopdoesntwork.Why?

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    5/24

    RedditCommentRanking

    Hypothe%cally,supposeallusersvotedonthecomment,andvoutofNup-voted.Thenwecouldsort

    bypropor%onp=v/Nofupvotes.

    N=16

    v=11

    p=11/16=0.6875

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    6/24

    RedditCommentRanking

    Actually,onlynusersoutofNvote,givinganobserved

    approximatepropor%onp=v/n

    n=3

    v=1

    p=1/3=0.333

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    7/24

    RedditCommentRanking

    Limitedsamplingcanrankvoteswrongwhenwedont

    haveenoughdata.

    p=0.333

    p=0.6875

    p=0.75

    p=0.1875

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    8/24

    Randomerrorinsampling

    Ifweobservepupvotesfromnrandomusers,whatisthedistribu%onofthetruepropor%onp?

    Distribu%onofpwhenp=0.5

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    9/24

    Confidenceinterval

    Givenobservedp,intervalthattruephasa

    probabilityoflyinginside.

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    10/24

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    11/24

    Week5:HybridFiltering

    FilteringCommentsbyVo%ng

    User-itemrecommenda%onsystems

    GeneralHybridFilters

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    12/24

    User-itemmatrix

    Storesra%ngofeachuserforeachitem.Couldalso

    bebinaryvariablethatsayswhetheruserclicked,liked,

    starred,shared,purchased...

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    13/24

    User-itemmatrix

    Nocontentanalysis.Weknownothingaboutwhatisineachitem.

    Typicallyverysparseauserhasntwatchedeven1ofallmovies.

    Filteringproblemisguessingunknownentryinmatrix.Highguessedvaluesarethingsuser

    wouldwanttosee.

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    14/24

    Filteringprocess

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    15/24

    Howtoguessunknownra%ng?

    Basicidea:suggestsimilaritems.

    Similaritemsareratedinasimilarwaybymany

    differentusers.

    Remember,ra%ngcouldbeaclick,alike,a

    purchase. UserswhoboughtAalsoboughtB... UserswhoclickedAalsoclickedB... UserswhosharedAalsosharedB...

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    16/24

    Similaritems

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    17/24

    Itemsimilarity

    Cosinesimilarity!

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    18/24

    Otherdistancemeasures

    adjustedcosinesimilarity

    Subtractsaveragera%ngforeachuser,tocompensate

    forgeneralenthusiasm(mostmoviessuckvs.most

    moviesaregreat)

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    19/24

    Genera%ngarecommenda%on

    Weightedaverageofitemra%ngsbytheirsimilarity.

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    20/24

    Week5:HybridFiltering

    FilteringCommentsbyVo%ng

    User-itemrecommenda%onsystems

    GeneralHybridFilters

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    21/24

    DifferentFilteringSystems

    Purealgorithmic:Newsblasteranalyzethetopicsinthedocuments.Noconceptofusers.

    Puresocial:WhatIseeonTwierdeterminedbywhoI

    follow.Nocontentanalysis.

    Hybrid:Redditcommentsfilteredbyanalgorithmthattakesvotesasinput.

    Hybrid:Itemsrecommendedbasedco-consump%onbyallusers.

    Whatelseispossible?

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    22/24

    ItemContent MyData OtherUsersData

    Textanalysis,topic

    modeling,clustering...

    whoIfollow

    whatIveread/liked

    socialnetworkstructure,

    otheruserslikes

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    23/24

    Howtoevaluate/op%mizethefilter?

  • 7/27/2019 Computational Journalism at Columbia, Fall 2013, Lecture 5: Hybrid Filtering

    24/24

    Howtoevaluate/op%mizethefilter?

    Nelix:trytopredictthera%ngthattheusergivesamovieaerwatchingit.

    Amazon:sellmorestuff. Googlewebsearch:humanratersA/Btesteverychange