semantics-based news recommendation with sf-idf+ international conference on web intelligence,...

Semantics-Based News Recommendation with SF-IDF+

International Conference on Web Intelligence, Mining, and Semantics (WIMS 2013)

June 13, 2013

Marnix [email protected]

Michel [email protected]

Frederik [email protected]

Flavius [email protected]

Erasmus University RotterdamPO Box 1738, NL-3000 DRRotterdam, the Netherlands

Introduction (1)

• Recommender systems help users to plough through a massive and increasing amount of information

• Recommender systems:– Content-based– Collaborative filtering– Hybrid

• Content-based systems are often term-based

• Common measure: Term Frequency – Inverse Document Frequency (TF-IDF) as proposed by Salton and Buckley [1988]


Introduction (2)

• One could take into account semantics:– Semantic Similarity (SS) recommenders:

• Jiang & Conrath [1997]• Leacock & Chodorow [1998]• Lin [1998]• Resnik [1995]• Wu & Palmer [1994]

– Concepts instead of terms → Concept Frequency – Inverse Document Frequency (CF-IDF):

• Reduces noise caused by non-meaningful terms• Yields less terms to evaluate• Allows for semantic features, e.g., synonyms• Relies on a domain ontology• Published at WIMS 2011


Introduction (3)

• One could take into account semantics:– Synsets instead of concepts → Synset Frequency – Inverse

Document Frequency (SF-IDF):• Similar to CF-IDF• Does not rely on a domain ontology• Published at WIMS 2012

– Research has shown that relationships like synonymy, hyponymy, … provide structure and contribute to an improved level of interpretability

– Hence, we coin SF-IDF+, which additionally accounts for synset semantic relationships


Introduction (4)

• Implementations in Ceryx (as a plug-in for Hermes [Frasincar et al., 2009], a news processing framework)

• What is the performance of semantic recommenders?– SF-IDF+ vs. SF-IDF– SF-IDF+ vs. TF-IDF– SF-IDF+ vs. SS


Framework: User Profile

• User profile consists of all read news items

• Implicit preference for specific topics


Framework: Preprocessing

• Before recommendations can be made, each news item is parsed:– Tokenizer– Sentence splitter– Lemmatizer– Part-of-Speech


Framework: Synsets

• We make use of the WordNet dictionary and WSD

• Each word has a set of senses and each sense has a set of semantically equivalent synonyms (synsets):– Turkey:

• turkey, Meleagris gallopavo (animal)• Turkey, Republic of Turkey (country)• joker, turkey (annoying person)• turkey, bomb, dud (failure)

– Fly:• fly, aviate, pilot (operate airplane)• flee, fly, take flight (run away)

• Synsets are linked using semantic pointers– Hypernym, hyponym, …


Framework: TF-IDF

• Term Frequency: the occurrence of a term ti in a document dj, i.e.,

• Inverse Document Frequency: the occurrence of a term ti in a set of documents D, i.e.,

• And hence


k jk

jiji n

ntf

,

,,

|}:{|

||log

jii dtj

Didf

ijiji idftfidftf ,,-

Framework: SF-IDF

• Synset Frequency: the occurrence of a synset si in a document dj, i.e.,

• Inverse Document Frequency: the occurrence of a synset si in a set of documents D, i.e.,

• And hence


k jk

jiji n

nsf

,

,,

|}:{|

||log

jii dsj

Didf

ijiji idfsfidfsf ,,-

Framework: SF-IDF+

• Synset Frequency: the occurrence of a synset si and its related synsets ri in a document dj, i.e.,

• Inverse Document Frequency: the occurrence of synsets si and ri in a set of documents D, i.e.,

• Weighting is applied depending on relations, and hence


k jk

jiji n

nsf

,

,,

|},:{|

||log

jiii drsj

Didf

rijirji widfsfidfsf ,,,-

Framework: SS (1)

• TF-IDF and SF-IDF(+) use cosine similarity:– Two vectors:

• User profile items scores• News message items scores

– Measures the cosine of the angle between the vectors

• Semantic Similarity (SS):– Two vectors:

• User profile synsets• News message synsets

– Jiang & Conrath [1997], Resnik [1995] , and Lin [1998]: information content of synsets

– Leacock & Chodorow [1998] and Wu & Palmer [1994]:path length between synsets


Framework: SS (2)

• SS score is calculated by computing the pair-wise similarities between synsets in the unread document u and the user profile r:

where W is a vector with all combinations of synsets from r and u that have a common Part-of-Speech, and where sim(u,r) is any of the mentioned SS measures.


||

),(

)( ),(

W

rusim

urank Wru

Implementation: Hermes

• Hermes framework is utilized for building a news personalization service for RSS

• Its implementation is the Hermes News Portal (HNP):– Programmed in Java– Uses OWL / SPARQL / Jena / GATE / WordNet


Implementation: Ceryx

• Ceryx is a plug-in for HNP

• Uses WordNet / Stanford POS Tagger / JAWS lemmatizer / Lesk WSD

• Main focus is on recommendation support

• User profiles are constructed

• Computes TF-IDF, SF-IDF, SF-IDF+, and SS


Evaluation (1)

• Experiment:– We let 19 participants evaluate 100 news items– We use 8 different user profiles focusing on various topics– Ceryx computes TF-IDF, SF-IDF, SF-IDF+, and SS for

various cut-off values– F1 scores are evaluated


Evaluation (2)

• Results:


TF-IDFSF-IDF+

SS

Evaluation (2)

• Results:


Conclusions

• Common recommendation is performed using TF-IDF

• Semantics could be considered by considering synsets and their relations

• Semantics-based recommendation outperforms the classic term-based recommendation

• Future work:– Employ also the similarity of words (e.g., named entities)

missing from WordNet (e.g., based on the Google Distance)– Compare SF-IDF, SF-IDF+, and SS with LDA (latent dirichlet

allocation) and ESA (explicit semantic analysis)


Questions


semantics-based news recommendation with sf-idf+ international conference on web intelligence,...

Documents