an information-pattern-based approach to novelty detection
DESCRIPTION
An information-pattern-based approach to novelty detection. Presenter : Lin, Shu-Han Authors : Xiaoyan Li, W. Bruce Croft. Information Processing and Management (2008). Outline. Motivation Objective Definition Observation Methodology Experiments Conclusion Personal Comments. - PowerPoint PPT PresentationTRANSCRIPT
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
An information-pattern-based approach to novelty detection
Presenter : Lin, Shu-Han
Authors : Xiaoyan Li, W. Bruce Croft
Information Processing and Management (2008)
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
2
Outline
Motivation
Objective
Definition
Observation
Methodology
Experiments
Conclusion
Personal Comments
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Motivation - specific topic
It is very difficult for traditional word-based approaches to separate the two non-relevant sentences(3&4) from the two relevant sentences(1&2).
The two non-relevant sentences are very likely to be indentified as novel because they contain many new words that do not appear in previous sentences.
3
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Motivation - general topic
It is very difficult for traditional word-based approaches to separate the non-relevant sentence(2) from the relevant sentence(1).
4
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Objectives
To attack above hard problem: To provide a new and more explicit definition of novelty. Novelty is defined as new
answers to the potential questions representing a user’s request or information need .
To propose a new concept in novelty detection – query-related information patterns. Very effective information patterns for novelty detection at the sentence level have been identified.
To propose a unified pattern-based approach that includes the following three steps: query analysis, relevant sentence detection and new pattern detection. The unified approach works for both specific topics and general topics.
5
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Definition - Information Patterns
Information patterns of specific topics
Information patterns of general topics
Opinion patterns and opinion sentences
Event patterns and event sentences
6
Table. Word patterns for the five types of NE(Name Entities)-questions
Table. Examples of opinion patterns
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Observation – information patterns
Sentence lengths
Relevant sentences on average have more words than non-relevant sentences.
Novel sentences on average have slightly more words than relevant sentences.
Opinion patterns
There are relatively more opinion sentences in relevant (and novel) sentences than in non-relevant sentences.
The novel sentences’ percentage of opinion sentences is slightly larger than relevant sentences’.
7
Table. Statistics of sentence lengths
Table. Statistics on opinion patterns for 22 opinion topics (2003)
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
Observation – information patterns(Cont.)
NE(Named entity) combinations PLD(PERSON, LOCATION, DATE) types
are more effective in separating relevant and non-relevant sentence.
POLD types(PERSON, ORGANIZATION,
LOCATION, DATE) will be used in new pattern detection; NEs of the ORGANIZATION type may provide different sources of new information.
NEs of the PLD types play a more important role in event topics than in opinion topics.
8
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology
9 Fig. ip-BAND: a unified information-pattern-based approach to novelty detection.
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology(Cont.)
(1) Query analysis and question formulation
10
How many (2)
Where (3)
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology(Cont.)
(2) Using patterns in relevance re-ranking Ranking with TFISF(term frequency –inverse sentence frequency) models
TFISF with information patterns
Sentence lengths
Name Entities
Opinion patterns
(3) Novel sentence extraction
11
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Experiments
Baseline approaches
B-NN: initial retrieval ranking
B-NW: new word detection
B-NWT: new word detection with a threshold
B-MMR: Maximal Marginal Relevance(MMR)
12
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Experiments
Performance for specific topics from TREC 2002, 2003, 2004
13Note: Data with * pass significance test at 95% confidence level by the Wilcoxon test and ** for significance test at 90% level.Chg%: Improvement over the first(B-NN) baseline in %.
Table. Performance of novelty detection for 8 specific topics (queries) from TREC 2002
Table. Performance of novelty detection for 15 specific topics (queries) from TREC 2003
Table. Performance of novelty detection for 11 specific topics (queries) from TREC 2004
①②③ ④
3.4 of 15 novel sentence
10.1 of 15 novel sentence
4.6 of 15 novel sentence
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Experiments
Performance for general topics from TREC 2002, 2003, 2004
14Note: Data with * pass significance test at 95% confidence level by the Wilcoxon test and ** for significance test at 90% level.Chg%: Improvement over the first(B-NN) baseline in %.
Table. Performance of novelty detection for 41 general topics (queries) from TREC 2002
Table. Performance of novelty detection for 35 general topics (queries) from TREC 2003
Table. Performance of novelty detection for 3 general topics (queries) from TREC 2004
①④
3.2 of 15 novel sentence
7.5 of 15 novel sentence
3.4 of 15 novel sentence
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Experiments
Comparison among specific, general and all topics at top 15 ranks
15
Note: Chg%: Improvement over the first baseline in percentage; Nvl#: Number of true novel sentences; Rdd#: Number of relevant but redundant sentences; NRl#: Number of non-relevant sentences.
Table. Comparison among specific, general and all topics at top 15 ranks
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
16
Conclusions
Novelty means new answers to the potential questions representing a user’s request or information need.
The proposed ip-BAND outperforms all baselines for specific topics and general topics, and specific topics is better than general topics.
It is impossible to collect complete novelty judgments in reality Baseline selection and evaluation measure by human assessors
Misjudgment of relevance and/or novelty by human assessors and disagreement of judgments between the human assessors
Limitation and accuracy of question formulations
Novelty detection precision will be low since some non-relevant sentences may be treated as novel.
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
17
Personal Comments
Advantage …
Drawback …
Application …