effects of overlaying ontologies to textrank graphs project report by kino coursey

Effects of overlaying Effects of overlaying ontologies to TextRank ontologies to TextRank graphsgraphs

Project ReportProject ReportBy Kino CourseyBy Kino Coursey

OutlineOutline

Introduction & BackgroundIntroduction & Background Ontology based Summarization Ontology based Summarization EvaluationEvaluation DiscussionDiscussion Future WorkFuture Work ConclusionConclusion

MotivationMotivation

An exponentially increasing An exponentially increasing volume of information requires volume of information requires summarizationsummarization– Humans are finiteHumans are finite– Text is being generated faster than Text is being generated faster than

a reader can reada reader can read– Need to quickly identify the Need to quickly identify the

relevance of documents relevance of documents

Central Question: Does Central Question: Does knowing more really knowing more really help?help? TextRank and a number of other TextRank and a number of other

random walk NLP algorithms have random walk NLP algorithms have been applied to different areas like been applied to different areas like text summarization and keyword text summarization and keyword extraction. extraction.

How would additional information from How would additional information from an ontology like WordNet or Cyc would an ontology like WordNet or Cyc would affect such algorithms. Would it be affect such algorithms. Would it be better or worse?better or worse?

Evaluation CriteriaEvaluation Criteria

The evaluation criteria would be the The evaluation criteria would be the change in performance of TextRank change in performance of TextRank when given the extra information.when given the extra information.

The evaluation dataset will be the The evaluation dataset will be the Document Understanding Conference Document Understanding Conference 2002 (DUC-2002) summarization test set2002 (DUC-2002) summarization test set

The ROUGE summarization evaluation The ROUGE summarization evaluation tool will be used to measure tool will be used to measure performance changeperformance change

Project PlanProject Plan

Implement TextRankImplement TextRank Construct a algorithm to import data Construct a algorithm to import data

from Cyc into TextRankfrom Cyc into TextRank Construct evaluation dataset Construct evaluation dataset

preprocessorpreprocessor Develop a parameter tuning processDevelop a parameter tuning process Measure performance with optimal Measure performance with optimal

parametersparameters Analyze and report resultsAnalyze and report results

ImplementationImplementation

Implemented Intelligent surfer model in Implemented Intelligent surfer model in PerlPerl

Implemented text-to-Cyc graph Implemented text-to-Cyc graph extractionextraction– Denotation mapDenotation map– Using: isa, genls, conceptuallyRelated, Using: isa, genls, conceptuallyRelated,

mainDomain, definingMt mainDomain, definingMt Explored graph visualization technology Explored graph visualization technology

(easier to debug what you can see)(easier to debug what you can see)– Nodes3d from BrainMaps.orgNodes3d from BrainMaps.org

Ontology Based Ontology Based SummarizationSummarization Augment TextRank with Cyc Augment TextRank with Cyc

relationshipsrelationships– Perform initial context free mapping Perform initial context free mapping

into Cyc Termsinto Cyc Terms– Perform Ranking processPerform Ranking process– Select the highest ranked sentences Select the highest ranked sentences

as extractive summaryas extractive summary

Intelligent Surfer Intelligent Surfer ModelModel

)( |)(|

)(*)1()(

iVInVj j

ji

VOut

VPRddVPR

)( |)(|

)(**)1()(

iVInVj j

jii

VOut

VPRdSdVPR

The Standard Model

Intelligent Surfer Model

For all nodes use

For all nodes use -->

N

i

iSN1

Constraint on Si

Si apportioned as a function of query relevancy. Here words in the input text have Si = 1/N while all other nodes have Si =0. When you get tired you jump back to the “problem statememt” , the input.

Weighted VersionWeighted Version

)(**)1()(1

,1k

tN

k k

kiii

t VPRO

WdSdVPR

N

j

kWjkO1

,Sum of the outputs

Weighted updates

Summation of the weighted outputs of the currently ranked nodes

From text to Cyc graphFrom text to Cyc graph

Text-to-Cyc graph extractionText-to-Cyc graph extraction– Denotation mapDenotation map– Using: isa, genls, Using: isa, genls,

conceptuallyRelated, mainDomain, conceptuallyRelated, mainDomain, definingMtdefiningMt

– Each edge has its own weight Each edge has its own weight associated with itassociated with it

– Finding the right weight is its own Finding the right weight is its own processprocess

Finding the right termsFinding the right terms

(denotation-mapper "Hurricane Gilbert swept toward the Dominican Republic Sunday")Results : (("Hurricane" . HurricaneAsObject) ("Hurricane" . HurricaneAsEvent) ("Gilbert" . JohnGilbert) ("Gilbert" . JodyGilbert) ("Gilbert" . MelissaGilbert) ("Gilbert" . GilbertStuart-TheArtist) ("Gilbert" . GilbertGottfried) ("swept" . SweepingAnArea) ("swept" . (ThingDescribableAsFn Sweep-TheWord Adjective)) ("toward" . (HypothesizedPrepositionSenseFn Toward-TheWord Preposition)) ("the Dominican Republic" . DominicanRepublic) ("Sunday" . wikip-Sunday) ("Sunday" . (ThingDescribableAsFn Sunday-TheWord Adjective)))

The Big ViewThe Big View

Tuning the system Tuning the system with Genetic with Genetic AlgorithmsAlgorithmsA Steady State Genetic Algorithm was used to find an optimal weighting compared against ROUGE-S on a subset of documents.

Genetic Algorithm & Genetic Algorithm & Evaluation FunctionEvaluation Function

ntsEvalDocumei ii

ii

RefTextTextRankSROGUERefTextOntoRankSROGUE

Fitness),(),(

2

1. Select k members for tournament (here k=4).2. For all members in tournament evaluate

performance on the task and compute fitness.

3. Perform tournament selection by sorting based on fitness and creating a parent set and a replacement set.

4. Copy parents over replacement set to make children.

5. Do mutation and crossover operations on children.

6. Go to step 1.

Initial GA EvaluationInitial GA Evaluation

Document TextRank OntoRank Ratio

1 0.0918 0.0952 1.03702 0.4095 0.3937 0.96123 0.2035 0.1991 0.97874 0.2687 0.2823 1.05065 0.0546 0.0588 1.07696 0.1778 0.2222 1.25007 0.3025 0.4034 1.33338 0.2507 0.2507 1.00009 0.1000 0.0952 0.952410 0.1685 0.1575 0.9348

AVG 1.0575

GA was run on a random subset of documents that scored below average with default settings, and was run until it provided a +5.75% gain over TextRank on the ROUGE-S scores.

Combined Ranking: Combined Ranking: HurricanAsObject vs. Hurricane HurricanAsObject vs. Hurricane as Eventas Event

Commonsense distinctions that vary from an ontology like WordNet.

HurricaneAsObject: “Hurricane Gilbert moved to the north …”

HurricaneAsEvent: “During Hurricane Gilbert many trees were …

Combined Ranking: Many Combined Ranking: Many Gilberts but one hurricane topic Gilberts but one hurricane topic ….….

Gilbert is an Gilbert is an ambiguous ambiguous word for Cyc word for Cyc

Yet the Yet the words words primary primary connections connections are topic are topic relatedrelated

Similar to Similar to human name human name association association in contextin context

EVALUATIONSEVALUATIONS

Initial GA scores showed a +5% Initial GA scores showed a +5% improvementimprovement

Evaluation on the whole datasetEvaluation on the whole dataset Shocking RevelationShocking Revelation Re-EvaluationRe-Evaluation

First Full evaluationFirst Full evaluation

Performed full per-document Performed full per-document evaluation on DUC-2002evaluation on DUC-2002

Carried out detailed per-Carried out detailed per-document review of relative document review of relative performance using ROUGE-Sperformance using ROUGE-S

Disappointing full Disappointing full dataset performancedataset performance

Relative Performance

0.8000

0.8500

0.9000

0.9500

1.0000

1.0500

100 200 300 400

Words

Cyc

Ran

k/T

extR

ank ROGUE-1

ROGUE-2

ROGUE-3

ROGUE-4

ROGUE-L

ROGUE-W-1.2

ROGUE-S*

ROGUE-SU*

Debugging via Debugging via HistogrammingHistogramming

Per document relative ratio

0.0000

0.5000

1.0000

1.5000

2.0000

2.5000

1 105 209 313 417 521 625 729 833 937 1041

Document #

Per

form

ance

Rat

io

ratio

• Sorted the relative performance on a per-document basis

• High variance, with average positive effect +15% and average negative effect -14%

• Unfortunately more often negative than positive, so a net negative skew

RevelationRevelation

While working on a distributed version of While working on a distributed version of TextRank discovered the two datasets in DUC-TextRank discovered the two datasets in DUC-20022002– The per-document generative summaryThe per-document generative summary– The multi-document extractive summaryThe multi-document extractive summary

Of course the system was using the Of course the system was using the generative summary to evaluate an extractive generative summary to evaluate an extractive system !system !

Convert and Re-Test on the multi-document Convert and Re-Test on the multi-document datasetdataset

No time to re-evolve using the GA for the No time to re-evolve using the GA for the multi-document datamulti-document data

Multi-document Multi-document Re-EvaluationRe-Evaluation

Relative Performance

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

1.02

1.04

100 200 300 400

Words in Summary

Cyc

Ran

k/T

extR

ank ROGUE-1

ROGUE-2

ROGUE-3

ROGUE-4

ROGUE-L

ROGUE-W-1.2

ROGUE-S*

ROGUE-SU*

Evaluation ConclusionsEvaluation Conclusions

Much more encouraging when Much more encouraging when comparing same data typescomparing same data types

Initial weakness prompted Initial weakness prompted analysis of negative result leading analysis of negative result leading to theory covered in discussionto theory covered in discussion

No breakthroughNo breakthrough

DiscussionDiscussion

Adding the commonsense graph produces Adding the commonsense graph produces wide variation in TextRank performance both wide variation in TextRank performance both positive and negative.positive and negative.– TextRank tries to preserve the total information TextRank tries to preserve the total information

present in a graphpresent in a graph– Adding commonsense to the graph can identify Adding commonsense to the graph can identify

what a reader what a reader should be interestedshould be interested in as well as in as well as what they what they probably already knowprobably already know

– In the first case there is an improvement : In the first case there is an improvement : disambiguation and context are selecteddisambiguation and context are selected

– In the second you transmit redundant information In the second you transmit redundant information … common sense, and reduce the effective … common sense, and reduce the effective bandwidth of the summary bandwidth of the summary

DiscussionDiscussion

Identification of Identification of stopconceptsstopconcepts– The ontology version of stopwordsThe ontology version of stopwords– Nodes that have so much Nodes that have so much

connectivity that they contain little connectivity that they contain little informationinformation

– Created a stopconcepts listCreated a stopconcepts list

Future WorkFuture Work

Run the GA on the multi-document data Run the GA on the multi-document data setset

Develop the ability to detect novel Develop the ability to detect novel information from redundant informationinformation from redundant information

The Ontology ranking process itself is The Ontology ranking process itself is usefuluseful– Ontological debuggingOntological debugging– Familiarization with the language of the Familiarization with the language of the

ontology via a form of parallel textontology via a form of parallel text

ConclusionsConclusions

Adding commonsense graphs to Adding commonsense graphs to TextRank can affect the performance TextRank can affect the performance both positively and negativelyboth positively and negatively

Need to identify how to modulate the Need to identify how to modulate the effects of commonsense informationeffects of commonsense information

Having the right data helps!Having the right data helps! Spin-offs for the text-to-ontology Spin-offs for the text-to-ontology

graph can be usefulgraph can be useful

ReferencesReferences

[[Richardson and Domingos 2002] Richardson and Domingos, Richardson and Domingos 2002] Richardson and Domingos, The The Intelligent Surfer: Probabilistic Combination of Link and Content Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRankInformation in PageRank, NIPS 2002 , NIPS 2002

[Mihalcea and Tarau 2004] Mihalcea, R. and Tarau, P. TextRank[Mihalcea and Tarau 2004] Mihalcea, R. and Tarau, P. TextRank: : Bringing Order Into TextsBringing Order Into Texts, EMNLP 2004 , EMNLP 2004

[Mihalcea, et al 2004] Mihalcea, R. and Tarau, P and Figa, E. [Mihalcea, et al 2004] Mihalcea, R. and Tarau, P and Figa, E. PageRank on Semantic Networks with Application to Word Sense PageRank on Semantic Networks with Application to Word Sense DisambiguationDisambiguation, COLING 2004 , COLING 2004

[Mihalcea, et al 2005] Mihalcea, R. and Tarau, P and Figa, E. Paul [Mihalcea, et al 2005] Mihalcea, R. and Tarau, P and Figa, E. Paul Tarau, Rada Mihalcea and Elizabeth Figa, Tarau, Rada Mihalcea and Elizabeth Figa, Semantic Document Semantic Document Engineering with WordNet and PageRankEngineering with WordNet and PageRank, in Proceedings of the ACM , in Proceedings of the ACM Conference on Applied Computing (ACM-SAC 2005), New Mexico, Conference on Applied Computing (ACM-SAC 2005), New Mexico, March 2005March 2005

[Mihalcea and Tarau Patent] Mihalcea, R. and Tarau, P. [Mihalcea and Tarau Patent] Mihalcea, R. and Tarau, P. Graph-based Graph-based ranking algorithms for text processingranking algorithms for text processing, Patent application , Patent application #20050278325 #20050278325

[Mihalcea and Tarau 2005] Mihalcea, R. and Tarau, P. [Mihalcea and Tarau 2005] Mihalcea, R. and Tarau, P. Multi-Multi-Document Summarization with Iterative Graph-based AlgorithmsDocument Summarization with Iterative Graph-based Algorithms, , Proceedings of the First International Conference on Intelligent Proceedings of the First International Conference on Intelligent Analysis Methods and Tools (IA 2005), McLean, VA, May 2005 Analysis Methods and Tools (IA 2005), McLean, VA, May 2005

ReferencesReferences

[Conyon and Muldoon 2006] M. J. Conyon and M. R. Muldoon (2006) [Conyon and Muldoon 2006] M. J. Conyon and M. R. Muldoon (2006) Ranking Ranking the Importance of Boards of Directorsthe Importance of Boards of Directors..

[Lin and Hovy 2003] Lin, Chin-Yew and E.H. Hovy. [Lin and Hovy 2003] Lin, Chin-Yew and E.H. Hovy. Automatic Evaluation of Automatic Evaluation of Summaries Using N-gram Co-occurrence StatisticsSummaries Using N-gram Co-occurrence Statistics. In Proceedings of 2003 . In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, 2003.May 27 - June 1, 2003.

[Nordin and Banzhaf 1997] P. Nordin and W. Banzhaf, [Nordin and Banzhaf 1997] P. Nordin and W. Banzhaf, "Real time control of a "Real time control of a Khepera robot using genetic programmingKhepera robot using genetic programming," Cybernetics and Control, Vol. 26, ," Cybernetics and Control, Vol. 26, No. 3, pp. 533- 561, 1997.No. 3, pp. 533- 561, 1997.

[de Jager 2004] de Jager, D., [de Jager 2004] de Jager, D., “PageRank: Three distributed algorithms,” “PageRank: Three distributed algorithms,” M.Sc. M.Sc. thesis, Department of Computing, Imperial College London, London SW7 2BZ, thesis, Department of Computing, Imperial College London, London SW7 2BZ, UK, September 2004.UK, September 2004.

[Brin and Page 1998] S. Brin and L. Page. [Brin and Page 1998] S. Brin and L. Page. The anatomy of a large-scale The anatomy of a large-scale hypertextual web search engine.hypertextual web search engine. In Seventh International World Wide Web In Seventh International World Wide Web Conference, Brisbane, Australia, 1998. Conference, Brisbane, Australia, 1998. http://citeseer.nj.nec.com/brin98anatomy.html http://citeseer.nj.nec.com/brin98anatomy.html

[Ding, et al 2004 ] L. Ding, T. Finin, A. Joshi, R. Pan, R.S. Cost, Y. Peng, P. [Ding, et al 2004 ] L. Ding, T. Finin, A. Joshi, R. Pan, R.S. Cost, Y. Peng, P. Riddivari, V. Doshi, and J. Sachs. Riddivari, V. Doshi, and J. Sachs. Swoogle: a search and metadata engine for Swoogle: a search and metadata engine for the semantic webthe semantic web. In Proc. of the 13th ACM Conference on Information and . In Proc. of the 13th ACM Conference on Information and Knowledge Management, pages 652--659, 2004.Knowledge Management, pages 652--659, 2004.

effects of overlaying ontologies to textrank graphs project report by kino coursey

Documents