ichass workshop text mining

37
Introduc)on to Text Mining High Performance Compu)ng in the Humani)es, Arts, and Social Science Workshop UIUC/NCSA July 28, 2008 LoreIa Auvil Na)onal Center for Supercompu)ng Applica)ons University of Illinois at Urbana Champaign

Upload: loretta-auvil

Post on 28-Nov-2014

1.228 views

Category:

Education


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: ICHASS Workshop Text Mining

Introduc)ontoTextMining

HighPerformanceCompu)ngintheHumani)es,Arts,andSocialScienceWorkshop

UIUC/NCSAJuly28,2008

LoreIaAuvil

Na)onalCenterforSupercompu)ngApplica)onsUniversityofIllinoisatUrbanaChampaign

Page 2: ICHASS Workshop Text Mining

SEASRTextAnaly)csGoals•  AddresstheScholarlytextanaly)csneedsofMONKandNEHproposal–

AbundanceofDataby:

•  EfficientlymanagingdistributedLiteraryandHistoricaltextualassets•  Structuringextractedinforma)ontofacilitateknowledgediscovery•  Extractinforma)onfromtextatalevelofseman)c/func)onalabstrac)on

thatissufficientlyrichtosupportques)on‐answering•  Devisearepresenta)onfortheextractedinforma)onthatcanbe

efficientlyreasonedovertorecoverdataintheques)on‐answerprocess•  Devisealgorithmsforques)onansweringandinference•  DevelopUIforeffec)vevisualknowledgediscoverywithseparatequery

logicfromapplica)onlogic•  Leveragingexis)ngapproachesanddevisealgorithmsforclustering,

inference,andQ&A•  DevelopinganInterac)onUIforeffec)vevisualdataexplora)on•  Enablethetextanaly)csthroughSEASRcomponents

Page 3: ICHASS Workshop Text Mining

TextMiningDefini)on

Manydefini)onsintheliterature•  Thenontrivialextrac)onofimplicit,previouslyunknown,andpoten)allyusefulinforma)onfrom(largeamountof)textualdata”

•  Anexplora)onandanalysisoftextual(natural‐language)databyautoma)candsemiautoma)cmeanstodiscovernewknowledge

•  Whatis“previouslyunknown”informa)on?–  Strictdefini)on

•  Informa)onthatnoteventhewriterknows–  Lenientdefini)on

•  Rediscovertheinforma)onthattheauthorencodedinthetext

Page 4: ICHASS Workshop Text Mining

TextMiningProcess

•  TextPreprocessing–  Syntac)cTextAnalysis–  Seman)cTextAnalysis

•  FeaturesGenera)on–  BagofWords–  Ngrams

•  FeatureSelec)on–  SimpleCoun)ng–  Sta)s)cs–  Selec)onbasedonPOS

•  Text/DataMining–  Classifica)on‐Supervised

Learning–  Clustering‐Unsupervised

Learning–  Informa)onExtrac)on

•  AnalyzingResults–  VisualExplora)on,Discovery

andKnowledgeExtrac)on–  Query‐based–ques)on

answering

Page 5: ICHASS Workshop Text Mining

TextCharacteris)cs(1)•  Largetextualdatabase

–  Enormouswealthoftextualinforma)onontheWeb–  Publica)onsareelectronic

•  Highdimensionality–  Considereachword/phraseasadimension

•  Noisydata–  Spellingmistakes–  Abbrevia)ons–  Acronyms

•  Textmessagesareverydynamic–  Webpagesareconstantlybeinggenerated(removed)–  Webpagesaregeneratedfromdatabasequeries

•  Notwellstructuredtext–  Email/Chatrooms

•  “ruavailable?”•  “Heywhazzzzzzup”

–  Speech

Page 6: ICHASS Workshop Text Mining

TextCharacteris)cs(2)•  Dependency

–  Relevantinforma)onisacomplexconjunc)onofwords/phrases–  Orderofwordsinthequery

•  hotdogstandintheamusementpark•  hotamusementstandinthedogpark

•  Ambiguity–  Wordambiguity

•  Pronouns(he,she…)•  Synonyms(buy,purchase)•  Wordswithmul)plemeanings(bat–itisrelatedtobaseballormammal)

–  Seman)cambiguity•  Thekingsawtherabbitwithhisglasses.(mul)plemeanings)

•  Authorityofthesource–  IBMismorelikelytobeanauthorizedsourcethenmysecondfar

cousin

Page 7: ICHASS Workshop Text Mining

TextPreprocessing•  Syntac)canalysis

–  Tokeniza)on–  Lemmi)za)on–  POStagging–  Shallowparsing–  Customliterarytagging

•  Seman)canalysis–  Informa)onExtrac)on

•  NamedEn)tytagging–  Seman)cCategory(unnameden)ty)tagging–  Co‐referenceresolu)on–  Ontologicalassocia)on(WordNet,VerbNet)–  Seman)cRoleanalysis–  Concept‐Rela)onextrac)on

Page 8: ICHASS Workshop Text Mining

Syntac)cAnalysis•  Tokeniza)on

–  Textdocumentisrepresentedbythewordsitcontains(andtheiroccurrences)–  e.g.,“Lordoftherings”→{“the”,“Lord”,“rings”,“of”}–  Highlyefficient–  Makeslearningfarsimplerandeasier–  Orderofwordsisnotthatimportantforcertainapplica)ons

•  Lemmi)za)on/Stemming–  Involvesthereduc)onofcorpuswordstotheirrespec)veheadwords(i.e.lemmas)–  Reducedimensionality–  Iden)fiesawordbyitsroot–  e.g.,flying,flew→fly

•  Stopwords–  Iden)fiesthemostcommonwordsthatareunlikelytohelpwithtextmining–  e.g.,“the”,“a”,“an”,“you”

•  Parsing/PartofSpeech(POS)tagging–  Generatesaparsetree(graph)foreachsentence–  Eachsentenceisastandalonegraph–  FindthecorrespondingPOSforeachword–  e.g.,John(noun)gave(verb)the(det)ball(noun)–  ShallowParsing

•  analysisofasentencewhichiden)fiesthecons)tuents(noungroups,verbs,...),butdoesnotspecifytheirinternalstructure,northeirroleinthemainsentence

–  DeepParsing•  moresophis)catedsyntac)c,seman)candcontextualprocessingmustbeperformedtoextractorconstructtheanswer

Page 9: ICHASS Workshop Text Mining

Seman)cAnalysis:Informa)onExtrac)on

•  Defini)on:Informa)onextrac)onistheiden)fica)onofspecificseman)celementswithinatext(e.g.,en))es,proper)es,rela)ons)

•  Extracttherelevantinforma)onandignorenon‐relevantinforma)on(important!)

•  Linkrelatedinforma)onandoutputinapredeterminedformat

Page 10: ICHASS Workshop Text Mining

Informa)onExtrac)on

Informa(onType Stateoftheart(Accuracy)En((es

anobjectofinterestsuchasapersonororganiza)on.

90‐98%

A9ributes

apropertyofanen)tysuchasitsname,alias,descriptor,ortype.

80%

Facts

arela1onshipheldbetweentwoormoreen))essuchasPosi)onofa

PersoninaCompany.

60‐70%

Events

anac1vityinvolvingseveralen))essuchasaterroristact,airlinecrash,managementchange,newproduct

introduc)on.

50‐60%

“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL

Page 11: ICHASS Workshop Text Mining

Informa)onExtrac)onApproaches

•  Terminology(name)lists–  Thisworksverywellifthelistofnamesandnameexpressionsisstableandavailable

•  Tokeniza)onandmorphology–  Thisworkswellforthingslikeformulasordates,whicharereadilyrecognizedbytheirinternalformat(e.g.,DD/MM/YYorchemicalformulas)

•  Useofcharacteris)cpaIerns–  Thisworksfairlywellfornovelen))es–  Rulescanbecreatedbyhandorlearnedviamachinelearningorsta)s)calalgorithms

–  RulescapturelocalpaIernsthatcharacterizeen))esfrominstancesofannotatedtrainingdata

Page 12: ICHASS Workshop Text Mining

Informa)onExtrac)on

Rela)on(Event)Extrac)on•  Iden)fy(andtag)therela)onamongtwoen))es:–  Apersonis_located_ataloca)on(news)–  Agenecodes_foraprotein(biology)

•  Rela)onsrequiremoreinforma)on–  Iden)fica)onoftwoen))es&theirrela)onship–  Predictedrela)onaccuracy

•  Pr(E1)*Pr(E2)*Pr(R)~=(.93)*(.93)*(.93)=.80•  Informa)oninrela)onsislesslocal–  Contextualinforma)onisaproblem:rightwordmaynotbeexplicitlypresentinthesentence

–  Eventsinvolvemorerela)onsandareevenharder

Page 13: ICHASS Workshop Text Mining

MayorRexLuthorannouncedtodaytheestablishmentofa

newresearchfacilityinAlderwood.Itwillbeknownas

BoyntonLaboratory.

NE:Person NE:Time

NE:Loca)on

NE:Organiza)on

Seman)cAnaly)cs

NamedEn)ty(NE)Tagging

Page 14: ICHASS Workshop Text Mining

MayorRexLuthorannouncedtodaytheestablishmentofa

newresearchfacilityinAlderwood.Itwillbeknownas

BoyntonLaboratory.

UNE:Organiza)on

Seman)cAnalysis

Seman)cCategory(unnameden)ty,UNE)Tagging

Page 15: ICHASS Workshop Text Mining

MayorRexLuthorannouncedtodaytheestablishmentofa

newresearchfacilityinAlderwood.Itwillbeknownas

BoyntonLaboratory.

UNE:Organiza)on

Seman)cAnalysis

Co‐referenceResolu)onforen))esandunnameden))es

Page 16: ICHASS Workshop Text Mining

Mayor Rex Luthor announced today the establishment

known as Boynton Laboratory

of a new research facility in Alderwoon. It will be

ACTIONACTOR WHEN OBJECT

WHERE

ACTION

OBJECT

COMPL

Seman)cAnalysis

Seman)cRoleAnalysis

Page 17: ICHASS Workshop Text Mining

Rex Luthor

person

announce

action

establ.

event

Boynton Lab

organiz.

today

time

Alderwood

location

location

(where)

object

(what)

time(when)

objec

t(w

hat)

actor(who)

Seman)cAnalysis

Concept‐Rela)onExtrac)on

Page 18: ICHASS Workshop Text Mining

IE–TemplateExtrac)on‐Steps

</VerbGroup> …

Page 19: ICHASS Workshop Text Mining

(c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson

…….

The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States.

``The mosque's chief cleric, Abu Hamza al-Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited. .'‘ …

TemplateExtrac)on<Facility>Finsbury Park Mosque</Facility>

<PersonPositionOrganization>  <OFFLEN OFFSET="3576" LENGTH=“33" />   <Person>Abu Hamza al-Masri</Person> <Position>chief cleric</Position> <Organization>Finsbury Park Mosque</Organization> </PersonPositionOrganization>

<Country>England</Country>

<PersonArrest>  <OFFLEN OFFSET="3814" LENGTH="61" />   <Person>Abu Hamza al-Masri</Person>   <Location>London</Location> <Date>1999</Date> <Reason>his alleged involvement in a Yemen bomb plot</Reason>   </PersonArrest>

<Country>England</Country>

<Country>France </Country>

<Country>United States</Country>

<Country>Belgium</Country>

<Person>Abu Hamza al-Masri</Person>

<City>London</City>

Page 20: ICHASS Workshop Text Mining

StreamingText:KnowledgeExtrac)on

•  Leveragingsomeearlierworkoninforma)onextrac)onfromtextstreams

Informa)onextrac)on•  processofusing

advancedautomatedmachinelearningapproaches

•  toiden)fyen))esintextdocuments

•  extractthisinforma)onalongwiththerela)onshipstheseen))esmayhaveinthetextdocuments

Thevisualiza)onabovedemonstratesinforma)onextrac)onofnames,placesandorganiza)onsfromreal‐)menewsfeeds.Asnewsar)clesarrive,theinforma)onisextractedanddisplayed.Rela)onshipsaredefinedwhenen))esco‐occurwithinaspecificwindowofwords.

Page 21: ICHASS Workshop Text Mining

Seman)cAnalysis•  WordSenseDisambigua)on

–  Contextbasedorproximitybased

–  Veryaccurate

Page 22: ICHASS Workshop Text Mining

OntologicalAssocia)on(WordNet)•  Wordnet:Asof2006,thedatabasecontainsabout150,000words

organizedinover115,000synsetsforatotalof207,000word‐sensepairs•  Searchfordog

–  ndog,domes)cdog,Canisfamiliaris(amemberofthegenusCanis(probablydescendedfromthecommonwolf)thathasbeendomes)catedbymansinceprehistoric)mes;occursinmanybreeds)

–  nfrump,dog(adullunaIrac)veunpleasantgirlorwoman)–  ndog(informaltermforaman)–  ncad,bounder,blackguard,dog,hound,heel(someonewhoismorally

reprehensible)–  nfrank,frankfurter,hotdog,hotdog,dog,wiener,wienerwurst,weenie(a

smooth‐texturedsausageofmincedbeeforporkusuallysmoked;oxenservedonabreadroll)

–  npawl,detent,click,dog(ahingedcatchthatfitsintoanotchofaratchettomoveawheelforwardorpreventitfrommovingbackward)

–  nandiron,firedog,dog,dog‐iron(metalsupportsforlogsinafireplace)–  vchase,chaseaxer,trail,tail,tag,givechase,dog,goaxer,track(goaxerwith

theintenttocatch)

Page 23: ICHASS Workshop Text Mining

FeatureSelec)on

•  ReduceDimensionality– Learnershavedifficultyaddressingtaskswithhighdimensionality

•  IrrelevantFeatures– Notallfeatureshelp!– Removefeaturesthatoccurinonlyafewdocuments

– Reducefeaturesthatoccurintoomanydocuments

Page 24: ICHASS Workshop Text Mining

TextMining:GeneralApplica)onAreas

•  Informa)onRetrieval–  Indexingandretrievaloftextualdocuments–  Findingasetof(ranked)documentsthatarerelevanttothequery

•  Informa)onExtrac)on–  Extrac)onofpar)alknowledgeinthetext

•  WebMining–  Indexingandretrievaloftextualdocumentsandextrac)onofpar)alknowledgeusingtheweb

•  Classifica)on–  Predictaclassforeachtextdocument

•  Clustering–  Genera)ngcollec)onsofsimilartextdocuments

Page 25: ICHASS Workshop Text Mining

TextMining:Supervisedvs.Unsupervised

•  Supervisedlearning(Classifica)on)–  Data(observa)ons,measurements,etc.)areaccompaniedby

labelsindica)ngtheclassoftheobserva)ons–  Splitintotrainingdataandtestdataformodelbuildingprocess–  Newdataisclassifiedbasedonthemodelbuiltwiththetraining

data–  Techniques

•  Bayesianclassifica)on,Decisiontrees,Neuralnetworks,Instance‐BasedMethods,SupportVectorMachines

•  Unsupervisedlearning(Clustering)–  Classlabelsoftrainingdataisunknown–  Givenasetofmeasurements,observa)ons,etc.withtheaimof

establishingtheexistenceofclassesorclustersinthedata

Page 26: ICHASS Workshop Text Mining

Results:SocialNetwork(TominRed)

Page 27: ICHASS Workshop Text Mining

Results:Timeline

Page 28: ICHASS Workshop Text Mining

Results:Maps

Page 29: ICHASS Workshop Text Mining

TextMining:T2KandThemeWeaver

Page 30: ICHASS Workshop Text Mining

Images from Pacific Northwest Laboratory

TextMining:ThemescapeandThemeRiver

•  VisualizingRela)onshipsBetweenDocuments

Page 31: ICHASS Workshop Text Mining

Gather–Analyze–Present

Page 32: ICHASS Workshop Text Mining

TextMining:Applica)ons

•  Email:Spamfiltering•  NewsFeeds:Discoverwhatis

interes)ng•  Medical:Iden)fyrela)onshipsand

linkinforma)onfromdifferentmedicalfields

•  HomelandSecurity•  Marke)ng:Discoverdis)nctgroupsof

poten)albuyersandmakesugges)onsforotherproducts

•  Industry:Iden)fyinggroupsofcompe)torswebpages

•  JobSeeking:Iden)fyparametersinsearchingforjobs

Page 33: ICHASS Workshop Text Mining

TextMining:Classifica)onDefini)on

•  Given:Collec)onoflabeledrecords–  Eachrecordcontainsasetoffeatures(aIributes),andthetrueclass

(label)–  Createatrainingsettobuildthemodel–  Createates)ngsettotestthemodel

•  Find:Modelfortheclassasafunc)onofthevaluesofthefeatures•  Goal:Assignaclass(asaccuratelyaspossible)topreviouslyunseen

records•  Evalua)on:WhatIsGoodClassifica)on?

–  Correctclassifica)on•  Knownlabeloftestexampleisiden)caltothepredictedclassfromthemodel

–  Accuracyra)o•  Percentoftestsetexamplesthatarecorrectlyclassifiedbythemodel

–  Distancemeasurebetweenclassescanbeused•  e.g.,classifying“football”documentasa“basketball”documentisnotasbad

asclassifyingitas“crime”

Page 34: ICHASS Workshop Text Mining

TextMining:ClusteringDefini)on•  Given:Setofdocumentsandasimilaritymeasure

amongdocuments•  Find:Clusterssuchthat

–  Documentsinoneclusteraremoresimilartooneanother

–  Documentsinseparateclustersarelesssimilartooneanother

•  Goal:–  Findingacorrectsetofdocuments

•  SimilarityMeasures:–  EuclideandistanceifaIributesarecon)nuous–  Otherproblem‐specificmeasures

•  e.g.,howmanywordsarecommoninthesedocuments

•  Evalua)on:WhatIsGoodClustering?–  Producehighqualityclusterswith

•  highintra‐classsimilarity•  lowinter‐classsimilarity

–  QualityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpaIerns

Page 35: ICHASS Workshop Text Mining

SEASR

MeandreWorkbench

Page 36: ICHASS Workshop Text Mining

FutureWork

•  EnhancementstoSeman)cAnalysis– UseofOntologicalAssocia)on(WordNet,VerbNet)

–  Improveco‐referencing

–  Improvefactextrac)on

•  Visualexplora)ontools

Page 37: ICHASS Workshop Text Mining

SEASR@Work‐MONK