ichass workshop lab

12
SEASR Lab: Text Mining High Performance Compu=ng in the Humani=es, Arts, and Social Science Workshop UIUC/NCSA July 28, 2008 LoreMa Auvil Na=onal Center for Supercompu=ng Applica=ons University of Illinois at Urbana Champaign

Upload: loretta-auvil

Post on 28-Nov-2014

854 views

Category:

Education


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: ICHASS Workshop Lab

SEASRLab:TextMining

HighPerformanceCompu=ngintheHumani=es,Arts,andSocialScienceWorkshop

UIUC/NCSAJuly28,2008

LoreMaAuvil

Na=onalCenterforSupercompu=ngApplica=onsUniversityofIllinoisatUrbanaChampaign

Page 2: ICHASS Workshop Lab

SEASR

MeandreWorkbench

Page 3: ICHASS Workshop Lab

TextMining:ClusteringDefini=on•  Given:Setofdocumentsandasimilaritymeasure

amongdocuments•  Find:Clusterssuchthat

–  Documentsinoneclusteraremoresimilartooneanother

–  Documentsinseparateclustersarelesssimilartooneanother

•  Goal:–  Findingacorrectsetofdocuments

•  SimilarityMeasures:–  EuclideandistanceifaMributesarecon=nuous–  Otherproblem‐specificmeasures

•  e.g.,howmanywordsarecommoninthesedocuments

•  Evalua=on:WhatIsGoodClustering?–  Producehighqualityclusterswith

•  highintra‐classsimilarity•  lowinter‐classsimilarity

–  QualityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpaMerns

Page 4: ICHASS Workshop Lab

TextClustering

•  Loadasingledocument•  Segmentthatdocumentapproximatelyevery250words(propertycanbeadjusted)

•  PartofSpeechTagging•  Termselec=onbyPartofSpeech

•  Clustereachsegmentbasedonsimilaritymetric

•  Createvisualiza=on

Page 5: ICHASS Workshop Lab

TextClusteringVisualiza=on

DendrogramconsistsofmanyU‐shapedlinesconnec=ngobjectsinahierarchicaltree.TheheightofeachUrepresentsthedistancebetweenthetwoobjectsbeingconnected

Page 6: ICHASS Workshop Lab

LabSession

•  OpenbrowserforMeandreWorkbenchtohMp://demo.seasr.org:1712

•  Login– userid:admin– Password:admin

– Server:demo.seasr.org– Port:1714

Page 7: ICHASS Workshop Lab

SEASR

MeandreWorkbench

RepositoryPanel

Workspace

DetailsPanel

Output

Page 8: ICHASS Workshop Lab

LoadanExis=ngflow

•  ClickonFlowsintheRepositoryPaneltoopenthesavedflows

•  DoubleClickonTextClustering2fromthislist

•  ClickonRunFlowtoexecutethisflowontheMeandreserver

Page 9: ICHASS Workshop Lab

DendrogramResults

•  Clickingona“cluster”showsitinblueanddisplaysthelistofwords

•  Displayshowsavgfreqofwordwithintheclusterandavgfreqoverallclusters

•  Onwindows,youcandoubleclicktodrillintoacluster

Page 10: ICHASS Workshop Lab

ChangingProper=esClickonacomponent,thendoubleclickinthePropertyPanel

oftheDetailsPanel•  PushString

–  string:pasteaurlforatextdocument(asciitextfile)•  TextSegmenta=on

–  segment_size:segngfornumberoftermsperdocument(segment)

•  HAC–  Distancemetric:segngformetric

•  FilterPOS–  tag_list:POStoincludeinthelistofterms

•  Nouns:NN,NNP,NNPS,NNS,NP,NPS•  Verbs:VB,VBD,VBG,VBN,VBP,VBZ•  Adjec=ves:JJ,JJR,JJS,JJSS•  Adverbs:RB,RBR,RBS

Page 11: ICHASS Workshop Lab

RunAgain

•  Changesomeproper=esandrerun…

Page 12: ICHASS Workshop Lab

Documenta=on

•  hMp://seasr.org/meandre/documenta=on/