text mining the chinese classicstext mining the chinese classics donald sturgeon department of...
TRANSCRIPT
TextminingtheChineseclassics
DonaldSturgeonDepartmentofComputerScienceDurhamUniversitydjs@dsturgeon.net
Overview• Generalintroductiontoctext.org– Organization and principles
• Locatingandusingtextsonctext.org– Setup– Search– Editing
• Digitaltoolsfortextualanalysis&visualization– Textreuse– Patternsearch(regularexpressions)– Visualization
Nota“traditional”textdatabase
• Nota collection of reviewed, authoritative text– Databases of this type:
• AcademiaSinica漢籍全文• CHANT /漢達文庫,etc.
• Instead: methods of navigating primary sources– Authority does not derive from expert review– Instead: verification of evidence by individual users
• In particular: primary source materials
Interface中/英 繁/簡
Full-textsearch
Titlesearch
Login&Settings
Textualdatabase
Othersections:Library,Wiki,Dictionary,etc.
Instructions
FindingTexts
• Left-handside=>“Titlesearch”• Possibleresults:
6
Transcription(textDB)(not user editable)
Transcription(wiki)(user editable)
Transcription(OCR,wiki)(uncorrected, editable)
Scannedprimarysource(notatranscription)
• Example:
Indicatesthistranscriptionislinkedtoascannedrepresentationofthe四庫全書 editionofthetext
Editions
論衡 四部叢刊
四庫全書薈要
Abstractwork Digitaltranscription(DB/Wiki)
Scannedtext(Library)
漢魏叢書
四部叢刊
四庫全書薈要
漢魏叢書
… …
7
Hands-ontutorial:PartI
• Overview– Setup– Findingtexts,searchingintexts, locating in scans– Specialfunctionsinthetextualdatabase• Parallels,translations,commentary
– Editing– Plugins
• (Tutorial:“Practicalintroductiontoctext.org”)
Hands-ontutorial:PartII• TextToolsplugin• Textualanalysistools
– N-gramcounts– Textreuseidentificationusingn-grams– Regularexpressions– Cosinesimilarity– PrincipalComponentAnalysis
• Visualizationtools– Networkgraphs– Heatmaps– Charts
• (Tutorial:“TextToolsforctext.org”)
CTPURNs• URNsidentifytextualobjects• Finding:– Opencontentspageforthetext– Lookatbottom-rightcorner– CTPURNsalwaysbegin“ctp:…”
• Decoding:– Sameasfindingtextsbytitle:
• PasteURNinto“Titlesearchbox”• Click“Search”• Contentspageforthattextwillopen
10