text mining the chinese classicstext mining the chinese classics donald sturgeon department of...

12
Text mining the Chinese classics Donald Sturgeon Department of Computer Science Durham University [email protected]

Upload: others

Post on 24-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

TextminingtheChineseclassics

DonaldSturgeonDepartmentofComputerScienceDurhamUniversitydjs@dsturgeon.net

Overview• Generalintroductiontoctext.org– Organization and principles

• Locatingandusingtextsonctext.org– Setup– Search– Editing

• Digitaltoolsfortextualanalysis&visualization– Textreuse– Patternsearch(regularexpressions)– Visualization

FundamentaltypesofdataImages Transcriptions

Nota“traditional”textdatabase

• Nota collection of reviewed, authoritative text– Databases of this type:

• AcademiaSinica漢籍全文• CHANT /漢達文庫,etc.

• Instead: methods of navigating primary sources– Authority does not derive from expert review– Instead: verification of evidence by individual users

• In particular: primary source materials

Interface中/英 繁/簡

Full-textsearch

Titlesearch

Login&Settings

Textualdatabase

Othersections:Library,Wiki,Dictionary,etc.

Instructions

FindingTexts

• Left-handside=>“Titlesearch”• Possibleresults:

6

Transcription(textDB)(not user editable)

Transcription(wiki)(user editable)

Transcription(OCR,wiki)(uncorrected, editable)

Scannedprimarysource(notatranscription)

• Example:

Indicatesthistranscriptionislinkedtoascannedrepresentationofthe四庫全書 editionofthetext

Editions

論衡 四部叢刊

四庫全書薈要

Abstractwork Digitaltranscription(DB/Wiki)

Scannedtext(Library)

漢魏叢書

四部叢刊

四庫全書薈要

漢魏叢書

… …

7

Hands-ontutorial:PartI

• Overview– Setup– Findingtexts,searchingintexts, locating in scans– Specialfunctionsinthetextualdatabase• Parallels,translations,commentary

– Editing– Plugins

• (Tutorial:“Practicalintroductiontoctext.org”)

Hands-ontutorial:PartII• TextToolsplugin• Textualanalysistools

– N-gramcounts– Textreuseidentificationusingn-grams– Regularexpressions– Cosinesimilarity– PrincipalComponentAnalysis

• Visualizationtools– Networkgraphs– Heatmaps– Charts

• (Tutorial:“TextToolsforctext.org”)

CTPURNs• URNsidentifytextualobjects• Finding:– Opencontentspageforthetext– Lookatbottom-rightcorner– CTPURNsalwaysbegin“ctp:…”

• Decoding:– Sameasfindingtextsbytitle:

• PasteURNinto“Titlesearchbox”• Click“Search”• Contentspageforthattextwillopen

10

11

2.Choosetexts

1.Selectfunction

3.Runanalysis

4.Viewoutput

Hands-ontutorial

• https://dsturgeon.net/maraas