nlp and text mining: an introduction · 2012. 6. 25. · nlp and text mining: an introduction...
TRANSCRIPT
![Page 1: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/1.jpg)
NLP and Text Mining: an Introduction
Matteo Romanello (DAI/KCL)
Histore Workshop – IHR – June 21, 2012
![Page 2: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/2.jpg)
Introduction
Basic Concepts
![Page 3: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/3.jpg)
Section 1
Introduction
![Page 4: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/4.jpg)
me
I BA Classics (Greek Literature and Philology)I MA Digital Humanities (Univ. of Venice)
I e-journals in Classics
I Currently:I PhD in Digital Humanities, King’s College London
I information extraction from secondary sources
I Research Associate at German Archeological Institute (Berlin)I Digital Infrastructure for Research in the Arts and Humanities
(DARIAH)
![Page 5: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/5.jpg)
What and Why?
![Page 7: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/7.jpg)
NLP Methods
< 1990s
I rely heavily on hand-coded rulesI extract named entities with regexps
I grammars, parsing, etc.
I top down
I hardly scalable
>= 1990s
I emphasis on statistical based approach
I machine learning
I bottom up
I scalable
![Page 8: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/8.jpg)
NLP in DH
I increasing need for mediation of NLP knowledgeI adoption and appropriation of technology need
I understanding of technologyI familiarising with
I JargonI to code or not to code?I basic concepts
I understanding a fieldI evolving quicklyI with a growing body of literatureI highly specialised
![Page 9: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/9.jpg)
(some of the main) NLP TasksSpeech Processing
I Machine Translation
I Speech Synthesis
Information Extraction
I Named Entity ExtractionI Named Entity [Classification | Resolution]
I Relationship Extraction
I Co-reference Resolution
Text Classification
I Sentiment Analysis
I Topic Modelling
![Page 10: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/10.jpg)
My playlist of NLP frameworks
I Voyeur/Voyant tools [web-based]I reading, text analysisI text visualisation
I Natural Language Toolkit [Python]
I General Architecture for Text Engineering (Uni Sheffield)[Java]
I LingPipe [Java]
I OpenNLP (Apache foundation) [Java]
![Page 11: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/11.jpg)
Challenges for NLP in DH
I tools not always work straight out of the boxI issues with
I character encoding (despite Unicode)I output of OCRon historical documentsI normalisation and pre-processing
I lack of ad-hoc resourcesI datasets for training, testing, evaluationI dictionaries and gazetteersI previous results for comparison
![Page 12: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/12.jpg)
Section 2
Basic Concepts
![Page 13: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/13.jpg)
Machine Learning
Supervised
I model is learned fromtraining data
Models
I Hidden Markov Model
I Support Vector Machine
I Conditional RandomFields
Applications
I sequence labelling
Unsupervised
I data are fit into a model
Models
I Clustering
I Latent DirichletAllocation
I Latent Semantic Indexing
Applications
I document clustering
I topic modelling
![Page 14: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/14.jpg)
Machine Learning Cycle (Sequence Labelling)
![Page 15: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/15.jpg)
Evaluation
I TP, FP, TN, FN are defined in relation to a specific taskI applicable to those where is known (quantifiable) what we are
looking for
I Information RetrievalI retrieving of information relevant to a given search queryI TP True Positives
I docs we did expect to show up and showed up (relevant,present)
I FP False PositivesI docs we didn’t expect to show up but showed up (not
relevant, present)
I TN True NegativesI not relevant docs we didn’t expect to show up and did not
show up (not relevant, missing)
I FN False NegativesI relevant docs we didn’t expect to show up but showed up
(not relevant, present)
![Page 16: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/16.jpg)
Evaluation Metrics
I precisionI precision = tp
tp+fp
I recallI recall = tp
tp+fn
I accuracyI accuracy = tp+tn
tp+tn+fp+fn
I f-scoreI fscore = 2 ∗ precision∗recall
precision+recall
![Page 17: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/17.jpg)
Topic ModellingM. Jockers, The LDA Buffet is Now Open; or, Latent DirichletAllocation for English Majors
Key concepts
I the algorithm extracts topics and representative wordsI the human interpreter eventually assigns a name/label to each
topic
I the number of topics is decided a priori
I each doc has different % of all the topics
I diachronic/synchronic exploration of topics
TM frameworks
I Mallet (Java)
I Gensim (Python)
I Stanford Topic Modelling Toolbox
![Page 18: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/18.jpg)
Topic Modelling (cont’d)
https://dhs.stanford.edu/algorithmic-literacy/
my-definition-of-topic-modeling/
![Page 19: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/19.jpg)
Martha Ballard’s Diary
http://historying.org/2010/04/01/
topic-modeling-martha-ballards-diary/
![Page 20: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/20.jpg)
Thematic Index of Classics in JSTOR
http://catalog.perseus.tufts.edu/jstor/
![Page 22: NLP and Text Mining: an Introduction · 2012. 6. 25. · NLP and Text Mining: an Introduction Matteo Romanello (DAI/KCL) Histore Workshop { IHR { June 21, 2012. Introduction Basic](https://reader033.vdocuments.mx/reader033/viewer/2022051814/6039d2a9dafed858e132971c/html5/thumbnails/22.jpg)
Comprehending the Digital Humanities
https://dhs.stanford.edu/
comprehending-the-digital-humanities/