elementi di natural language processingislab.di.unimi.it/ontoweb/materiale/nlp.pdf · nlp e...

Elementi di Natural Language ProcessingCorso di Ontologie e Semantic Web

Prof. Alfio Ferrara

Anno Accademico 2012/2013

Indice

1 Introduzione 2

2 Costruzione del vocabolario lessicale 3

3 Creazione di un corpus testuale 9

4 Occorrenze e distribuzioni 11

5 Similarita testuale 20

1

1 Introduzione

1 Introduzione

Natural Language Processing (NLP)

• Per NLP intendiamo una famiglia di tecniche per il trattamento dell’informazione espressain linguaggio naturale, spesso in sorgenti non strutturate.

• Possiamo dividere le tecniche di NLP in tre grandi famiglie:

– Tecniche sintattiche: principalmente basate sul lessico e sulle distribuzioni statistichedei termini

– Tecniche grammaticali: basate sulla struttura grammaticale o morfologica delle espres-sioni linguistiche

– Tecniche semantiche: basate sui significati dei termini e delle espressioni

NLP e semantic web

• Le tecniche NLP, nate prima e indipendentemente dal semantic web, possono rivelarsi utiliin molti contesti applicativi legati al semantic web. Infatti:

– Ancorche strutturata, l’informazione presente nei dataset RDF attualmente disponibilie largamente dipendente dal significato dei termini e dal loro uso

– Le tecniche di NLP possono essere usate per estrarre informazione utile ai fini semanticida molte sorgenti non strutturate

In particolare, vedremo come si possano adattare le tecniche di NLP a risorse RDF, concependole entita RDF (oggetti astratti descritti da proprieta e valori di proprieta) come piccoli testi inseritiin un corpus piu vasto (il dataset RDF).

Temi trattati

• In questa breve introduzione, tratteremo principalmente tecniche di tipo sintattico che pos-sano rivelarsi utili per il contesto del semantic web. In particolare affronteremo i seguentiproblemi:

– Costruzione di un vocabolario dei termini

– Calcolo della rilevanza terminologica e spazio vettoriale

– Individuazione di termini composti di interesse semantico

2

2 Costruzione del vocabolario lessicale

– Determinazione della similarita lessicale di brevi testi (es. tweet estratti da profiliTwitter (http://www.twitter.com)

• Saranno forniti esempi reali elaborati con strumenti Python (NLTK toolkit) e Java (Lucene)

Riferimenti bibliografici principali:

• C.D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, CambridgeUniversity Press. 2008, Capitoli 2 e 6. (online: http://nlp.stanford.edu/IR-book/information-retrieval-book.html)

• S. Bird, E. Klein, E. Loper. Natural Language Processing with Python. O’Reilly, 2009.(online: http://nltk.org/book)

• Documentazione su Lucene: http://wiki.apache.org/lucene-java/HowTo

Esempio di riferimento

• Come esempio di riferimento utilizzeremo un corpus costituito da circa 21000 tweed raccoltidalle timeline Twitter ufficiali dei principali leader e forze politiche italiani in due momenti:il 20.2.2013 e il 18.3.2013


Individuazione dei token

• Il primo passo per la costruzione del vocabolario lessicale in un documento all’interno di uncorpus e la trasformazione del testo in una sequenza di token

• Si parla di “token” e non di termini, poiche cio che e considerato un termine nel linguaggionaturale, puo essere rappresentato sotto forma di piu token, in dipendenza dalla tecnica dianalisi del testo scelta e dalla lingua

• Esempio

aren’t → 〈aren’t〉〈arent〉〈are | n’t〉〈aren | t〉

3


Espressioni regolari e punteggiatura

• Il metodo piu comunemente utilizzato per individuare token all’interno di un testo consistenel segmentare il testo attraverso espressioni regolari che tengano conto della punteggiatura

• Occorre tenere conto della lingua e della codifica del carattere usata

Tokenization con NLTK

# coding= u t f 8import n l t kt e x t = ”RT @LaTorreNormanna : La sovraesposiz ione mediat ica d i Casini−F in i−

Monti adesso \ ‘ e davvero imbarazzante . . . # Elez ion i2013 ”pa t t e rn = r ’ ’ ’ (? x )

( [ A−Z ] \ . ) + # abb rev iaz i on i| \w+(−\w+)∗ # paro le con t r a t t o s i separazione un i t e| \#(\w+)∗ # conservo g l i hashtag| \@(\w+)∗ # conservo i r i f e r i m e n t i| \$\d +(\ .\d+)?\%? # p e r c e n t u a l i e moneta| \ . \ . \ . # p u n t i n i d i sospensione| [ ] [ . , ; ” \ ’ ? ( ) :− ‘ ] # token’ ’ ’

tokens = n l t k . regexp token ize ( tex t , pa t t e rn )pr in t tokens

Output: [’RT’, ’@LaTorreNormanna’, ’:’, ’La’, ’sovraesposizione’, ’mediatica’, ’di’, ’Casini-Fini-

Monti’, ’adesso’, ’e’, ’davvero’, ’imbarazzante’, ’...’, ’#Elezioni2013’]

Tokenization con Lucene (1)

public s t a t i c L i s t<St r ing> t oken i zeS t r i ng ( Analyzer analyzer , S t r i n gs t r i n g ) {L i s t<St r ing> r e s u l t = new Ar rayL i s t<St r ing >() ;t ry {

TokenStream stream = analyzer . tokenStream ( null , new Str ingReader (s t r i n g ) ) ;

while ( stream . incrementToken ( ) ) {r e s u l t . add ( stream . g e t A t t r i b u t e ( CharTermAtt r ibute . class ) . t o S t r i n g

( ) ) ;}

} catch ( IOExcept ion e ) {throw new RuntimeException ( e ) ;

}return r e s u l t ;

4


}

Tokenization con Lucene (2)

S t r i n g t e x t = ”RT @LaTorreNormanna : La sovraesposiz ione mediat ica d i Casini−F in i−Monti adesso \ ‘ e davvero imbarazzante . . . \ # Elez ion i2013 ” ;

L i s t<St r ing> standard = token i zeS t r i ng (new StandardAnalyzer ( Version .LUCENE 35) , t e x t ) ;

System . out . p r i n t l n ( standard ) ;

Output: [rt, latorrenormanna, la, sovraesposizione, mediatica, di, casini, fini, monti, adesso, e’, davvero,

imbarazzante, elezioni2013]

L i s t<St r ing> whitespace = token i zeS t r i ng (new WhitespaceAnalyzer ( Version .LUCENE 35) , t e x t ) ;

System . out . p r i n t l n ( whitespace ) ;

Output: [RT, @LaTorreNormanna:, La, sovraesposizione, mediatica, di, Casini-Fini-Monti, adesso, e,

davvero, imbarazzante...#Elezioni2013]

Spesso la tokenizzazione si accompagna a ulteriori operazioni di filtro sulle porzioni di testoche vanno effettivamente conservate all’interno della sequenza di token, in modo particolare perl’eliminazione di termini frequenti di uso comune (stopword) e degli elementi di punteggiatura.

Eliminazione di stopword

• Lo scopo di procedere all’eliminazione delle stopword e di ridurre il “rumore” statisticoprovocato dalla presenza nel testo di termini di uso comune e spesso poco significativi dalpunto di vista semantico

• Esempi inglesi: the, of, on, ...

• Esempi italiani: a, da, che, ...

• L’eliminazione delle stopword si basa su liste note e dipendenti dalla lingua

Eliminazione di stopword e punteggiatura con NLTK

. . .pa t t e rn = ” (? x ) ( [ A−Z ] \ . ) + | \w+(−\w+)∗ | \#(\w+)∗ | \@(\w+)∗ | \$\d +(\ .\d+)

?%?”tokens = n l t k . regexp token ize ( tex t , pa t t e rn )pr in t tokens

5


# El iminaz ione stopwordspr in t [w for w in tokens i f not w. lower ( ) in n l t k . corpus . stopwords . words ( ’

i t a l i a n ’ ) ]

Output: [’RT’, ’@LaTorreNormanna’, ’sovraesposizione’, ’mediatica’, ’Casini-Fini-Monti’, ’adesso’, ’e’,

’davvero’, ’imbarazzante’, ’#Elezioni2013’]

Eliminazione di stopword e punteggiatura con Lucene. . .TokenStream stream = analyzer . tokenStream ( null , new Str ingReader ( s t r i n g ) ) ;stream = new LowerCaseFi l ter ( Version . LUCENE 35 , stream ) ;stream = new S t o p F i l t e r ( Version . LUCENE 35 , stream , I t a l i a n A n a l y z e r .

getDefau l tStopSet ( ) ) ;. . .

Output: [rt, @latorrenormanna:, sovraesposizione, mediatica, casini-fini-monti, adesso, davvero, imbaraz-

zante...#elezioni2013]

Normalizzazione terminologica

• Una delle esigenze fondamentali delle tecniche NLP e confrontare i termini e individuarnecorrettamente le occorrenze

• A questo scopo e necessario ricondurre le forme flesse di un termine a una forma normale.Es. parola, parole→ parol

• A questo scopo si adottano due principali famiglie di tecniche:

– Sintattiche→ stemming

– Semantiche→ lemmatizzazione

Stemming

• Le tecniche di stemming tagliano le code dei termini secondo regole sintattiche e empirichesolo parzialmente dipendenti dalla lingua. Lo stemming produce buoni risultati per moltitermini, ma diverse imperfezioni:

• Vi sono molti algoritmi di stemming: es. Lancaster Stemmer, Porter Stemmer, Snowballstemmer(s). Quest’ultimo e disponibile in diverse versioni per lingue diverse.

• Come esempio, si veda Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980):130-137.

6


Esempi NLTK (1)

import n l t ken = n l t k . stem . snowbal l . EnglishStemmer ( )i t = n l t k . stem . snowbal l . I ta l ianStemmer ( )t1 = ’ query ’t2 = ’ quer ies ’t3 = ’ paro la ’t4 = ’ paro le ’pr in t ” Ing lese su ing lese ”pr in t t1 + ” => ” + en . stem ( t1 )pr in t t2 + ” => ” + en . stem ( t2 )pr in t ” Ing lese su i t a l i a n o ”pr in t t3 + ” => ” + en . stem ( t3 )pr in t t4 + ” => ” + en . stem ( t4 )pr in t ” I t a l i a n o su ing lese ”pr in t t1 + ” => ” + i t . stem ( t1 )pr in t t2 + ” => ” + i t . stem ( t2 )pr in t ” I t a l i a n o su i t a l i a n o ”pr in t t3 + ” => ” + i t . stem ( t3 )pr in t t4 + ” => ” + i t . stem ( t4 )

Esempi NLTK (2)Output

Ing lese su ing lesequery => quer iquer ies => quer iIng lese su i t a l i a n oparo la => paro laparo le => paro lI t a l i a n o su ing lesequery => queryquer ies => quer iesI t a l i a n o su i t a l i a n oparo la => paro lparo le => paro l

Limiti delle tecniche di stemming

import n l t ki t = n l t k . stem . snowbal l . I ta l ianStemmer ( )t1 = ’ essere ’

7


t2 = ’ s ia ’pr in t i t . stem ( t1 )pr in t i t . stem ( t2 )

Output: esser sia

Stemming con NLTK

. . .pa t t e rn = ” (? x ) ( [ A−Z ] \ . ) + | \w+(−\w+)∗ | \#(\w+)∗ | \@(\w+)∗ | \$\d +(\ .\d+)

?%?”tokens = n l t k . regexp token ize ( tex t , pa t t e rn )tokens = [w for w in tokens i f not w. lower ( ) in n l t k . corpus . stopwords . words

( ’ i t a l i a n ’ ) ]#Stemmingpr in t [ i t . stem (w) for w in tokens ]

Output: [u’rt’, u’@latorrenormann’, u’sovraesposizion’, u’mediat’, u’casini-fini-mont’, u’adess’, u’davver’,

u’imbarazz’, u’#elezioni2013’]

Stemming con Lucene

public s t a t i c L i s t<St r ing> t oken i zeS t r i ng ( Analyzer analyzer , S t r i n g s t r i n g ,boolean stemming ) {

/ / Execute LowerCaseFi l ter and S t o p F i l t e r. . .

i f ( stemming ) {stream = new S n o w b a l l F i l t e r ( stream , new I ta l ianStemmer ( ) ) ;

}while ( stream . incrementToken ( ) ) {

r e s u l t . add ( stream . g e t A t t r i b u t e ( CharTermAtt r ibute . class ) . t o S t r i n g ( ) );

}. . .

return r e s u l t ;}

Output: [rt, @latorrenormanna:, sovraesposizion, mediat, casini-fini-mont, adess, davver, imbarazzan-

te...#elezioni2013]

LemmatizzazioneLe tecniche di lemmatizzazione utilizzano un vocabolario per definire la radice corretta dei

termini

8

3 Creazione di un corpus testuale

import n l t ken = n l t k . stem . wordnet . WordNetLemmatizer ( )terms = [ ’ ran ’ , ’ went ’ , ’was ’ ]lemmas = [ en . lemmatize (w, ’ v ’ ) for w in terms ]pr in t ’ , ’ . j o i n ( lemmas )

Output: run, go, beNel caso di NLTK abbiamo a disposizione WordNet per la lemmatizzazione, ma tale risorsa e

limitata alla lingua inglese. Per l’italiano e possibile scrivere un lemmatizzatore utilizzando liste ditermini flessi e delle corrispondenti radici.WordNet: Miller, George A. “WordNet: a lexical database for English.” Communications of theACM 38.11 (1995): 39-41.

Esempio di lemmatizzazione in Java (CorpusAnalyzer c©ISLab)

S t r i n g t e x t = ”RT @LaTorreNormanna : La sovraesposiz ione mediat ica d i Casini−F in i−Monti adesso \ ‘ e davvero imbarazzante . . . \ # Elez ion i2013 ” ;

CorpusAnalyzer a = new CorpusAnalyzer ( TextAnalyzer . ITA ) ;a . u s e E l i s i o n F i l t e r ( true ) ;a . useLowerF i l te r ( true ) ;a . enableLemmatizat ion ( ) ;a . useS topF i l t e r ( true ) ;S t r i n g i d = a . addText ( t e x t ) ;a . analyze ( ) ;Vector<St r ing> tokens = a . getAnalyzedTextByID ( i d ) ;System . out . p r i n t l n ( tokens ) ;

Output: [rt, latorrenormanna, sovraesposizione, mediatico, casino, fine, monte, adesso, davvero,imbarazzare, elezioni2013]

CorpusAnalyzer e un progetto Java prodotta dal laboratorio ISLab che utilizza Lucene e alcunimetodi implementati ex-novo per fornire un analizzatore di corpora testuali.


Dopo la fase di pre-processamento e preparazione del testo si puo procedere alla creazione di uncorpus complessivo, nel quale i diversi testi siano archiviati come documenti e dotati di un indicegenerale e di procedure che consentano l’elaborazione statistica dei contenuti.

Creazione e popolamento di un corpus

9


• Il primo passo per l’analisi testuale consiste nella creazione di un corpus di documenti, ingenere a partire da risorse online o su disco, associando ogni documento a un identificatoreall’interno del corpus

• E’ importante decidere cosa considerare un documento nella fase di creazione di un corpus

• Nel caso del nostro esempio, potremmo adottare strategie diverse per scopi diversi:

– Creare un documento per ogni tweet

– Creare un documento per ogni tweet, annotando il documento con la fonte di prove-nienza (es. il partito politico dalla cui timeline il tweet e stato generato)

– Creare un documento per ogni timeline, trattando i singoli tweet come frasi all’internodel documento (soluzione adottata)

Creazione di un corpus da disco con NLTK

import n l t kd = ” . / co rpus d i r / ”t oken i ze r = n l t k . token ize . WhitespaceTokenizer ( )l i n e t o k e n i z e r = n l t k . token ize . L ineTokenizer ( )corpus = n l t k . corpus . PlaintextCorpusReader ( d , ’ .∗ ’ , word token izer=

token izer , se n t t o k en i z e r = l i n e t o k e n i z e r )for i d in corpus . f i l e i d s ( ) :

pr in t id , corpus . words ( i d ) [ : 6 5 ] , ’ . . . ’

Output: Fare2013 [’#grillonomics’, ’gaia’, ’decrescita’, ’porta’, ’programma’, ’...’] LegaNordPadania

[’lombardia’, ’andrea’, ’gibelli’, ’nuovo’, ’segretario’, ’...’] Mov5Stelle [’due’, ’gruppi’, ’comunicazione’,

’camera’, ’senato’, ’...’] ilpdl [’sivlio’, ’23’, ’marzo’, ’ore’, ’15’, ’...’] ingroia [’rt’, ’@papafrancesco ’, ’con-

ferenza’, ’stampa’, ’giornalisti’, ’...’] pdnetwork [’#pdoodle’, ’restituiamo’, ’moralita’, ’politica’, ’stop’, ’...’]

scelta civica [’rt’, ’@pierpaolovargiu’, ’scheda’, ’bianca’, ’@scelta civica’, ’...’] sinistraelib [’annunciato’,

’giorni’, ’scorsi’, ’sinistra’, ’ecologia’, ’...’]

I testi precedentemente pre-trattati con tecniche di tokenizzazione e filtro sono stati salvati sufile nella directory ./corpus dir/

Creazione di un corpus da disco con CorpusAnalyzerNell’esempio Java, creiamo un corpus per ogni partito, trattando i singoli tweet come documenti

DataFactory f = new DataFactory ( ) ;S t r i n g i n s t = ” pdnetwork ” ;I n s t i t u t i o n i = f . c rea te Ins t i tu t ionByName ( i n s t ) ;CorpusAnalyzer a = new CorpusAnalyzer ( TextAnalyzer . ITA ) ;

10

4 Occorrenze e distribuzioni

a . u s e E l i s i o n F i l t e r ( true ) ;a . useLowerF i l te r ( true ) ;a . useS topF i l t e r ( true ) ;for ( Con t r i bu t i on c : i . ge tCon t r i bu t i ons ( ) ) {

S t r i n g i d = c . getProduct ( ) . ge tU r l ( ) ;S t r i n g t e x t = c . getProduct ( ) . g e t T i t l e ( ) ;a . addText ( id , t e x t ) ;

}System . out . p r i n t l n ( a . ge tA l lTex t IDs ( ) ) ;

Output: [279875833127112705, 303795081561899008, ...]


Le prime e piu semplici analisi che si possono condurre sui corpora appena creati consistono insemplici statistiche sul numero di occorrenze dei termini e sulle loro distribuzioni nei diversi testi.

Distribuzioni di frequenza terminologica

• Il numero di occorrenze di un termine in un testo e un primo, semplice, indicatore delletematiche trattate

• Un primo esempio consiste nel confrontare documenti diversi rispetto alla frequenza di alcunitermini di interesse

• Un secondo esempio consiste nel verificare quali siano i termini piu frequenti in un testo dato

Comparazione delle frequenze con NLTK

. . .corpus = loadcorpus . load ( d )def d i s t ( search , corpus ) :

pr in t ” D i s t r i b u z i o n e de l termine : ” + searchfor p a r t i t o in corpus . f i l e i d s ( ) :

f = n l t k . p r o b a b i l i t y . FreqDis t ( corpus . words ( p a r t i t o ) )pr in t p a r t i t o + ” => ” + s t r ( f [ search ] )

d i s t ( ’ l avo ro ’ , corpus )

11


Output: Distribuzione del termine: lavoro

Fare2013 → 19LegaNordPadania → 18

Mov5Stelle → 6ilpdl → 16

ingroia → 6pdnetwork → 51

scelta civica → 94sinistraelib → 61

Computo delle occorrenze con CorpusAnalyzer (1)

Analys is s t a t s = new Analys is ( ) ;S t r i n g pdname = ” pdnetwork ” ;I n s t i t u t i o n pdnetwork = f . c rea te Ins t i tu t ionByName (pdname) ;S t r i n g pdlname = ” i l p d l ” ;I n s t i t u t i o n i l p d l = f . c rea te Ins t i tu t ionByName ( pdlname ) ;CorpusAnalyzer pd = s t a t s . getNewCorpusPer Ins t i tu t ion ( a , pdnetwork , true ) ;pd . analyze ( ) ;CorpusAnalyzer pd l = s t a t s . getNewCorpusPer Ins t i tu t ion ( a , i l p d l , true ) ;pd l . analyze ( ) ;TermsDescr iptor pdoccurrences = pd . runOccurrences ( ) ;TermsDescr iptor pdloccurrences = pdl . runOccurrences ( ) ;System . out . p r i n t l n ( ” Termin i p i \ ‘ u f r e q u e n t i i n pdnetwork ” ) ;pdoccurrences . pr in tTopKSorted ( 5 ) ;System . out . p r i n t l n ( ” Termin i p i \ ‘ u f r e q u e n t i i n i l p d l ” ) ;pd loccurrences . pr in tTopKSorted ( 5 ) ;

Computo delle occorrenze con CorpusAnalyzer (2)

Termine pdnetwork ilpdlitaliagiusta 693 0

pbersani 661 0http 625 414t.co 605 352

rt 488 1270berlusconi 0 397

ilpdl 0 296

Termini composti e co-occorrenze

12


• Molto frequentemente, un concetto rilevante per una risorsa testuale e espresso da un terminecomposto che, pertanto, non verra considerato nel computo delle frequenze basato su token

• Per individuare termini composti potenzialmente rilevanti si procede in due fasi:

– Individuazione di tutti i possibili n-grammi, ovvero sequenze consecutive di n terminiin un testo

– Computo della rilevanza statistica di un n-gramma rispetto alla rilevanza dei suoi com-ponenti presi singolarmente

Esempi di n-grammi con NLTK

fpd = ’ pdnetwork ’tokens = corpus . words ( fpd )bigram measures = n l t k . c o l l o c a t i o n s . BigramAssocMeasures ( )t r igram measures = n l t k . c o l l o c a t i o n s . TrigramAssocMeasures ( )bigrams = n l t k . c o l l o c a t i o n s . B igramCol loca t ionF inder . from words ( tokens )#El iminiamo n−grammi con scarsa d i f f u s i o n ebigrams . a p p l y f r e q f i l t e r ( 2 )t r i g rams = n l t k . c o l l o c a t i o n s . T r ig ramCo l loca t ionF inder . from words ( tokens )#El iminiamo n−grammi con scarsa d i f f u s i o n et r i g rams . a p p l y f r e q f i l t e r ( 2 )bes tb i = bigrams . nbest ( bigram measures . raw freq , 10)b e s t t r i = t r i g rams . nbest ( t r igram measures . raw freq , 10)pr in t bes tb ipr in t b e s t t r i

Risultato dell’esempio

[ ( ’ h t t p ’ , ’ t ’ ) , ( ’ t ’ , ’ co ’ ) , ( ’ r t ’ , ’ @pbersani ’ ) , ( ’ # i t a l i a g i u s t a ’ , ’ h t t p ’ ), ( ’ r t ’ , ’@youdem ’ ) , ( ’ # i t a l i a g i u s t a ’ , ’ r t ’ ) , ( ’ # i t a l i a g i u s t a ’ , ’@pbersani ’ ) , ( ’ # p d l i v e ’ , ’ # i t a l i a g i u s t a ’ ) , ( ’@youdem ’ , ’ # i t a l i a g i u s t a ’ ) ,

( ’@youdem ’ , ’ # p r o g r e s s i v e a l l i a n c e ’ ) ]

[ ( ’ h t t p ’ , ’ t ’ , ’ co ’ ) , ( ’ # i t a l i a g i u s t a ’ , ’ h t t p ’ , ’ t ’ ) , ( ’ # i t a l i a g i u s t a ’ , ’ r t’ , ’ @pbersani ’ ) , ( ’ r t ’ , ’@youdem ’ , ’ # p r o g r e s s i v e a l l i a n c e ’ ) , ( ’ d i r e t t a ’ ,’ h t t p ’ , ’ t ’ ) , ( ’ t ’ , ’ co ’ , ’ vymipwd8 ’ ) , ( ’ # i t a l i a g i u s t a ’ , ’ r t ’ , ’@youdem ’) , ( ’ c a n d id a t i ’ , ’ h t t p ’ , ’ t ’ ) , ( ’ co ’ , ’ vymipwd8 ’ , ’ #camera ’ ) , ( ’ d i r e t t a ’, ’@youdem ’ , ’ # i t a l i a g i u s t a ’ ) ]

13


Mutual Information

• Il numero di occorrenze di un n-gramma non e quasi mai un buon indicatore della suarilevanza

• Un indicatore migliore si basa sul concetto di Mutual Information

mi(ti, tj) = log

(p(ti, tj)

p(ti) · p(tj)

)

Mutual Information con NLTKbestb ipmi = bigrams . nbest ( bigram measures . pmi , 10)b e s t t r i p m i = t r i g rams . nbest ( t r igram measures . pmi , 10)[ ( ’ been ’ , ’ quoted ’ ) , ( ’ cance l la ’ , ’ r a d i c i ’ ) , ( ’ c e c i l e ’ , ’ kyenge ’ ) , ( ’ conf \

xe9rence ’ , ’ des ’ ) , ( ’ consumo ’ , ’ suolo ’ ) , ( ’ c o r r i ’ , ’ f o t t i t e n e ’ ) , ( ’ des ’ ,’ leaders ’ ) , ( ’emma ’ , ’ f a t t o r i n i ’ ) , ( ’ fernando ’ , ’ biague ’ ) , ( ’ f l a v i a ’ , ’

n a r d e l l i ’ ) ][ ( ’ been ’ , ’ quoted ’ , ’my ’ ) , ( ’ conf \xe9rence ’ , ’ des ’ , ’ leaders ’ ) , ( ’ c o r r i ’ , ’

f o t t i t e n e ’ , ’ o rgog l i o ’ ) , ( ’ ve ’ , ’ been ’ , ’ quoted ’ ) , ( ’my ’ , ’ # s t o r i f y ’ , ’s t o r y ’ ) , ( ’ quoted ’ , ’my ’ , ’ # s t o r i f y ’ ) , ( ’ you ’ , ’ ve ’ , ’ been ’ ) , ( ’i n t eg raz ione ’ , ’ cance l la ’ , ’ r a d i c i ’ ) , ( ’ banda ’ , ’ l a rga ’ , ’ i c t ’ ) , ( ’giuseppe ’ , ’ opera io ’ , ’ su i c i da ’ ) ]

Bi-grammi con CorpusAnalyzer (1)DataFactory f = new DataFactory ( ) ;CorpusAnalyzer a = new CorpusAnalyzer ( TextAnalyzer . ITA ) ;a . u s e E l i s i o n F i l t e r ( true ) ;a . useLowerF i l te r ( true ) ;a . useS topF i l t e r ( true ) ;a . enableNgramsAnalysis ( ) ;a . setNgramsMinOccurrences ( 2 . 0 ) ;a . setNgramsDimension ( 2 ) ;a . setNgramsMethod ( CorpusAnalyzer .MUTUAL INFORMATION NGRAM FREQ) ;a . setNgramsThreshold ( 1 . 0 ) ;

Ana lys is s t a t s = new Analys is ( ) ;S t r i n g pdname = ” pdnetwork ” ;I n s t i t u t i o n pdnetwork = f . c rea te Ins t i tu t ionByName (pdname) ;CorpusAnalyzer pd = s t a t s . getNewCorpusPer Ins t i tu t ion ( a , pdnetwork , true ) ;pd . analyze ( ) ;TermsDescr iptor ngrams = pd . getNGrams ( ) ;ngrams . pr intTopKSorted (10) ;

14


Bi-grammi con CorpusAnalyzer (2)

l i v i a tu rco −> 10.249848546772641quote l a t t e −> 10.249848546772641v ibo v a l e n t i a −> 10.249848546772641giuseppe operaio −> 10.249848546772641i l v o d iamant i −> 10.249848546772641un ion i c i v i l i −> 9.962166474320862giampaolo g a l l i −> 9.962166474320862banda la rga −> 9.962166474320862r o s a r i a capacchione −> 9.739022923006653doppio tu rno −> 9.739022923006653

Calcolare la rilevanza dei termini

• Il criterio piu semplice per discriminare fra le occorrenze terminologiche e basato sull’ideadi associare un peso wi a ogni termine ti tale che:

– Il peso wij dipenda dal documento dj in cui ti compare. Cio implica che lo stesso ter-mine possa essere associato a pesi diversi in documenti diversi (rilevanza contestuale)

– Il peso wij sia associato al numero di occorrenze di ti in dj . Questa funzione prende ilnome di term frequency (tf).

tf(ti, dj) =numero di occorrenze di ti ∈ dj

| dj |

– Il peso wij sia inversamente proporzionale al numero di documenti dk del corpus D

in cui ti compare. In altri termini, cerchiamo di penalizzare i termini con alto numerodi occorrenze ma di uso molto comune. Questa funzione prende il nome di inversedocument frequency (idf)

Inverse document frequencySia | D | il numero di documenti contenuti nel corpus e df(ti) =| {dj ∈ D : ti ∈ dj} |

(document frequency) il numero di documenti di D che contengono il termine ti:

idf(ti) = log| D |df(ti)

o, equivalentemente:

idf(ti) = log| D |

df(ti) + 1

15


Inverse Document Frequency con NLTKRicordiamo che nell’esempio NLTK ogni timeline costituisce un documento

def i d f ( words , corpus ) :i = {}c = n l t k . Tex tCo l l ec t i on ( corpus )for w in set ( words ) :

i [w] = c . i d f (w)return i

Sorgenti di TextCollection:

matches = len ( l i s t ( True for t e x t in s e l f . t e x t s i f term in t e x t ) ). . .i d f = log ( f l o a t ( len ( s e l f . t e x t s ) ) / matches )

Esempio (1)Consideriamo i primi 10 termini in ordine di frequenza per 4 timeline e verifichiamo i corri-

spondenti valori di idf.

pdnetwork ilpdl#italiagiusta 693 0.98

@pbersani 661 0.47http 625 0.0

t 609 0.0co 605 0.0rt 488 0.13e 293 0.0

#edicolapd 153 1.39@youdem 151 1.39

#progressivealliance 139 2.08

rt 1270 0.13http 414 0.0

t 372 0.0co 356 0.0

@ilpdl 299 0.69#berlusconi 298 0.13

e 213 0.0berlusconi 111 0.13@angealfa 96 0.47

bersani 73 0.0

Esempio (2)

16


Mov5Stelle scelta civicaco 1826 0.0

http 1826 0.0t 1826 0.0e 350 0.0

diretta 326 0.0seguite 146 0.47

hmvgrtnp 140 2.08beppe 130 0.47

ora 116 0.0cosa 114 0.13

@senatoremonti 902 0.29http 382 0.0

t 379 0.0co 377 0.0rt 348 0.13

#sceltacivica 331 1.39e 298 0.0

@scelta civica 167 0.69italia 125 0.0

#conmontiperlitalia 115 2.08

Inverse Document Frequency con CorpusAnalyzerRicordiamo che nell’esempio con CorpusAnalyzer ogni timeline costituisce un corpus

/∗ ∗∗ d i c t i o n a r y e ’ una mappa : termine t −> insieme d i∗ i d d i documenti con tenen t i t∗ ∗ /private Map<St r ing , Set<St r ing>> d i c t i o n a r y = new TreeMap<St r ing , Set<St r ing

>>() ;

public TermsDescr iptor runIDF ( ) {th is . i d f . c l ea r ( ) ;. . .for ( S t r i n g term : th is . d i c t i o n a r y . keySet ( ) ) {

Double va l = ( double ) th is . d i c t i o n a r y . get ( term ) . s ize ( ) ;va l = Math . log ( ( double ) th is . analyzedTexts . s ize ( ) / (1 + va l ) ) ;th is . i d f . put ( term , va l ) ;

}return th is . i d f ;

}

Esempio (1)

17


pdnetwork ilpdlitaliagiusta 693.0 0.77

pbersani 661.0 0.81http 625.0 0.98t.co 605.0 1.02

rt 488.0 1.12youdem 154.0 2.28

edicolapd 153.0 2.27pd 145.0 2.34

progressivealliance 139.0 2.37italia 90.0 2.83

rt 1270.0 0.16http 414.0 1.29

berlusconi 397.0 1.32t.co 352.0 1.45

ilpdl 296.0 1.61pd 114.0 2.6

bersani 102.0 2.68grillo 99.0 2.71

pdl 98.0 2.72angealfa 97.0 2.72

Esempio (2)

Mov5Stelle scelta civicat.co 1826.0 0.0http 1823.0 0.0

diretta 326.0 1.55seguite 146.0 2.32

hmvgrtnp 140.0 2.36beppe 130.0 2.43

ora 115.0 2.57cosa 114.0 2.6m5s 102.0 2.7

grillo 96.0 2.73

senatoremonti 537.0 1.02http 381.0 1.5t.co 376.0 1.51

rt 348.0 1.45sceltacivica 331.0 1.5

scelta civica 170.0 2.17italia 162.0 2.23

conmontiperlitalia 115.0 2.55politica 107.0 2.63lavoro 97.0 2.77

Combinare tf e idfLa combinazione di tf e idf ha lo scopo di fornire un peso terminologico che dipenda sia dalla

rilevanza terminologica in un documento, sia, all’inverso, dall’uso del termine nell’intero corpus.

tf -idf(ti, dj) = tf(ti, dj) · idf(ti)

TF-IDF con NLTK

def t f i d f ( words , tex t , corpus ) :i = {}c = n l t k . Tex tCo l l ec t i on ( corpus )for w in set ( words ) :

i [w] = c . t f i d f (w, corpus . raw ( t e x t ) )return i

18


EsempioNell’esempio consideriamo la lista dei primi 10 termini per rilevanza calcolata attraverso tf-idf

sia sui termini originali, sia sui termini filtrati (rimuovendo hashtag e chiocciole).

pdnetwork

originali filtratitermine tf idf tf-idf

#italiagiusta 0.0048 0.98 0.0047@pbersani 0.0046 0.47 0.0022

#progressivealliance 0.001 2.08 0.002#edicolapd 0.0011 1.39 0.0015@youdem 0.001 1.39 0.0015

#pdlive 0.0005 2.08 0.001@unitaonline 0.0003 2.08 0.0006#renaissance 0.0003 2.08 0.0006

#pb2013 0.0002 2.08 0.0005@grasso p 0.0002 2.08 0.0005

termine tf idf tf-idfperche 0.0002 2.08 0.0004

progressisti 0.0002 1.39 0.0003dire 0.0011 0.13 0.0002

appuntamento 0.0002 0.69 0.0001far 0.0006 0.13 0.0001

europa 0.0005 0.13 0.0001via 0.0005 0.13 0.0001

berlusconi 0.0004 0.13 0.0001anni 0.0003 0.13 0.0

proposta 0.0003 0.13 0.0

Combinare n-grammi e tf-idf

• Il metodo piu semplice per combinare la ricerca di co-occorrenze significative e tf-idf consistenel sostituire alle serie di n token adiacenti che formano un n-gramma significativo un solotoken contenente l’n-gramma considerato e successivamente eseguire il calcolo di tf-idf

Esempio con CorpusAnalyzer

public TermsDescr iptor getMutualInformationNGramRelevance ( i n t dimension ,Double f i l t e r ) {

TermsDescr iptor ng re l = new TermsDescr iptor ( ) ;TermsDescr iptor ngrams = th is . getNGramOccurrences ( dimension , f i l t e r ) ;/ / Get s i n g l e terms occurrencesTermsDescr iptor occurrences = th is . runOccurrences ( ) ;Double ngoccurrences = ngrams . getValuesSum ( ) ;Double termoccurrences = occurrences . getValuesSum ( ) ;/ / Run on ngramsfor ( Object ng : ngrams . keySet ( ) ) {

S t r i n g k = ( S t r i n g ) ng ;S t r i n g [ ] terms = k . s p l i t ( ” ” ) ;Double l p = 1 . 0 ;for ( S t r i n g t : terms ) {

l p = l p ∗ ( occurrences . get ( t ) / termoccurrences ) ;}Double np = ngrams . get ( ng ) / ngoccurrences ;ng re l . add ( ng , Math . log ( np / l p ) ) ;

}

19

5 Similarita testuale

return ngre l ;}

Esempio

Analys is s t a t s = new Analys is ( ) ;I n s t i t u t i o n t l i n e = f . c rea te Ins t i tu t ionByName ( t i m e l i n e ) ;CorpusAnalyzer analyzer = s t a t s . getNewCorpusPer Ins t i tu t ion ( ca , t l i n e , true )

;ana lyzer . analyze ( ) ;TermsDescr iptor i d f = analyzer . runIDF ( ) ;TermsDescr iptor t f i d f = analyzer . getAnalyzedTFIDF ( ” 292224740150624256 ” ) ;System . out . p r i n t l n ( ana lyzer . getTextByID ( ” 292224740150624256 ” ) ) ;t f i d f . p r i n t S o r t e d ( ) ;

Risultato dell’esempioTweet 292224740150624256: Livia Turco: nei prossimi giorni incontreremo sindaco di

Lampedusa, andremo nelle zone terremotate, torneremo a Castel Volturno e a Torino.

incontreremo −> 6.620073206530356c a s t e l −> 6.620073206530356lampedusa −> 6.620073206530356torneremo −> 6.620073206530356te r remota te −> 6.620073206530356andremo −> 6.620073206530356zone −> 6.620073206530356vo l t u rno −> 6.620073206530356l i v i a tu rco −> 5.926926025970411tu rco −> 5.926926025970411pross imi −> 5.926926025970411sindaco −> 5.521460917862246g i o r n i −> 5.115995809754082t o r i n o −> 5.0106352940962555


Vector space model

• Un metodo comune per utilizzare i termini pesati ai fini del calcolo di similarita fra documenticonsiste nel rappresentare ogni documento come un vettore in uno spazio vettoriale

20


• L’idea e di derivare un vettore ~Vi da ogni documento di in modo che ogni componente delvettore corrisponda al peso del rispettivo termine (perdendo l’informazione sull’ordine deitermini nel documento)

• Esempio: Livia Turco: nei prossimi giorni incontreremo sindaco di Lampedusa, andremonelle zone terremotate, torneremo a Castel Volturno e a Torino

〈6.62, 6.62, 6.62, 6.62, 6.62, 6.62, 6.62, 6.62, 5.92, 5.92, 5.92, 5.52, 5.11, 5.01〉

Cosine similarity

• La rappresentazione vettoriale e utile a calcolare la similarita fra due documenti come simi-larita cosenica dei loro corrispondenti vettori (in cui un termine presente nel documento di

ma assente in dj avra peso 0 in dj)

sim(di, dj) =~Vi · ~Vj

| ~Vi || ~Vj |con

~Vi · ~Vj =M∑k=0

vikvjk : vik ∈ ~Vi, vjk ∈ ~Vj (dot product)

| ~Vi |=

√√√√ M∑k=0

v2ik (Euclidean length)

Esempio cosine similarita con NLTK (1)

def c rea te vec to r ( tex t , cwords , c o r p u s i d f ) :numw = len ( corpus . words ( t e x t ) )o = n l t k . FreqDis t ( corpus . words ( t e x t ) )f i l t e r o = [ x for x in o i f o [ x ] > 50 and l en ( x ) > 2]v = zeros ( len ( cwords ) )cnt = 0for w in sor ted ( cwords ) :

i f w in o . keys ( ) :t f = f l o a t ( o [w ] ) / numwt f i d f = t f ∗ c o r p u s i d f [w]v [ cn t ] = t f i d f

else :

21


v [ cn t ] = 0cnt = cnt + 1

return v

Esempio cosine similarita con NLTK (2)

def cos ( v1 , v2 ) :e1 = 0e2 = 0d = 0for i in range (0 , len ( v1 ) ) :

e1 = e1 + v1 [ i ]∗∗2e2 = e2 + v2 [ i ]∗∗2d = d + v1 [ i ] ∗ v2 [ i ]

return d / ( math . s q r t ( e1 ) ∗ math . s q r t ( e2 ) )

Esempio: matrice di similarita

ilpdl

ingroia

LegaNordPadania

Mov5Stelle

pdnetwork

scelta_civica

sinistraelib

● ● ● ●●●●

● ● ●●●●

● ●●●●

●●●●

●●●

●●

●

0.0088 0.0018 0.0097 0.07010.00210.00360.0033

0.0103 0.0134 0.01970.00940.02890.0077

0.003 0.00770.00340.00870.0016

0.01380.00280.00630.0215

0.00760.02060.0044

0.00210.0024

0.0027

Fare2013 ilpdl ingroia LegaNordPadania Mov5Stelle scelta_civica sinistraelibpartito2

part

ito1

sim●

●

●

●

●

●●

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Esempio: clustering gerarchico

22


Mov

5Ste

lle

ilpdl

ingr

oia

Lega

Nor

dPad

ania

pdne

twor

k

sini

stra

elib

Fare

2013

scel

ta_c

ivic

a

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1.00

Partiti (hierarchical clustering)

hclust (*, "complete")

Dis

tanc

e (1

− s

imila

rity)

Similarita con CorpusAnalyzerNell’esempio con CorpusAnalyzer cerchiamo di calcolare la similarita fra tweet

s t a t i c TermsDescr iptor getsim ( S t r i n g t e x t i d , DataFactory f , CorpusAnalyzerca ) {

TermsDescr iptor s i m i l a r = new TermsDescr iptor ( ) ;TermsDescr iptor t f i d f = ca . getAnalyzedTFIDF ( t e x t i d ) ;C o s i n e S i m i l a r i t y cos = new C o s i n e S i m i l a r i t y ( ) ;for ( S t r i n g i d : ca . ge tA l lTex t IDs ( ) ) {

TermsDescr iptor x = ca . getAnalyzedTFIDF ( i d ) ;s i m i l a r . add ( id , cos . sentenceMatch ( t f i d f , x ) ) ;

}return s i m i l a r ;

}

23


Esempio

• Livia Turco: nei prossimi giorni incontreremo sindaco di Lampedusa, andremo nelle zoneterremotate, torneremo a Castel Volturno e a Torino.

• 1.0 → Livia Turco: nei prossimi giorni incontreremo sindaco di Lampedusa, andremo nelle zoneterremotate, torneremo a Castel Volturno e a Torino.

• 0.14→ Livia Turco: non dimentichiamo questi anni orribili e di aver combattuto con la schiena dritta,guardando all’Italia reale. #nuoviitaliani

• 0.13→ Livia Turco: legislatura di destra ha proposto i medici spia, proseguendo con respingimenti inmare e reato immigrazione clandestina

• 0.09→@matteorenzi a #leinvasionibarbariche Voglio che @pbersani vinca e che il Pd governi per iprossimi 5 anni #Italiagiusta

• 0.08→@matteorenzi ’Nelle ultime 48 ore ci giochiamo il futuro per i prossimi 5 anni’ #ottoemezzo#italiagiusta ...

24

elementi di natural language processingislab.di.unimi.it/ontoweb/materiale/nlp.pdf · nlp e...

Documents