entropy and semantic: a mathematical approach to ... · long-range dynamical correlations in...

173
Entropy and Semantic: a mathematical approach to Authorship Attribution, plagiarism detection and key words extraction Workshop on “Web Information and Quality Evaluation” Universidad Politécnica de Valencia M. Degli Esposti [email protected] Department of Mathematics University of Bologna 13-15 September 2010 M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 1 / 60

Upload: others

Post on 06-Sep-2019

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and Semantic: a mathematical approach to

Authorship Attribution, plagiarism detection and key wordsextraction

Workshop on “Web Information and Quality Evaluation”Universidad Politécnica de Valencia

M. Degli [email protected]

Department of Mathematics

University of Bologna

13-15 September 2010

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 1 / 60

Page 2: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Main objective of the talk

1 present a (narrow) point of view from mathematical-physics onAutomatic Text categorization and information retrieval in general

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 2 / 60

Page 3: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Main objective of the talk

1 present a (narrow) point of view from mathematical-physics onAutomatic Text categorization and information retrieval in general

2 bring to your attention some recent results that appeared in thecommunity of mathematics and physics

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 2 / 60

Page 4: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Main objective of the talk

1 present a (narrow) point of view from mathematical-physics onAutomatic Text categorization and information retrieval in general

2 bring to your attention some recent results that appeared in thecommunity of mathematics and physics

3 discuss a “simple” question: how far can we go just with “entropy” (orrelated) , without linguistics and computational linguistics ?

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 2 / 60

Page 5: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and Semantic

Simple, but important, observations...

Although the information to be encoded by language is usually highlycomplex, it can be readily projected onto a string of words.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 3 / 60

Page 6: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and Semantic

Simple, but important, observations...

Although the information to be encoded by language is usually highlycomplex, it can be readily projected onto a string of words.

In recent years the use of tools drawn from statistical physics and dynamicalsystems has quantitatively revealed rich linguistic structures at many scales,ranging from the domain of syntax to the organization of whole lexicons andliterary corpora.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 3 / 60

Page 7: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and Semantic

Simple, but important, observations...

Although the information to be encoded by language is usually highlycomplex, it can be readily projected onto a string of words.

In recent years the use of tools drawn from statistical physics and dynamicalsystems has quantitatively revealed rich linguistic structures at many scales,ranging from the domain of syntax to the organization of whole lexicons andliterary corpora.

However, a fundamental question that has not been directly addressed sofar is how statistical structures relate to the function of encoding complexinformation

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 3 / 60

Page 8: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and Semantic

Simple, but important, observations...

Although the information to be encoded by language is usually highlycomplex, it can be readily projected onto a string of words.

In recent years the use of tools drawn from statistical physics and dynamicalsystems has quantitatively revealed rich linguistic structures at many scales,ranging from the domain of syntax to the organization of whole lexicons andliterary corpora.

However, a fundamental question that has not been directly addressed sofar is how statistical structures relate to the function of encoding complexinformation

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 3 / 60

Page 9: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and Semantic

two recent papers....

In the following two papers quantitative measures have been introduce tocaptures the relationship between the statistical structure of word sequencesand their semantic content.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 4 / 60

Page 10: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and Semantic

two recent papers....

In the following two papers quantitative measures have been introduce tocaptures the relationship between the statistical structure of word sequencesand their semantic content.

E Alvarez-Lacalle, B Dorow, JP Eckmann and E Moses: "Hierarchicalstructures induce long-range dynamical correlations in written texts",Proceedings of the National Academy of Sciences, 103 (21), pp. 7956-7961(2006)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 4 / 60

Page 11: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and Semantic

two recent papers....

In the following two papers quantitative measures have been introduce tocaptures the relationship between the statistical structure of word sequencesand their semantic content.

E Alvarez-Lacalle, B Dorow, JP Eckmann and E Moses: "Hierarchicalstructures induce long-range dynamical correlations in written texts",Proceedings of the National Academy of Sciences, 103 (21), pp. 7956-7961(2006)

M. A. Montemurro and D. Zanette: "Towards the quantification of thesemantic information encoded in written language",arxiv.org/ abs/ 0907.1558v2 (2009)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 4 / 60

Page 12: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and SemanticLong-range dynamical correlations in written texts: una

parola tira l’altra

a (very) big vector space and few notations

Wall is a vector space in which each word of the English languagerepresents a base vector.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 5 / 60

Page 13: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and SemanticLong-range dynamical correlations in written texts: una

parola tira l’altra

a (very) big vector space and few notations

Wall is a vector space in which each word of the English languagerepresents a base vector.

Given a text x ∈ A∗ , we restrict the analysis to the subspace Wtext of thewords appearing at least once in x .

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 5 / 60

Page 14: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and SemanticLong-range dynamical correlations in written texts: una

parola tira l’altra

a (very) big vector space and few notations

Wall is a vector space in which each word of the English languagerepresents a base vector.

Given a text x ∈ A∗ , we restrict the analysis to the subspace Wtext of thewords appearing at least once in x .

D = D(x) is the dictionary of x , i.e. the set of distinct words, ordered usingfor example the rank.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 5 / 60

Page 15: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and SemanticLong-range dynamical correlations in written texts: una

parola tira l’altra

a (very) big vector space and few notations

Wall is a vector space in which each word of the English languagerepresents a base vector.

Given a text x ∈ A∗ , we restrict the analysis to the subspace Wtext of thewords appearing at least once in x .

D = D(x) is the dictionary of x , i.e. the set of distinct words, ordered usingfor example the rank.

At each word ωj is associated a canonical vector ej .

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 5 / 60

Page 16: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and SemanticLong-range dynamical correlations in written texts: una

parola tira l’altra

a (very) big vector space and few notations

Wall is a vector space in which each word of the English languagerepresents a base vector.

Given a text x ∈ A∗ , we restrict the analysis to the subspace Wtext of thewords appearing at least once in x .

D = D(x) is the dictionary of x , i.e. the set of distinct words, ordered usingfor example the rank.

At each word ωj is associated a canonical vector ej .

Arbitrary directions in this vector space are therefore combinations of words.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 5 / 60

Page 17: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and SemanticLong-range dynamical correlations in written texts: una

parola tira l’altra

a (very) big vector space and few notations

Wall is a vector space in which each word of the English languagerepresents a base vector.

Given a text x ∈ A∗ , we restrict the analysis to the subspace Wtext of thewords appearing at least once in x .

D = D(x) is the dictionary of x , i.e. the set of distinct words, ordered usingfor example the rank.

At each word ωj is associated a canonical vector ej .

Arbitrary directions in this vector space are therefore combinations of words.

Among these combinations one is interested in those that represents certaintopics, or concepts that are discussed in the text.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 5 / 60

Page 18: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and SemanticLong-range dynamical correlations in written texts: una

parola tira l’altra

the window of attention....

These word groups are looked for within a window of attention of words-sizea, e.g. a = 200 words .

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 6 / 60

Page 19: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and SemanticLong-range dynamical correlations in written texts: una

parola tira l’altra

the window of attention....

These word groups are looked for within a window of attention of words-sizea, e.g. a = 200 words .

This window represents the words that have just been read, and thesecomprise at each point of the text a momentary alertvector of attention.....

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 6 / 60

Page 20: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Entropy and SemanticLong-range dynamical correlations in written texts: una

parola tira l’altra

the window of attention....

These word groups are looked for within a window of attention of words-sizea, e.g. a = 200 words .

This window represents the words that have just been read, and thesecomprise at each point of the text a momentary alertvector of attention.....

but first, the corpus...(and the stemming)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 6 / 60

Page 21: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Corpus, stemming and stop words

the Corpus

In Eckmann’s paper, the authors used 12 books in their English version.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 7 / 60

Page 22: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Corpus, stemming and stop words

the Corpus

In Eckmann’s paper, the authors used 12 books in their English version.

Nine of them were novels :War and Peace (WP) by Tolstoi,

Don Quixote (QJ) by Cervantes,

The Iliad (IL) by Homer,

Moby-Dick or The Whale (MD) by Melville,

David Crockett (DC) by Abbott,

The adventure of Tom Sawyer (TS) by Twain,

Naked Lunch (NK) by Burroughs,

Hamlet (HM) by Shakespeare,

The Metamorphosis (MT) by Kafka.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 7 / 60

Page 23: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Corpus, stemming and stop words

the Corpus

In Eckmann’s paper, the authors used 12 books in their English version.

Nine of them were novels :War and Peace (WP) by Tolstoi,

Don Quixote (QJ) by Cervantes,

The Iliad (IL) by Homer,

Moby-Dick or The Whale (MD) by Melville,

David Crockett (DC) by Abbott,

The adventure of Tom Sawyer (TS) by Twain,

Naked Lunch (NK) by Burroughs,

Hamlet (HM) by Shakespeare,

The Metamorphosis (MT) by Kafka.

In addition :Relativity: The Special and the General Theory (EI) by Einstein

Critique of Pure Reason (KT) by Kant

The Republic (RP) by Plato.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 7 / 60

Page 24: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Corpus, stemming and stop words

the Corpus

: >

Figure: Corpus parameters and results: mthr is the threshold for the number of occurrences and dthr is thenumber of words kept after thresholding. P is the percentage of the words in the book that passes the threshold,

P =P

dthrj=1

mj /L. dconv is the dimension at which a power law is bring fit. The absolute values of the negative

exponents of the fit are given in the last column, together with their error in parentheses.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 8 / 60

Page 25: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Corpus, stemming and stop words

Cleaning and Stemming

Each of the book was processed by eliminating punctuation and extractingthe words.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 9 / 60

Page 26: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Corpus, stemming and stop words

Cleaning and Stemming

Each of the book was processed by eliminating punctuation and extractingthe words.

Each word has been stemmed by querying WORDNET 2.0.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 9 / 60

Page 27: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Corpus, stemming and stop words

Cleaning and Stemming

Each of the book was processed by eliminating punctuation and extractingthe words.

Each word has been stemmed by querying WORDNET 2.0.

The leading word for this query was retained, keeping the information onwhether it was originally a noun, a verb, or an adjective.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 9 / 60

Page 28: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Corpus, stemming and stop words

Cleaning and Stemming

Each of the book was processed by eliminating punctuation and extractingthe words.

Each word has been stemmed by querying WORDNET 2.0.

The leading word for this query was retained, keeping the information onwhether it was originally a noun, a verb, or an adjective.

A list of stop words that carry no significant meaning has been defined andat each of them were assigned a value of zero:determiners, pronouns, andthe like

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 9 / 60

Page 29: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Corpus, stemming and stop words

Cleaning and Stemming

Each of the book was processed by eliminating punctuation and extractingthe words.

Each word has been stemmed by querying WORDNET 2.0.

The leading word for this query was retained, keeping the information onwhether it was originally a noun, a verb, or an adjective.

A list of stop words that carry no significant meaning has been defined andat each of them were assigned a value of zero:determiners, pronouns, andthe like

Moreover were rejected those words that occur significantly in at least 11 ofthe 12 texts in the corpus.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 9 / 60

Page 30: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Corpus, stemming and stop words

Cleaning and Stemming

Each of the book was processed by eliminating punctuation and extractingthe words.

Each word has been stemmed by querying WORDNET 2.0.

The leading word for this query was retained, keeping the information onwhether it was originally a noun, a verb, or an adjective.

A list of stop words that carry no significant meaning has been defined andat each of them were assigned a value of zero:determiners, pronouns, andthe like

Moreover were rejected those words that occur significantly in at least 11 ofthe 12 texts in the corpus.

Books were thus transformed into a list of stemmed words, and used forconstructing the mathematical objects we will now discuss. .......

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 9 / 60

Page 31: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Connectivity Matrix

the vector of attention

Fix a window size a (e.g. a = 200 words).

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 10 / 60

Page 32: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Connectivity Matrix

the vector of attention

Fix a window size a (e.g. a = 200 words).

We define its (normalized) vector of attention V as:

V =

j

m2j (a)

− 12

j

mj(a)ej ,

where the sum can be thought over all dictionary D(x).

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 10 / 60

Page 33: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Connectivity Matrix

the vector of attention

Fix a window size a (e.g. a = 200 words).

We define its (normalized) vector of attention V as:

V =

j

m2j (a)

− 12

j

mj(a)ej ,

where the sum can be thought over all dictionary D(x).

Now we would like to project the vector V onto a smaller subspace relatedwith different concepts or themes that appear in the text.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 10 / 60

Page 34: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Connectivity Matrix

Symmetric Connectivity Matrix

The starting point is the construction of a symmetric connectivity matrix M.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 11 / 60

Page 35: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Connectivity Matrix

Symmetric Connectivity Matrix

The starting point is the construction of a symmetric connectivity matrix M.

Definition (The symmetric connectivity matrix M)

Given a text x , the matrix M has rows and columns indexed by words, andthe entry Mij counts how often word ωi occurs within a distance a/2 oneither side of word ωj .

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 11 / 60

Page 36: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Connectivity Matrix

the Normalized Symmetric Connectivity Matrix

The connectivity matrix R of an equivalent random/shuffled book :

Rij =a

Lmimj ,

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 12 / 60

Page 37: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Connectivity Matrix

the Normalized Symmetric Connectivity Matrix

The connectivity matrix R of an equivalent random/shuffled book :

Rij =a

Lmimj ,

Definition

Given a text x and a context length a, the normalized connectivity matrix Nis defined as:

Nij = R− 1

2ij (Mij − Rij) .

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 12 / 60

Page 38: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Connectivity Matrix

the Normalized Symmetric Connectivity Matrix

The connectivity matrix R of an equivalent random/shuffled book :

Rij =a

Lmimj ,

Definition

Given a text x and a context length a, the normalized connectivity matrix Nis defined as:

Nij = R− 1

2ij (Mij − Rij) .

This normalization quantifies the extent to which the analyzed text deviates from

a random book (with the same words distribution) measured in units of its

standard deviation.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 12 / 60

Page 39: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Connectivity Matrix

Projecting down: SVD

We now project onto a smaller subspace by keeping only those d basisvector with highest singular values.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 13 / 60

Page 40: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Connectivity Matrix

Projecting down: SVD

We now project onto a smaller subspace by keeping only those d basisvector with highest singular values.

The idea behind this choice of principal directions is that the mostimportant vectors in this decomposition describe concepts.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 13 / 60

Page 41: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Connectivity Matrix

Projecting down: SVD

We now project onto a smaller subspace by keeping only those d basisvector with highest singular values.

The idea behind this choice of principal directions is that the mostimportant vectors in this decomposition describe concepts.

Given d vectors fro the SVD basis, every word can be projected onto aunique superposition of those basic vectors, i.e.:

ek →d

j=1

Skjvj ,

where ek is the canonical vector representing word ωk .

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 13 / 60

Page 42: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Let us see some concept vectors....

few experiments......

: >

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 14 / 60

Page 43: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

a dynamic analysis

The idea is now to slide the window of attention of fixed size a = 200 alongthe text and observe how the corresponding vectors V moves in the vectorspace spanned by the SVD.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 15 / 60

Page 44: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

a dynamic analysis

The idea is now to slide the window of attention of fixed size a = 200 alongthe text and observe how the corresponding vectors V moves in the vectorspace spanned by the SVD.

If this vector space were irrelevant to the text, then the trajectory defined inthis space would perform a random walk.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 15 / 60

Page 45: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

a dynamic analysis

The idea is now to slide the window of attention of fixed size a = 200 alongthe text and observe how the corresponding vectors V moves in the vectorspace spanned by the SVD.

If this vector space were irrelevant to the text, then the trajectory defined inthis space would perform a random walk.

If, on the contrary, the evolution of the text is reflected in this vector space,then the trajectory should trace out the concepts in a systematic way, andsome evidence of this will be observed (and hopefully measured)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 15 / 60

Page 46: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Trajectories and time

Trajectories in this vector space can be connected to the process of readingof the text by replacing the notion of distance along the text with the timeit takes to read it

t = ℓ× δt,

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 16 / 60

Page 47: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Trajectories and time

Trajectories in this vector space can be connected to the process of readingof the text by replacing the notion of distance along the text with the timeit takes to read it

t = ℓ× δt,

with ℓ the distance into the text and δt the average time it takes ahypothetical reader to read a word.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 16 / 60

Page 48: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

a dynamic analysis

At each time t we define in this way a vector of attention, V(t) corresponding tothe window [t/δt − a/2, t/δt + a/2].

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 17 / 60

Page 49: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

a dynamic analysis

At each time t we define in this way a vector of attention, V(t) corresponding tothe window [t/δt − a/2, t/δt + a/2].

We project the vector V(t) onto the first d vectors vj :

V(t)←

d∑

j=1

Sj (t)vj ,

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 17 / 60

Page 50: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

a dynamic analysis

At each time t we define in this way a vector of attention, V(t) corresponding tothe window [t/δt − a/2, t/δt + a/2].

We project the vector V(t) onto the first d vectors vj :

V(t)←

d∑

j=1

Sj (t)vj ,

The moving unit vector V(t) ∈ Rd is a dynamical system and it is natural to studyits autocorrelation function in time:

C (τ) = 〈V(t) · V(t + τ)〉t ,

where 〈·〉t is the time average.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 17 / 60

Page 51: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

autocorrelation function in Tom Sawyer

: >

Figure: Log-log plot of the autocorrelation function for the Adventures of Tom Sawyer using different numbers ofsingular components for building the dynamics. For comparison, the autocorrelation of a randomized version of thebook is also shown.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 18 / 60

Page 52: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

autocorrelation function in the other books...

: >

Figure: Autocorrelation functions and fits fro seven of the book listed.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 19 / 60

Page 53: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

autocorrelation function in the other books...

: >

Figure: Autocorrelation functions and fits fro seven of the book listed.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 19 / 60

Page 54: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

autocorrelation function in the other books...

: >

Figure: Autocorrelation functions and fits fro seven of the book listed.

authors claim that this range is much longer than what we found when measuring correlations among sentences,

without using the concept vectors.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 19 / 60

Page 55: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Spectrum of words

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 20 / 60

Page 56: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Spectrum of words: Pinocchio

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 21 / 60

Page 57: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Few notations

A given text x of N words is divided in P parts, each of word-length Nj ,j = 1, 2, . . . ,P .

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 22 / 60

Page 58: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Few notations

A given text x of N words is divided in P parts, each of word-length Nj ,j = 1, 2, . . . ,P .

Assume ω is a word that appears nj times in part j , with j = 1, . . . ,P :µ(ω|j) := nj/Nj can be considerate as the conditional probability of findingword ω in part j .

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 22 / 60

Page 59: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Few notations

A given text x of N words is divided in P parts, each of word-length Nj ,j = 1, 2, . . . ,P .

Assume ω is a word that appears nj times in part j , with j = 1, . . . ,P :µ(ω|j) := nj/Nj can be considerate as the conditional probability of findingword ω in part j .

We also denote by µ(j) = Nj/N the a priori probability that the word ωappears in part j , then

P∑

j=1

µ(ω|j)µ(j) = µ(ω),

where µ(ω) = n/N stands for the overall probability of occurrences of aword in the whole text.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 22 / 60

Page 60: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Bayes’s rule

We look for the inverted probability µ(j |ω), which tell us how likely is thatwe are looking into part j given that we saw an instance of word ω in thetext.

µ(j |ω) =µ(ω|j)µ(j)

∑Pk=1 µ(ω|k)µ(k)

= nj/n.

Now we can write Shannon mutual

I (x ,D) =∑

ω∈D

µ(ω)

P∑

j=1

µ(j |ω) log

(

µ(j |ω)

µ(j)

)

.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 23 / 60

Page 61: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Entropy of a word (in a given text x)

h(x |ω) := −

P∑

j=1

µ(j |ω) log µ(j |ω), µ(j |ω) = nj/n

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 24 / 60

Page 62: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Entropy of a word (in a given text x)

h(x |ω) := −

P∑

j=1

µ(j |ω) log µ(j |ω), µ(j |ω) = nj/n

moreover, we also average over shuffling

h(x |ω)⟩

:= −

P∑

j=1

〈µ(j |ω) log µ(j |ω)〉 .

Definition

Relevant words are ranked w.r.t.

h(x |ω) −⟨

h(x |ω)⟩

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 24 / 60

Page 63: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Shuffling and Averaging...

We can use elementary methods to compute an analytic expression of theentropy < h(x |ω) >.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 25 / 60

Page 64: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Shuffling and Averaging...

We can use elementary methods to compute an analytic expression of theentropy < h(x |ω) >.

For a word ω that appears mj times in part j with a frequency n over thetext x , this entropy takes the following form:

h(x |ω) := −

P∑

j=1

mj

nlog

mj

n.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 25 / 60

Page 65: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Shuffling and Averaging...

We can use elementary methods to compute an analytic expression of theentropy < h(x |ω) >.

For a word ω that appears mj times in part j with a frequency n over thetext x , this entropy takes the following form:

h(x |ω) := −

P∑

j=1

mj

nlog

mj

n.

We now compute the average over all possible realizations of the randomtext:

h(x |ω)⟩

= −∑

m1+···+mP=n

mj≤N/P

µ(m1, . . . ,mP)

P∑

j=1

mj

nlog

mj

n,

where µ(m1, . . . ,mP) is the probability of finding mj words ω in part j , withj = 1, . . . ,P .M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 25 / 60

Page 66: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Shuffling and Averaging: we can use symmetry

h(x |ω)⟩

= −P

min(n,N/P)∑

m=1

µ(m)m

nlog

m

n,

where the margin probability µ(n) is given by the probability of finding minstances of word ω in one part, together with (N/P −m) words differentfrom ω, and reads:

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 26 / 60

Page 67: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Shuffling and Averaging: we can use symmetry

h(x |ω)⟩

= −P

min(n,N/P)∑

m=1

µ(m)m

nlog

m

n,

where the margin probability µ(n) is given by the probability of finding minstances of word ω in one part, together with (N/P −m) words differentfrom ω, and reads:

µ(m) =

(

nm

)(

N−nN/P−m

)

(

NN/P

) .

and use Gaussian approximation, to get:

h(x |ω)⟩

≈ 1−P − 1

2n log P

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 26 / 60

Page 68: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Pinocchio’s words Entropy distribution

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 27 / 60

Page 69: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Kant’s words Entropy distribution

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 28 / 60

Page 70: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Dante’s words Entropy distribution

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 29 / 60

Page 71: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Spectrum of words: Pinocchio

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 30 / 60

Page 72: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Anna Karerina

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 31 / 60

Page 73: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

A dynamic Analysis: un tempo per leggere, un tempo perpensare

Promessi Sposi

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 32 / 60

Page 74: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

A.A. with K-L

Authorship Attribution algorithms based on Relative Entropy (K-LDivergence).

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 33 / 60

Page 75: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

A.A. with K-L

Authorship Attribution algorithms based on Relative Entropy (K-LDivergence).

...we start with wrong assumptions (i.e. the author is a stochastic source)and we end up with interesting results.......

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 33 / 60

Page 76: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

A.A. with K-L

Authorship Attribution algorithms based on Relative Entropy (K-LDivergence).

...we start with wrong assumptions (i.e. the author is a stochastic source)and we end up with interesting results.......

a mathematical problem: given two unknown stochastic (stationary andergodic) sources µ and ν, compute/approximate the relative entropy

d(µ‖ν)

just by using two finite realizations x1, . . . , xn and y1, . . . , ym of µ and νrespectively......

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 33 / 60

Page 77: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

µ ergodic, stationary stochastic source

Just to recall the main defintions...........

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 34 / 60

Page 78: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

µ ergodic, stationary stochastic source

Just to recall the main defintions...........

n-block entropy

Hn(µ) := −∑

|ω|=n

µ(ω) log µ(ω).

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 34 / 60

Page 79: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

µ ergodic, stationary stochastic source

Just to recall the main defintions...........

n-block entropy

Hn(µ) := −∑

|ω|=n

µ(ω) log µ(ω).

entropy rate and n-conditional entropy

hn(µ):=

entropy rate Hn+1(µ)− Hn(µ)=

conditional entropy

ωn1∈A

n,a∈A

µ(ωn1a) log µ(a|wn

1 )

:= Eµn+1 (log µ(a|ωn1)) ,

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 34 / 60

Page 80: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

µ ergodic, stationary stochastic source

Just to recall the main defintions...........

n-block entropy

Hn(µ) := −∑

|ω|=n

µ(ω) log µ(ω).

entropy rate and n-conditional entropy

hn(µ):=

entropy rate Hn+1(µ)− Hn(µ)=

conditional entropy

ωn1∈A

n,a∈A

µ(ωn1a) log µ(a|wn

1 )

:= Eµn+1 (log µ(a|ωn1)) ,

Entropy of µ

h(µ) = limn→∞

Hn(µ)

n= lim

n→∞hn(µ) = Eµ (log µ(a|ω∞

1 ))

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 34 / 60

Page 81: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Cross and Relative entropy: h(µ||ν) = h(µ) + d(µ||ν)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 35 / 60

Page 82: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Cross and Relative entropy: h(µ||ν) = h(µ) + d(µ||ν)

n-conditional cross entropy:

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 35 / 60

Page 83: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Cross and Relative entropy: h(µ||ν) = h(µ) + d(µ||ν)

n-conditional cross entropy:

hn(µ||ν) = −∑

ω∈An, a∈A

µ(ωa) log ν(a|ω),

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 35 / 60

Page 84: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Cross and Relative entropy: h(µ||ν) = h(µ) + d(µ||ν)

n-conditional cross entropy:

hn(µ||ν) = −∑

ω∈An, a∈A

µ(ωa) log ν(a|ω),

cross entropy

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 35 / 60

Page 85: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Cross and Relative entropy: h(µ||ν) = h(µ) + d(µ||ν)

n-conditional cross entropy:

hn(µ||ν) = −∑

ω∈An, a∈A

µ(ωa) log ν(a|ω),

cross entropy

h(µ||ν) = limk→+∞

1

nHk(µ||ν) = lim

n→+∞hn(µ||ν),

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 35 / 60

Page 86: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Cross and Relative entropy: h(µ||ν) = h(µ) + d(µ||ν)

n-conditional cross entropy:

hn(µ||ν) = −∑

ω∈An, a∈A

µ(ωa) log ν(a|ω),

cross entropy

h(µ||ν) = limk→+∞

1

nHk(µ||ν) = lim

n→+∞hn(µ||ν),

relative entropy (Kullback-Leibler divergence)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 35 / 60

Page 87: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Cross and Relative entropy: h(µ||ν) = h(µ) + d(µ||ν)

n-conditional cross entropy:

hn(µ||ν) = −∑

ω∈An, a∈A

µ(ωa) log ν(a|ω),

cross entropy

h(µ||ν) = limk→+∞

1

nHk(µ||ν) = lim

n→+∞hn(µ||ν),

relative entropy (Kullback-Leibler divergence)

d(µ||ν) =

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 35 / 60

Page 88: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Cross and Relative entropy: h(µ||ν) = h(µ) + d(µ||ν)

n-conditional cross entropy:

hn(µ||ν) = −∑

ω∈An, a∈A

µ(ωa) log ν(a|ω),

cross entropy

h(µ||ν) = limk→+∞

1

nHk(µ||ν) = lim

n→+∞hn(µ||ν),

relative entropy (Kullback-Leibler divergence)

d(µ||ν) = limn→∞ Eµ

(

logµ(ωn |ω

n−11 )

ν(ωn|ωn−11 )

)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 35 / 60

Page 89: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Cross and Relative entropy: h(µ||ν) = h(µ) + d(µ||ν)

n-conditional cross entropy:

hn(µ||ν) = −∑

ω∈An, a∈A

µ(ωa) log ν(a|ω),

cross entropy

h(µ||ν) = limk→+∞

1

nHk(µ||ν) = lim

n→+∞hn(µ||ν),

relative entropy (Kullback-Leibler divergence)

d(µ||ν) = limn→∞ Eµ

(

logµ(ωn |ω

n−11 )

ν(ωn|ωn−11 )

)

= limn→∞∑

ωn1A

n µ(ωn1) log

µ(ωn |ωn−11 )

ν(ωn |ωn−11 )

.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 35 / 60

Page 90: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Three methods for computing K-L divergence

1 Zippers: cross-parsing and Merhav-Ziv Theorem

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 36 / 60

Page 91: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Three methods for computing K-L divergence

1 Zippers: cross-parsing and Merhav-Ziv Theorem

2 NSRPS: Non Sequential Recursive Pair Substitution

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 36 / 60

Page 92: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Authorship Attribution (A.A.)

Three methods for computing K-L divergence

1 Zippers: cross-parsing and Merhav-Ziv Theorem

2 NSRPS: Non Sequential Recursive Pair Substitution

3 BWT: The Burrows-Wheeler Transform

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 36 / 60

Page 93: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Zippers

LZ78

In LZ78 a parsing into blocks (often referred to as words) of variable lengthis performed according to the following rule:

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 37 / 60

Page 94: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Zippers

LZ78

In LZ78 a parsing into blocks (often referred to as words) of variable lengthis performed according to the following rule:

the next word is the shortest word that hasn’t been previously seenin the parse

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 37 / 60

Page 95: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Zippers

LZ78

In LZ78 a parsing into blocks (often referred to as words) of variable lengthis performed according to the following rule:

the next word is the shortest word that hasn’t been previously seenin the parse

Every new parsed word is added to a dictionary, which can then be used forreference to proceed in the parsing.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 37 / 60

Page 96: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Zippers

LZ78

In LZ78 a parsing into blocks (often referred to as words) of variable lengthis performed according to the following rule:

the next word is the shortest word that hasn’t been previously seenin the parse

Every new parsed word is added to a dictionary, which can then be used forreference to proceed in the parsing.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 37 / 60

Page 97: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Zippers

an example of LZ78-parsing

an1 = accbbabcbcbbabbcbcabbb

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 38 / 60

Page 98: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Zippers

an example of LZ78-parsing

an1 = accbbabcbcbbabbcbcabbb

The final result of the parse is:

a|c|cb|b|ab|cbc|bb|abb|cbca|bbb

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 38 / 60

Page 99: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Zippers

Ziv’s Theorem

Theorem

If µ is a stationary ergodic process,

c(n) log c(n)

n−−−→n→∞

hµ almost surely

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 39 / 60

Page 100: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Zippers

Ziv’s Theorem

Theorem

If µ is a stationary ergodic process,

c(n) log c(n)

n−−−→n→∞

hµ almost surely

Theorem

(Ziv, Merhav) If X is stationary and ergodic with positive entropy and Y isa Markov chain Pn ≪ Qn asymptotically, then

limn→∞

cn(x |y) log n

n= h(P) + d(P‖Q) (P × Q)− a.s.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 39 / 60

Page 101: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

Returning and Waiting times

Entropy and cross entropy can be related to the asymptoticbehavior of properly defined returning times and waiting times,respectively.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 40 / 60

Page 102: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

Returning and Waiting times

Entropy and cross entropy can be related to the asymptoticbehavior of properly defined returning times and waiting times,respectively.

returning timeR(wn

1 ) = min{k > 1 : wk+n−1k

= wn1 }

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 40 / 60

Page 103: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

Returning and Waiting times

Entropy and cross entropy can be related to the asymptoticbehavior of properly defined returning times and waiting times,respectively.

returning timeR(wn

1 ) = min{k > 1 : wk+n−1k

= wn1 }

waiting timeW (wn

1 , z) = min{k > 1 : zk+n−1k = wn

1 }

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 40 / 60

Page 104: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

Returning and Waiting times

Entropy and cross entropy can be related to the asymptoticbehavior of properly defined returning times and waiting times,respectively.

returning timeR(wn

1 ) = min{k > 1 : wk+n−1k

= wn1 }

waiting timeW (wn

1 , z) = min{k > 1 : zk+n−1k = wn

1 }

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 40 / 60

Page 105: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

Returning and Waiting times

Entropy and cross entropy can be related to the asymptoticbehavior of properly defined returning times and waiting times,respectively.

returning timeR(wn

1 ) = min{k > 1 : wk+n−1k

= wn1 }

waiting timeW (wn

1 , z) = min{k > 1 : zk+n−1k = wn

1 }

Note that W (wn1 ,w) = R(wn

1 ).

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 40 / 60

Page 106: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

Two important results

Theorem (Entropy and returning time)

If µ is a stationary, ergodic process, then

limn→∞

1

nlog R(wn

1 ) = h(µ) µ−a.s.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 41 / 60

Page 107: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

Two important results

Theorem (Entropy and returning time)

If µ is a stationary, ergodic process, then

limn→∞

1

nlog R(wn

1 ) = h(µ) µ−a.s.

Theorem (Relative entropy and waiting time)

If µ is stationary and ergodic, ν is k-Markov and µn << νn, then

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 41 / 60

Page 108: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

Two important results

Theorem (Entropy and returning time)

If µ is a stationary, ergodic process, then

limn→∞

1

nlog R(wn

1 ) = h(µ) µ−a.s.

Theorem (Relative entropy and waiting time)

If µ is stationary and ergodic, ν is k-Markov and µn << νn, then

limn→∞

1

nlog W (wn

1 , z) = h(µ) + d(µ||ν) = h(µ||ν), (µ× ν)−a.s.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 41 / 60

Page 109: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

A. Gramsci (1891-1937), Journalist and founder of the Italian Comunist Party

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 42 / 60

Page 110: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

A. Gramsci (1891-1937), Journalist and founder of the Italian Comunist Party

During the period 1914-1928, Gramsci produced an enormous numberof articles on different national newspaper.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 42 / 60

Page 111: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

A. Gramsci (1891-1937), Journalist and founder of the Italian Comunist Party

During the period 1914-1928, Gramsci produced an enormous numberof articles on different national newspaper.

Most of these article (hundreds, if not thousands) are NOT signed

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 42 / 60

Page 112: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

A. Gramsci (1891-1937), Journalist and founder of the Italian Comunist Party

During the period 1914-1928, Gramsci produced an enormous numberof articles on different national newspaper.

Most of these article (hundreds, if not thousands) are NOT signed

Other possible authors : Bordiga, Serrati, Tasca, Togliatti...

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 42 / 60

Page 113: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

A. Gramsci (1891-1937), Journalist and founder of the Italian Comunist Party

During the period 1914-1928, Gramsci produced an enormous numberof articles on different national newspaper.

Most of these article (hundreds, if not thousands) are NOT signed

Other possible authors : Bordiga, Serrati, Tasca, Togliatti...

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 42 / 60

Page 114: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

A. Gramsci (1891-1937), Journalist and founder of the Italian Comunist Party

During the period 1914-1928, Gramsci produced an enormous numberof articles on different national newspaper.

Most of these article (hundreds, if not thousands) are NOT signed

Other possible authors : Bordiga, Serrati, Tasca, Togliatti...the aim isto recognize the articles really written by A. Gramsci...

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 42 / 60

Page 115: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

A. Gramsci (1891-1937), Journalist and founder of the Italian Comunist Party

During the period 1914-1928, Gramsci produced an enormous numberof articles on different national newspaper.

Most of these article (hundreds, if not thousands) are NOT signed

Other possible authors : Bordiga, Serrati, Tasca, Togliatti...the aim isto recognize the articles really written by A. Gramsci...

Quite positive results for the period 1915-1917

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 42 / 60

Page 116: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

A. Gramsci (1891-1937), Journalist and founder of the Italian Comunist Party

During the period 1914-1928, Gramsci produced an enormous numberof articles on different national newspaper.

Most of these article (hundreds, if not thousands) are NOT signed

Other possible authors : Bordiga, Serrati, Tasca, Togliatti...the aim isto recognize the articles really written by A. Gramsci...

Quite positive results for the period 1915-1917

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 42 / 60

Page 117: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

A. Gramsci (1891-1937), Journalist and founder of the Italian Comunist Party

During the period 1914-1928, Gramsci produced an enormous numberof articles on different national newspaper.

Most of these article (hundreds, if not thousands) are NOT signed

Other possible authors : Bordiga, Serrati, Tasca, Togliatti...the aim isto recognize the articles really written by A. Gramsci...

Quite positive results for the period 1915-1917 (!?!?)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 42 / 60

Page 118: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

A. Gramsci (1891-1937), Journalist and founder of the Italian Comunist Party

During the period 1914-1928, Gramsci produced an enormous numberof articles on different national newspaper.

Most of these article (hundreds, if not thousands) are NOT signed

Other possible authors : Bordiga, Serrati, Tasca, Togliatti...the aim isto recognize the articles really written by A. Gramsci...

Quite positive results for the period 1915-1917 (!?!?)

Joint collaboration with D. Benedetto, E. Caglioti e M. Lana, for thenew Edizione Nazionale delle Opere di Gramsci (2007-2008)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 42 / 60

Page 119: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 43 / 60

Page 120: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

entropy and returning times

A real scenario: Gramsci’s articles

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 44 / 60

Page 121: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

Reference

D. Benedetto, E. Caglioti, G. Cristadoro and —-: "Relative entropy vianon-sequential recursive pair substitution", Journal of StatisticalMechanics: Theory and Experiments , in press (2010)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 45 / 60

Page 122: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

a family of transformations on sequences and the corresponding operators on distributions:

given a, b ∈ A, α /∈ A and A′ = A ∪ {α}, a pair substitution is a map

Gαab : A∗ → A

′∗

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 46 / 60

Page 123: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

a family of transformations on sequences and the corresponding operators on distributions:

given a, b ∈ A, α /∈ A and A′ = A ∪ {α}, a pair substitution is a map

Gαab : A∗ → A

′∗

which substitutes sequentially, from left to right, the occurrences of ab withα.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 46 / 60

Page 124: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

a family of transformations on sequences and the corresponding operators on distributions:

given a, b ∈ A, α /∈ A and A′ = A ∪ {α}, a pair substitution is a map

Gαab : A∗ → A

′∗

which substitutes sequentially, from left to right, the occurrences of ab withα.

For example

G 201 (0010001011100100) = 020022110200.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 46 / 60

Page 125: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

a family of transformations on sequences and the corresponding operators on distributions:

given a, b ∈ A, α /∈ A and A′ = A ∪ {α}, a pair substitution is a map

Gαab : A∗ → A

′∗

which substitutes sequentially, from left to right, the occurrences of ab withα.

For example

G 201 (0010001011100100) = 020022110200.

or:G 2

00(0001000011) = 2012211.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 46 / 60

Page 126: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

a family of transformations on sequences and the corresponding operators on distributions:

given a, b ∈ A, α /∈ A and A′ = A ∪ {α}, a pair substitution is a map

Gαab : A∗ → A

′∗

which substitutes sequentially, from left to right, the occurrences of ab withα.

For example

G 201 (0010001011100100) = 020022110200.

or:G 2

00(0001000011) = 2012211.

G = Gα

abis always an injective but not surjective map that can be immediately extended also to infinite sequences

w ∈ AN.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 46 / 60

Page 127: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

the action of G

G shorten the original sequence:

1

Zab(ωn1)

:=|Gα

ab(ωn1)|

|ωn1 |

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 47 / 60

Page 128: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

the action of G

G shorten the original sequence:

1

Zab(ωn1)

:=|Gα

ab(ωn1)|

|ωn1 |

= 1−♯{ab ⊆ ωn

1}

n,

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 47 / 60

Page 129: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

the action of G

G shorten the original sequence:

1

Zab(ωn1)

:=|Gα

ab(ωn1)|

|ωn1 |

= 1−♯{ab ⊆ ωn

1}

n,

For µ-typical sequences we can pass to the limit and define:

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 47 / 60

Page 130: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

the action of G

G shorten the original sequence:

1

Zab(ωn1)

:=|Gα

ab(ωn1)|

|ωn1 |

= 1−♯{ab ⊆ ωn

1}

n,

For µ-typical sequences we can pass to the limit and define:

1

Zµ:= lim

n→∞

|G (ωn1)|

|ωn1 |

=

{

1− µ(ab) if a 6= b1− µ(aa) + µ(aaa) − µ(aaaa) + · · · if a = b

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 47 / 60

Page 131: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

invariance of the entropy

Invariance of entropyh(Gµ) = Z h(µ).

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 48 / 60

Page 132: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

invariance of the entropy

Invariance of entropyh(Gµ) = Z h(µ).

Decreasing of the 1-conditional entropy

h1(Gµ) ≤ Zh1(µ).

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 48 / 60

Page 133: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

invariance of the entropy

Invariance of entropyh(Gµ) = Z h(µ).

Decreasing of the 1-conditional entropy

h1(Gµ) ≤ Zh1(µ).

G maps 1-Markov measures in 1-Markov measures:

h(Gµ) ≤ h1(Gµ) ≤ Zh1(µ) = Zh(µ) = h(Gµ)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 48 / 60

Page 134: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

invariance of the entropy

Invariance of entropyh(Gµ) = Z h(µ).

Decreasing of the 1-conditional entropy

h1(Gµ) ≤ Zh1(µ).

G maps 1-Markov measures in 1-Markov measures:

h(Gµ) ≤ h1(Gµ) ≤ Zh1(µ) = Zh(µ) = h(Gµ)

Decreasing of the k-conditional entropy

hk(Gµ) ≤ Zhk(µ).

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 48 / 60

Page 135: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

invariance of the entropy

Invariance of entropyh(Gµ) = Z h(µ).

Decreasing of the 1-conditional entropy

h1(Gµ) ≤ Zh1(µ).

G maps 1-Markov measures in 1-Markov measures:

h(Gµ) ≤ h1(Gµ) ≤ Zh1(µ) = Zh(µ) = h(Gµ)

Decreasing of the k-conditional entropy

hk(Gµ) ≤ Zhk(µ).

G maps k-Markov measures in k-Markov measures.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 48 / 60

Page 136: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

invariance of the entropy

Invariance of entropyh(Gµ) = Z h(µ).

Decreasing of the 1-conditional entropy

h1(Gµ) ≤ Zh1(µ).

G maps 1-Markov measures in 1-Markov measures:

h(Gµ) ≤ h1(Gµ) ≤ Zh1(µ) = Zh(µ) = h(Gµ)

Decreasing of the k-conditional entropy

hk(Gµ) ≤ Zhk(µ).

G maps k-Markov measures in k-Markov measures.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 48 / 60

Page 137: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

invariance of the entropy

Invariance of entropyh(Gµ) = Z h(µ).

Decreasing of the 1-conditional entropy

h1(Gµ) ≤ Zh1(µ).

G maps 1-Markov measures in 1-Markov measures:

h(Gµ) ≤ h1(Gµ) ≤ Zh1(µ) = Zh(µ) = h(Gµ)

Decreasing of the k-conditional entropy

hk(Gµ) ≤ Zhk(µ).

G maps k-Markov measures in k-Markov measures.

These properties, roughly speaking, reflect the fact that:

the amount of information of G(ω) , which is equal to that of ω, is more concentrated on the pairs of consecutive

symbols.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 48 / 60

Page 138: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

iterating G ...

A1, A2, . . .AN , . . . will be an increasing alphabet sequence

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 49 / 60

Page 139: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

iterating G ...

A1, A2, . . .AN , . . . will be an increasing alphabet sequence

Given N and chosen aN, bN ∈ AN−1:

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 49 / 60

Page 140: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

iterating G ...

A1, A2, . . .AN , . . . will be an increasing alphabet sequence

Given N and chosen aN, bN ∈ AN−1:

αN /∈ AN−1 is a new symbol and define the new alphabet asAN = AN−1 ∪ {αN};

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 49 / 60

Page 141: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

iterating G ...

A1, A2, . . .AN , . . . will be an increasing alphabet sequence

Given N and chosen aN, bN ∈ AN−1:

αN /∈ AN−1 is a new symbol and define the new alphabet asAN = AN−1 ∪ {αN};

GN is the substitution map GN = GαNaNbN

: A∗N−1→ A∗

N which substituteswhit αN the occurrences of the pair aNbN ;

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 49 / 60

Page 142: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

iterating G ...

A1, A2, . . .AN , . . . will be an increasing alphabet sequence

Given N and chosen aN, bN ∈ AN−1:

αN /∈ AN−1 is a new symbol and define the new alphabet asAN = AN−1 ∪ {αN};

GN is the substitution map GN = GαNaNbN

: A∗N−1→ A∗

N which substituteswhit αN the occurrences of the pair aNbN ;

GN the corresponding map from the measures on AZ

N−1to the measures on

AZ

N ;

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 49 / 60

Page 143: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

iterating G ...

A1, A2, . . .AN , . . . will be an increasing alphabet sequence

Given N and chosen aN, bN ∈ AN−1:

αN /∈ AN−1 is a new symbol and define the new alphabet asAN = AN−1 ∪ {αN};

GN is the substitution map GN = GαNaNbN

: A∗N−1→ A∗

N which substituteswhit αN the occurrences of the pair aNbN ;

GN the corresponding map from the measures on AZ

N−1to the measures on

AZ

N ;

we define by ZN the corresponding normalization factor ZN = ZαNaNbN

.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 49 / 60

Page 144: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

over-line to denote iterated quantities

GN := GN ◦ GN−1 ◦ · · · ◦ G1,

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 50 / 60

Page 145: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

over-line to denote iterated quantities

GN := GN ◦ GN−1 ◦ · · · ◦ G1, GN := GN ◦ GN−1 ◦ · · · ◦ G1

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 50 / 60

Page 146: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

over-line to denote iterated quantities

GN := GN ◦ GN−1 ◦ · · · ◦ G1, GN := GN ◦ GN−1 ◦ · · · ◦ G1

and alsoZN = ZNZN−1 · · ·Z1.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 50 / 60

Page 147: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

asymptotic of ZN

The asymptotic properties of ZN clearly depend on the pairs chosen in thesubstitutions.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 51 / 60

Page 148: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

asymptotic of ZN

The asymptotic properties of ZN clearly depend on the pairs chosen in thesubstitutions.

In particular, if at any step N the chosen pair aNbN is the pair of maximumof frequency of AN−1 then (Theorem 4.1 in BCG):

limN→∞

ZN = +∞

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 51 / 60

Page 149: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

asymptotic properties of the entropy

Theorem (Entropy via NSRPS)

Iflim

N→∞ZN = +∞

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 52 / 60

Page 150: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

asymptotic properties of the entropy

Theorem (Entropy via NSRPS)

Iflim

N→∞ZN = +∞

then

h(µ) = limN→∞

1

ZN

h1(µN)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 52 / 60

Page 151: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

asymptotic properties of the entropy

Theorem (Entropy via NSRPS)

Iflim

N→∞ZN = +∞

then

h(µ) = limN→∞

1

ZN

h1(µN)

i.e. µN := GNµ becomes asymptotically 1-Markov.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 52 / 60

Page 152: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

generalization to the cross and relative entropy

Theorem (Invariance of relative entropy for pair substitution)

If µ is ergodic, ν is a Markov chain and µn << νn, then if G is a pairsubstitution

d(Gµ||Gν) = Zµd(µ||ν)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 53 / 60

Page 153: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

Non Sequential Recursive Pair Substitution

generalization to the cross and relative entropy

Theorem (Invariance of relative entropy for pair substitution)

If µ is ergodic, ν is a Markov chain and µn << νn, then if G is a pairsubstitution

d(Gµ||Gν) = Zµd(µ||ν)

Theorem (KL divergence via NSRPS)

If ZνN → +∞ as N → +∞,

h(µ||ν) = limN→+∞

h1(GNµ||GNν)

ZµN

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 53 / 60

Page 154: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Burrows-Wheeler Transform

BWT in few words

ω = ω1ω2 · · ·ωn ∈ An is a finite string on some ordered, finite alphabet.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 54 / 60

Page 155: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Burrows-Wheeler Transform

BWT in few words

ω = ω1ω2 · · ·ωn ∈ An is a finite string on some ordered, finite alphabet.

Generate all the n cyclic rotations:

ω1ω2 · · ·ωn, ω2ω3 · · ·ωnω1, . . . . . . ωnω1ω2 · · ·ωn−1.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 54 / 60

Page 156: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Burrows-Wheeler Transform

BWT in few words

ω = ω1ω2 · · ·ωn ∈ An is a finite string on some ordered, finite alphabet.

Generate all the n cyclic rotations:

ω1ω2 · · ·ωn, ω2ω3 · · ·ωnω1, . . . . . . ωnω1ω2 · · ·ωn−1.

Sort them from right-to-left in lexicographic order.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 54 / 60

Page 157: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Burrows-Wheeler Transform

BWT in few words

ω = ω1ω2 · · ·ωn ∈ An is a finite string on some ordered, finite alphabet.

Generate all the n cyclic rotations:

ω1ω2 · · ·ωn, ω2ω3 · · ·ωnω1, . . . . . . ωnω1ω2 · · ·ωn−1.

Sort them from right-to-left in lexicographic order.

Form a matrixM whose rows are the sorted cyclic permutations.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 54 / 60

Page 158: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Burrows-Wheeler Transform

BWT in few words

ω = ω1ω2 · · ·ωn ∈ An is a finite string on some ordered, finite alphabet.

Generate all the n cyclic rotations:

ω1ω2 · · ·ωn, ω2ω3 · · ·ωnω1, . . . . . . ωnω1ω2 · · ·ωn−1.

Sort them from right-to-left in lexicographic order.

Form a matrixM whose rows are the sorted cyclic permutations.

bwt(ω) is defined as the first column ofM.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 54 / 60

Page 159: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Burrows-Wheeler Transform

an example of BWT

: >

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 55 / 60

Page 160: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Burrows-Wheeler Transform

why the BWT can be important in constructing efficient

entropy indicators

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 56 / 60

Page 161: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Burrows-Wheeler Transform

why the BWT can be important in constructing efficient

entropy indicators

given a fixed finite string s ∈ AN , for each substring ω of s, all characters ins following ω are grouped together inside bwt(s).

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 56 / 60

Page 162: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Burrows-Wheeler Transform

why the BWT can be important in constructing efficient

entropy indicators

given a fixed finite string s ∈ AN , for each substring ω of s, all characters ins following ω are grouped together inside bwt(s).

Think now at s as an asymptotically larger string coming from a stochasticsources, we might conclude that:

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 56 / 60

Page 163: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

The Burrows-Wheeler Transform

why the BWT can be important in constructing efficient

entropy indicators

given a fixed finite string s ∈ AN , for each substring ω of s, all characters ins following ω are grouped together inside bwt(s).

Think now at s as an asymptotically larger string coming from a stochasticsources, we might conclude that:

bwt(s) looks like a piecewise i.i.d. process.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 56 / 60

Page 164: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

BWT as an indicator for entropy and relative entropy

just a remark

The context sorting properties of the BWT, suggest a method to estimateconditional empirical distribution based on segmentation of the BWToutput.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 57 / 60

Page 165: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

BWT as an indicator for entropy and relative entropy

The algorithm in four steps

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 58 / 60

Page 166: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

BWT as an indicator for entropy and relative entropy

The algorithm in four steps

1 Run the BWT on a realization of the source.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 58 / 60

Page 167: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

BWT as an indicator for entropy and relative entropy

The algorithm in four steps

1 Run the BWT on a realization of the source.

2 Partition the BWT output sequence x into Tx segments. For example using a

uniform segmentation strategy.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 58 / 60

Page 168: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

BWT as an indicator for entropy and relative entropy

The algorithm in four steps

1 Run the BWT on a realization of the source.

2 Partition the BWT output sequence x into Tx segments. For example using a

uniform segmentation strategy.

3 Estimate the first-order distribution within each segment. We denote by nj (a) the

number of occurrences of the symbol a ∈ A in the jth segment, and by µ(a, j) the probability estimate of

symbol a again in the jth segment:

µ(a, j) =nj (a)

b∈Anj (b)

.

The contribution to the entropy estimate of the empirical distribution in the jth segment is given by

log µ(j) =∑

a∈A

nj (a) log µ(a, j).

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 58 / 60

Page 169: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

BWT as an indicator for entropy and relative entropy

The algorithm in four steps

1 Run the BWT on a realization of the source.

2 Partition the BWT output sequence x into Tx segments. For example using a

uniform segmentation strategy.

3 Estimate the first-order distribution within each segment. We denote by nj (a) the

number of occurrences of the symbol a ∈ A in the jth segment, and by µ(a, j) the probability estimate of

symbol a again in the jth segment:

µ(a, j) =nj (a)

b∈Anj (b)

.

The contribution to the entropy estimate of the empirical distribution in the jth segment is given by

log µ(j) =∑

a∈A

nj (a) log µ(a, j).

4 Average the individual estimates over the segments to get the estimate:

h(xn1) := −

1

n

Tx∑

k=1

log µ(j)

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 58 / 60

Page 170: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

BWT as an indicator for entropy and relative entropy

the Main Theorem

Theorem

Let x ∈ An be a sequence of length n generated from a stationary ergodicsources µ, with entropy hµ.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 59 / 60

Page 171: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

BWT as an indicator for entropy and relative entropy

the Main Theorem

Theorem

Let x ∈ An be a sequence of length n generated from a stationary ergodicsources µ, with entropy hµ.

The entropy estimator using uniform segmentation with segment lengthc(n) = α · nγ converges to the entropy rate almost surely:

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 59 / 60

Page 172: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

BWT as an indicator for entropy and relative entropy

the Main Theorem

Theorem

Let x ∈ An be a sequence of length n generated from a stationary ergodicsources µ, with entropy hµ.

The entropy estimator using uniform segmentation with segment lengthc(n) = α · nγ converges to the entropy rate almost surely:

lim|x |=n→∞

h(x) = hµ, a.s.

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 59 / 60

Page 173: Entropy and Semantic: a mathematical approach to ... · Long-range dynamical correlations in written texts: una parola tira l’altra a (very) big vector space and few notations Wall

BWT as an indicator for entropy and relative entropy

BWT for K-L estimates

: >

M.D.E. (University of Bologna) Entropy and Semantic 13-15 September 2010 60 / 60