quantitative and network co-occurences analysis in literature teaching, by luca cinacchio
TRANSCRIPT
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
1/42
Quantitative and
Network Co-Occurrences Analysis
in Literature TeachingPresentation at
Mathematica UserGroup Meeting Italia 2010
by Luca Cinacchio
Universit di Torino, Corso di Laurea in Fisica
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
2/42
Abstract
Literature for many high school students is a boring discipline.
But often, what is more boring to the students, is the apparent discretion of judgment that afflicts the analysis of a
book.
Some clear evidence proofing the judgment, can be indeed very useful to the student, in the understanding of the
critics.
It's here that a quantitative analysis of the text can play a successful role.
2 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
3/42
Teaching Literature only with a traditional approach...
Literature for many high school students is a boring discipline.
But often, what is more boring to the students, is the apparent discretion of judgment that afflicts the analysis of a
book.
For example, analyzing a novel that narrates the story of a family during the birth of their country , a critic can say:
The book is a celebration of the family and its values..
And the students, overall if he has not read the book, can think: Why is the book a celebration of the family? Why is
it not a celebration of the roots of this family's country?
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 3
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
4/42
...and also with a
quantitative approach
Some clear evidence proofing the judgment, can be indeed very useful to the student, in the understanding of the
critics.
It's here that a quantitative analysis of the text can play a successful role.
For example, we can add to the judgment The book is a celebration of the family and its values some quantitative
information like: in fact, the word 'family' is the most recurring one inside the text, and this can be a good starting
point in helping the student to understand why that judgment has been expressed.
4 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
5/42
The past...
Quantitative analysis of text were made before computers but that required a long time.
Just to perform the simplest analysis, the ranking of the occurrences (how many times each word occurs and where)
people had to patiently compile lots of card and annotate each occurrence.
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 5
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
6/42
...and the present
With the computer everything has become simpler : many different kinds of quantitative analysis can be done in just
a few seconds (or a fraction!) relying on software dedicated to this task.
There are many different quantitative analysis that can be performed on a text and there is a huge bibliography on the
subject.
6 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
7/42
Resources
The availability of a large number of electronic literary texts has increased the attractiveness of quantitative
approaches: right now it is easy to look on the internet to find a good collection of major works of any classic author.
For Italian literature a good starting can be the Progetto Manuzio at the url http://www.liberliber.it/biblioteca/.
Here you will find a large collection of Italian Classics and the books are downloadable in different formats: plain text
(txt), HyperText Markup Language (HTML) or Acrobat (pdf).
For use with the utilities provided in this paper I recomend to use the txt format.
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 7
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
8/42
Playing with the text
But what is sometimes underestimated is how this kind of approach can be helpful in the school, of course as
integration of the most traditional one.
What we are looking for, is a data-centric approach to novels, that is, we can utilize graphs, maps, and charts.
Doing quantitative analysis on a text, the student can feel that they are in-charge of the text analysis. They become
an active actor and not just be a passive subject, like when they have to blindly trust what is written inside the
textbook of the course.
The need for some kind of comparative norm suggests that counting more than one text will often be required and the
nature of the research will dictate the appropriate comparison text. In some cases, other texts by the same author will
be selected, or contemporary authors.
Having at hand a series of tools that allow the student to quickly and easily perform different kinds of analysis, can
end with a sort of recreational approach to the text.
8 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
9/42
The genesis of MathText
There are many dedicated software for text quantitative analysis. Unfortunately, some of them that allow sophisti-
cated analysis, like co-occurrences network, are not free.
Two years ago, two friends of mine wanted to do some quantitative analysis on two different text: the first one on the
full corpus of the TV series Lost, and the second one on an obscure old French text, Hypolite by Gabriel Gilbert (the
first one is still a work in progress, the second one ended with a Tesi di Laurea at the Universit di Torino, Facolt di
Lingue e letterature straniere ).
So I wrote in Mathematica a collection of small utilities to do some quantitative text analysis, and I called them
MathText.
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 9
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
10/42
What MathText can do
Some basic operations of data cleansing
Almost any item, feature, or characteristic of a text that can be reliably identified can be counted. Decisions about
what to count can be obvious, problematic, or extremely difficult, and poor initial choices can lead to wasted effort
and worthless results. Even careful planning leaves room for surprises, fortunately often of the happy sort that call for
further or different quantification.
For example counting and including in the analysis the articles is not very useful: we already know that in an English
text the article "the" will be ranked first, and in some analysis like co-occurrences network it will make weird the
graphics
So it can be a wise choice to not include in the analysis words or symbols like:
articles
prepositions (simple and, for the Italian, articulated)
punctuation
numbers (although for some text they can be useful)
conjunctions
10 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
11/42
What MathText can do (cont.)
MathText provides basic tools to do this kind of basic data cleansing.
Data cleansing is very basic and very weirdly written, but in this way no Mathematica dummies will be able to
personalize this section according to their needs.
Unfortunately a thing that MathText cannot do is the reduction of different tense of the same verb and/or different
persons of the same verb to a common root.
i.e. mangio and mangiavo will be considered as 2 different occurrences.
i.e. mangio and mangiano will be considered as 2 different occurrences.
Writing this kind of tool was beyond my skill. I know that there are some utilities accomplishing this task: I hope that
somebody can maybe in the future implement it in a better version of MathText.
Another thing that you must be aware of is that at the present moment MathText considers the singular and the
plural form of a word as 2 different occurrences.
i.e. home and homes are 2 occurrences; casa and case (Italian) are 2 occurrences.
(Here let me open a short digression: as I told, MathTExt has been written for two friends of mine. My skill in Mathe-
matica programming is very basic, so the result is not so professional as it could be if it was written by some
Mathematica geek that is here today. But everybody can share it, and over all can improve it: if you do that, please,
redistribute your improved version!)
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 11
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
12/42
What MathText can do (cont.)
Count of the words inside the text
Count of the different words inside the text
Index of ' vocabulary' s richness'
The last one is the ratio ofCount of the different words inside the text and Count of the words inside the text.
The maximun theoric index is 1, and it represents a text were all the words are different.
It can be useful in comparative analysis; i.e. are all the works of this Author with roughly the same index? What is the
index of the Author and the index of other similar Authors? And so on...
12 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
13/42
What MathText can do (cont.)
A table, with all the words contained in the text, in alphabetical order, and the number of occurrences for
each word.
A very large output was generated. Here is a sample of it:
", 1, abbagliante, 2, abbaglianti, 1, abbaiamenti, 3,
abbaiando, 3, abbaiano, 1, abbaiare, 5, abbaiava, 3, abbai, 3,
abbandonare, 2, abbandonarla, 1, abbandonarono, 1, abbandonato, 5,
abbandonava, 1, 9139, zampe, 5, zanne, 3, zanzare, 1,
zanzariera, 1, zeppa, 1, zigzag, 1, zitto, 15, zucchero, 2,
zuffolato, 1, zuffolava, 1, zuffoli, 1, zuffolo, 1, zuppa, 1
Show Less Show More Show Full Output Set Size Limit...
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 13
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
14/42
What MathText can do (cont.)
A ranked table of occurrences ranking
A very large output was generated. Here is a sample of it:
844, si, 1, 829, non, 2, 608, tremalnaik, 3, 378, ma, 4,
327, era, 5, 310, kammamuri, 6, 278, disse, 7, 268, tu, 8,
268, , 9, 266, pi, 10, 249, mi, 11, 9144, 1, abbandon, 9156,
1, abbandono, 9157, 1, abbandoni, 9158, 1, abbandoner, 9159,
1, abbandoneremo, 9160, 1, abbandonava, 9161, 1, abbandonarono, 9162,
1, abbandonarla, 9163, 1, abbaiano, 9164, 1, abbaglianti, 9165, 1, ", 9166
Show Less Show More Show Full Output Set Size Limit...
14 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
15/42
What MathText can do (cont.)
Co-occurrences with triples
You select a word. MathText will split all the text in overlapping triples (units of 3 words), then will extract and present
to you all the triples containing the selected word.
Here an example with the word "barba":
barba, nera, arruffata, barba, occhi, scintillanti, barba, nera, ma,
barba, nera, occhi, barba, nerissima, folta, barba, quattro, uomini,
barba, grigia, cav, piccola, barba, nera, nera, barba, occhi,
lunga, barba, nera, folta, barba, nera, quarant'anni, barba, nerissima,
mordeva, barba, quattro, mare, barba, grigia, coperto, piccola, barba,
lunga, nera, barba, d'una, lunga, barba, feroce, folta, barba,
statura, quarant'anni, barba, si, mordeva, barba, lupo, mare, barba
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 15
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
16/42
What MathText can do (cont.)
Co-occurrences with triples (cont)
Then, it will show you a list of co-occurrences of all the words that occur with your selected word inside the triples.
Be aware! This produces a sort of "weight" of each word occurrences related to your work. In fact if a word is directlyat the side of your word, it will be counted twice. If a word is still in the triple, but two position away from your word, it
will be counted only once.
The first row represent your selected word: the numeric value is again computed from the triples, and it is just how
many time it is contained in the triples. Having your word in the first row can be useful for further computations, if you
want to quickly identify to which word that list of list result was related to.
Here an example with the same word "barba"
barba, 21, nera, 8, occhi, 3, folta, 3, lunga, 3, nerissima, 2,
quattro, 2, grigia, 2, piccola, 2, quarant'anni, 2, mordeva, 2,
mare, 2, arruffata, 1, scintillanti, 1, ma, 1, uomini, 1, cav, 1,
coperto, 1, d'una, 1, feroce, 1, statura, 1, si, 1, lupo, 1
16 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
17/42
What MathText can do: co-occurences networks
Co-occurrence networks are the collective interconnection of terms based on their paired presence within a specified
unit of text.
Rules to define co-occurrence within a text corpus can be set according to desired criteria.
The criteria that I used works as follows:
you select a list of words.They are chosen accordingly to the hypothesis that you would like to explore.
Let me give you an example (it's pure fantasy!). We can imagine that we are analyzing a corpus of speeches
of a political man.
We can start creating an occurrences ranking: what are the words that he use more often
We discover that these words are family, nation and communist.
Now we can use a list of these 3 words to see what are the connections linking them in the speeches of our
political man.
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 17
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
18/42
What MathText can do: co-occurences networks (cont.)
For each occurrence of each your word in the list, will be created a "window", or lexical unit, with a specified
number of words existing to the left (before) and the to the right (after)
e.g.: if you are looking for the words "range" and this words is contained in the sentence
"It unifies a broad range of programming paradigms"
if you choose 2 as a parameter for the window (or lexical unit) , will be created this list of therms:
{a broad, broad range, range of, of programming}
18 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
19/42
What MathText can do: co-occurences networks (cont.)
We can think to a network like this :
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 19
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
20/42
What MathText can do: co-occurences networks (cont.)
Now let assume that our word rangeis contained inside another one sentence of our text:
There are many things inside range strongly connected with love
This time we can think to have a network like this:
20 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
21/42
What MathText can do: co-occurences networks (cont.)
If all our text was made of these two sentences, and our analysis was limited to the word range, the final network
that we obtain looks like this one:
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 21
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
22/42
What MathText can do: co-occurences networks (cont.)
This was a really simple example. What is practically done is a little bit more complex.
In fact, we look also for links between the words contained inside our extracted lexical units.
So, imagine to have one more sentence in our corpus:
[] inside broad range []
Now the network will be like this:
22 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
23/42
What MathText can do: co-occurences networks (cont.)
As you can see one more link has been added in between inside and broad.
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 23
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
24/42
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
25/42
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
26/42
MathText :the code
26 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
27/42
MathTExt
Some utilities for text analysisby Luca Cinacchio - Universit di Torino - Corso di Laurea in Fisica
Importing, data-cleansing, exporting and re-importing
MathTExt works with any txt file in plain text. It has been tested with big texts with no problem.
Ok, to have the file in the proper format I used a dirty trick: after the import of the file and some cleansing, I export it
as txt and suddenly I reimport it with the option "Words".
Data cleansing is very basic and very unelegant, but in this way also no Mathematicadummies will be able to
personalize this section according to their needs.
If you scroll the StringReplace list, you find inside it a section that is only (* comment *): these are string deleting
instructions for ENGLISH LANGUAGE only!
Be careful, since each word in the list will be erased from the original text. These were the settings used for the
example analysis of Lost used in this notebook by my friend.
Take care: you must setup the full path of your txt file, and also change the path of the exported-reimported file.
temp Import"C:\\mathematicafiles\\salgarimisteri.txt" ; change the path with your own;file should be in ".txt"format.
StringReplacetemp, "," "";StringReplace, "." " ";StringReplace, ";" "";StringReplace, "" "";StringReplace, "" "";StringReplace, "?" "";StringReplace, "" "";StringReplace, "i" "";StringReplace, "A" "a";StringReplace, "B" "b";StringReplace, "C" "c";
StringReplace, "D" "d";StringReplace, "E" "e";StringReplace, "F" "f";
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 27
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
28/42
StringReplace, "G" "g";StringReplace, "H" "h";StringReplace, "I" "i";StringReplace, "J" "j";StringReplace, "K" "k";StringReplace
, "L" "l"
;
StringReplace, "M" "m";StringReplace, "N" "n";StringReplace, "O" "o";StringReplace, "P" "p";StringReplace, "Q" "q";StringReplace, "R" "r";StringReplace, "S" "s";StringReplace, "T" "t";StringReplace, "U" "u";StringReplace, "V" "v";StringReplace, "W" "w";StringReplace, "X" "x";StringReplace, "Y" "y";StringReplace, "Z" "z";StringReplace, "0" "";StringReplace, "1" "";StringReplace, "2" "";StringReplace, "3" "";
StringReplace, "4" "";StringReplace, "5" "";StringReplace, "6" "";StringReplace, "7" "";StringReplace, "8" "";StringReplace, "9" "";
START ITALIAN SECTION
StringReplace
, " g l i " " "
;
StringReplace, " il " " ";StringReplace, " lo " " ";
28 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
29/42
StringReplace, " la " " ";StringReplace, " le " " ";StringReplace, " i " " ";StringReplace, " c h e " " ";StringReplace, " a " " ";StringReplace
, " a' " " "
;
StringReplace, " di " " ";StringReplace, " da " " ";StringReplace, " in " " ";StringReplace, " c o n " " ";StringReplace, " su " " ";StringReplace, " p e r " " ";StringReplace, " t r a " " ";StringReplace, " f r a " " ";StringReplace, " d e l " " ";StringReplace, " dello " " ";StringReplace, " della " " ";StringReplace, " delle " " ";StringReplace, " degli " " ";StringReplace, " d e i " " ";StringReplace, " al " " ";StringReplace, " a l l o " " ";StringReplace, " a l l a " " ";StringReplace, " a g l i " " ";
StringReplace, " a l l e " " ";StringReplace, " ai " " ";StringReplace, " s u l " " ";StringReplace, " sullo " " ";StringReplace, " sulla " " ";StringReplace, " sulle " " ";StringReplace, " s u i " " ";StringReplace, " sugli " " ";StringReplace, " d a l " " ";StringReplace
, " dallo " " "
;
StringReplace, " dalla " " ";StringReplace, " dalle " " ";
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 29
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
30/42
StringReplace, " d a i " " ";StringReplace, " dagli " " ";StringReplace, " n e l " " ";StringReplace, " nello " " ";StringReplace, " nella " " ";StringReplace
, " nelle " " "
;
StringReplace, " negli " " ";StringReplace, " n e i " " ";StringReplace, " e " " ";StringReplace, " ed " " ";StringReplace, " un " " ";StringReplace, " u n a " " ";StringReplace, " u n o " " ";StringReplace, " a...a... " " ";
END OF ITALIAN SECTION
START ENGLISH SECTION inside the comment some string replacements
only for ENGLISH LANGUAGE Be careful,
since each word in the list will be erased from theoriginal text. These were the settings for the
example analysis of Lost used in this notebook
StringReplace, " a " " ";StringReplace, " a n " " ";StringReplace, " little " " ";StringReplace, " f e w " " ";StringReplace, " t h e " " ";StringReplace, " t h i s " " ";StringReplace
, " these " " "
;
StringReplace, " t h a t " " ";StringReplace, " those " " ";
30 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
31/42
StringReplace, " t h a n " " ";StringReplace, " a s " " ";StringReplace, " o n e " " ";StringReplace, " o n e s " " ";StringReplace, " m a n y " " ";StringReplace
, " m u c h " " "
;
StringReplace, " a l l " " ";StringReplace, " e a c h " " ";StringReplace, " every " " ";StringReplace, " b o t h " " ";StringReplace, " neither " " ";StringReplace, " either " " ";StringReplace, " s o m e " " ";StringReplace, " a n y " " ";StringReplace, " n o " " ";StringReplace, " n o n e " " ";StringReplace, " everyone " " ";StringReplace, " every " " ";StringReplace, " everybody " " ";StringReplace, " everything " " ";StringReplace, " e l s e " " ";StringReplace, " anybody " " ";StringReplace, " another " " ";StringReplace, " o n e " " ";
StringReplace, " s o m e " " ";StringReplace, " w h o " " ";StringReplace, " whose " " ";StringReplace, " w h o m " " ";StringReplace, " which " " ";StringReplace, " w h a t " " ";StringReplace, " w h y " " ";StringReplace, " w h e n " " ";StringReplace, " where " " ";StringReplace
, " h o w " " "
;
StringReplace, " ' s " " ";StringReplace, " ' d " " ";
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 31
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
32/42
StringReplace, " ' v e " " ";StringReplace, " ' r e " " ";StringReplace, " m y " " ";StringReplace, " m i n e " " ";StringReplace, " yours " " ";StringReplace
, " y o u r " " "
;
StringReplace, " y o u " " ";StringReplace, " h i s " " ";StringReplace, " h e r " " ";StringReplace, " i t s " " ";StringReplace, " h e r s " " ";StringReplace, " o u r s " " ";StringReplace, " o u r " " ";StringReplace, " theirs " " ";StringReplace, " their " " ";StringReplace, " m e " " ";StringReplace, " u s " " ";StringReplace, " w e " " ";StringReplace, " t h e y " " ";StringReplace, " t h e m " " ";StringReplace, " ' m " " ";StringReplace, " i t " " ";StringReplace, " o f " " ";StringReplace, " a t " " ";
StringReplace, " m o s t " " ";StringReplace, " t o " " ";StringReplace, " t o o " " ";StringReplace, " f o r " " ";StringReplace, " f r o m " " ";StringReplace, " a t " " ";StringReplace, " o n " " ";StringReplace, " b y " " ";StringReplace, " before " " ";StringReplace
, " i n " " "
;
StringReplace, " since " " ";StringReplace, " during " " ";
32 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
33/42
StringReplace, " t i l l " " ";StringReplace, " untill " " ";StringReplace, " afterwards " " ";StringReplace, " after " " ";StringReplace, " i n t o " " ";StringReplace
, " o n t o " " "
;
StringReplace, " o f f " " ";StringReplace, " o u t " " ";StringReplace, " o u t o f " " ";StringReplace, " above " " ";StringReplace, " o v e r " " ";StringReplace, " under " " ";StringReplace, " below " " ";StringReplace, " beneath " " ";
ENDO OF ENGLISH SECTION
StringReplace, "" " ";StringReplace, "&" " ";StringReplace, " " " ";StringReplace, " " " ";temp StringReplace, ":" "";Export"C:\\mathematicafiles\\cleanfile.txt", temp; change with your own path
testo
Import"C:\\mathematicafiles\\cleanfile.txt", "Words"; if you've changed previous path
substitute with the right one
Occurrencies
Execute this cell, and the results will be printed.
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 33
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
34/42
Print"The length of the text is ",Lengthtesto, " words"
tabellaricorrenze SortTallyFlattentesto;vettorericorrenze
FlattenTableParttabellaricorrenze, i, 2,
i, 1, Lengthtabellaricorrenze;tabellafrequenze TallyReverseSortvettorericorrenze;tabellafrequenze2 Transpose
Last tabellafrequenze, First tabellafrequenze ;Print"The text contains ", Lengthtabellaricorrenze,
" different words"Print"The text has a 'vocabulary's richness' of ",
Lengthtabellaricorrenze Lengthtesto N ," 1 corresponds to maximum theoric index"
Print"Here the occurrencies table. Its data arestored in the variable tabellaricorrenze"
tabellaricorrenze
Print"Here the frequencies table Its dataare stored in the variable tabellafrequenze"
tabellafrequenze
The length of the text is 49 250 words
The text contains 9166 different words
The text has a 'vocabulary's richness' of
0.186112 1 corresponds to maximum theoric index
Here the occurrencies table. Its data are stored in the variable tabellaricorrenze
A very large output was generated. Here is a sample of it:
", 1, abbagliante, 2, abbaglianti, 1, abbaiamenti, 3,
abbaiando, 3, abbaiano, 1, abbaiare, 5, abbaiava, 3,
abbai, 3, abbandonare, 2, abbandonarla, 1, abbandonarono, 1,
abbandonato, 5, abbandonava, 1, abbandoneremo, 1, 9136,
zagaglia, 1, zampaccie, 1, zampe, 5, zanne, 3, zanzare, 1,
zanzariera, 1, zeppa, 1, zigzag, 1, zitto, 15, zucchero, 2,zuffolato, 1, zuffolava, 1, zuffoli, 1, zuffolo, 1, zuppa, 1
Show Less Show More Show Full Output Set Size Limit...
Here the frequencies table Its data are stored in the variable tabellafrequenze
34 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
35/42
844, 1, 829, 1, 608, 1, 378, 1, 327, 1, 310, 1, 278, 1, 268, 2,
266, 1, 249, 1, 248, 1, 235, 1, 231, 1, 229, 2, 215, 1, 212, 1,
193, 1, 189, 1, 187, 1, 185, 1, 171, 1, 167, 1, 166, 1, 164, 1,
163, 1, 161, 1, 154, 1, 148, 1, 141, 1, 140, 1, 138, 1, 137, 1,
134, 2, 132, 1, 129, 1, 126, 1, 125, 1, 123, 1, 119, 1, 117, 3,
114, 1, 113, 1, 109, 1, 107, 1, 105, 1, 103, 2, 100, 2, 99, 2,
98, 3, 97, 2, 94, 2, 92, 4, 91, 1, 89, 1, 87, 1, 86, 1,
85, 1, 84, 2, 83, 2, 82, 1, 81, 1, 80, 1, 79, 3, 78, 3, 77, 2,
76, 5, 75, 1, 74, 2, 73, 1, 72, 2, 71, 2, 70, 2, 69, 2, 68, 2,
67, 5, 66, 2, 65, 1, 64, 4, 63, 2, 62, 3, 60, 1, 59, 2, 58, 3,
57, 3, 56, 2, 55, 6, 54, 1, 53, 3, 52, 1, 51, 3, 50, 2, 49, 4,
48, 3, 47, 2, 46, 9, 45, 8, 44, 2, 43, 8, 42, 4, 41, 8, 40, 3,
39, 6, 38, 1, 37, 7, 36, 8, 35, 9, 34, 7, 33, 11, 32, 11,
31, 7, 30, 12, 29, 8, 28, 7, 27, 10, 26, 11, 25, 8, 24, 11,
23, 14, 22, 21, 21, 15, 20, 20, 19, 20, 18, 27, 17, 25, 16, 36,
15, 41, 14, 45, 13, 45, 12, 47, 11, 82, 10, 91, 9, 91, 8, 132,
7, 140, 6, 206, 5, 310, 4, 472, 3, 695, 2, 1444, 1, 4812
Occurrencies Ranking
tabellaricorrenze2 TransposeLast tabellaricorrenze, First tabellaricorrenze;
rank Tablei, i, 1, Lengthtabellaricorrenze2;tabellaricorrenze2 Transpose
Last tabellaricorrenze, First tabellaricorrenze;tabellaricorrenze3 ReverseSorttabellaricorrenze2;tabellaricorrenze4 Table
AppendParttabellaricorrenze3, i, i,
i, 1, Lengthtabellaricorrenze3;Print
"The Ranking table in the order are showed: number ofoccurrencies, word, rank. Its data areastored in the variable tabellaricorrenze4"
tabellaricorrenze4
The Ranking table in the order are showed: number of occurrencies,
word, rank. Its data are astored in the variable tabellaricorrenze4
A very large output was generated. Here is a sample of it:
844, si, 1, 829, non, 2, 608, tremalnaik, 3, 378, ma, 4, 327, era, 5,
310, kammamuri, 6, 278, disse, 7, 268, tu, 8, 268, , 9,
266, pi, 10, 249, mi, 11, 248, come, 12, 235, io, 13, 9141,
1, abbassando, 9155, 1, abbandon, 9156, 1, abbandono, 9157,
1, abbandoni, 9158, 1, abbandoner, 9159, 1, abbandoneremo, 9160,
1, abbandonava, 9161, 1, abbandonarono, 9162, 1, abbandonarla, 9163,
1, abbaiano, 9164, 1, abbaglianti, 9165, 1, ", 9166
Show Less Show More Show Full Output Set Size Limit...
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 35
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
36/42
Co-Occurrencies for a single word
triplette Partitiontesto, 3, 1;triplette1 ;; 11;selecttriples m_, word_ :
JoinSelect m, MatchQ1 , word &,Select m, MatchQ2 , word &,Select m, MatchQ3 , word &
Usage example
Write in the following cell your words (i.e. "destiny"). Don't forget to write your word as a string, in between to the " ".
The result will be a list of overlapping triples, each containing your selected word. These are all the triples of the text
with your word inside.
q selecttriplestriplette, "barba"
barba, nera, arruffata, barba, occhi, scintillanti, barba, nera, ma,
barba, nera, occhi, barba, nerissima, folta, barba, quattro, uomini,
barba, grigia, cav, piccola, barba, nera, nera, barba, occhi,
lunga, barba, nera, folta, barba, nera, quarant'anni, barba, nerissima,
mordeva, barba, quattro, mare, barba, grigia, coperto, piccola, barba,
lunga, nera, barba, d'una, lunga, barba, feroce, folta, barba,
statura, quarant'anni, barba, si, mordeva, barba, lupo, mare, barba
Executing the following cell a list of co-occurecies of al the words that occur with your selected word will be pro-
duced.
HOW IT WORKS: I use the triples produced in the former instruction, counting the frequency of each word. This
produces a sort of "weight" of each word occurrencies related to your work. In fact if a word is directly at the side of
your word, it will be counted twice. If a word is still in the triple, but two position away from your word, it will be
counted only once.
The first row represent your selected word: the numeric value is again computed from the triples, and it is just how
many time it is contained in the triples. Having your word in the first row can be useful for further computations, if you
want to quicly identify to wich word that list of list result was related to.
ReverseSortTallyFlattenq, 12 22 &
barba, 21, nera, 8, occhi, 3, folta, 3, lunga, 3, nerissima, 2,
quattro, 2, grigia, 2, piccola, 2, quarant'anni, 2, mordeva, 2,
mare, 2, arruffata, 1, scintillanti, 1, ma, 1, uomini, 1, cav, 1,coperto, 1, d'una, 1, feroce, 1, statura, 1, si, 1, lupo, 1
36 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
37/42
Co-occurrences table based on a list of selected words
You can choose how many words you want. For each of them a list of co-occurrencies will be produced, and all the
data will be aggregated in a co-occrrencies table, containing in the columns all the words co-occuring with your
selected words, and on the rows your selected words. Crossing the two units will give you the number of co-occurren-
cies for the couple.
Again, for the co-occurrencies I use the triples, counting the frequency of each word. This produces a sort of
"weighting" each word occurrencies related to your work. In fact if a word is directly at the side of your word, it will be
counted twice. If a word is still in the triple, but two psition away from your word, it will be counted only once.
Insert here your words (don't forget the " " ):
vecparolescelte "famiglia", "sposa","moglie", "figlio", "figli", "figlia";
selectparolalista_, parola_ :
Selectlista, MatchQ2 , parola &Table
IfMatchQselectparolatabellaricorrenze3, vecparolesceltei,, Print" WARNING MESSAGE
One of the choosen words is not in the text. Cooccurences
table requires that all the words are presente
in the text. Aborting procedure."; Abort,Print"Check words passed", i, 1,Lengthvecparolescelte;
Check words passed
Check words passed
Check words passed
Check words passed
Check words passed
Check words passed
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 37
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
38/42
posparolalista_, parola_ :Flatten Positionlista, parola
selecttriples m_, word_ :JoinSelect m, MatchQ1 , word &,
Select m, MatchQ2 , word &,Select m, MatchQ3 , word &
righecollect ;tripletemp
Tableselecttriplestriplette, vecparolesceltei,i, 1, Lengthvecparolescelte, 1;
Print"The list of the 'words space'associated to your choosen words"
listparole UnionSortFlattentripletemp
The list of the 'words space' associated to your choosen words
acque, ad, ada, all'ultimo, amato, ancora, andato, avuto, baleni, bengalese,
bevanda, bravi, capitano, capriccio, chiamava, chiese, ci, cibo, colpo, comanda,
compresi, conta, corishant, darei, dell'india, dinanzi, disse, diventar, dov',
d'un, d'una, , ella, empio, entro, era, erro, esclam, famiglia, farebbe,
ferma, figli, figlia, figlio, finalmente, fu, gatto, giammai, gl'indiani,
gridando, guardava, ha, intera, inviato, io, irremovibile, jungla, kl,
l'hai, liberi, l'indiano, lui, ma, mai, mano, me, meglio, mia, miei, minaccia,
moglie, morire, morta, n, nome, non, notte, o, oh, ordinai, ordinate, palla,
parlo, patria, pietrificato, piombo, poi, povera, prode, punto, pure, rapire,
rapita, rinchiusi, ripet, rispose, ritorna, s'accorse, sacre, salve, sar,
sarai, saremo, sar, scomparve, scompose, scorsi, sdegnosamente, se, selvaggio,
si, s, siete, spasimo, spilla, sposa, stata, stessa, sua, suoi, suyodhana,
taci, t'amo, thugs, tremalnaik, tu, tua, uomo, va', vago, vecchio, vostra
38 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
39/42
righecollect ;Clearriga1;initvaluevector Table0, i, 1, Lengthlistparole;valuevector initvaluevector;
Tablevaluevector initvaluevector;
Tablevaluevectorposparolalistparole,
ReverseSortTallyFlattentripletempk, 12 22 &i, 1
ReverseSortTallyFlattentripletempk, 12 22 &i, 2;
riga1 valuevector, i, 1,LengthReverseSortTallyFlattentripletempk
, 12 22 &, 1;
righecollect Appendrighecollect, riga1,k, 1, Lengthvecparolescelte, 1;cooccurences InsertTablerighecollecti,
i, 1, Lengthvecparolescelte, 1, listparole, 1;Print"The cooccurences table, computed
with the requested words"cooccurences TableForm
The cooccurences table, computed with the requested words
acque ad ada all'ultimo amato ancora andato avuto baleni bengalese bevanda bravi c
0 1 0 0 0 0 0 0 0 0 1 0 0
0 0 0 2 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
11 0 0 0 1 0 2 0 1 2 0 0 0
0 0 0 0 0 0 0 0 0 0 0 2 0
0 0 3 0 0 1 0 2 0 0 0 0 1
Network of co-occurencies of the text
Plot of co-occurencies of the text. Only words directly connected are taken in account (a word is connected with twoneighbours: the one before and the one after the word).
Warning!!! Apply this analysis to the full text produce unreadable networks, with too many points, and quiet often an
out of memory kernel quit. This is the reason because of you can specify a sort of "window" for the analysis, setting
the init number word and the final number words of the text for your window.
Each node has a "tooltip": rollovering the mouse will result on a label
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 39
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
40/42
numini 1000; specificy the number of theinitial word from which you want to start
numfin 1600; specificy the number of the endig
word to which you want to stop. Far from more than
5000 words you risk a 'out of memory' warning qsx Map "a" &, testo;qsx All, 2 InsertDroptesto, 1, Lasttesto, 1;qsx Dropqsx, 1;part Takeqsx, numini, numfin;GraphPlot part, VertexLabeling Tooltip
Network of co-occurencies with a list of words and free length for the lexical unit
Here a graph of co - occurencies word for your list of selected words is produced.
For each occurrence of each your word will be create a "window", or lexical unit, with numwords existing to the left
and numwords existing to the right.
So, if you are looking for the words "range" and this words is contained in the sentence "It unifies a broad range of
programming paradigms and uses its unique concept of symbolic programming", if you choose num = 2, this noes
and links will be generated for the network: { a broad, broad range, range of, of programming}
Tips: play with the number n, associated with strong or weak deletions of simple words (i.e. the, a, of, ...) in the data-
cleansing section.
With num = 1, if you delete the most ranked word, usually you get a network with separate components, and it can be
harder to catch meaningful relationship in the text.
Using greater num allow you to erase part of the most common words and still retain a net of relationships between
the words.
You can insert also a single word, still inside the { } and still inside the " " , instead a list of words
40 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
41/42
num 3; insert here the number of words that will beconsidered before and after each selected words
vecparolescelte2 "figlio", "padre"; insert here your words
selectparolalista_, parola_ :Selectlista, MatchQ2 , parola &
TableIfMatchQselectparola
tabellaricorrenze3, vecparolescelte2i, ,Print" WARNING MESSAGE
One of the choosen words is not in the text. Cooccurences
table requires that all the words are presente
in the text. Aborting procedure."; Abort,
Print"Check words passed", i, 1,Lengthvecparolescelte2;
listposition
FlattenTablePositiontesto, vecparolescelte2k,k, 1, Lengthvecparolescelte2;
createsingolnpla posizione_, num_ :
FlattenReverseposizione
Rangenum, posizione, posizione Rangenum
createallnpleposizione_, num_ :Tablecreatesingolnplai, num, i, Flattenposizione
couples FlattenTablePartitioncreateallnplelistposition, numk, 2, 1,k, 1, Lengthlistposition, 1;
grafdatabis TableTaketesto, couplesi, 1, couplesi, 2,i, 1, Lengthcouples;
grafdata2bis Map1 2 &, grafdatabis;GraphPlotgrafdata2bis, VertexLabeling TooltipGraphPlotgrafdata2bis,
VertexLabeling True, ImageSize 900
Check words passed
Check words passed
cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb 41
-
8/8/2019 Quantitative and Network Co-Occurences Analysis in Literature Teaching, by Luca Cinacchio
42/42
t'h
sei irremovibile
io
guizzava pesciolino
dorato
padre
miodiss'egli
voce vennegalla
prosegu
l'indianorapido
tu
vergine
pagoda
sacra
corishant
rispose
strangolatore
narrare
simili
cose
quell'infelice
negapatnan
giammai
parlare
cosa
sono
ada
ebbene
stavo
ucciderle
ah
l'orribile
trama
finalmente
rivedo
aveva
gridato
giovanetta
42 cinacchio_Quantitative and Network Co Occurences Analysis in Literature Teaching.nb