methods for contextual discovery and analysis.pdf

165
Document Explorer User Guide __________________ Methods for Contextual Discovery and Analysis Hamilton-Locke, Inc.

Upload: bernardo-lopez-carrillo

Post on 19-Jan-2016

31 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Methods for Contextual Discovery and Analysis.pdf

Document Explorer User Guide __________________

Methods for Contextual Discovery and Analysis

Hamilton-Locke, Inc.

Page 2: Methods for Contextual Discovery and Analysis.pdf

Methods for Contextual Discovery and Analysis

QUALITATIVE AND QUANTITATIVE METHODS OF TEXT ANALYSIS FOR DOCUMENT EXPLORER/WORDCRUNCHER USERS

© 2002 Hamilton-Locke, Inc.

Page 3: Methods for Contextual Discovery and Analysis.pdf

i

Table of Contents I. DISCOVERY METHODS............................................................................................ 1

INTRODUCTION................................................................................................................. 1 SECTION A – USING WORD COUNTS................................................................................ 2

Counts of Total Words ................................................................................................ 3 Total Word Counts .................................................................................................. 3 Total Unique Words ................................................................................................ 7

Counts by Groups of Words ...................................................................................... 12 Counts of Word Parts................................................................................................ 15 Counts of Related Words........................................................................................... 17 Counts by Collocated Terms ..................................................................................... 19

Viewable Search Information................................................................................ 20 Neighborhood Width......................................................................................... 20 Rating ................................................................................................................ 20 Z-score............................................................................................................... 21 Sample Frq ........................................................................................................ 21 Total Frq............................................................................................................ 21 Percent............................................................................................................... 21 Expected Frq ..................................................................................................... 21 Std Dev.............................................................................................................. 21 Sort Sequence.................................................................................................... 21

SECTION B – IN-CONTEXT SEARCHING.......................................................................... 22 Meaning and Word Use............................................................................................. 22 Consistency in Word Use .......................................................................................... 23

Options and Elements of the Reference List Window .......................................... 25 Reference Window............................................................................................ 25 Citation Line...................................................................................................... 25 View Option ...................................................................................................... 26 Occurrence Number .......................................................................................... 26 Select Option ..................................................................................................... 26 Delete Option .................................................................................................... 26

SECTION C – DISCOVERING WITH RUNTIME CONCORDANCES ....................................... 27 SECTION D – COMPARING READABILITY LEVELS .......................................................... 30 SECTION E – EXAMINING COLLOCATED WORDS ........................................................... 33 SECTION F – STYLE ANALYSIS....................................................................................... 35

What is Style? ............................................................................................................ 35 The Significance of Style Analysis ....................................................................... 36

Discovering Style....................................................................................................... 37 1) Patterns in the content: What is in the text?...................................................... 37 2) Patterns in discretionary use: How is the content being used? ......................... 38 3) Patterns in associations: How do things fit together? ....................................... 38

SECTION G – TRANSLATION CONSISTENCY: WORD USE, THEMES AND IMAGERY......... 40

Page 4: Methods for Contextual Discovery and Analysis.pdf

ii

Translation ................................................................................................................ 40 Discovery Before Translation ............................................................................... 41 Evaluating Prior Translations................................................................................ 41

SECTION H – TEXTUAL ANALYSIS METHODS FOR WRITERS.......................................... 44 Objective Evaluation ................................................................................................. 44 Additional Writing Aids............................................................................................. 45

PART II. ANALYTICAL METHODS ......................................................................... 46 INTRODUCTION............................................................................................................... 46 SECTION A. METHODS OF SUMMARIZING OBSERVATIONS ............................................. 48

Levels of Measurement.............................................................................................. 48 Types of Data ............................................................................................................ 48 Creating a Frequency Distribution........................................................................... 48 Describing a Frequency Distribution ....................................................................... 49 Measures of Central Tendency.................................................................................. 50 Measures of Variability............................................................................................. 50 Various Vocabulary................................................................................................... 51 Parametric vs. Nonparametric Assumptions and Tests ............................................ 51

Advantages of Nonparametric Statistics ............................................................... 52 Disadvantages of Nonparametric Procedures ....................................................... 52 When to use Nonparametric Procedures ............................................................... 52

SECTION B. METHODS OF HYPOTHESIS TESTING ........................................................... 53 Steps in Hypothesis Testing....................................................................................... 53 Test of Significance ................................................................................................... 53

SECTION C. VARIABLE SELECTION ................................................................................ 54 Measures of Total Words and Total Unique Words.................................................. 54 Readability or Grade Level ....................................................................................... 55 Word Groupings (Comparing Sets of Key Words).................................................... 56 Grammatical Discriminators .................................................................................... 58 Comparison Against a Pool ...................................................................................... 58 Strings and Collocated Word Variables.................................................................... 58

SECTION D. STATISTICAL ANALYSIS USING MICROSOFT EXCEL.................................... 59 SECTION E. METHODS OF COMPARING OBSERVATIONS ................................................. 61

Two Independent Observations................................................................................. 62 T-test for Two Independent Samples .................................................................... 63 Sign Test................................................................................................................ 64 Signed-Ranked Two-Sample Test (Mann-Whitney)............................................. 65

Two Related Observations ........................................................................................ 66 Paired T-Test ......................................................................................................... 67 Sign Test for Pairs ................................................................................................. 68 Matched Pairs Signed-Ranked (Wilcoxon)........................................................... 69

Three or More Independent Observations ................................................................ 71 One-Way Analysis of Variance............................................................................. 72 Extension of Sign Test .......................................................................................... 74 Kruskal-Wallis Test............................................................................................... 75 Multiple Comparisons........................................................................................... 76

Three or More Related Samples................................................................................ 77

Page 5: Methods for Contextual Discovery and Analysis.pdf

iii

Two-Way ANOVA ............................................................................................... 78 Friedman Two-Way ANOVA by Ranks............................................................... 80 Multiple Comparison Procedure for use with Friedman Test ............................... 82 Use of Aligned Ranks (Hodges-Lehmann) ........................................................... 84 Page’s Test for Ordered Alternatives .................................................................... 85

SECTION F. ASSOCIATION, TREND AND SLOPE COMPARISONS AND TIME SERIES .......... 86 Scattergram (Scatterplot).......................................................................................... 86 Determining Association (Correlation) .................................................................... 87

Pearson Product-Moment Correlation Coefficient................................................ 88 Spearman Rank Correlation .................................................................................. 89 Kendall’s Tau ........................................................................................................ 90 Olmstead-Tukey Corner Test of Association........................................................ 92 Phi Coefficient....................................................................................................... 92 Yule’s Q Coefficient ............................................................................................. 94 Goodman-Kruskal Coefficient .............................................................................. 95 Cramer’s Statistic .................................................................................................. 97 Point Biserial Coefficient of Correlation .............................................................. 98 Chi-Square Test of Independence ......................................................................... 99 Kendall’s Coefficient of Concordance W ........................................................... 100 Partial Correlation Coefficient ............................................................................ 101

Trend and Slope Comparison (Regression) ............................................................ 102 Theil Test............................................................................................................. 103 Sign-Test for Trend ............................................................................................. 105 Sen, Adichie Test ................................................................................................ 106 Jaeckel, Hettmansperger-McKean ...................................................................... 107

Time Series .............................................................................................................. 108 Basic Concepts of Time Series............................................................................ 108 Some Classes of Univariate Time-Series Models ............................................... 110 Autoregressive (AR) Process .............................................................................. 111 Moving Average (MA) Process .......................................................................... 112 ARMA................................................................................................................. 113 SARIMA ............................................................................................................. 114 Periodic AR Models ............................................................................................ 114 Fractional Integrated ARMA (abbreviated ARFIMA)........................................ 114 State Space Models ............................................................................................. 114 Growth Curve Models......................................................................................... 115 Non-linear Models............................................................................................... 115 Time-series Model Building................................................................................ 116 Forecasting .......................................................................................................... 117

SECTION G. GOODNESS OF FIT ..................................................................................... 118 Introduction............................................................................................................. 118

Chi-Square Goodness of Fit Test ........................................................................ 119 Kolmogorov-Smirnov One-Sample Test ............................................................ 120 Kolmogorov-Smirnov Two-Sample Test............................................................ 122 Lillefors ............................................................................................................... 123

SECTION H. MULTIVARIATE METHODS........................................................................ 124

Page 6: Methods for Contextual Discovery and Analysis.pdf

iv

Factor and Principal Component Analysis............................................................. 124 Cluster Analysis....................................................................................................... 125 Discriminant or Classification Analysis ................................................................. 126 Multivariate Analysis of Variance (MANOVA)....................................................... 127

PART III. TABLES ...................................................................................................... 129 1. NORMAL DISTRIBUTION – AREAS UNDER THE NORMAL CURVE .......................... 130 2. BINOMIAL DISTRIBUTION – CRITICAL VALUES OF THE BINOMIAL TEST .............. 131 2. BINOMIAL DISTRIBUTION – CRITICAL VALUES OF THE BINOMIAL TEST .............. 132 3. F DISTRIBUTION – CRITICAL VALUES .................................................................. 132 3. F DISTRIBUTION – CRITICAL VALUES .................................................................. 133 4. T DISTRIBUTION – CRITICAL VALUES .................................................................. 135 6. CONVERTING R TO Z............................................................................................. 139 7. CHI-SQUARE DISTRIBUTION – CRITICAL VALUES ................................................ 141 8. STUDENTIZED RANGE STATISTIC – CRITICAL VALUES......................................... 142 9. DUNNETT’S TEST ................................................................................................. 142 9. DUNNETT’S TEST ................................................................................................. 143 10. MANN-WHITNEY U TEST................................................................................. 143 10. MANN-WHITNEY U TEST................................................................................. 144 11. WILCOX RANKED SUMS TEST .......................................................................... 145 11. WILCOX RANKED SUMS TEST .......................................................................... 146 12. WILCOXON SIGNED RANKS TEST ..................................................................... 147 12. WILCOXON SIGNED RANKS TEST ..................................................................... 148 13. SAMPLE SIZE REQUIREMENTS .......................................................................... 149

PART IV. BIBLIOGRAPHY AND APPENDIX........................................................ 151 CITATIONS FOR PART I. – DISCOVERY METHODS......................................................... 151 CITATIONS FOR PART II. – ANALYTICAL METHODS ..................................................... 152 APPENDIX..................................................................................................................... 154

Page 7: Methods for Contextual Discovery and Analysis.pdf

Introduction 1

I. DISCOVERY METHODS Introduction

The study of the elements of language has long been relegated to the field of linguistics. Letters combine to form sounds, which are combined to form words, which are combined to form phrases and sentences. Words, phrases and sentences are the building blocks for conveying thoughts, concepts, theme and imagery. They are used to convey, convince, prove and provoke. They are the foundation of literature, poetry, history, law, government and business. Language is part of every aspect of human existence. Interpersonal communication depends on some form of symbolic language whether oral, written or signed. Religions, laws and governments find their foundation in words. Entertainment, whether by reading, television, movies, or radio, is based on language. Businesses are founded on branding, name recognition, marketing campaigns and persuasive sales. It is also the foundation of academia. Literature, Composition, Foreign Language, Linguistics, Public Relations, Business Management, History and all like fields are all directly dependant on language. Physics, Mathematics, Biology, Chemistry, Medicine, Engineering and other scientific fields are equally dependant on language in that they comprehend the nature, motives and ends of their work via language. Language and humanity are interdependent. Where one is the other will be found and conversely, where one is not the other will not be found. In light of this point it is easier to see the benefits of language study in all fields – as a rule, the process of language discovery and analysis is accompanied by a better understanding of humanity. Language study always reveals something to us about ourselves, our individual and collective perceptions of the universe, the relationships between individuals and groups within a society and about our culture, customs, artistic achievements and social and political movements of a given era or across a time period. When we view language as dynamic rather than static we reveal its changes and its progressions.

The purpose of this manual is to discuss methods for text and discourse analysis so as to make apparent that the methods and tools of Document Explorer are applicable to all academic fields. It is expected that the processes of discovery and analysis will be revelatory to research in all fields.

To study language we examine the building blocks of language. These rudiments include words, phrases, sentences, themes and images. Admittedly, this is a simplistic way of defining language and only one of many, yet it suits the needs of this manual well.

This manual introduces and explains tools designed for electronic text analysis and to give examples of actual and possible research to show the versatility and practicality of these tools in all fields of academic research. Examples, though specific and categorized, should be seen as iconic templates illustrative of practical application of the Document Explorer tools to an array of academic fields.

Page 8: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 2

Section A – Using Word Counts Word counts are counts of:

1. The total number of words in a document. 2. The total number of unique words in a document. 3. The total number of occurrences of a type or part of a word. 4. The total number of groups of words. 5. Related words. 6. Collocated words.

Each of these counts has potential to reveal useful information about definitions, themes, concepts and images. Document Explorer incorporates search tools to perform counts of punctuation, individual letters, words, sub-words and word parts, or counts by phrases. In addition, these tools facilitate specialized searches by related words or collocated terms. The following sections discuss each of these several types of counts, explains what they are, gives applications across several fields of study and outlines procedures for Document Explorer users.

Page 9: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 3

Counts of Total Words Total word count has a long-standing tradition in the classroom. From grade school to the university, writing instructors pose a minimum word requirement for compositions. Though by far not an unfailing indicator, there is often a parallel between the holistic quality and the length of an essay. When it comes to text analysis, total word counts have the potential to reveal more than a vague association with quality. Total word counts can be divided into two categories – (1) the total number of words and (2) the number of unique words in a text. A count of the total number of words assesses the size of the text. The count of the number of unique words assesses the size of the vocabulary used. These two measurements can also be calculated for sections or subsections or for different features of a text such as, author, theme, time period, source, genre, etc.

Total Word Counts Total word count is a function of word economy. It can be assumed that the number of words or verbiage that an author uses in a text shows the importance of the theme addressed by that text. When the text is made up of several sections, each devoted to a different theme, a count of the words dedicated to each theme can be compared to the overall word economy. Data gathered in this manner can also infer prejudice and partiality that the author may hold – the true value the author places on the subject treated or audience addressed may be discerned.

Linguistic Applications:

• How does the verbiage of various languages compare for describing a single event – are some languages inherently more concise than others?

• How does word economy correlate with age and development? • How many words does one student use to describe an event compared to another

student? Analyze by age group, e.g., compare kinder garden students to 3rd grade students.

Social Science Applications:

• What proportion of words in a State of the Union Address deal with a particular issue compared to the other issues in the speech or to the opposing party rebuttal?

• What proportion of the words in a newspaper, news transcript, or magazine deal with a particular issue? Do the proportions imply prejudice or bias?

• What proportion of the words in textbooks deal with issues of a particular minority group and what can be inferred from such data? i

• How many words compose the tax code for the United States? How many for the tax code of the Philippines? -- What does this imply about taxation of the two

Page 10: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 4

countries, about culture and about types of taxation (word counts in income tax vs. sales tax vs. property tax vs. business tax sections).

• How many words does politician A use in answering debate questions compared to politician B?

• What is the word count for one inaugural address vs. another? Before TV and after TV. At war or at peace?

• What is the ratio of the number of words in the Constitution of the United States regarding a principle per the number of words that the U.S. Congress or Supreme Court uses to articulate and interpret the related law?

Humanities Applications:

• How does the propensity to use more verbiage vary from author to author or from

one culture to another? Why? • How does one genre compare to another genre?

Business and Market Research Applications:

• How many words are in the market survey? (How long is the survey and how

much time will it take to read.) • In a verbatim response to a survey question what category of respondents have the

longest response? (Possibly demonstrates interest level by respondent.) • How many total words are in the advertisement, web page, or opinion editorial? • How does the count in the Annual Report compare to similar works?

Page 11: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 5

Procedures: The Search Category Report The Search Category Report contains general information about the document. Select Search on the Menu and select the Search Category Report. The following describes the contents of this report.

All Words: Total Words: This is the total number of words in this search category. Depending upon the way the publisher set up the data for this book or group of books, some of these words may be numbers or punctuation marks. Average Length: This is the average length (mean) of all the words in the search category. The reason it usually smaller than the average length of the unique words is many small words are high frequency words. This skews the data toward smaller numbers. Length Std. Dev.: This statistic gives an insight into the clustering of the length data. In this case it means that approximately 84% of the words will have a length below 6.1 characters in length (4.0 + 2.1). The tool tip prompt for words 8 characters in length indicates these words are in the 82nd percentile. Average Frequency: The average frequency (count) of all the words in the search category. This is indicated by the red mark on the Low Frq. histogram. Notice that a lot of words occur only a few times and a few words occur a lot of times. This is common. Frequency Std. Dev.: This statistic gives an insight into the clustering of the data. From this histogram, it is obvious that this is a different distribution of data than seen with the

Page 12: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 6

word lengths. For this reason, the only thing such a large standard deviation (relative to the mean) can tell us is that the data is not clustered. Please note that to view the Search Category report on a subsection, first cut and paste a specific subsection of the document into MS Word ™, convert that document into by using the conversion icon on the Document Explorer toolbar, then open the book in Document Explorer and view the Search Category Report on the selected subsection.

Page 13: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 7

Total Unique Words A count of the unique words in a text is a count of the size of the vocabulary of an author. (Note that by this definition vocabulary is used in a broad sense and includes the different inflections of a word and that counting the unique words of a text will not provide a lexicon for the text. Assessing lexicons is addressed in the section titled Counting by Related Words.) Counts of unique words, when compared to counts of total words in a document show the richness of the vocabulary of the text.

Linguistic Applications:

• How does the size of the vocabulary change through the different stages of childhood?

• How do language development and word repetition relate? • What is the size of the vocabulary of a particular language – how many unique

words compose the standard newspaper? • Compare domestic or foreign language authors (or cultures) for vocabulary

richness. • Compare vocabulary growth in bilingual and monolingual speakers.

Social Science Applications:

• What vocabulary distinguishes a particular culture or sub-culture, e.g., gang

terminology, Christian terminology, etc. • Which newspaper or magazine has the biggest vocabulary or is the most

expressive? • What is the vocabulary that distinguishes one newspaper from another? What is

the difference between different newspapers when covering a particular story or theme?

• Which political speakers have the richest vocabulary?

Humanities Applications:

• When two authors each produce a work of 5,000 total words, one with 2,500 unique words, the other with 4,000 unique words – what does that say about the authors?

• How does an author’s vocabulary richness differ between genres? • How do plays with a similar theme compare in vocabulary use? ii

Business and Market Research Applications:

• In a verbatim response – what category of respondent had the largest vocabulary? • In an advertisement or web page – how many unique words are used?

Page 14: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 8

PROCEDURES: TOTAL UNIQUE WORDS

The Search Category Report also contains information on the unique words in a document. The following outlines the information given in the report. Unique Words Total Words: The sum of all the unique words in this search category. This will match the number of words shown below the WordWheel. Depending upon the way the publisher set up the data for this book, some of these words may be numbers or punctuation marks. These words are shown on the WordWheel. Average Length: The average number (called the mean) of characters in each unique word in the search category. This is also shown as a mark on bottom axis of the Unique Words - Lengths histogram on the bottom right side of the dialog box above. Length Std. Dev.: This statistic gives an insight into the clustering of the data. In this case it means that approximately 84% of the words will have a length below 10.7 (mean + standard deviation) characters in length. The tool tip prompt for unique words 11 characters in length indicates that these words are in the 83rd percentile.

Page 15: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 9

Counts of Individual Words As a discovery tool, counting individual words can have several applications, primary ones being identifying themes and topic importance. Once themes are identified they can then be examined more carefully by looking at the words in context and the images they build. Discovering topic importance in a text may also provide direction for further analyses and perhaps a better understanding of an author’s approach to and opinion of the topic of the text and the audience to which it is addressed The basic idea with individual word counts is that you count how many times each word occurs in a text. This type of counting can be done not only for the text as a whole, but also for a particular part or voice in the text such as a character in a novel or a speaker in a political debate. Once the words in the text are listed in order of the frequency of occurrence, almost at a glance one can get an idea of the content of the text. The vocabulary arrayed by counting will help the researcher discover something about the text, the author, or original situation in which the text was authored. This list of unique words can also reveal word groupings. These word groupings can be used in different ways and will be covered in-depth in the section called Counts by Groups of Words. With word counts of individual words a researcher can also assess expectation for particular word use. Predictive models can be established and authorship examined within the framework of predictive models. The following examples are insightful of applications in the various fields.

Linguistics Applications:

• What are the most common words or types of words used at different stages of language development?

• What are the most common words used in several languages as measured in newspapers – this would essentially be a comparison of the semantic content, e.g., man, hombre, ish, etc?

• What words are overused by children of different ages?

Social Science Applications:

• Which topics, (foreign policy, welfare reform, domestic economy, etc.) does an author treat most extensively?

• What is the importance of a specific topic in a political platform? • How many times does a given president reference deity in his inaugural address? • By performing word counts on political propaganda, what themes and imagery

can be identified – what can be inferred about the authors? iii • How can counts of individual words help the researcher to better understand a

political speaker – are there contrasts between the speaker’s choice of words and his self-characterization? iv

Page 16: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 10

Humanities Applications:

• What are the 100 most commonly used contextual words in Shakespeare’s

sonnets, tragedies and comedies? • Which words distinguish one author from another? For example, an author may

repeatedly use a preposition such as about when around, on, encircling, upon, up to, concerning, regarding, in relation, etc. may be equally applicable?

• What words or what kind of words are most common to a particular genre?

Business and Market Research Applications:

• In designing a market survey – what words are overused? • In examining a verbatim response – what words were used the most when

answering a question? This is important for building classification codes for verbatim responses.

Page 17: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 11

PROCEDURES: THE WORDWHEEL

The WordWheel lists all the words in the text along with individual word counts (frequency), word length and Z-score. The WordWheel in WordCruncher is a Windows® list control. The width of the columns can be changed by dragging on the column separators (vertical bars). The scroll bar is used to reposition the word list. In order to type in the WordWheel, the keyboard focus must be activated (highlight a line by clicking on it) and typing must commence without hesitation. Please note that the Z-score gives users a feel for the difference between the sample frequency of a word and the expected frequency.

Page 18: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 12

Counts by Groups of Words Once individual word counts are examined the researcher can begin to tailor counts by targeting specific types or groups of words. A researcher can discover a great deal about the author, the time period and the culture associated with the language by performing counts of word groups. The procedure for doing these counts is that the researcher first categorizes words into baskets of similar terms. Grouped terms carry similar connotations such as optimism, pessimism, activity, delay, caution, recklessness, division or faction, union, rebellion, submission, fear, etc. Word groupings can be very extensive and are usually grouped around a hypothesized theme. Once the hypothesized categories are set and words that pertain to the categories are identified, the researcher begins to search for those words within a text. The occurrence or omission of words in the text is what becomes revealing. It would be well to note that for the purposes of word counts, an idiom (or any phrase that is repeated) can often be considered as a word. Document Explorer can search out a single word, groups of words, a phrase, or any combination of words and phrases. The applications of this type of counting are myriad as are the potential implications for the results. Some examples are given below to give a feel for the range of applications for this type of word count.

Linguistic Applications:

• How can word-group analyses be used to evaluate communication? v • How do words of a particular connotation (positive/negative, friendly, angry, etc.)

appear in different media or in a single media over a specific time period? • How has popular music changed with reference to a specific basket of terms?

Social Science Applications:

• How can word-group analyses be used in qualitative evaluation of campaign

speeches? vi • How have word-group analyses been used to evaluate WWII Nazi propaganda in

film? vii • What do word-group counts on presidential speeches reveal about the current

status of the United States or of another country? • What word groups are most common in long-term marriages vs. short-term

marriages (those that end in divorce)? What are ratios of positive words to negative words in successful long-term marriages?

• What is the ratio of positive to negative words in different media over time? How does this change compare to events of national or international significance?

Page 19: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 13

• How do word-group counts in student essays correlate with a propensity toward physical violence – is there a connection between the use of violent words and physical aggression?

Humanities Applications:

• Which author most uses the words of a positive connotation? Negative? • When considering the themes of different works, which author uses the fewest

words to build a theme and which uses the most – what can be implied from such data?

• What are the most common word groups in Shakespeare’s tragedies – in his comedies?

Business and Market Research Applications:

• In evaluating verbatim responses to a market research survey question, this area

facilitates the coding of responses. (Please see tutorial on Coding, Classifying and Ranking Contextual Data.)

• In evaluating editorials – can the context be classified as positive, negative, ambiguous, ambivalent, decisive, etc.?

Page 20: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 14

PROCEDURES: GROUPS OF WORDS

By searching with a + sign between the words or phrases words can be grouped together. The use of wild cards (*) can list all words of a particular type and by marking the box for “Use all word forms” the search will include all related forms of the words.

Page 21: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 15

Counts of Word Parts Examining the parts that make up a word is a specialty field of linguistics called morphology. Morphological analysis is based on examining how the parts of a word (including roots, prefixes, infixes and suffixes) are put together. Morphology has stronger implications in some languages than others. For example, English relies on morphology much less than German. In German it is common to create a single word from other words or parts of words. “Kindergarten” is a German word used in English that demonstrates how German morphs words to create a single word. What morphology reveals depends a lot on the way the language uses morphology. Applications of morphology in the field of linguistics are readily visible, though for the Humanities and the Political Science such applications may not be immediately apparent. What are the uses of morphological analysis? What can be learned by counting morphological occurrences? What can we say of the writer/speaker or the audience? Verbs: We can look at verb conjugation and other inflections placed on words. These can vary from person to person or from speech community to speech community within the same language. Pronunciations: Orthography: British, Irish, Indian and American English all have different pronunciations. Where these pronunciations are indicated in the orthography they can be counted by Document Explorer.

Linguistics Applications:

• When do children begin to use particular morphemic constructions such as past tense verb conjugation and the possessive suffix?

• How does child speech differ from adult speech in the creative use of prefixes and suffixes?

• How do foreign languages differ in the manner in which they use prefixes, suffixes and infixes?

Social Science Applications:

• How do different social classes use prefixes and suffixes? • Does geographical location correlate with non-standard use of prefixes and

suffixes? • How are prefixes and suffixes used in music lyrics? How does this vary among

the different types of music, e.g., rap, country, pop, opera?

Page 22: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 16

Humanities Examples:

• In transcripts of theatrical works and in novels – what is the correlation between dialectical variations as indicated by non-standard prefix and suffix use and character development? What might the correlation reveal about the author’s views of the speech community that typically employs such variations?

• How does Shakespeare use prefixes and suffixes or, how do prefix and suffix use differ from Middle English to Modern English?

PROCEDURES: PARTS OF WORDS

The use of word parts can be discovered by first examining the WordWheel to see which prefixes, infixes, or suffixes are attested. Researchers can then use a wild card marker (*) to search specific substrings (word parts.) Note that by using Boolean strings, a combination of word parts and word groups can be searched.

Page 23: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 17

Counts of Related Words A count of the unique words numbers each of the forms of a word separately. (e.g., hate, hateful, hatred). A lexicon will count all of the various forms of a word as the same word. For example, the words eat, eats, eaten and ate are different yet are based on the same lexical item and represent one entry in a lexicon. In this manner of counting a text might have 8,000 total words and 4,500 unique words, but based on a lexicon will have only 3,000 words. Discovering and analyzing the lexicon of an author can reveal a great deal. A lexicon is developed as a result of encounters with the world. Hence, by examining the quantitative and qualitative properties of the lexicon, it is possible to discover something of the level and type of education or the diversity and nature of life experiences that an author has had. An individual’s lexicon changes as time passes. This may be intentional or subconscious. Authors who are very aware of their word use, probably most common in politics, adapt their words to fit the audience or topic addressed. Other authors’ lexicons change as they traverse life phases. Researchers can examine the lexicon of an author as a whole, or for as many of their works as can by found. By examining the lexicon of an author at particular points in the author’s career and comparing data gathered from those analyses to life events, the researcher often discovers correlations that contribute to a deeper understanding of the author.

Linguistic Applications:

• When do children begin to learn to expand their vocabulary with related words – when do they begin to use all verb inflections vs. just one or two?

• How rich is a particular language in its availability of words for expression?

Social Science Applications:

• Are certain forms of words more prevalent in a particular culture, in a particular newspaper, in speeches by a particular political party or in works by a particular author?

Humanities Applications:

• How many different related words does Shakespeare use to convey a particular

thought or image? • How does an author use related words to stress a particular message.

Page 24: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 18

Business and Market Research Applications:

• Use of the various lexicons in Document Explorer/WordCruncher facilitates “Word Groupings” and helps build classification codes for verbatim questions.

PROCEDURES: ALL WORD FORMS

In the Search window, mark Use all word forms to search for all related word forms. This function automatically includes all related word forms for individual words and groups of words in a single lexical search.

Page 25: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 19

Counts by Collocated Terms Collocated terms are words that are co-located or located near each other. Sometimes referred to as correlated or neighborhood terms, collocation is a great instrument for analyzing the content of a text. For example, in the King James Bible, by performing a collocation for the word love the researcher finds that it is most commonly associated with the words hate, neighbor and husband. These three words are called collocates of the word love in the King James Bible. Collocation words are listed in an array with the statistical properties of counts, frequency and expected values. The list represents words that are located within a specified proximity to the search word. The array of words listed by frequency is a valuable list allowing the user to peruse the words most closely linked to the searched term. Simply viewing collocated terms from two different authors, newspapers, or political platforms can be very revealing. Collocations can be a tremendous tool for examining content, authorship, style, theme and image analysis. The statistical properties of collocated terms will be addressed in a later section, but it should be pointed out that by sorting the various statistical parameters each can reveal a different ranking of the collocated terms.

Applications:

• What terms are generally collocated in a particular text? • Why does the author collocate those particular terms? • What are the statistical values associated with the Document Explorer collocation

report? • Are the occurrences of the collocations statistically significant? • How can collocated terms assist in building classification coding for verbatim

responses or media editorial evaluation?

Page 26: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 20

PROCEDURES: SORT BY NEIGHBORS REPORT

The Sort by Neighbors report shows the occurrence-related data of words found adjacent to the search anchor word (first word in search argument). This report is always visible on the Sort by Neighbors tab. By using the Sort by Neighbors tab of User Preferences all options that affect the appearance of this report can be set. Below is a list of the data that is viewable with relation to any search.

Viewable Search Information Neighborhood Width This option determines the size of the report. The maximum neighborhood size is

25 words before and after the anchor word. In the example used from Constitution Papers, with the search results from the word "freedom" (67 hits) the following table shows the number of unique words in the neighborhood based upon neighborhood width.

These numbers are representative; however, the numbers will be different for every search. Notice the range from the maximum is 25,25 to the two minimums (0,1 or 1,0). The usual neighborhood size is 10 or less. Words removed much further than that most frequently have fewer associations with the search anchor word. The cases where the minimums are used are also very unusual; however, they are very valid and with some studies very useful.

Rating This is a custom Document Explorer/WordCruncher statistic: it varies between 10

and -10. A rating greater than zero is shown in black and means you might wish to pay attention to these words. The explanation that follows is technical and requires exposure

Page 27: Methods for Contextual Discovery and Analysis.pdf

Section A – Using Word Counts 21

to statistical concepts. This rating is defined as: ( ( word Z-score - Average Z-score) / Standard Deviation of the Z-scores) X 2. A rating greater than 10.0 is assigned to 10. A rating less than -10.0 is assigned to -10. This statistic has been normalized to allow comparisons between reports.

Z-score Again, the explanation that follows is technical and requires exposure to statistical

concepts. This statistic is designed to give you a feel for the difference between the sample frequency of a word in the neighborhood and the expected frequency (based on the ratio of the total neighborhood size to the total size of the document). This difference is divided by the standard deviation of the sample size to normalize this difference for comparison purposes (comparing data between two reports). This data is used in computing the rating.

Sample Frq This is the frequency count of the number of times this word occurs in the neighborhoods.

Total Frq This is the frequency count of the number of times this word occurs in the total book.

Percent This is the ratio of sample frequency to total frequency (e.g. if the sample frequency is10 and the total frequency is 67, the percent is 14.9%). This may be more useful than reporting the sample frequency because it incorporates the total frequency as well.

Expected Frq The ratio of the neighborhood size to the book size times the total frequency (e.g. if the neighborhood is 1/20th the size of the book, then we would expect to find 1/20th of the words in the neighborhood)

Std Dev The standard deviation of the expected frequency (gives a feel for the clustering of these frequencies).

Sort Sequence The default sort sequence for this report is by rating (descending). You can change this by clicking on the column header you wish to have as the primary sort key. Clicking on the column header again, changes the sort direction from descending to ascending and ascending to descending.

Page 28: Methods for Contextual Discovery and Analysis.pdf

Section B – In-Context Searching 22

Section B – In-Context Searching The preceding section of the manual focused on how various types of counts can be used and their possible applications across different fields. This section of the manual will discuss discovery techniques. Many of the methods discussed in Section A provide a foundation for Section B – for example, counting individual words and word groups provides an overall view of what themes are found in a text. These themes can then be scrutinized by using techniques explained in this section of the manual. In-context searching facilitates an examination of word use, meaning and consistency of use and meaning throughout a text or across a group of texts. Document Explorer incorporates in-context search tools that allow the researcher to view how a particular word is used. One objective of searching in context is based on the fact that words may vary in meaning according to authorial discretion. The principle that words can vary in meaning is one aspect of language that makes it so versatile and creative. Meaning and Word Use By examining word use in-context the researcher is able see the meaning that the author gives to particular words or phrases. Any word or phrase in a text has the potential to have a spectrum of meanings. A good example of this is the word love in English. The left column shows some possible uses of the word love with corresponding semantic content in the right column: John loves God. Devotion or adoration.

Mary loves her husband. Affection. Sue loves Bob. Romance. He loves Pizza Hut. Preference. He loves to golf. Derive pleasure/enjoyment. Joe loves women. Lust sensuously. Jim loves money. Lust greedily.

The technique of examining how an author uses individual words is a common method of literary and political analysis. This tool can reveal much about the tone (for example, positive or negative) of the text. Political spin is based on contextual word and phrase use. A fascinating exercise is to compare newspapers, magazines and news programs for contextual definitions of words. The contextual views can reveal slants toward liberal or conservative prejudice. Characteristics of contextual definitions are often a reflection of the author and his appraisal of the content, audience, time period, genre or medium. An extension of searching for contextual definition is searching for definition consistency across a text or group of texts.

Page 29: Methods for Contextual Discovery and Analysis.pdf

Section B – In-Context Searching 23

Consistency in Word Use When researchers study the many uses of a word, they often do so across a series of texts rather than with an isolated text. A good example is in the field of law where legal precedence often hangs on the consistent in-context-use of a word or phrase. Lawyers and their researchers have to look through the documentation of many related cases and identify where and how a term is used to find a contextual definition that suits their needs. Other fields, such as the humanities, examine consistency of term usage within a document or across a series of texts. Searching across authors, genres, or across a given time period allows researchers to examine word use and consistency. Political science researchers can scan for term usage across a library of newspaper articles or campaign speech transcripts and compare term use. In the past, much of the research by this technique has been done manually. With Document Explorer, the computer shuffles through the paperwork, leaving valuable time for the researcher to observe the data and expand the search to related words or collocated terms. Throughout their career, authors may use a word or phrase consistently or with variation. Variation in word use may be a result of an author’s personal views, experience, writing style, or something else entirely. The following examples provide additional applications of contextual searching.

Linguistic Applications:

• How does an increase of variations in word use correlate with age and development?

• How has the use of euphemisms developed in American newspapers? • How is a particular term used differently across fields – with bond, in law bail

bond, in business stocks and bonds, in science chemical bond, etc. • Which English words have the broadest spectrum of meanings? • Which language averages the most number of meanings per word – what can be

implied from this data?

Social Science Applications:

• How are the terms love, hate, commitment, etc. used in successful vs. non-successful marriages?

• How are terms such as morality, virtue and values used in the news media? • Do Jay, Hamilton and Madison use tyranny similarly in the Federalist Papers? • How are politically correct terms used in today’s newspapers compared to those

from 1980, 1960, 1940, etc?

Page 30: Methods for Contextual Discovery and Analysis.pdf

Section B – In-Context Searching 24

• Are the terms in a proposed bill used the same as they were used in preexisting laws?

Humanities Applications:

• What particular manner of using given terms or phrases distinguishes Shakespeare

from another author? viii • How broadly does an author define a term throughout a single work? • Do a set of authors use a term consistently or differently – if so, how? • How do the theme and imagery within a text change as the contextual definitions

of terms change? • How does an author use words to build up imagery – are there patterns in the

number of words or kinds of words used?

Business and Market Research Applications:

• Is the market survey designed using a uniform vocabulary across the entire survey? Is the advertisement designed with a uniform vocabulary?

• In a verbatim response to a market survey – what terms are the same yet have different contextual definitions? How does this change the coding and classification of the responses?

Page 31: Methods for Contextual Discovery and Analysis.pdf

Section B – In-Context Searching 25

PROCEDURES: IN-CONTEXT SEARCHING

After searching a word or phrase, the Reference List window shows the “hits” in-context and gives the location reference for each “hit.” The Reference List window is a container that holds smaller reference windows. The following options apply to all three tabs 1. Reference windows can be deleted until only one remains. 2. As many reference windows can be added/opened as the user desires. 3. The scroll bar or mouse wheel changes which reference window is highlighted.

Options and Elements of the Reference List Window Reference Window

Each reference window has a citation line and a small text window. Each reference window contains a citation line, a view option, an occurrence number, a select option, a delete option, each explained below.

Citation Line The citation line shows the location of the reference within the work being searched.

Page 32: Methods for Contextual Discovery and Analysis.pdf

Section B – In-Context Searching 26

View Option Clicking on the View option allows you to view searched words and phrases in a broader context in a text window.

Occurrence Number A serial number assigned to each hit (search result.)

Select Option Clicking on the Select option allows you to transfer a reference found in the Reference-List window to the Selected References window.

Delete Option Selecting the Delete option allows you to delete references found in the Reference-List window.

Page 33: Methods for Contextual Discovery and Analysis.pdf

Section C – Discovering with Runtime Concordances 27

Section C – Discovering with Runtime Concordances In its simplest form a concordance is a listing of all the words in a text, given within their respective contexts. For example, a concordance for a literary work such as Mark Twain’s Tom Sawyer would list every word in that book in alphabetical order, each word being accompanied by a pre-specified amount of text as it occurs before and after the word in the actual document. On the other hand, a runtime concordance is when you can search any word and see the contextual uses of that word. The search program in Document Explorer is a runtime concordance. Advantages of this concordance builder as opposed to a textual concordance include (1) speed -- users can build a concordance faster than turning pages to find the cited word, (2) flexibility -- users can build a concordance for groups of words or phrases, (3) specificity -- users can build a concordance based on collocated terms and (4) users can build a concordance using wild cards for prefixes, suffixes, or word roots. The idea of examining word use with the search window and building runtime concordances are based on the principle of viewing results in context. Document Explorer can build an enormous reference list of hits that can be exported for review and examination. With the use of wild cards (*) the entire book or library of books can be exported to a physical concordance or viewed in runtime. A physical concordance displays all of the words at once (which can be a huge amount of data) and leaves the researcher to view the text within the confines of the parameters set for the number of words before and after the text – limiting the purpose of the concordance’s contextual view. The runtime concordance allows you to quickly broaden the contextual framework allowing the user to expand their contextual view of a particular reference. Concurrent examination of many words and phrases and the ability to control contextual view parameters are powerful tools for research. Although these tools are not limited to Linguistics, Humanities, Political Science and Foreign language, examples of research that has been done in these fields with the use of a concordance are given below.

Linguistic Applications:

• Does authorial gender correlate with the way that particular words are used? ix • Use the concordance program to compose a dictionary to teach foreign language

vocabulary. x

Social Science Applications:

• When two political platforms are compared, is term use consistent between them? • When two legal documents are compared, is a specific term used with consistent

or variant contextual meaning?

Page 34: Methods for Contextual Discovery and Analysis.pdf

Section C – Discovering with Runtime Concordances 28

• When comparing use of “politically correct” terms in newspapers it will be seen that these terms generally change over time. What can be implied from these changes?

• For terms related to Civil Rights, freedom, prejudice, private, etc., how are these terms contextually used by several different forms of governments, e.g., Communist, Capitalist, Socialist, etc?

Humanities Applications:

• How can concordances be used to detect plagiarism? xi • What are the contextual definitions for the terms justice, peace and freedom as

used by one author compared to another?

Business and Market Research Applications:

• How can the contextual searching and coding applications be expanded to include multiple classification categories?

• How can run-time concordances be useful in adjusting classifications on the run?

Page 35: Methods for Contextual Discovery and Analysis.pdf

Section C – Discovering with Runtime Concordances 29

PROCEDURES: SINGLE AND MULTIPLE-WORD CONCORDANCES

The run time concordance is viewed from the Search Reference-List window. Note how users can build multiple concordances very quickly. A run-time concordance can be built for a combination of words and for all word forms. The Reference List can be exported to a file to be edited or for use in print material by selecting File, Save Reference List.

Single Word Concordance

Combination words and all word forms

Page 36: Methods for Contextual Discovery and Analysis.pdf

Section D – Comparing Readability Levels 30

Section D – Comparing Readability Levels Readability levels are a measure of the ease or difficulty with which a text can be understood. A common formula for calculating this is the Flesch-Kincaid Readability Formula: Grade Level = 0.39 • (aver. # words/sentence) + (aver. # vowels/word) – 15.59. This formula is based on the average number of words per sentence plus the average number of vowels per word and adjusted by several factors. There are many existing formulas used for calculating readability, all of which have their strengths and weaknesses. For our purposes, we are not looking at specific readability, but at comparing readability of different texts. The Flesch-Kincaid has proven accurate enough that it has been built-into Microsoft’s word processor, Word. We will use this method in comparing the readability between texts. In private, public and commercial sectors readability levels have been used in various ways. Some use it to promote reading by guiding readers to publications with a readability level that corresponds to the reader’s level – this is typified where libraries assign a readability level to books. Others suggest that any document conveying critical information, from legal forms to websites, should be tested for readability in order to increase the probability of successfully conveying the information. The readability level of the Miranda Rights has been examined in order to determine the chances of that information being unsuccessfully conveyed. Readability levels can reflect authorship, the degree of education or intelligence of the author or the audience as the author may choose to write or speak with more simplicity or complexity, depending on the audience. The numerous causes for discrepancies among readability levels are not always immediately perceptible. The process for discovering these causes is to calculate and compare readability levels. The following are general examples for using comparative readability levels across: � two or more different authors � several different speeches from the same author � different genre, e.g., fiction vs. prose � one time period to another � one geographical location to another

These examples of readability comparisons have been used in actual research. Many other types of comparisons can be found according to the needs and creativity of the researcher. This research can reveal differences, but in order to say that there is a significant difference the researcher must use statistical methods outlined in the analytical procedures section.

Page 37: Methods for Contextual Discovery and Analysis.pdf

Section D – Comparing Readability Levels 31

Linguistic Applications:

• How do the readability levels of spoken vs. written texts compare to each other at different ages/developmental stages?

Social Science Applications:

• What are the readability or grade levels of speeches given by several presidents of

the United States? xii • What is the difference in readability levels of the responses to questions in a live

debate compared to scripted speeches? • How do Congressional bills or documentation from federal agencies (IRS, EPA,

etc.) compare to the readability levels of media that are commonly and generally accessed by the American people such as newsprint, news broadcasts, pulp fiction, etc?

Humanities Applications:

• What does readability level say about an author, his audience, or his topics? • How does one author’s readability level differ from another? (e.g. How does

Shakespeare’s readability level compare to another author of that era such as Hobbes or Bacon? Why?)

Business and Market Research Applications:

• What grade level is the advertisement or the opinion editorial? • How do you classify verbatim responses to a market survey by education level? • If you have education level, the readability level can either confirm the education

level or the interest level of the respondent.

Page 38: Methods for Contextual Discovery and Analysis.pdf

Section D – Comparing Readability Levels 32

PROCEDURES: CALCULATING FLESCH-KINCAID

This is the procedure for calculating readability levels with Microsoft’s Word ©. Select the Tools menu then the Spelling and Grammar. When the Spelling and Grammar checker opens, click on Options. Finally, check Show Readability Statistics under the Spelling & Grammar tab (the lowest box.) When Microsoft Word finishes checking spelling and grammar, a dialogue box will display information about the reading level of the document, including a readability score calculated by the Flesch-Kincaid formula.

Page 39: Methods for Contextual Discovery and Analysis.pdf

Section E – Examining Collocated Words 33

Section E – Examining Collocated Words Collocated terms are terms that are co-located or located near each other – performing a collocation on a word in a text shows a word and lists the words that most commonly occur near that word in the text. In the King James Bible the researcher finds that the word love is most commonly used along with the words hate, neighbor and husband. These three words are called collocates of the word love. Collocation counts can help users discover themes and recognize imagery as well as analyze aspects of an author’s style. The associations an author gives to words, both of contrast and similarity, are much more easily identified in the results of a collocation count. The applications for using collocations are truly limited only by the ingenuity of the researcher. Below are some examples of some ways in which collocation has been used in research.

Linguistic Applications:

• What patterns of collocated words characterize a particular author’s works? • What words are most commonly collocated for ESL students? • What terms are commonly collocated by native English speaking children at

different ages, e.g., at what age do freedom and speech begin to be collocated? • Which languages consistently rely on collocated terms (including reduplication)

for expression? For example, in Mandarin Chinese chr fan is a common collocation which literally means eat rice but has come to be used as eat any food at all.

Social Science Applications:

• What subcultures collocate words such as faith and god or big and government

most frequently? • What are common collocations found in rap, country, classic rock, alternative,

classical symphonic, elevator music, etc. – what do these collocations reveal about the cultures that produce and consume such music?

• How do collocations vary from text to text as the nature of the content of those texts varies, e.g., what are collocations common to texts dealing with U.S. military histories vs. collocations found in texts dealing with the history of U.S. civil rights?

Humanities Applications:

• When collocations are performed on an author’s works, what aspects of the

author’s life become salient? xiii • How can imagery in a text be discovered with collocated terms? xiv

Page 40: Methods for Contextual Discovery and Analysis.pdf

Section E – Examining Collocated Words 34

Business and Market Research Applications:

• What words are consistently collocated with the classification grouping words? How does this change the classification word groupings? (bad, not bad)

• What words are commonly collocated with the advertisement “hooks?” PROCEDURES: THE COLLOCATION WINDOW

Collocations are shown in the “Sort by Neighbors” window. Double clicking on a specific word will show the collocation hit in context as shown below. The above example, shows some of the results for a collocation search on love in Hamlet. Highlighting a word (liar) will change the reference list to show the collocated term in-context in the upper window. Select Help from the Menu for explanations of the statistics of the collocated terms. The parameters for collocation statistics can be set/altered by clicking the Report Preferences icon (middle icon) on the Search Results window.

Page 41: Methods for Contextual Discovery and Analysis.pdf

Section F – Style Analysis 35

Section F – Style Analysis To begin it would be well for the reader to note that there has been a great deal of research in the area of style analysis – authorship studies taking the forefront of the field. Such research projects are usually ongoing and quite lengthy. The aim of this section is not to provide an exhaustive history of style analysis but to reduce the process to its essentials and explain how to apply those essentials to research. Style analysis fundamentally seeks to distinguish one text from another and to establish or discover something of the origins of the text analyzed. This section defines what style is, explains how style analysis is significant and discusses how to formulate research in order to identify styles. What is Style? Style is a flexible term. Most students have at least a feeling for what the word style can mean. In literature, style is related to the form of expression that an author uses. In order to offer a methodology for style analysis a clear definition of style must first be provided. For the purposes of textual analysis style is a conglomerate effect of any number of distinct features, which in their sum provide an overall distinctness for the text. By that definition, style is a sufficient number of the features of a text, which taken as a whole, set the text apart from other texts. When people speak of Shakespeare’s style, they refer to the features of Shakespearean works that together make his works distinctly Shakespearean. It would be well to define what is meant by features since features are what make up style. Features are components or facets of the text that can be identified and examined independent of other components or facets of the text. Textual features relating to the form of composition include such things as punctuation, orthography, sentence structure and word choice. Features relating to content include theme and poetic devices like metaphor. In a text these kinds of features are everywhere – one could argue that the text is entirely composed of such features. To build a style the features must occur in some sort of pattern. To illustrate how feature patterns occur let us examine the feature of sentence complexity. The feature of sentence complexity is part of almost every text but in order for this feature to contribute to a style the researcher must identify a pattern of occurrence. A pattern of occurrence in sentence complexity could be something like compound sentences occurring only at the end of paragraphs or perhaps a complete absence of simple sentences throughout a given text. Thus, style consists of not merely random features clustered together to form a text but of identifiable patterns in the features of a text. Style analysis and feature pattern analysis are used interchangeably in this section.

Page 42: Methods for Contextual Discovery and Analysis.pdf

Section F – Style Analysis 36

The Significance of Style Analysis Discovering the style of a text could be quite bland if the process never got outside of the text itself. If all that is known about the text is how its features are organized, not much is known. As it turns out, style will always point back to some aspect of the origin of the text. The correlation between the style and the origin of the text pushes the study of style beyond the text itself and grounds style analysis in the real world. The aspects of textual origin that correlate with style are: (1) the genre of the text such as poetry, expository, narrative, etc. (2) the medium by which the text is presented, for example, book form, newsprint, conversation, etc. (3) the era in which the text originated, e.g., during WWII or early in the 19th century (4) the geographical area from which the text originated, Australia, New England, Rome, etc. (5) the author of the text, Shakespeare, Joseph Conrad, a college student, the United States Supreme Court, etc. (6) the nature of the content of the text, for example medical, military, academic, personal, etc. and (7) the audience addressed by the text such as citizens of a nation, a particular ethnic or age group, etc. The point is that whenever patterns in features are discovered there will always be a correlation between the patterns and one or more of the aspects of textual origin. The example questions below illustrate some ways in which feature patterns and aspects of textual origin could correlate. Each bulleted item is followed by a parenthetical label for the aspects of textual origin that the question correlates to style.

Linguistic Applications:

• How can style be analyzed to classify texts by genre? xv (genre) • What is the difference in speech styles between adolescents and adults? (author) • What aspects of syntax vary from discourse to written texts? (medium) • How do features of style vary between texts explaining medical procedures vs.

texts explaining military procedures? (nature of content)

Social Science Applications:

• How do feature patterns in conversational speech differ from those of public discourse? (medium, audience)

• How do the feature patterns of speeches by a given political candidate differ from one audience to another – when the audience is composed mostly of African Americans vs. Caucasians? (audience)

• How do styles of congressional bills vary over time? (era) • What features of the Federalist Papers of undetermined authorship have been

analyzed to evince authorship by one party or another? (author)

Page 43: Methods for Contextual Discovery and Analysis.pdf

Section F – Style Analysis 37

Humanities Applications:

• How are patterns in features of expository texts from early 19th century distinct from patterns in the features of persuasive texts from the same era? (era, geographical area, genre)

• When authorship is unknown, how can style analysis be used to establish a probable author? xvi (author)

• What feature patterns distinguish personal communications by a group of famous authors from their professional writings? (nature of content)

• What contrast does style analysis reveal between music lyrics and written poetry? (medium)

Business and Market Research Applications:

• Is there a particular style to the advertisement? • Is there a pattern in the survey design? • Is there bias in the question outline?

Discovering Style There is no single formula for discovering style – remember that a style consists of enough features occurring in identifiable patterns to make the text distinct from other texts. Every researcher has different motives for performing style analysis – some may want to examine authorship, others seek to discover the distinctness of one medium versus another and other researchers might want to know how style has changed over time for a given location or genre. The researcher will begin stylistic analysis with a general research question relative to one or more of the seven aspects of origin, perhaps, “What are the characteristics of Emily Dickinson’s style?” or “How do the speaking styles of W. J. Clinton and Ronald Regan differ?” After collecting the appropriate texts the researcher will begin a search for patterns in textual features. These are the three categories that most features of a text fall into. Below each category are examples that show the researcher how to frame questions in order to discover feature patterns. The bolded words indicate textual features.

1) Patterns in the content: What is in the text?

• What kinds of inflections occur in the texts – which verb conjugations are omitted that are generally common to texts?

• What grammatical devices are used – are sentences simple, complex, compound? • What kinds of words are used– what is the ratio of function words to content

words?

Page 44: Methods for Contextual Discovery and Analysis.pdf

Section F – Style Analysis 38

• What orthographic variations are present – analyze vs. analyse? • What themes are present – good vs. evil, redemption, childhood innocence, etc? • What poetic devices are employed – is metaphor, rhyme or meter present? • What word groups are present – considering the 100 most common words, are

they optimistic/pessimistic, legal, dynamic/sedentary, foreign, etc? • What collocations are present – freedom, speech and press, right and bear arms?

2) Patterns in discretionary use: How is the content being used?

• Where in the text do inflections occur? • How are words selected – as a function of character development, poetic device or

authorial dialect? • What terms are used to build up imagery – are themes formed by patterns of

rhyming words, by use of idioms or by a set of terms being collocated repeatedly? • Is a word used denotatively or metaphorically – is head used to mean the body

part or the uppermost portion of another entity such as a line of people or a river? • Is a word used denotatively or connotatively – is home used strictly to indicate a

dwelling place or to bring to mind impressions of family, security and inclusion? • What types of novel or non-standard word use are evident – are there instances of

backformation, e.g., orientate for orient, borrowing, e.g., avante guarde, carpe diem, amigo, vulgarity, blending, e.g., smog for smoke + fog, clipping, e.g., meg and net for megabyte and internet, coinage or word manufacture, e.g., musquirt for the clear runny juice that always comes out before the mustard?

• What are the dynamics of the semantics – are contextual definitions consistent or do they vary?

• What are the dynamics of the pragmatic aspects of the text – how are aspects of truth, quantity and relevance of information, etc. manipulated, e.g., to a humorous end or otherwise?

• What is the nature of the collocations in the text – are the collocations repeatedly composed of only two or three words or are they composed of several words, do the collocations play any role in organizing imagery? 3) Patterns in associations: How do things fit together?

• Which words are regularly collocated – dog and cat, man and woman, friendly and fire, etc?

• What patterns in function word ratios and word groups are evident - what is the ratio of a collocation such as to x to (where x is any word) over all the instances of to?

• What kinds of punctuation is used and where? • What words start and end sentences – what is the percentage of sentences that

start with a, an, and, in, it, that end with it, that have the or with as the second-to-last word?

Page 45: Methods for Contextual Discovery and Analysis.pdf

Section F – Style Analysis 39

PROCEDURES: STYLE ANALYSIS

Style analysis is basically pattern analysis and Document Explorer is very useful for pattern discovery. Document Explorer’s search program allows the user not only to view all the hits in-context, but to view the entire document with the hits highlighted. As Document Explorer users use thematic words, search the WordWheel for terms or find collocated terms patterns in lexical choice are immediately evident. Searching all related words also reveals word patterns across a lexicon of related terms.

Page 46: Methods for Contextual Discovery and Analysis.pdf

Section H – Textual Analysis Methods for Writers 40

Section G – Translation Consistency: Word Use, Themes and Imagery This part of the manual is divided into two segments. The first is a short summary of translation itself. It includes an overview of the essential nature of translation, common problems in translation and an explanation on how to use Document Explorer to avert some of those problems. The second segment contains a discussion of how to use Document Explorer tools to evaluate prior translations. Translation Before explaining how to use Document Explorer for translation, it will be helpful to identify the assumption made by the explanation. This assumption is that the essence of translation is to have the meaning that the author intended to convey through the text of the original language to be conveyed through the text in the second language. This is the definition that the manual uses for translation. Translation is not merely a function of changing all of the words from the original language to corresponding words in the second language. Even when the translator is familiar with both languages there are some factors that make it a difficult task. Here are some examples of the complex nature of translation:

• When the second language does not have the cultural aspects to support symbolisms of the original. In modern translations of the Hebrew Old Testament. Ancient writers such as Isaiah relied heavily on metaphor and symbolic speech. Phrases such as “the ships of the Tarshish” and “cedars of Lebanon” held significance for the Hebrew culture in Isaiah’s day but Modern English lacks the cultural aspects for those symbols to hold the same significance.

• When technical terms of the original language do not exist in the second language. Imagine translating a tractor repair manual from English to Arabic. How would the term “nine toothed dog” be rendered in Arabic if it indicates a gear in the transmission of the tractor? To translate the words exactly as they stand in English could result in unwanted confusion.

• When a word has various meanings in the original language the translator must decide how the word is being used in order to make an accurate translation. Take run for example – my nose is running, they ran three miles, the river ran dry, his blood ran hot, she ran to the store, they ran the machine all day, Joe ran for a public office, run up the flag, etc.

As these examples illustrate, a word can have many potential meanings. The translator must take into consideration not just the isolated words but their individual contexts as well. Additionally, meaning exists on more levels than just that of the word – idioms and metaphors are composed of several words that convey a holistic meaning which exists above or beyond the meaning of the words themselves. This holistic meaning is what must be translated.

Page 47: Methods for Contextual Discovery and Analysis.pdf

Section H – Textual Analysis Methods for Writers 41

Discovery Before Translation Document Explorer tools help to understand the relationship between the language of the original text and the meaning that it conveys. Discovering the source document before translation is one way to ensure a more accurate translation. The greatest authors speak through imagery and not via mere words. When analyzing a document prior to translation if it is difficult to search out all of the occurrences of a particular word, then it is so much more complicated to discover themes, find imagery and analyze style. Hence, the importance of a tool that allows the discovery and comparison of images in a text. These are some ways in which Document Explorer tools can be used to discover a text prior to translation:

• in-context searching – to understand word use, identify themes and imagery • collocations – identify themes and imagery • perform style analysis – to note how textual features are organized • run-time concordance – to examine how much variation there is in word selection

Evaluating Prior Translations

Post translation analyses are done by comparing the translated document to the source document, by comparing several independent translations of the same document to each other or by a combination of both. Document Explorer enables the researcher to evaluate several translations of a single document by comparing them to each other, even when the researcher is not familiar with the source language. Alternately, when fluent in the source language, the researcher can use synchronized windows to compare the source document and the translated documents line by line. Procedures for these comparisons are discussed below. These are some ways in which Document Explorer tools can be used to evaluate a text after it has been translated:

• Perform word counts - by counting how many total words are used, how many unique words and how many times each individual word is used.

• Use in-context searching – by checking the consistency in word use, the theme treatment and the imagery.

• Use synchronized screens – to view both the original document and translated documents or two-plus independent translations of the original as they are connected/synchronized paragraph by paragraph, chapter by chapter, etc.

Page 48: Methods for Contextual Discovery and Analysis.pdf

Section H – Textual Analysis Methods for Writers 42

Applications:

• Use Document Explorer to discover a poetic text and write an exegesis that serves as a template for translation of the source into several different languages. xvii

• Evaluate a time series of translations of the Iliad, the Bible or any other work that has been translated repeatedly over time. How are the terms and ideas translated differently? Compare the Middle English of the King James Bible to the Modern English of the New International Version of the Bible?

• Evaluate several independent translations of the same document. Examine the variance in the terms used to represent the same ideas in the source text.

• Compare the several translations of terms used in presidential news briefings as they occur in foreign media. xviii

• Considering the U. S. Constitution, how has the terminology been translated? How is arms in the third amendment rendered in other languages?

• How is Tolstoy translated into one culture as compared to another– American vs. French vs. German?

• How do translations of UN Resolutions or WTO Charters into several languages differ?

PROCEDURES: SYNCHRONIZED WINDOWS

Because the Document Explorer tools for much of translation analysis has been covered in previous sections, we will focus on synchronization tools and the use of word counts, in-context searching and run-time concordance use, collocation and pattern analysis.

The Synchronous Layout allows you to select the default layout for the books in a book set. This data is used every time a book set is opened. WordCruncher creates separate

Page 49: Methods for Contextual Discovery and Analysis.pdf

Section H – Textual Analysis Methods for Writers 43

panes for each book in the book set in accordance with the pane layout determined here. The Help files will walk the user through the synchronization process. Synchronized layouts allow users to not only read files when they are connected synchronically, but to search the synchronized files. Example 1. Reading Synchronized Files

Example 2. Searching Synchronized Files

Page 50: Methods for Contextual Discovery and Analysis.pdf

Section H – Textual Analysis Methods for Writers 44

Section H – Textual Analysis Methods for Writers This section discusses two ways in which the tools explained in previous sections can be used: 1) to analyze one’s own writing in order to attain a greater degree of objectivity in evaluating the quality and impact of one’s own texts. 2) as additional writing tools for authors. Objective Evaluation Evaluating ones own writing can be very difficult. Aside from aversion to conscious self-criticism, the author often has a difficult time attaining the ear of the audience or the same level of objectivity that a dispassionate reader naturally has. Objectivity not only reveals deficiencies in the content of a text, but also provides an understanding for the general effect that the text might have on a reader. Since it is difficult for an author to perceive the text as one who is not involved in the process of constantly reading and reviewing, Document Explorer tools can assist in finding a more objective position. Below are some ways that Document Explorer tools can be used to gain a more objective standpoint when reviewing self-compositions.

• WordWheel and Word Counts: Examination of the WordWheel and their counts –

is the mixture of words what the author desires it to be, e.g., varied and dynamic or uniform and stale? Are thematic words overused or are they underused to the extent that there is not a sufficient reinforcement of the theme? Is there a necessity or opportunity for variation in word choice such as use of negated antonyms for understatement – not bad for good, or not good for bad, etc?

• In-context searching: View term locations in the search window – are the terms used correctly in-context? Search on a list of cliché phrases – is it possible to substitute a more novel, appropriate or convincing phrase? Check the contextual use of specific terms to ensure appropriateness for the intended audience. Search the focus words of a speech for impact within the context.

• Grade level: Perform a readability test on the document. Check for number and context of connectors, conjunctions and subordinators, e.g., and, but, or, however, hence, accordingly, etc. as the use of these words is a factor in complexity level.

• Collocated terms: Check to see if collocations form themes and ideas that are fitting to the audience addressed or the content treated – are jargon phrases compatible with the audience?

• Themes and devotion to themes: How often are thematic words used? Do they build toward a climax or do they reinforce a direction for the theme? Are themes overt as in Bush’s Thousand Points of Light speech or are they underlying/symbolic?

• Imagery and Symbolism: What is the parallel between symbols from the text and the themes discussed? How are images constructed or what are their fundamental parts?

Page 51: Methods for Contextual Discovery and Analysis.pdf

Section H – Textual Analysis Methods for Writers 45

• Style Analysis: What is the style? What do the patterns of textual features say about the author, the expected audience, the time period, etc.

Additional Writing Aids

• Word count analysis may provide the writer with ideas for an appropriate title. • In-context search tools of Document Explorer can be used to organize a thematic

index for a manuscript.

Page 52: Methods for Contextual Discovery and Analysis.pdf

Part II. Analytical Methods - Introduction 46

PART II. ANALYTICAL METHODS Introduction This part of the manual describes the statistical procedures that will enable Document Explorer users expand document analytical capabilities. By understanding various procedures for text analysis (some simple and others complex), document exploration may be expanded to include statistically sound comparison and predictive modeling. Hence, a fundamental objective of this section is to explain how Document Explorer users can summarize observations, compare sets of observations and make projections using both deductive and inductive methods. It is important to note that these statistics can be used to model relationships and estimate predictive models but that they are not meant to establish cause-and-effect relationships. Most cases of cause and effect must be left up to the theorists and qualitative reasoning. Quantitative modeling methods are simply that, “models” of correlation or relationship. Although the Discovery Methods are simple enough for a broad audience, the Analytical Methods outlined here are complex and may be better suited for students and faculty with at least a rudimentary knowledge of Statistics, though the introduction to each section may be of interest to a broad range of researchers in that it explains the procedures in general terms and gives practical applications for the procedures. The analysis procedures, in most cases, can be performed by computer software with little difficulty. Still, it is important for any researcher to understand the input data and the nature of the procedure in order to interpret the statistical data output. Matriculation in a course on general statistics, including non-parametric statistical procedures, is recommended. Words, strings, collocation-counts, frequencies and distributions are the fundamentals for summarizing observations and making statistical comparisons and projections. Document Explorer readily supplies this output; additionally, data from Document Explorer are easily copied and pasted into spreadsheets and statistical programs (see Section D.) The main sections of this part of the manual are summarized below. Methods of Summarizing Observations: Statistics is a way of summarizing observations through graphics and quantitative methods. This section explains much of the terminology covered in general statistics, covering types of data, methods of summarizing location and dispersion parameters, creating and describing distribution frequencies, measures of central tendency, measures of variability and other important.

Page 53: Methods for Contextual Discovery and Analysis.pdf

Introduction 47

Hypothesis Testing: This section explains the formation of hypotheses. The importance of this section cannot be underestimated. The formation of a correct hypothesis is key to variable selection and selecting the correct test procedure. Variable Selection: Text analysis has special types of variables that are outlined in this section. These variables can be combined to create combinations of variables and ratios. Comparing Observations: Statistical tests may be used to investigate the differences between means and differences between medians of two sets of observations. These observations may be “independent” or “pared.” This section will explain tests comparing two observations and three or more observations. Association and Trend Analysis: (Correlation and Regression Analysis) This section examines methods of studying relationships between two different measures. Association (in terms of correlation) and trend analysis (in terms of regression) are methods for determining the relationships between measures. Time Series An extension of Association and Trend is when we examine an observation across time. This section examines the special nature of time related variables and various methods for modeling and forecasting. Comparing Dispersion: (Goodness of Fit) Comparing dispersion is a comparison of frequency distributions rather than location parameters. This comparison considers the word frequencies or the distribution of variables across a series of works. Multivariate Analysis of Variance: (MANOVA) This section examines methods of comparing multiple variables through classification, discrimination and clustering procedures.

Page 54: Methods for Contextual Discovery and Analysis.pdf

Section A. Methods of Summarizing Observations 48

Section A. Methods of Summarizing Observations “Location” and “dispersion” are terms we use in answering most statistical questions. Measures of location include the mean, the median and the mode. These are methods of computing the central tendency of a distribution. Measures of dispersion describe how much the observations differ. This is called variability. Common measures of variability are the range, standard deviation and variance. Descriptive statistics: are used to describe the data and are not employed to draw predictions. Basically, they are methods and procedures used for presenting and summarizing data. Examples: tables, graphs. Inferential Statistics: are used to make inferences or to make predictions; used to make conclusions about the population (all the objects that have something in common with one another) from the sample (set of objects derived from the population). Levels of Measurement Nominal: categorical, identify mutually exclusive categories; cannot be mathematically manipulated. Examples: hair color, SSN Ordinal: rank-order, represent rank-orders but do not give any information about the differences between adjacent ranks. Example: Order of finish in a horse race Interval: considers the relative order of the measures involved and also has equal differences between measurements corresponding to equal differences in the amount of the attribute being measured; does not have a true zero. Example: IQ Ratio: has a true zero point, equal differences between measurements correspond to equal differences in the amount of the attribute being measured. Examples: weight, height, blood pressure level Types of Data Continuous: can assume any value within the range of values that defines the limits of that variable. Example: temperature Discrete: can only assume a limited number of values. Example: values on the face of a die Creating a Frequency Distribution What do the observations look like graphically? This is a graphical look at the data and is what allows you to see the data in a physical representation. This means graphing counts and looking for general shapes in the data.

Page 55: Methods for Contextual Discovery and Analysis.pdf

Section A. Methods of Summarizing Observations 49

Frequency Histogram: histogram showing the frequency of individual data values on the vertical axis and the data value along the horizontal axis Describing a Frequency Distribution Normal: will look like a classic bell-shaped curve Bimodal: will have two modes, so it looks like two humps or two bells next to each other Skewed: a measure of the relative symmetry of the distribution; zero indicates symmetry. Positive values show a long right tail; negative values show a long left tail.

Page 56: Methods for Contextual Discovery and Analysis.pdf

Section A. Methods of Summarizing Observations 50

Kurtosis: a measure of relative peakedness, based on the size of the tail of a distribution. If a distribution is unimodal and symmetric, then K=3 indicates a normal, bell-shaped distribution (mesokurtic); K < 3 indicates a platykurtic distribution (flatter than normal, with shorter tails); and K > 3 indicates a leptokurtic distribution (more peaked than

normal, with longer tails). Kurtosis is calculated using this formula: ( )

34

4

−−

= ∑σ

µNX

K ,

where σ is the standard deviation. Measures of Central Tendency mode: data value that occurs most frequently in a sample; not necessarily unique; if there are two modes, the data are called bimodal; the mode is most useful for discrete data with a small range median: middle score in a distribution; the point that defines the upper and lower 50 percent of the sample; the exact middle of the data set; “if n is odd the median is a member of the data set, while if n is even the median is the average of two adjacent values”; robust measure of central tendency because it is insensitive to outliers and extreme values; most commonly used in nonparametric tests mean: average score of the distribution; “average of the sample data”; most common measure of central tendency; not robust to outliers and extreme values

Example: in the distribution of the following seven scores: 5,6,8,9,13,13,16 the mode is 13, the median is 9 and the mean is 10

Measures of Variability How is the vocabulary dispersed? How is a basket of key words dispersed? Range: the difference between the maximum and the minimum value in a sample, a measure of dispersion; not robust to outliers and extreme values

Mesokurtic Platykurtic Leptokurtic

Page 57: Methods for Contextual Discovery and Analysis.pdf

Section A. Methods of Summarizing Observations 51

Variance: the mean of the squared deviation scores; the sum of the squared deviations about the mean divided by the sample size minus one. The larger the variance the greater the dispersion or spread around the mean; not robust to outliers and extreme values Standard Deviation: square root of the variance; measure of dispersion about the mean; measured in the same units as the mean Central Tendency: description of the location of the middle of characteristic values in a distribution (mean, median, midrange, trimmed mean, modal class); position where the data tend to center Dispersion: general reference to the “spread” of data values around the center of a distribution, including variance, standard deviation and range Various Vocabulary Parametric tests: require assumptions about the shape of the populations involved Nonparametric tests: does not require assumptions about the shape of the populations involved (distribution-free tests) Outlier: any sample observation that is more than 3 standard deviations from the mean. In general, it is an observation that may be from a different population because it differs markedly from the others in the sample. Robust: the quality of being unaffected by a particular factor; example: the median is robust to outliers Quartiles: the first quartile (q1) is the point along the x-axis which defines the lower 25 percent of the sample. The second quartile is the median. The third quartile is the point along the x-axis that defines the upper 25 percent of the sample. Statistic: characteristic of a sample Parameter: characteristic of a population Parametric vs. Nonparametric Assumptions and Tests Nonparametric statistical tests are especially appropriate when the sample size is small, the data is not continuous, or you don’t think your data come from a normal distribution. Parametric tests make specific assumptions about one or more of the population parameters that characterize the underlying distribution for which the test is employed. Nonparametric tests make no such assumptions. Typically, parametric tests use interval or ratio data, while nonparametric tests use categorical/nominal and ordinal/rank-order data.

Page 58: Methods for Contextual Discovery and Analysis.pdf

Section A. Methods of Summarizing Observations 52

Advantages of Nonparametric Statistics 1. Computations are quick and easy. 2. May be applied when the data are measured on a weak measurement scale (such

as nominal or ordinal). 3. Depend only on a minimum of assumptions, which makes them quite general. 4. Outliers have limited influence since the observations are usually replaced by

signs or ranks. 5. Can be used with data measured on a qualitative rather than quantitative scale. 6. Valid for small sample sizes (less than 25). There is no minimum sample size

required for most methods to be valid and reliable. 7. Easy to use and understand. 8. More widely applicable than parametric methods, since the techniques may be

applied to phenomena for which it is impractical or impossible to obtain precise measurements on (at least) an interval scale.

Disadvantages of Nonparametric Procedures

1. The arithmetic in many instances is tedious and laborious. 2. May lose efficiency when converting data to simple signs or ranks. 3. Since the calculations for most nonparametric methods are simple and rapid, these

procedures are sometimes used when parametric procedures are more appropriate. 4. Less flexible than linear models and ANOVA.

When to use Nonparametric Procedures

1. The assumptions necessary for the valid use of a parametric procedure are not met.

2. The data have been measured on a scale weaker than that required for the parametric procedure that would otherwise be employed.

3. The hypothesis to be tested does not involve a population parameter. 4. Data with notable outliers (which cannot be eliminated with transformations). 5. Non-normal distribution of the dependent variable. 6. Unequal variances across groups.

Page 59: Methods for Contextual Discovery and Analysis.pdf

Section B. Methods of Hypothesis Testing 53

Section B. Methods of Hypothesis Testing Steps in Hypothesis Testing

1. Choose the null and alternative hypotheses. 2. Set alpha level (usually α=0.01 or α=0.05). 3. Choose the appropriate statistical test. 4. Calculate the test statistic. 5. Decide whether or not to reject the null hypothesis. 6. Make a summary statement about the conclusion, from the statistical analysis.

Test of Significance

1. Establish basic assumptions of the experiment or survey. 2. Predict what outcome is expected under the assumptions using the sampling

distribution. 3. Observe the outcome. 4. Calculate the probability, under assumptions of outcomes as extreme as our

observation, using the sampling distribution. 5. If this probability is large, then the outcome is consistent with assumptions. If the

probability is small, then the outcome is inconsistent with assumptions, or there is statistically significant evidence against the assumptions.

null hypothesis: a statement of no change in status; remains with the status quo; the hypothesis you generally want to disprove; denoted by H0 alternative hypothesis: statement of change in the status; predicted change from normal; describing what you want to prove (this helps you decide to do a one- or two-sided test); denoted by H1 alpha: boundary value for credibility of H0 vs. H1; typically a small value, usually 0.05 or smaller; denoted by α significance level: predetermined boundary value level for determining statistical significance; your accepted risk of an improbable event test statistic: number computed from data we use to test H0, assuming it is true p-value: probability of getting the test statistic value or a more extreme value, assuming H0 is true; If p-value ≤ α then this is statistically significant and you reject H0, otherwise there is no statistical significance and you do not reject H0. REMEMBER: if any of the assumptions of the test is seriously violated, the reliability of the computed test statistic may be compromised.

Page 60: Methods for Contextual Discovery and Analysis.pdf

Section C. Variable Selection 54

Section C. Variable Selection Textual analysis has a wide variety of individual variables and combinations of variables. The following is a selection of variables to consider in text analysis. Measures of Total Words and Total Unique Words

• Total Number of Words (N) • Total Unique Vocabulary (V)

Unique Word Ratio UWR = V/N

Type-token Ratio TTR = Ni/Vi This statistic represents the rate at which new words are generated by an author.

Pace Pace = Vi/Ni This statistic represents the rate at which new words are

generated by an author.

Entropy H = ∑pi log pi

i pi = probability of appearance of the ith word type (number of occurrences) = ------------------------------------ (total # of words in the text)

Increasing the internal sturucture yields decreasing entropy. Increasing disorder (randomness) yields increasing entropy.

Adjusting for length of sample text

H = -100 ∑ pi log pi/logN

i (100 is a measure of diversity)

Page 61: Methods for Contextual Discovery and Analysis.pdf

Section C. Variable Selection 55

Readability or Grade Level

• Word Length • Sentence Length • Number of Nouns • Number of Punctuation Marks

Test for Once-used Words (Hapax Legomena)

R = (100 log N)/(1-V1/V)

Tests the propensity of an author to choose between the alternatives of employing a word used previously or employing a new word. This test may also measure change over time in vocabulary richness and be helpful when problems of dating are an issue.

Yule’s Characteristic

A measure of vocabulary richness based on the assumption that the occurrence of a given word is based on chance and can be regarded as a Poisson distribution

K = 104 (∑r2 Vt-N)/N2 r

Simpson’s Index

Chance that two members of an arbitrarily chosen pair of tokens will belong to the same type.

D = 104 (∑r(r-1)Vt/[N(N-1)] i

(r = 1,2,3, . . .) (Vr= number of types which occur just r times in a sample of text)

Page 62: Methods for Contextual Discovery and Analysis.pdf

Section C. Variab

P

Rn

Word Groupings (Comparing Sets of Key Words)

• Frequent Non-Contextual Words (i.e. the, and, of, that, to, in, a) • Infrequent Non-Contextual Words (i.e. again, after, among, according, wherefore) • Preferred• Non-Pref

Grade Level (Readability) Word length, sentence length, number of nouns, and number of punctuation marks are variables that combine to give an estimate of the grade level of the text. Grade levels can then be compared.

GL* = 0.39 x (average no. of words/sentence) + (average no. of vowels/word) -15.59

* Flesch Kincaid Readability Formula adjusted for foreign language use.

Non-contextual Word Ratio (Top occurring non-contextual words)

(Least occurring non-contextual words) Rank words by occurrence. Select the top ranking non-contextual words and the bottom ranking non-contextual words.

le Selection 56

referred Words Ratio (Preferred Words – Non-preferred Words) (Preferred Words + Non-preferred Words)

ank words by occurrence. Select author preferred words and author on-preferred words.

Words erred Words

Page 63: Methods for Contextual Discovery and Analysis.pdf

Section C. Variable Selection

• Rare Words • Most Common Words • New Words • Function Words • Verb Words (i.e. run, walk...) • Concept Words (i.e. faith, freedom, abuse) • Feminine Endings • Open Lines • Contractions (i.e. I’m, you’re, we’ve, I’ve, you’ve) • I do variants (i.e. I do not, I do, I do + verb) • Metric Fillers (i.e. if that, the which, when that, since that• Adversions (i.e. look, look you, you see, do you see, marklisten) • -th (i.e. fifth, heareth, sayeth, thinketh) • Prefixes (i.e. where-, there-, un-, fore-, dis-) • Suffixes (i.e. -less, -able, -ful, -ish, -ible, -ment, -like) • Positive Intensifiers (i.e. most, many, very, much, more) • Negative Intensifiers (i.e. none, no one, nothing, few)

R

) my words, hea

C

R

N

Rare/Common Words Ratio

Rare Words Slope1 = -----------------------------

Most Common Words By plotting the results on a graph, a measure of slope (-/+) for the entire text is given. Slopes can then be compared.

Rare/New Words Ratio

Rare Words Slope2 = -----------------------------

New Words

57

r me,

Page 64: Methods for Contextual Discovery and Analysis.pdf

Section C. Variable Selection 58

• Frequency of i-syllable words • Large words

Grammatical Discriminators

• Parts of speech (i.e. nouns, verbs, adjectives, adverbs, pronouns, noun/verb ratio, verb/adjective ratio, prepositions, conjunctions, articles) • Sentence Constructs (, ; : .) • Verb Plots -- Measures the word count between verbs. (i.e. sat . . (3) . run . . (4) . . go . . (5) . . be)

Comparison Against a Pool • Distinctiveness Ratio

(freq of word from author1) D = -------------------------------------------------

(freq of word from all other authors)

Strings and Collocated Word Variables

• Word Patterns • Collocated Words

Word Parts and Rhyming Words

• Word Parts (Suffixes, Infixes, Prefixes, Word Roots) • Rhyming Words (*ing, *ly, *tion)

Page 65: Methods for Contextual Discovery and Analysis.pdf

Section D. Statistical Analysis Using Microsoft Excel 59

Section D. Statistical Analysis Using Microsoft Excel Document Explorer/WordCruncher exports data into formats that can be copied and pasted into most statistical analysis programs including SAS, SPSS and Minitab. The statistical functions provided in Microsoft Excel are also available, although there has been a great deal of criticism about the reliability of several Excel procedures (especially its Random Number Generator); still, as an educational tool Microsoft Excel has merit and is widely available. The following is a list of Excel’s statistical functions. Statistical worksheet functions perform statistical analysis on ranges of data. For example, a statistical worksheet function can provide statistical information about a straight line plotted through a group of values, such as the slope of the line and the y-intercept, or about the actual points that make up the straight line. Microsoft Excel’s Statistical Functions AVEDEV AVERAGE AVERAGEA BETADIST BETAINV BINOMDIST CHIDIST CHIINV CHITEST CONFIDENCE CORREL COUNT COUNTA COVAR CRITBINOM DEVSQ EXPONDIST FDIST FINV FISHER FISHERINV FORECAST FREQUENCY FTEST GAMMADIST GAMMAINV

GAMMALN GEOMEAN GROWTH HARMEAN HYPGEOMDIST INTERCEPT KURT LARGE LINEST LOGEST LOGINV LOGNORMDIST MAX MAXA MEDIAN MIN MINA MODE NEGBINOMDIST NORMDIST NORMINV NORMSDIST NORMSINV PEARSON PERCENTILE PERCENTRANK

PERMUT POISSON PROB QUARTILE RANK RSQ SKEW SLOPE SMALL STANDARDIZE STDEV STDEVA STDEVP STDEVPA STEYX TDIST TINV TREND TRIMMEAN TTEST VAR VARA VARP VARPA WEIBULL ZTEST

In addition to these statistical functions, there are the following graphical representations for time series analysis: LINEAR TRENDLINE LOGARITHMIC TRENDLINE POLYNOMIAL TRENDLINE

POWER TRENDLINE EXPONENTIAL TRENDLINE MOVING AVERAGE TRENDLINE

Page 66: Methods for Contextual Discovery and Analysis.pdf

Section D. Statistical Analysis Using Microsoft Excel 60

Microsoft Excel provides a set of data analysis tools — called the Analysis ToolPak — a step-saver when developing complex statistical or engineering analyses. You provide the data and parameters for each analysis; the tool uses the appropriate statistical or engineering macro functions and then displays the results in an output table. Some tools generate charts in addition to output tables. To use these tools, you need to be familiar with the specific area of statistics or engineering that you want to develop analyses for. The following tools are available in Excel: Analysis of Variance (ANOVA) Correlation analysis Covariance analysis Descriptive Statistics Exponential Smoothing Fourier Analysis F-Test: Two-Sample for Variances Histogram

Moving Average Perform a t-Test Random Number Generation Rank and Percentile Regression Sampling z-Test: Two-sample for Means

Although Excel covers a variety of procedures, it does not include most non-parametric procedures and MANOVA procedures. Specific statistical programs like SAS and Minitab cover a broader spectrum of procedures and give more in-depth statistical analyses of a procedure.

Page 67: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 61

Section E. Methods of Comparing Observations Statistical tests may be used to investigate the differences between means and differences between medians of two sets of observations. An example of comparing observations is the comparison of readability levels of the works of William Shakespeare and the works of Francis Bacon. If the average readability level of Shakespeare’s works is 10th grade and the average readability level for Bacon’s works is 12th grade, we cannot say that the readability levels are significantly different without looking at distribution and probability statistics. An additional example is examining positive word use to negative word use in the home in order to establish a predictive model for divorce. The counts of positive words are compared to the counts of negative words and the two are compared to see if there is a significant difference between the two counts. Without looking at the distribution of the two counts the researcher cannot say that the two are significantly different. The test for significant difference is what we will address in this section. Observations are paired or independent and learning to recognize data as one or the other is important in selecting the correct method of analysis. Independent observations are samples selected at random. Paired observations may come from before and after observations or from observations that are carefully matched except for one factor whose effect is being investigated. The first example above shows independent observations whereas the second shows paired observations. In either case, we may use statistical tests to distinguish differences that may be explained by chance, or whether or not the differences are extreme enough to be considered statistically significant. This means that there is evidence that the two groups of data are not the same. Two general classes of statistical tests are parametric and nonparametric tests. Parametric tests work under the assumption of centrality about the mean. Nonparametric statistical tests use the median and may work without parametric assumptions. As you will see, much of nonparametric statistics uses the rankings of the data rather than the data itself. Each of the following sections will examine both the parametric and the nonparametric methods for comparing both location parameters and dispersion parameters. The following sections will address various statistical methods tailored to determine statistical significance in the differences between observations:

Two Independent Observations Two Related Observations Three or more Independent Observations Three or more Related Observations

Page 68: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 62

Two Independent Observations Two samples are considered independent if neither sample influences the other. In the example that compares readability levels between Shakespeare and Bacon the two observations are independent. Given two independent samples, we test that they each come from populations with the same mean or median. Through testing, we gather evidence to answer some interesting questions about our documents. For example, we may want to compare two documents to see if the documents are written by the same author or how one author treats a basket of words compared to another author. There are a number of comparisons we may make by testing independent observations: “grade level” of two works, ratios of variables, writing pace and entropy of two independent texts, strings and collocated word sets, etc. The important thing to remember is that each of these studies requires variables that are drawn independent of each other. The following statistical procedures test for “statistical significance” in the difference between the two observations. Both parametric and non-parametric procedures are listed in order that the user can apply various procedures based on the assumptions required by the data.

Page 69: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 63

T-test for Two Independent Samples In this parametric test, the two sample means are employed to estimate the values

of the means of the populations from which the samples are taken. If the test is significant, the researcher can conclude there is a high likelihood that the samples represent populations with different mean values. Interval or Ratio data is required. ASSUMPTIONS:

a) random sample b) underlying distribution is normal c) homogeneity of variance (this means the variance of the underlying population of

sample 1 is equal to the variance of the underlying population of sample 2) HYPOTHESES H0: µ1 = µ2

In words: the two populations have the same means. µ represents the population mean and the subscript 1 or 2 represents which population it comes from.

H1: µ1 ≠ µ2 , or H1: µ1 > µ2 ,or H1: µ1 < µ2 TEST STATISTIC

First, calculate the mean for each sample. 1

11 n

XX ∑= ,

2

22 n

XX ∑= .

Secondly, calculate the estimated population variances for each sample

1

)(

~1

1

212

12

1 −

−=

∑∑

nnX

Xs ,

1

)(

~2

2

222

22

2 −

−=

∑∑

nnX

Xs .

Then the formula, for equal or unequal sample sizes, is:

+

−+−+−

−=

2121

222

211

21

112

~)1(~)1(nnnn

snsn

XXt

When interpreting the results of this test statistic, use the Table of the Student’s t distribution, with the degrees of freedom, df = n1+n2-2.

Page 70: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 64

Sign Test This is called a sign test because we convert the data for analysis into a series of

plus or minus signs. With this test, the larger the sample, the greater the power and the shorter the confidence interval for a given level. If we reject the null hypothesis, our research interest then becomes determining what the true population median might be. This test is easy to implement and calculate, its drawback is in that you are losing some information. Uses of the sign test:

a) tests whether a random variable in a pair (X, Y) tends to be larger than the other (i.e., a comparison between the probabilities of two different types of outcomes)

b) tests a hypothesis about the median of a single population or the median difference in a population of paired differences (sign test for location)

c) tests for trend in a series of ordinal measurements d) tests for correlation

ASSUMPTIONS:

a) independent samples with unknown median M b) measured on (at least) an ordinal scale c) the underlying variable of interest is continuous

HYPOTHESES H0: θ = M0 or H0: p(+) = p(-) H1: θ ≠ M0, or H1: θ > M0, or H1: θ < M0, Where θ represents the population median and M0 represents the median you are guessing. TEST STATISTIC

The first step is to sort the data. Next, subtract the hypothesized median, M0, from each of the original data values. If the difference is zero, or in other words, if the data value is equal to the median, then that piece of data should be thrown out. Equality should always be discarded. If the difference (Xi-M0) is positive, then assign a plus sign (+). If the difference (Xi-M0) is negative, represent it with a negative sign (-). The test statistic is T = total number of plus signs (+). You also need to know n=total number of (+)’s and (-)’s. If you eliminated any ties, remember that they are also eliminated from the total n. Finally, reject H0 if tnT −≥ , where t is a critical value that comes from the binomial table.

Page 71: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 65

Signed-Ranked Two-Sample Test (Mann-Whitney) The hypothesis being tested here is whether two independent samples represent

two populations with different median values. The type of data required for this nonparametric test is ordinal, but the data needs to be in rank-order format. A significant result means there is a significant difference between the two sample medians. This usually means that the samples represent populations with different median values. ASSUMPTIONS

a) each sample randomly selected b) the two samples are independent c) original variable observed is a continuous random variable d) underlying distributions of both samples are identical in shape

HYPOTHESES H0: θ1 = θ2 H1: θ1 ≠ θ2, or H1: θ1 > θ2, or H1: θ1 < θ2

TEST STATISTIC

First, the data must be ranked. The ranking protocol for the Mann-Whitney U test is as follows, with N being the total number of subjects in the experiment and n1 being the number of subjects in group 1, n2 being the number of subjects in group 2, etc.

a) All N scores are arranged in order of magnitude (irrespective of group membership) beginning on the left with the lowest score and moving to the right as scores increase.

b) All N scores are assigned a rank; a rank of 1 is assigned to the lowest score, a rank of 2 to the 2nd lowest score (if there are no ties) and so on, with a rank of N (if no ties) being assigned to the highest score.

c) If there are ties: when two or more subjects have the same score, the average of the ranks involved is assigned to all scores tied for a given rank.

It is acceptable to reverse the ranking protocol, but it is easier to interpret if the ranking is kept as described. After ranking the data, sum the rankings for each group, ∑ 1R , ∑ 2R . Then calculate the test statistics, U1 and U2. The test statistics involve the sample sizes of the two groups and the summation of the rankings from the groups, represented by ΣR1 and ΣR2.

�−++= 111

211 2)1( RnnnnU

�−++= 222

212 2)1( RnnnnU

To check your calculations, verify that n1n2=U1 + U2.

Evaluate U with the Table of Critical Values for the Mann-Whitney U statistic (see Part III.) The values are listed in reference to the number of subjects in each group. In order to be significant, the obtained value of U must be equal to or less than the tabled critical value at the predetermined level of significance.

Page 72: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 66

Two Related Observations Sometimes referred to as difference scores, paired observations are the result of taking two related observations. The statistical analysis is based on the differences between the two observations and not on the observations themselves. One common experiment is to study before and after effects. Most statistical work in paired observations covers weight loss before and after observations or comparing processes on widgets. Document Explorer adds a human dimension, that of political, social and literary trends. Before and after studies are now able to examine trends in our vocabulary use which can tell us about such trends. Consider the implications of answering the following questions. What changes in economic language occur after the Federal Reserve changes interest rates? What is the change in political rhetoric after a change of administration? What is the change in social rhetoric after a Supreme Court decision is released and can we tell from the rhetoric change if the effect is large or small? These are examples of before-and-after observations gathered from newspapers (i.e. New York Times, Washington Post, Los Angeles Times), news broadcast transcripts, congressional records, or other media -- paired observations from individual sources. Other examples of the use of paired comparisons could be drawn from the same source. In the example at the beginning of this section we compare positive word counts with negative word counts across n families to predict divorce. Another possible comparison might be a study that examines the use of a basket of anti-American words and a basket of pro-American words that are drawn from individual newspapers (paired variable) to predict bilateral relations. In literature we might test a basket of authors to see if adverb/verb ratios are different across two genres. Rhetoric is a powerful predictive instrument that can be fully realized when the proper method and procedure are selected. Again, as with the previous tests, the following statistical procedures test for statistical significance in the difference between the two paired observations. With each of the statistical procedures we include both parametric and non-parametric procedures. These are listed in order that the user can apply various procedures based on the assumptions required by the data.

Page 73: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 67

Paired T-Test This parametric test answers the question, “do two dependent samples represent

two populations with different mean values?” This test uses (at least) interval/ratio data and is employed when the population variances are unknown. When there is a significant difference between the means of the two paired groups, there is a high likelihood that the conditions represent populations with different mean values. ASSUMPTIONS:

a) sample of n subjects was randomly selected b) underlying distribution is normal c) homogeneity of variance

HYPOTHESES: H0: µ1 = µ2

In words: the two populations have the same mean. µ represents the population mean and the subscript 1 or 2 represents which population it comes from.

H1: µ1 ≠ µ2 ,or H1: µ1 > µ2 ,or H1: µ1 < µ2 TEST STATISTIC

These computations use the Difference Method. Create a table with each subject representing a row and the scores from condition 1 and condition 2 each in a column. Allow the condition 1 score to be called X1 and the condition 2 score to be called X2. Now calculate the difference score for each subject, D = X1 - X2 . Add this as a new column to the table. Next to this column, add another column that is the squared difference, or D2. From the table we can sum the D column and the D2 column so that we have Σ D and Σ D2 . Σ D represents the sum of the difference scores. When this quantity is divided by the number of subjects, the mean of the difference scores, D is obtained,

nD

D ∑= . Next, calculate 1

)(~

22

−=

∑∑

nnD

DsD . Finally, using this result, the

standard error of the mean difference is calculated n

ss DD

~= . The test statistic for the t-

test for two dependent samples is Ds

Dt = . We use our obtained value of t and evaluate it

using the Table of Student’s t Distribution in the appendix. The degrees of freedom for the t-test for two dependent samples are df = n-1.

Page 74: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 68

Sign Test for Pairs This nonparametric test is very similar to the binomial sign test. This test requires

that each of n subjects has two scores. The scores are represented by X1 and X2. A difference between the scores is calculated, D+ if X1 > X2 and D- if X1 < X2 . The test evaluates the hypothesis of whether or not the underlying population represented by the sample or the proportion of subjects who obtain a positive signed difference, (i.e., a higher score in condition 1) is some value other than 0.5. This test can be employed with a before-after design. Assumptions:

a) The data comes from a random sample of the population. b) The data must be able to be rank-ordered.

HYPOTHESES: H0: π + = 0.5

In words: in the underlying population the sample represents, the proportion of subjects who obtain a positive signed difference (X1 > X2) equals 0.5

H1: π+ ≠ 0.5, or H1: π+ > 0.5, or H1: π+ < 0.5 π+ represents the hypothesized population proportion

TEST STATISTIC

Calculate D = X1 - X2. Then determine if this is greater than, less than, or exactly zero. If greater than zero, mark this subject with a “+” sign. If the D quantity is less than zero, mark this subject with a “-” sign. If exactly zero, then it is a tie and is eliminated from the data analysis. When there are ties that are eliminated from the data analysis, the sample size, n, is reduced accordingly.

Add up the number of “+” signs; this is Σ D+. Add up the number of “-” signs; this

is Σ D-. To get the proportions of each, use the equations: nD

p ∑+

+ = ,nD

p ∑−

− =

Make sure you are using the n revised for “ties”. Now use the table of The Cumulative Probabilities of the Binomial Distribution

(see part III.) N represents the number of signed differences (excludes those that are zero) and` x represents the number of positive signed differences. Find n and x values and use the column π = 0.5. The entry represents the probability of obtaining x or more positive signed differences from the total n signed differences.

Page 75: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 69

Matched Pairs Signed-Ranked (Wilcoxon) This nonparametric test is employed with ordinal data, testing the hypothesis, “do

two dependent samples represent two different populations?” Or, stated differently, this test evaluates whether in the underlying population the median of the difference scores θD equals zero. This test is an extension of the Wilcoxon Signed-Ranked Test. Each of the n subjects must have two interval/ratio scores. The difference score D is computed by subtracting a subject’s score in Condition 2 from his score in Condition 1. Significant difference in this test implies a high likelihood that the two samples or conditions represent two different populations. This test does not rank the original interval/ratio scores of subjects; instead, it ranks the interval/ratio difference scores. Thus it is a test of ordinal data. ASSUMPTIONS

a) The sample is randomly selected from the population it represents. b) The original scores are in interval/ratio data format. c) The distribution of the difference scores in the populations represented by the two

samples is symmetric about the median of the population of difference scores. HYPOTHESES H0: θD=0

In the underlying populations represented by condition 1 and condition 2, the median of the difference scores equals zero. Requiring θD=0 is the same as saying the sum of the ranks of the positive difference scores is equal to the sum of the negative difference scores.

H1: θD ≠ 0 The median of the differences is some value other than zero.

TEST STATISTIC

Each subject has two scores, X1 and X2. These scores could be coming from condition 1 and condition 2. The first step is to calculate the difference between the two scores, D = X1 - X2. Next, the difference scores need to be ranked with respect to their absolute values. Use the following guidelines when ranking the D scores:

a) Remember it is the absolute values of D that are ranked, this means the sign of D is not taken into account.

b) Any difference score that equals zero is not ranked. This is equivalent to eliminating from the analysis any subject who yields a difference score of 0.

c) When the scores are tied, the average of the ranks involved is assigned to all scores tied for a given rank.

d) With this test, be certain a rank of 1 is assigned to the difference score with the lowest absolute value. Assign a rank of n to the score with the highest absolute value, where n is the number of difference scores that have been ranked.

Once the absolute value of the difference scores, |D|, have been ranked, the sign of each difference score is placed in front of its rank. Finally, sum the ranks with a positive sign and call this quantity ∑

+R .

Page 76: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 70

The sum of the ranks with a negative sign is called ∑−R .

To check the accuracy of the value of ∑+R and ∑

−R , this equation should be true

∑ ∑+=+ −+

2)1(nnRR .

If the sample is derived from a population in which θD=0, then ∑

+R will equal

∑−R . If ∑

+R >∑−R , it is likely that condition 1 represents a population with higher

scores than the population represented by condition 2. If ∑+R <∑

−R , it is likely that condition 2 represents a population with higher scores than the population represented by condition 1. To show whether the difference is significant, one needs to use a test statistic. The absolute value of the smaller of ∑

+R vs.∑−R is the Wilcoxon Test

Statistic. Use the Table Critical T values for Wilcoxon’s Signed-Ranks Tests (see Part III.) Allow n to be the number of signed ranks. The null hypothesis can be rejected if the test statistic is equal to or less than the tabled critical value at the prespecified level of significance.

Page 77: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 71

Three or More Independent Observations Testing three or more samples is an extension of the two-sample test, but allows for testing across all samples. An extension of the example sighted earlier comparing readability levels for Shakespeare and Bacon could be extended to compare these two to other authors (like Hobbs). This related technique is called Analysis of Variance and allows you to consider data from several samples at the same time. You can try to distinguish systematic differences between sample groups from the chance variation found within each group. By comparing 3 or more samples, we can compare groups of documents regarding authorship, concepts, genre differences and other variables. In this section we will look at methods for comparing 3 or more authors, comparing one author against other authors, comparing one author’s genre against the authors other genres and comparing concepts across various media. We will also look at how these techniques can be used in comparing strings, word baskets, collocations and concordance use. These samples need to be independent both within and among the several samples under study.

Page 78: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 72

One-Way Analysis of Variance Analysis of Variance is usually abbreviated with the acronym ANOVA. ANOVA

procedures are employed to evaluate whether or not there is a difference between at least two means in a set of data for which two or more means can be computed. The test statistic computed for an analysis of variance is based on the F distribution. F values are always positive. If the F value is significant, then there is a significant difference between at least two of the sample means in the set of k means. As a result, the researcher can conclude that there is a high likelihood that at least two of the samples represent populations with different mean values. ASSUMPTIONS

a) Each sample is randomly selected from the population it represents b) The distribution of data in the underlying population from which each of the

samples is derived is normal c) Homogeneity of variance (the variances of the k underlying populations

represented by the k samples are equal to one another) HYPOTHESES H0: µ1=µ2=µ3

The mean of the population represented by group 1 equals the mean of the population represented by group 2 equals the mean of the population represented by group 3. All groups have the same mean.

H1: Not H0 At least two of the k population means are not equal to each other.

TEST STATISTIC

Create a table to summarize the data. Allow each group to have a set of columns. The scores of the subjects in group 1 will be listed in the column labeled X1, the scores of the subjects in Group 2 are listed in the column labeled X2 and so on. Then next to the columns X1, X2, etc., create columns 2

1X , 22X and 2

3X . Within these columns, list the squares of the scores of the subjects in each of the three groups. Within each group, calculate the sums of each column, ∑ 1X ,∑

21X , for group 1 and∑ 2X ,∑

22X , for

group 2 and so on for each group. The notation n represents the number of subjects for each group. Thus n1 is the number of subjects in group 1 and n2 is the number of subjects in group 2. Then the means can be calculated for each group. The mean of group 1 is

calculated with this equation: 1

11 n

XX ∑= . Similarly the mean of group 2 is

2

22 n

XX ∑= . The notation N represents the total number of subjects employed in the

experiment, knnnN +++= ...21 . The value ∑ TX represents the total sum of the scores

Page 79: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 73

of the N subjects who participate in the experiment, ∑ ∑∑∑ +++= kT XXXX ...21 .

TX represents the grand mean, calculated asNX

X TT

∑= .

The total sum of the squared scores of the N subjects who participate in the experiment is ∑ ∑ ∑∑ +++= 22

22

12 ... kT XXXX .

To compute the total variability, the total sum of squares, SST and the between-groups sum of squares, SSBG, need to be calculated.

( )∑

∑−=NX

XSS TTT

22 ,

( ) ( )NX

nX

SS Tk

j j

jBG

2

1

2∑

∑∑ −

=

=

And since BGTWG SSSSSS −= , we can easily calculate SSWG.

BG

BGBG df

SSMS = , where 1−= kdf BG

wg

WGWG df

SSMS = , where kNdfWG −=

Finally the test statistic is WG

BG

MSMSF = .

The obtained F-value is evaluated with the Table of the F distribution in the Appendix. In the table the critical values are listed in reference to the number of degrees of freedom associated with the numerator and the denominator of the F ratio. In order to reject the null hypothesis, the obtained F value must be equal to or greater than the tabled critical value at the prespecified level of significance.

Page 80: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 74

Extension of Sign Test This is a simple extension of the sign test presented earlier in the handbook.

ASSUMPTIONS

a) Each sample is a random sample of size ni drawn from one of c populations of interest with unknown medians M1, M2, …, Mc.

b) The observations are independent both within and among samples. c) The measurement scale employed is at least ordinal. d) If all populations have the same median, then for each population the probability

p, that an observed value exceeds the grand median, is the same. HYPOTHESES H0: M1 = M2 = … = MC H1: at least one population has a median different from at least one of the others. TEST STATISTIC

First compute a grand median for all the data. Then for each sample, determine how many of the observations are above or below the median. The data is easiest to handle when organized into a contingency table.

Combine the c samples, order them and compute the combined sample median. Then classify each observation according to the sample (or population) to which it belongs and according to whether it is larger than, equal to, or less than the grand median. In the two-way contingency table, each column represents a sample. There are two rows, one representing observations greater than the sample median and the other row representing those that are less than or equal to the sample median. Sum across the columns of the table to get the total number of observations that are greater than the sample median; this is a. Similarly, b is the total number of observations that are equal to or less than the combined sample median. O1i= the number of observations in the ith sample greater than the grand median O2i= the number of observations in the ith sample less than or equal to the grand median Sample 1 2 … C Total > Median O11 O12 … O1c a ≤ Median O12 O22 … O2c b TOTAL n1 n2 … nc N Test Statistic

∑=

−=

c

li i

ii

nNanO

abNT

2

12

This test statistic is approximately distributed as a chi-square with c-1 degrees of freedom. If the calculated value of the test statistic is greater than the tabulated value of chi-square for c-1 degrees of freedom and α, then we reject the null hypothesis of equal population medians at the α level of significance.

Page 81: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 75

Kruskal-Wallis Test This test is essentially an extension of the Mann-Whitney test. This is a

nonparametric test employed with ordinal data. A significant test result indicates a significant difference between at least two of the sample medians in the set of k medians, which implies a high likelihood that at least two of the samples represent populations with different median values. ASSUMPTIONS

a) The sample is randomly selected from the population it represents. b) The k samples are independent of one another. c) The dependent variable is a continuous random variable. d) The underlying distributions are identical in shape, but they don’t have to be

normally distributed. This assumption is important because it implies equal dispersion of data.

HYPOTHESES H0: θ1= θ2= θ3

The median of each group is equal. H1: at least one of θ1, θ2 or θ3 is unequal.

There is a difference between at least two of the k population medians. This alternative hypothesis is always nondirectional.

TEST STATISTIC The ranking protocol is the same as for Mann-Whitney. After rank-ordering all N subjects, the sum of the ranks is computed for each group,∑ jR . So, if there are three

groups, there will be three sums, ∑ 1R , ∑ 2R , ∑ 3R . The Kruskal-Wallis test statistic

is ( )

)1(3)1(

121

2

+−

+= ∑

∑=

NnR

NNH

k

j j

j .

In order to reject the null hypothesis the computed value H must be equal to or greater than the tabled critical chi-square value at the prespecified level of significance. Evaluate H with the Table of the Chi-Square Distribution, with df = k-1 where k is the number of groups.

Page 82: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 76

Multiple Comparisons After rejecting the null hypothesis in the Kruskal-Wallis test and concluding that

not all sampled populations are identical, it is natural to question which populations are different from which others. The best way to determine the differences between the groups is through a multiple-comparison method. When the multiple-comparison procedure is used, it is important to employ an experiment-wise error rate. This will be a conservative approach in making multiple comparisons.

In order to use the multiple-comparison procedure, the first step is to obtain the mean of the ranks for each sample, symbolized, for example, by jR as the mean of the jth

sample and iR as the mean of the ith sample. Next, select an experiment-wise error rate of α, which will be the overall level of significance. The choice of α is in part determined by

k, the number of samples involved. For k samples, there will be a total of2

)1( −kk pairs of

samples that can be compared. When making multiple comparisons with an experiment-wise error rate, select a value of α larger than those customarily encountered in single-comparison inference procedures, for example, 0.15, 0.20 or perhaps 0.25, depending on the size of k.

Next, find in the Table of the Normal Distribution the value of z that has α/k(k-1)

area to its right. Finally, form the inequality

++≤−

−−jikk

ji nnNNzRR 1112

)1()1(1 α

,

where N is the number of observations in all the samples combined. If the k samples are all of the same size, the inequality reduces to

6)1(

)1(1

+≤−

−−

NkzRRkk

ji α. Any difference ji RR − that is larger than the right-

hand side of the inequality is declared significant at the α level. Remember to take the direction of the difference into account.

If there are extensive ties in the data, the above inequalities can be adjusted to ensure a conservative result. Adjusting for ties, the appropriate inequality for unequal

sample sizes is[ ]

)1(12

11)()1( 32

+−−−

≤−∑ ∑

Nnn

ttNNzRR ji

ji . The appropriate

inequality for equal sample sizes is [ ]

)1(6)()1( 32

−−−−

≤− ∑ ∑NN

ttNNkzRR ji , where t is

the number of values in the combined sample that are tied. The adjustment usually has a negligible effect on the results. Comparing all treatments with a control

In some research situations, one of the k treatments is a control and it is interesting to compare each treatment with the control conditions. The procedure is the same as the multiple-comparisons procedure described above. The only change is that the

Page 83: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 77

appropriate z will have α divided by 2(k-1). This is because there will be k-1 comparisons made. Three or More Related Samples As with the extension of the two sample tests for paired observations, this is an extension of the two-sample paired test. The statistical analysis is based on the differences between the two observations and not on the observations themselves. As we mentioned earlier, a common paired experiment is to study before and after effects. By using Two-way Analysis of Variance you can consider data from several samples at the same time to try to distinguish systematic differences between sample groups from the chance variation found within each group.

Page 84: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 78

Two-Way ANOVA This test attempts to determine whether, in a set of k dependent samples (where k

is at least 2), at least two of the samples represent populations with different mean values. If the computed test statistic is significant, it indicates there is a significant difference between at least two of the sample means in the set of k means. In order to compute the test statistic for the two-way ANOVA, two variability components are calculated and compared: the between-conditions variability and residual variability.

ASSUMPTIONS

a) The sample of n subjects has been randomly selected from the population it represents.

b) The distribution of data in the underlying populations that each of the experimental conditions represents is normal.

c) The Sphericity Assumption is made (which is that the underlying population variances and covariances are equal.)

HYPOTHESES H0: µ1=µ2=µ3 , in words: the mean of the population represented by group 1 equals the mean of the population represented by group 2 equals the mean of the population represented by group 3, or all groups have the same mean. H1: Not H0, or at least two of the k population means are not equal to each other Test STATISTIC The computations are very similar to the one-way ANOVA with independent samples. These are the necessary changes. ∑ iS is the sum of the scores for subject i under the conditions 1, 2 and 3

for three conditions, ( ) ( ) ( ) ( )

NX

nX

nX

nX

SS TBC

223

22

21 ∑∑∑∑ −

++=

the between subjects component of the variability is: ( ) ( )

NX

kS

SS Tn

i

iBS

2

1

2∑

∑∑ −

=

=

Use the equality: BSBCTres SSSSSSSS −−=

BC

BCBC df

SSMS = BS

BSBS df

SSMS = res

resres df

SSMS =

1−= kdf BC 1−= ndf BS

)1)(1( −−= kndfres 1−= NdfT

Finally, the test statistic is res

BC

MSMSF =

Page 85: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 79

allow 321 nnnn ===

The obtained value of F is evaluated with the Table of the F distribution in the appendix. In this table, the critical values are listed in reference to the number of degrees of freedom associated with the numerator and the denominator of the F ratio. The way we have calculated the F ratio, the degrees of freedom for the numerator are BCdf and the degrees of freedom for the denominator are resdf . In order to reject the null hypothesis, the obtained F value must be equal to or greater than the tabled critical value at the prespecified level of significance.

Page 86: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 80

Friedman Two-Way ANOVA by Ranks This is a nonparametric test employed with ordinal data, using two or more

dependent samples. A significant result from this test indicates a significant difference between at least two of the sample medians in the set of k medians. This implies that there is a high likelihood that at least two of the samples represent populations with different median values. In all the explanations for this test, k is the number of groups or samples we are comparing.

This test is very useful when analyzing continuous data with skewed distributions. Since the test assesses medians based on ranks, it is possible to make mistakes! Always examine both the means and the medians to make sure the test makes sense. This test is restricted to within-group comparisons only.

ASSUMPTIONS

a) The sample was randomly selected from the population it represents. b) The dependent variable (variable to be analyzed) is a continuous random variable. c) The samples of data are either

i. multiple observations from a single subject across more than two time periods or conditions or

ii. blocks of matched subjects in which the subjects from a given block are randomly assigned to one of the two or more conditions.

d) The subjects or blocks of subjects are independent. e) There are no interactions between treatments and blocks.

HYPOTHESES H0: θ1= θ2= θ3, there is no difference in the medians (average ranks) of the samples H1: at least one of θ1, θ2 or θ3 is unequal. TEST STATISTIC

Depending on how the data are gathered, different vocabulary may be used to describe the same thing. Just remember that, in this case, treatment and condition are synonyms and a block is the same as a subject. In the computations for this test, there are i=1,..,b blocks (subjects) and j=1,…,k treatments (conditions). The ranking procedure for this test involves ranking each of the k scores within each subject. Thus, for each subject, a rank 1 is assigned to the subject’s lowest score, a rank 2 assigned to a subject’s middle score and a rank 3 is assigned to the highest score (if there are 3 groups or conditions.) In the event of tied scores, use the same process as other rank-order statistics. If the data is gathered as blocks of matched subjects, then obtain the ranks within blocks. Specifically, refer to the Mann Whitney process of ranking. It is permissible in this test to reverse the ranking protocol. This will yield the same value for the Friedman test statistic.

After rank-ordering, the sum of the ranks is computed for each of the experimental conditions (treatments). So, if there are three conditions, there will be three sums, ∑ 1R , ∑ 2R , ∑ 3R . Then the test statistic needs to be calculated:

( ) )1(3)1(

121

22 +−

+= ∑ ∑

=knR

knk

k

jjrχ

Page 87: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 81

In order to reject the null hypothesis, the computed 2rχ must be equal to or greater

than the tabled critical chi-sq at the prespecified level of significance. Use the Table of the Chi-Square Distribution and df=k-1 (see Part III.)

Page 88: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 82

Multiple Comparison Procedure for use with Friedman Test Often, it isn’t enough to simply know that at least one of the groups is different

from the others. It can be important to know where exactly the differences are located. In that case, it is appropriate to use a multiple-comparison procedure after the Friedman test. TEST STATISTIC

When we compare all possible differences between pairs of samples, when the experiment-wise error rate is α and when the number of blocks is large, then we declare

Rj and Rj ′ significantly different if 6

)1('

+≥− kbkzRR jj

where Rj and Rj′ are the hth and h′th treatment rank totals and z is a value from the Normal Distribution Table corresponding to )1( −kkα .

Alternatively, the test statistic can be computed as )1)(1()(2 22

21),1)(1( −−−>−

−−− kbBAbtRR

kbji α ,

where [ ]∑∑= =

=b

i

k

jijXRA

1 1

22 )( , ∑

=

=b

jjR

bB

1

22

1 .

Many research situations involve the comparison of two or more treatments, one of which is a control condition. As soon as the researcher concludes that there is a difference among the treatment effects, interest usually focuses on determining which of the other treatments exhibit an effect that is different from the control effect. This technique is an extension of the familiar sign test. Here are the steps (as explained in Wayne Daniel’s “Applied Nonparametric Statistics”):

1. Represent by xi0 and xij (i=1,…,b and j=1,…,k) the responses to the control and the jth treatment in the ith block of a randomized complete block design. Here k is the number of treatments, excluding the control condition.

2. Compute the signed differences dij=xij-xi0. In other words, pair each treatment with the control condition and in each block of this pairing, subtract the control measurement from the treatment measurement. There will be k pairings, each containing b differences.

3. Let rj be the number of differences, dij, that have the less frequently occurring sign (either positive or negative) within a pairing of a treatment with the control.

4. Let M0 be the median response of a population of subjects or objects experiencing the control condition and Mj be the median response of a population of objects or subjects receiving the jth treatment. Apply one of the following decision rules:

a. For testing H0: Mj ≥ M0, against H1: Mj < M0, reject H0 if the number of plus signs is less than or equal to the critical value of rj appearing in the “Table of Critical Values of Minimum rj for Comparison of k Treatments Against One Control in b Sets of Observations” for k (the number of treatments excluding the control), b and the chosen experiment-wise error rate.

b. For testing H0: Mj ≤ M0, against H1: Mj > M0, reject H0 if the number of minus signs is equal to or less than the critical value of rj appearing in the Table for k, b and the chosen experiment-wise error rate.

Page 89: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 83

c. For testing H0: Mj = M0, against H1: Mj ≠ M0, reject H0 if the number of minus signs or the number of plus signs (whichever is fewer) is equal to or less than the critical value of rj appearing in the Table for k, b and the chosen experiment-wise error rate.

Page 90: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 84

Use of Aligned Ranks (Hodges-Lehmann) The Friedman test is based on b sets of ranks (or b subjects) and the treatments are

ranked separately in each set. This allows only for intrablock comparisons. When the number of treatments is small, there may be some situations where it is desirable to compare among blocks. These situations are appropriate to employ aligned ranks. TEST STATISTIC

Aligned ranks involve subtracting from each observation within a block some measure of location (mean or median). The differences are called aligned observations. These aligned observations keep their identities with respect to the block and treatment combination to which they belong and are ranked from 1 to kb with respect to each other. In other words, the ranking scheme is the same as that employed with the Kruskal-Wallis test. The ranks assigned to the aligned observations are called aligned ranks. In the absence of ties, the aligned-ranks test statistic may be written as:

[ ]{ } ∑

=

=

−++

+−−

= b

ii

k

jj

Rkkbkbkb

kbkbRkT

1

2.

1

222.

ˆ)/1(6/)12)(1(

)1)(4/(ˆ)1(, where i=1,..,b blocks (subjects) and j=1,…,k

treatments (conditions),∑=

b

iiR

1

2.

ˆ is the rank total of the ith block and 2.

ˆjR =rank total of the

jth treatment.

If there are ties are present, replace the denominator of T with ∑ ∑∑= ==

−b

i

b

ii

k

jij RkR

1 1

2.

1

2 ˆ)/1(ˆ

In order to reject the null hypothesis, the computed test statistic must be equal to or greater than the tabled critical chi-square value at the prespecified level of significance. Use the Table of the Chi-Square Distribution and df=k-1 (see Part III.)

Page 91: Methods for Contextual Discovery and Analysis.pdf

Section E. Methods of Comparing Observations 85

Page’s Test for Ordered Alternatives There are certain multi-sample situations in which an ordered alternative

hypothesis is more meaningful than one in which order is ignored. This is a procedure that is appropriate in two-way analysis of variance hypothesis-testing situations in which an ordered alternative is meaningful. The assumptions are the same as those for the Friedman test. HYPOTHESES Allow τj to represent the effect of the jth treatment. The hypotheses are as follows:

kH τττ ~...~~: 210 === , in words: all treatment effects are equal. In the alternative hypothesis, the treatment effects are ordered.

kH τττ ≤≤≤ ...: 211 TEST STATISTIC

The test statistic is ∑=

+++==k

jkj kRRRjRL

121 ...2 , where R1,…,Rk are the treatment

rank sums obtained in the manner explained in the discussion of the Friedman test. If the treatment effects are ordered as specified in the alternative hypothesis, then

Rj tends to be larger than Ri for i < j. In other words, if there are three treatments and their effects are ordered according to H1, then R1 tends to be smaller than R2 and R2 in turn is smaller than R3. Since the treatment rank sums are weighted by the index of their true position in the ordering specified by H1, L tends to be large when H1 is true.

Reject H0 at the α level of significance if the computed value of L is greater than or equal to the critical value of L for k, b and α given in the Table of Selected Critical Values of L, for Page’s Ordered Alternatives Test (see Part III.)

Page 92: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 86

Section F. Association, Trend and Slope Comparisons and Time Series In the previous sections we have been concerned with only one type of data at a time (e.g. readability levels, word counts, etc.) The data may have been individual counts, baskets of words, or ratios, but the statistic we have been exploring is a single set of data. In this section we will look at methods of studying relationships between two different measures. For example you may be looking at the relationship between the pace and the entropy of an author, you may be interested in comparing the use of a particular concept word over time or you may be interested in comparing the slope of one authors wordprint to the slope of another authors wordprint. In this section we will cover methods of determining correlation, predictive trends and ratio slope comparisons. In this section we examine correlation, for example, where the use of a particular word dwindles in use over time as the use of another begins to be more frequent. Note that evidence of correlation does not imply causation. Scattergram (Scatterplot) A scattergram is used to display graphically the relationship between two different measurements. The pattern of the scatterplot represents the relationship between the two measurements. Lets say that these example scatterplots are representative of age and reading level. Scatterplot A shows a logical positive relationship between age and reading level. Scatterplot B shows an illogical “negative” relationship between age and reading level.

* * * * * * * * * *

* * * * * * * * * *

A BAge Age

Reading Level

Reading Level

Page 93: Methods for Contextual Discovery and Analysis.pdf

Sect

ScatScat Dete It is usin(pos Corrwheuse a Whethoseduc

Age * * * * * * * * * * * * *

DAge

* * * * * * * * * * *

C

ion F. Association, Trend and Slope Comparisons and Time Series 87

terplot C shows a positive relationship, but also shows a definite pattern. terplot D shows no relationship between the two variables.

rmining Association (Correlation)

also possible to summarize the relationship between two measures quantitatively g a correlation coefficient. Coefficients generally represent correlation between +1 itive correlation) and –1 (negative correlation) where 0 shows no correlation.

elation can be found between authors, topics, genres and respondents. For example, n analyzing various authors a comparison can be made between unique vocabulary nd age of the author.

n analyzing market survey data, after classifying and coding the verbatim responses e codes can be examined by finding correlations between response and sex, age, ational level, or race.

Reading Level

Reading Level

Page 94: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 88

Pearson Product-Moment Correlation Coefficient This correlation coefficient measures the strength of association between two

variables, X and Y. It is used for comparing observations from a bivariate population and only detects linear relationships. It is only appropriate if it is assumed that the distribution of the sampled population is a bivariate normal distribution. Clearly, the assumption of bivariate normal distribution makes this a parametric test. The calculations for this coefficient are shown only for reference, but not much will be said on the use of the coefficient. The data can be summarized as demonstrated in this table:

( )( )

( ) ( )

−−==

∑∑

==

=

n

ii

n

ii

n

iii

YYXX

YYXXr

1

2

1

2

r is bounded between –1 and 1 when r = 0 there is no correlation when r = 1 there is a perfect positive correlation when r = -1 there is a perfect negative correlation

Subject X Y 1 X1 Y1 2 X2 Y2 … … … N Xn Yn

Page 95: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 89

Spearman Rank Correlation This represents a correlation between ranks and can also be used as a test of

theindependence of ranks. One of the nice things about this correlation coefficient is that it uses all of the data. Spearman’s Rank doesn’t throw out zeroes or ties. ASSUMPTIONS a) The data is a random sample of n pairs of numeric or non-numeric observations (X1, Y1.) b) Each pair of observations represents two measurements taken on the same object or individual; this object or individual is called the unit of association. c) The data is not (necessarily) normally distributed. HYPOTHESES H0: X and Y are independent. H1: X and Y are directly or inversely related

or there is a direct relationship between X and Y or there is an inverse relationship between X and Y.

TEST STATISTIC You have n pairs of data (Xi, Yi.) Rank each Xi relative to all the other X, such that ( )( ) 1min =iXR and ( )( ) 1max >iXR . Rank the Yi to the other Y in the same fashion.

( ) 1=iXR if it is the smallest observed value for X, similarly ( ) 1=iYR if it is the smallest observed value for Y. Now, compute the test statistic. When there are no ties: Define the difference, id , to be the difference between the rank of the X and the Y

( ) )( iii YRXRd −= and then calculate the Spearman Rank correlation, )1(

61 2

1

2

−−=

∑=

nn

dr

n

ii

s

If ties occur, each tied value is assigned the mean of the rank positions for which it is tied. When there are ties the test statistic is:

( ) ( )[ ]

( ) ( )2

1

1

22

21

1

22

1

2

21

21

21

+−

+−

+−=

∑∑

==

=

n

ii

n

ii

n

iii

s

nnYRnnXR

nnYRXRr

sr is a measure of association; it measures the degrees of correspondence between the

ranks of the observations, rather than the observations themselves. After sr has been calculated, compare it to Qα in the table Quantiles of Spearman’s Rank Correlation (see Part III.)

Page 96: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 90

Kendall’s Tau This correlation coefficient is computationally complex. It is based on ranks of

observations. Before it can be calculated, some definitions must be established. Definitions: Concordance: two pairs (Xi, Yi) and (Xj, Yj) are said to be concordant if the difference between Xi and Xj is in the same direction as the difference between Yi and Yj. Xi > Xj and Yi > Yj or Xi < Xj and Yi < Yj. Essentially, the change in the x and the change in the y are in the same direction. Discordance: two pairs (Xi, Yi) and (Xj, Yj) are said to be discordant if the difference between Xi and Xj is in a different direction from the difference between Yi and Yj. Xi > Xj and Yi < Yj or Xi < Xj and Yi > Yj. Equality is not concordant or discordant. Kendall’s Tau measures the probability of concordance minus the probability of discordance. When τ = 0 then X and Y are independent When τ > 0 then there is a positive (direct) association between X and Y When τ < 0 then X and Y are inversely related ASSUMPTIONS a) The data consists of a random sample of n observations of pairs (Xi, Yi) of numeric or non-numeric observations. Each pair of observations represents two measurements taken on the same unit of association. b) The data is measured (at least) on an ordinal scale HYPOTHESES H0: X and Y are independent H1: τ ≠ 0 or τ > 0 or τ < 0 TEST STATISTIC

CN is the number of concordant pairs

DN is the number of discordant pairs

2)1( −

−= nn

NN DCτ

The test statistic is: DC NNT −= . This needs to be compared to Qα in the Table “Quantiles of the Kendall Test Statistic”. An alternative way of doing the computations:

The test statistic is: 2

)1(ˆ

−= nnSτ , where n is the number of observations (or ranks).

To obtain S, follow these steps:

Page 97: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 91

1. Arrange the observations (Xi, Yi) in a column according to the magnitude of the X’s, with the smallest X first (this is natural order).

2. Compare each Y value, one at a time, with each Y value appearing below it. In making these comparisons, we say that a pair of Y values (one Y being compared to the Y below it) is in natural order if the Y below is larger than the Y above. A pair of Y values is in reverse natural order if Y below is smaller than Y above.

3. Let CN be the number of pairs in natural order and DN be the number of pairs in reverse natural order.

4. DC NNS −=

Page 98: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 92

Olmstead-Tukey Corner Test of Association This test is designed to detect the presence of a correlation between two variables

X and Y. It places heavy emphasis on the extreme values of the variables. The computations of this test are easy to perform. ASSUMPTIONS

a) n pairs of observations constitute a random sample b) measurement is at least ordinal c) variables are continuous

HYPOTHESES H0: X and Y are independent H1: X and Y are correlated TEST STATISTIC As described in Wayne W. Daniel’s Applied Nonparametric Statistics, these are the steps to apply the corner test of association:

1. Plot the data points as a scatter diagram 2. Draw a horizontal line through the median Ym of the Y values and a vertical line

through the median Xm of the X values. 3. Label the upper right and lower left quadrants with a plus sign and the upper left

and lower right quadrants with a minus sign. 4. Beginning at the top of the scatter diagram, proceed downward. Count the number

of points encountered before (in order to count the next point) it is necessary to cross the vertical median. Record the number of points counted and affix the sign of the quadrant in which they occur.

5. Begin at the right of the scatter diagram, move to the left and count points until it is necessary to cross the horizontal median. Record the number of points counted and affix the sign of the quadrant in which they occur.

6. Repeat steps 4 and 5, beginning at the bottom and left of the scatter diagram. Ignore points lying exactly on one of the median lines in the counting procedure and proceed with the counting as if those points were not present.

7. Add the four numbers, observing signs. Take the absolute value of the sum and call it S. This is the test statistic, sumquadrantS = .

Decision Rule Reject H0 at the α level of significance if the entry in the body of the table, “Corner Test for Association” corresponding to S and n is equal to or less than alpha.

Phi Coefficient There are situations when the Pearson Correlation Coefficient is not an

appropriate measure of strength of association. Specifically, when the two variables are categorical and the data consist of frequencies that may be displayed in a contingency table. The simplest contingency table is a 2x2 table when the data is collected from a

Page 99: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 93

model with two dichotomous variables, each of which has only two categories. A dichotomous variable is one that can assume only one of two mutually exclusive values. Examples are gender, or marital status. This is an example of a 2x2 contingency table: Variable I

Categories

Variable II categories 1 2 Total

1 a b a + b 2 c d c + d

Total a + c b + d N Using the notation of this table, the Phi Coefficient can be calculated.

( )( )( )( )dbcadcbabcad

++++−=φ

The Phi Coefficient is bounded between 1 and –1, or 11 <<− φ . Using the conversion 22 φχ n= , the value of φ can then be compared to tabulated chi-square values with 1 degree of freedom.

Page 100: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 94

Yule’s Q Coefficient Like the Phi Coefficient, the Yule Coefficient is appropriate when working with

two dichotomous variables. It is also applicable when the values of the variable can be meaningfully grouped into two distinct categories. Use the example of the 2x2 contingency table included previously to understand the calculations for this coefficient.

bcadbcadQ

+−= This coefficient describes the strength of association and 11 ≤≤− Q .

Page 101: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 95

Goodman-Kruskal Coefficient This is an extension of the Yule Coefficient from a 2x2 contingency table to a

contingency table of size r x c. Suppose the variables of interest, X and Y, are measured on an ordinal scale. Assume the values of X are ordered in magnitude so that

rXXX <<< ...21 . Similarly, Y values are ordered cYYY <<< ...21 . Convert the X measurements to ranks 1, 2, …, r and the Y measurements to ranks 1, 2, …, c and use these ranks to label the rows and columns of the contingency table, as shown. The ranks of the Y measurements provide labels for the c columns and the ranks of the X measurements provide labels for the r rows. Assign the data from the subjects to the table. Use the rank of its X value to determine which row it belongs to and use the rank of its Y value to determine its column. Let P equal the number of pairs of subjects whose X and Y measurements agree with respect to order and let Q equal the number of pairs of subjects whose X and Y

measurements disagree with respect to order. Now, the test statistic is QPQPG

+−= .

Y rank X rank 1 2 … j … c Total

1 N11 N12 … N1j … N1c N1. 2 N21 N22 … N2j … N2c N2. … I Ni1 Ni2 … Nij … Nic Ni.

… R Nr1 Nr2 … Nrj … Nrc Nr.

Total N.1 N.2 … N.j … N.c N As described in Wayne W. Daniel’s “Applied Nonparametric Statistics,” these are the steps for calculation of G. To obtain P, perform the following calculations:

1. Identify the frequency in the upper left-hand corner of the contingency table (the frequency of cell 1,1,n11) as a multiplier. Call it Multiplier P1.

2. Add the frequencies of all remaining cells of the table that are to the right and below Multiplier P1 that are not in the same row or column as Multiplier P1. Call the result Sum P1.

3. Compute Product P1 = (Multiplier P1) x (Sum P1). 4. Move to the next cell in the same row as Multiplier P1 (cell 1,2). Call its

frequency Multiplier P2. 5. Add the frequencies of all cells of the table that are to the right of and below

Multiplier P2, but not in the same row or column with it. Call the result Sum P2. 6. Compute Product P2 = (Multiplier P2) x (Sum P2). 7. Proceed as in Steps 1 through 6 until there are no more cells in row 1 that have

cells below and to the right and not in the same row or column. 8. Beginning with the first cell in the row, repeat Steps 1 through 7 for each

remaining row. 9. Add all products obtained in Steps 1 through 8. The result is P.

Page 102: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 96

To obtain Q, do the following:

1. Identify the frequency of the cell in the upper right-hand corner of the contingency as Multiplier Q1.

2. Add the frequencies of all cells that are below and to the left of Multiplier Q1, but not in the same row or column with it. Call the result Sum Q1.

3. Product Q1 = (Multiplier Q1) x (Sum Q1). 4. Multiplier Q2 is the frequency of the cell to the immediate left of the cell

containing Multiplier Q1. 5. Sum Q2 is the sum of all frequencies in cells to the left of and below the cell

containing Multiplier Q2, but are not in the same row or column with it. 6. Product Q2 = (Multiplier Q2) x (Sum Q2). 7. Proceed as in Steps 1 through 6 until there are no cells in row 1 that have cells

below and to the left and not in the same row or column. 8. Beginning with the last cell in the row, repeat Steps 1 through 7 for each

remaining row. 9. Add the products obtained in Steps 1 through 8. The result is Q.

Page 103: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 97

Cramer’s Statistic This is an appropriate measure of the strength of association between two

categorical variables yielding data that may be displayed in a contingency table of any

size. The Cramer Coefficient is defined as )1(

2

−=

tnC χ , where n is the total sample size

and t is the smaller of the number of rows or the number of columns in the contingency

table and ( )∑

=

−=

r

i i

ii

EEO

1

22χ . This statistic C can assume values between 0 and 1. When

there is no association between the two variables under study, C will be equal to zero. When C is equal to 1 and r = c there is perfect correlation between the two variables. The advantage of this statistic is that it can be used to compare contingency tables of different sizes.

Page 104: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 98

Point Biserial Coefficient of Correlation This test is used when it is desired to assess the strength of the relationship between two variables when one variable is dichotomous and the other is measured on an interval or ratio scale. All subjects will therefore have two measurements of interest. For example, both gender and income might be measured. The correlation between a dichotomous variable and a continuous variable is called biserial correlation. There are two different coefficients to describe this: the biserial correlation coefficient and the point biserial correlation coefficient. Only the latter is discussed in this manual. The discussion of this coefficient and its use as a descriptive statistic are included here because they are part of nonparametric statistics. Making inferences using this biserial correlation coefficient is part of classical statistics and is not included. Let Y be the dichotomous variable, with possible values 0 and 1. Allow X to be the other variable. The point biserial correlation coefficient is:

( )

−=∑

20101

xx

xxnnnrpb . In this equation, n1 is the number of 1s, n0 is the number of

0s and n1+n0 = n, the total sample size. 1x is the mean value of x for all n1 subjects 0x is the mean value of x for all n0 subjects. 11 ≤≤− pbr

Page 105: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 99

Chi-Square Test of Independence When there is no association between two variables, they are called independent.

In other words, knowing the value of one variable doesn’t help determine the value of the other. Formally, independence means the distribution of one variable in no way depends on the distribution of the other. This test is used to decide whether two variables in a population are independent. ASSUMPTIONS

a) random sample of size n from some population of interest b) Each observation belongs to one and only one category of each criterion. The

criteria are the variables of interest in a given situation. c) Either the variables are inherently categorical or they can be classified into

mutually exclusive categories. HYPOTHESES H0: the two criteria of classification are independent. H1: the criteria are not independent TEST STATISTIC When doing these calculations, refer to the r x c contingency table (see Part III.)

The test statistic is: ( )

∑∑= =

−=

r

i

c

j ij

ijij

EEO

1 1

22χ , where

NNN

E jiij

..= and ijij NO = , r is the

number of rows in the table and c is the number of columns. Reject the null hypothesis of independence at the α level of significance if the computed value of 2χ exceeds the tabulated value of 2

1 αχ − for (r-1)(c-1) degrees of freedom.

Page 106: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 100

Kendall’s Coefficient of Concordance W When dealing with more than 2 sets of rankings, the Kendall coefficient of

concordance is appropriate. This test is useful when interested in the degree of agreement among several b (for b>2) sets of rankings of k objects or individuals. Previously, with Spearman sr and Kendall’s τ, the evaluation concerned the extent of the agreement or disagreement of two sets of rankings. A set of b rankings can be obtained in one of two ways:

1. Ranking group of k objects or individuals on the basis of each of b characteristics. 2. A panel of b judges or observers may rank a group of k objects or individuals on

the same characteristic. The two things accomplished with this test: measure the strength of agreement among the b sets of rankings and test the null hypothesis of no association among rankings. ASSUMPTIONS

a) The data consist of b complete sets of observations or measurements on k objects or individuals.

b) The measurement scale is (at least) ordinal. c) Either the data are organized as ranks, or they can be converted into ranks.

HYPOTHESES H0: the b sets of rankings are not associated

or the b judges are assigning ranks to the subjects independently and at random H1: the b sets of rankings are associated TEST STATISTIC

)1(

)1(312

221

222

+−=

∑=

kkb

kkbRW

k

jj

where b is the number of sets of rankings, k is the number of

individuals or objects that are ranked and Rj is the sum of ranks assigned to the jth object or individual.

Sufficiently large values of W lead us to reject the null hypothesis of no association. For small values of b and k, use the Table of Kendall’s coefficient of concordance. Reject the null hypothesis at the α level of significance if the value of P in the table corresponding to the appropriate W, b, k is less than or equal to α. For values of b and k not included in the table, compute and compare the value

Wkb )1(2 −=χ with the values in the chi-square table corresponding to k-1 degrees of freedom.

When ties occur, the mean of the rank positions for which it is tied should be

assigned. Then W is adjusted for ties in this way: ∑

−−−

+−= =

)()1(

)1(312

3221

222

ttbkkb

kkbRW

k

jj

,

where t is the number of observations in any set of rankings tied for a given rank.

Page 107: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 101

Partial Correlation Coefficient The idea with this coefficient is to hold one or more other variables constant when

calculating the correlation. This calculation is similar to pair-wise correlations. This test is applicable when investigating relationships among 3 or more random

variables. The multiple correlation coefficient provides an overall measure of the correlation among all variables considered together. The partial correlation coefficient, however, measures the correlation between two variables while holding constant one or more of the other variables.

As an example, for three variables, X, Y and Z with a joint distribution of multivariate normal, the partial correlation between X and Y, holding Z constant, would

be: ( )( )22.11 yzxz

yzxzxyzxy

rr

rrrr

−−

−= , where rxy, rxz and ryz are the respective Pearson product-

moment correlations for X and Y, X and Z and Y and Z. When the joint distribution of (X, Y, Z) is unknown, try using Spearman sr , or

Kendall’s τ̂ . To use sr , substitute appropriate ranks for actual measurements. Then use a statistical software package to perform the usual parametric multiple correlation analysis. Finally, use the partial correlation coefficient that will be computed as part of the output as the partial rank correlation coefficient.

For τ̂ , when there are no ties, calculate ( )( )22.ˆ1ˆ1

ˆˆˆˆ

yzxz

yzxzxyzxy

ττ

ττττ

−−

−= . This represents the

partial correlation between X and Y, holding Z constant. This can all be generalized for more than three variables.

HYPOTHESES The null hypotheses that could be tested using Kendall’s τ̂ and their corresponding alternatives are as follows: 1. 0: .0 =zxyH τ , 0: .1 ≠zxyH τ 2. 0: .0 ≤zxyH τ , 0: .1 >zxyH τ 3. 0: .0 ≥zxyH τ , 0: .1 <zxyH τ Critical values for certain sample sizes and preselected levels of significance are given in the Table Estimates of the Quantiles of Kendall’s Partial Rank Correlation Coefficient (see Part III.) The decision rules for each of the above hypotheses are:

1. For 0: .1 ≠zxyH τ , reject H0 if the computed value of zxy.τ̂ is greater than the value of zxy.τ̂ for n and 1-α/2 given in the table.

2. For 0: .1 >zxyH τ , reject H0 if the computed value of zxy.τ̂ is greater than the value of zxy.τ̂ for n and 1-α/2 given in the table.

3. For 0: .1 <zxyH τ , reject H0 if the computed value of zxy.τ̂ is less than the negative of the value of zxy.τ̂ for n and 1-α/2 given in the table.

Page 108: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 102

Trend and Slope Comparison (Regression) An extension of correlation is finding the trend or slope of the relationships. This is called regression and multiple regression analysis. If two measures are related, it is possible to use one measure to predict the other. This is widely used in economics to examine the relationship between productivity and inputs (e.g. crop yield and rain), in medicine to study health and consumption (blood pressure and salt consumption) and in social studies to examine demographic changes and social attitudes (death rates and cigarette consumption.) The relationship demonstrates the impact of a dependent variable (item of interest) by one or several independent variables (influential factors). For example, if particular stock price is the dependent variable, the independent variable(s) could be positive or negative word counts about the company by media. As more positive word counts are used in the press the stock price is likely to increase. As more negative word counts are used the stock price is likely to decline. Note the following dependent variables and independent variables regarding literature and contextual data analysis. Dependent Variables Independent Variables An author’s style elements The author’s age, sex, race, education, etc. An author’s view or definition A time period, the original language of the author

of a topic (love, justice) An author’s productivity A location This type of study can be important in finding out when or where a particular author changed or altered his paradigm of life. This is an important tool for authorship study and for examining psychoanalysis in literature. The dependent variables are generated from the contextual data itself where the independent variables are particular to the author’s experience. What is the relationship between an author’s writings and his life experience? In examining survey data, once the verbatim responses are classified and coded using Document Explorer/WordCruncher they can then contribute to the model for predicting the propensity of the respondent to buy the particular product or vote a particular way. The focus of this section is not to explain regression analysis, but to introduce its application for contextual data. Regression analysis is explained widely in textbooks and on the Internet. The following tests are additional non-parametric procedures that can contribute to further analysis of the data.

Page 109: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 103

Theil Test This procedure is based on the Kendall Tau statistic.

ASSUMPTIONS a) The appropriate model is iii eXY ++= βα , i = 1, …, n, where the Xi’s are known

constants and α and β are unknown parameters. b) For each value of Xi there is a subpopulation of Y values. c) Yi is an observed value of the continuous random variable Y at the value Xi. d) The Xi are all distinct (no ties) and we take X1 < X2 < … <Xn. e) The random errors, ei’s are independent and come from the same continuous

population. The data for analysis consists of n pairs of observations, (X1, Y1), (X2, Y2), …, (Xi, Yi), …, (Xn, Yn), where the ith pair represents measurements taken on the ith unit of association. HYPOTHESES

A. (Two-sided): 00 : ββ =H , 01: ββ ≠H B. (One-sided): 00 : ββ ≤H , 01: ββ >H C. (One-sided): 00 : ββ ≥H , 01: ββ <H

TEST STATISTIC

Since this procedure is based on the Kendall tau statistic, we compute the Kendall statistic by comparing all possible pairs of observations of the form ( )iii XYX 0, β− , in the same way as described under the Kendall tau statistic section. Here is a brief summary of the steps.

1. Arrange the pairs of observations ( )iii XYX 0, β− in a column in natural order with respect to the X values.

2. Compare each ii XY 0β− with each jj XY 0β− appearing below it. 3. Let P be the number of such comparisons that result in a pair

( )jjii XYXY 00 , ββ −− that is in natural order and let Q be the number of such comparisons that result in a pair that is in reverse natural order.

4. Let S = P – Q. The test statistic is 2

)1(ˆ

−= nnSτ .

The decision rules for the three sets of hypotheses stated above:

A. Refer to Table of Critical values for use with the Kendall Tau Statistic (Part III.) Reject H0 at the α level of significance if the computed value of τ̂ is either positive and larger than the *τ entry for n and α/2, or negative and smaller than the negative of the *τ entry for n and α/2.

B. Refer to Table of Critical values for use with the Kendall tau statistic. Reject H0 at the α level of significance if the computed value of τ̂ is larger than the *τ entry for n and α.

Page 110: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 104

C. Refer to Table of Critical values for use with the Kendall tau statistic. Reject H0 at the α level of significance if the computed value of τ̂ is smaller than the negative of the *τ entry for n and α.

Page 111: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 105

Sign-Test for Trend A straight line is the simplest monotonic trend. A monotonic trend need not be

linear; it may simply express a tendency for observations to increase or decrease subject to certain local or random irregularities. Consider a set of independent observations x1, x2, …, xn ordered in time. If we have an even number of observations, for example, n = 2m, we take the differences xm+1 – x1, xm+2 - x2, …, x2m – xm. For an odd number of observations, 2m+1, we may proceed as above, omitting the middle value xm+1 and calculation xm+2 – x1, etc.

If there were an increasing trend we would expect most of these differences to be positive. If there were no trend then these differences (in view of the independence assumption) are equally likely to be positive or negative. When the differences are primarily negative, this suggests a decreasing trend. This implies that under the null hypothesis of no trend, the plus (or minus) signs have a binomial distribution with parameters m and p= ½. So, once the number of plus signs have been counted, we can use the table of values from the binomial distribution to determine the p-value.

Remember that periodic trends are common and this test applied to data with a periodic trend might miss the trend.

Page 112: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 106

Sen, Adichie Test This is an asymptotically distribution-free test for the parallelism of several

regression lines. This is useful when concerned with testing equality of the k slope parameters without additional constraints on the corresponding, unspecified intercepts. ASSUMPTIONS

a) The straight-line model is ijijiiij exY ++= βα , i=1,…,k; j=1,…,ni, where the xij’s are known constants and kαα ,...,1 and kββ ,...,1 are the unknown intercept and slope parameters, respectively.

b) The knnN ++= ...1 random variables kknkn eeee ,...,,...,,..., 1111 1are mutually

independent. c) The random variables are k random samples from a common continuous

population. HYPOTHESES

[ ]βββ === kH ...: 10 with β unspecified; in words: the k regression lines have a common but unspecified slope

[ ]kH ββ ,...,: 11 not all equal TEST STATISTIC To compute the Sen-Adichie statistic V, the first step is to align each of the k regression

samples.

∑∑

∑∑

= =

= =

−=

k

i

n

jiij

k

i

n

jiij

i

i

xx

xx

1 1

2

1 1

)(

)(β , where ∑

=

=in

j i

iji n

xx

1

, for i = 1,…, k.

For each of the k regression samples, compute the aligned observations )(*ijijij xYY β−= , i

= 1,…, k; j=1, …, nj. Order these aligned observations from least to greatest within each of the k regression samples. Let *

ijr denote the rank of a specific aligned observation in the joint ranking of the aligned observations in the ith regression sample.

Compute( )[ ]

∑= +

−=

in

j i

ijiiji n

rxxT

1

**

1, where ∑

=

=in

j i

iji n

xx

1

. Setting ∑=

−=in

jiiji xxC

1

22 )( , i = 1,…, k;

The Sen-Adichie statistic V is then given by 2

1

*

12∑=

=

k

i i

i

CTV .

Reject H0 if 2,1 αχ −≥ kV ; otherwise, do not reject, where 2

,1 αχ −k is the upper α percentile of a chi-square distribution with k-1 degrees of freedom. If there are ties among the aligned observations, use average ranks to break the ties.

Page 113: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 107

Jaeckel, Hettmansperger-McKean This is an asymptotically distribution-free rank-based procedure for testing appropriate hypotheses in the setting of multiple linear regression.

Page 114: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 108

Time Series Time Series are a set of observations measured sequentially through time. An

example of time series for contextual data is an examination of classification word counts across time (e.g. word counts referencing deity in Presidential inaugural addresses). Many examples of time series analysis can bee seen in newspapers, magazines and other time sensitive media.

There are two special features of this kind of data. The first is that the

observations are (clearly) not independent of each other. The second unique feature is that the analysis must take into account the order in which the observations are collected.

Basic Concepts of Time Series Objectives of Time Series analysis:

a) Description: describing the data using summary statistics and graphical methods (especially the time plot)

b) Model: finding a suitable statistical model to describe the data-generating process (in this handbook we focus only on univariate models, which are based on past values of that variable only)

c) Forecasting or predicting: estimating future values of the series d) Control: enabling analyst to take action in context of forecasts

One of the main tasks of classical time series analysis is to try to decompose the

variation. There are four main categories: seasonal variation, trend, other cyclic variation and irregular fluctuations. Seasonal variation can be described as similar patterns of behavior observed at particular times of the year, usually on an annual period. Trend is either a steady upward growth or a downward decline over several successive time periods. The perception of trend can depend on the length of the observed series. Other cyclic variation occurs like seasonal variation, but deals with periods other than a year (such as a 5-year business cycle, or daily biorhythms). Irregular fluctuations can describe any “left over” variation.

The first step in describing the data is to plot the observations against time. This is called a time plot. The graph can show features such as trend, seasonality, outliers, turning points and sudden discontinuities. Although conceptually simple, the graph may be challenging to create and difficult to interpret. A transformation of the data may be necessary if the variance appears to increase as the mean increases, if the observations are skewed, or if the seasonal effect is multiplicative. If necessary, an appropriate Box-Cox Transformation can be found. Trend and Seasonality

Trend is stochastically described as tttt βαµ += , where tµ represents the local level of the mean, tα represents the local intercept and tβ represents the local slope. The trend is defined as the rate of change in tµ , or as the slope, tβ . Another way of defining trend is βµµ += −1tt .

Page 115: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 109

If seasonality is present, decide whether to measure and/or remove seasonality before measuring trend, since trend and seasonality are inextricably related. Seasonal variation can be additive or multiplicative. It is additive when the seasonality does not depend on the mean level, multiplicative when the size of seasonal variation is proportional to the local mean level.

Seasonal differencing is a method of deseasonalizing the data. The seasonal index at time t is it. The equation to describe seasonality is )1( tttt iX εµ ′+= where )1( −=′ tt εε . Correlation Functions

Deterministic means future values can be predicted exactly from the past values. Stochastic, or random, means the future is only partly determined by past values. Autocorrelation is correlation between successive values of the same time series. Stationary means the properties of the underlying model do not change through time.

The Autocovariance Function, abbreviated ACVF, is standardized to give

autocorrelation coefficients kρ . All of the kρ make up the autocorrelation function, which is abbreviated ACF. kρ measures correlation at lag k between Xt and Xt+k, ACF is an even function of lag.

Partial autocorrelation function (partial ACF) measures excess correlation at lag k

which has not already been accounted for by autocorrelations at lower lags. There is also inverse ACF.

Page 116: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 110

Some Classes of Univariate Time-Series Models Univariate means that it is the distribution of a single random variable at time t.

Many forecasting procedures are based explicitly or implicitly on a univariate time-series model. Thus it is helpful to understand a range of possible models. Purely Random Process: A sequence of uncorrelated, identically distributed random variables with a zero mean and constant variance. This process is clearly stationary with a

constant spectrum. =

=otherwisek

k ,00,1

ρ

Random Walk: a purely random process ttt ZXX += −1 . Even though this is not a stationary process, the first difference ( )1−− tt XX does form a stationary process. Another, similar process is Random Walk Plus Noise.

Page 117: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 111

Autoregressive (AR) Process The value at time t depends linearly on last p values and the model looks like a

regression model, hence the term autoregression. A time series {Xt} is said to be AR(p) if it is a weighted linear sum of the past p

values plus a random shock so that tptpttt ZXXXX ++++= −−− φφφ ...2211 , where {Zt} is

a purely random process with mean zero and variance 2Zσ .

Using the backward shift operator B, such that 1−= tt XBX , then AR(p) can be more succinctly written as tt ZXB =)(φ where p

p BBBB φφφφ −−−−= ...1)( 221 is a

polynomial of order p. Properties: stationary provided 1<φ

The ACF is kk φρ = for k=0,1,2,…

The partial ACF is zero at all lags greater than p, so the partial ACF can be used to help determine the order of an AR process by looking at the lag value where the partial ACF “cuts off”. The simplest example is an AR(1) ttt ZXX += −1φ

Page 118: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 112

Moving Average (MA) Process The value at time t is a sort of moving average of (unobservable) random shocks

{Zt}. The time series {Xt} is said to be a moving average process of order q (abbreviated MA(q)) if it is a weighted linear sum of the last q random shocks so that

qtqttt ZZZX −− +++= θθ ...11 where {Zt} denotes a purely random process with mean zero

and constant variance 2Zσ . This process can be alternately written tt ZBX )(θ= , where q

q BBB θθθ +++= ...1)( 1 is a polynomial of order q. A finite-order MA process is stationary for all parameter values )(Bθ is not uniquely determined by the ACF. As a consequence, given a sample ACF, it is not possible to estimate a unique MA process from a given set of data without putting a constraint on what is allowed. Usually the constraint is that the polynomial )(xθ has all its roots outside the unit circle. This allows the invertibility condition. This means we can effectively rewrite an MA process of finite order as an AR(∞ ) process. The ACF is

MA(q) is

>

=

=

=∑

=

=+

qk

qk

k

q

ii

kq

ikii

k

,0

,...,2,1,

0,1

0

2

0

θ

θθρ .

Thus ACF cuts off at lag q. This property may be used to try to assess the order of the process by looking for the lag beyond which the sample ACF is not significantly different from zero. MA(1) looks like this 1−+= ttt ZZX θ ,

where the ACF is described as 1,

1,0)1(

0,1

2 =

>+

=

= k

k

k

k θθρ

Essentially this means the ACF “cuts off” at lag 1.

Page 119: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 113

ARMA Process

Mixed autoregressive moving average model with p autoregressive terms and q moving average terms. An ARMA(p,q) is tt ZBXB )()( θφ = . This is stationary and invertible if roots of both functions lie outside the unit circle. This process is very important because of parsimony or using as few parameters as possible to accurately describe the process. One of any two pure processes may be more effective in describing what is going on.

Models ARIMA stands for autoregressive integrated moving average. This is a very important forecasting tool and is the basis of many fundamental time-series ideas. These are a more general class of models. In practice most time series are non stationary and we cannot apply a stationary AR, MAR, or ARMA process. But, we can difference non-stationary series so as to make them stationary. First differences: ( ) ( ) ttt XBXX −=− − 11 . These can also be differenced themselves to get

2nd differences and so on. In general, the dth difference is ( ) td XB−1 .

If the original data is differenced d times before fitting an ARMA(p,q) then the model for undifferenced series is ARIMA(p,d,q) where “I” stands for integrated and d denotes the number of differences taken. tt

d ZBXBB )()1)(( θφ =− . The process is stationary when p=q=0 and d=1, this is an ARIMA(0,1,0) described as

ttt ZXX =− −1 , which is the same as a random walk. The main difficulty with fitting AR and MA models is assessing the order of the

process, in other words, determining p and q. With ARIMA models, an added difficulty is needing to choose the order of differencing, or figuring out d. A first-order difference is usually adequate for non-seasonal series, but a second-order difference is occasionally needed. Once stationary, an ARMA can be fitted to differenced data in the usual way.

Page 120: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 114

SARIMA This is a seasonal ARIMA model with s time periods per year. Bs such that stt

s XXB −= . Seasonal differencing is written as ( ) t

sstt XBXX )1( −=− −

Non-seasonal terms (p, d, q) and seasonal terms (P, D, Q) which is abbreviated SARIMA(p, d, q) x (P, D, Q)s

ts

tDsds ZBBXBBBB )()()1()1)(()( Θ=−−Φ θφ

)( sBΘ is a polynomial in Bs of order Q, )( sBΦ is a polynomial in Bs of order P.

Periodic AR Models These can be used to describe seasonal variation throughout the year. Periodic correlation occurs when the size of the autocorrelation coefficients depends on the lag and on the position in the seasonal cycle. Let mrX , be a random variable in the mth seasonal period in the rth year. Then a PAR(1) model is mrmrmmr ZXX ,1,,1, += −φ .

Fractional Integrated ARMA (abbreviated ARFIMA) These models allow fractional differencing, which allows d to be a non integer. ARFIMA (p, d, q) is described as tt

d ZBXBB )()1)(( θφ =− . Computations and interpretation of non-integer differences are difficult to make with these. Long-memory: ACF decays very slowly, implying observations that are far apart are still related to some extent. This makes it very difficult to get good estimates of the parameters.

State Space Models These models were developed by engineers for systems that vary through time.

Since it is often difficult to distinguish between types of stationarity and non-stationarity, there is much to be said for choosing a forecasting method which makes few assumptions about the form of the trend but is adaptive in form and robust to changes in the underlying model. State-space models give more robust forecasts than ARIMA.

The signal at time t is taken to be a linear combination of state variables which constitute the state vector at time t. the number of state variables is m and the (mx1) state vector is tθ . T

th is a known (mx1) vector and tn denotes the observation error.

ttTtt nhX += θ

Another way of describing state space models is ttt nX += µ , where ttt w+= −1µµ . The set of state variables is defined as the minimum set of information from

present and past data such that the future behavior of the system is completely determined by present values of state variables. In other words, the future is independent of past values. This is also called the Markov Property when the latest value is all that is needed to make predictions.

Page 121: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 115

The key assumption of these models is that the state vector evolves according to this equation: tttt wG += −1θθ

There are many special cases of these models: random walk plus noise model, linear growth model, Harvey’s basic structural model. In fact, ARIMA models can be recast into state-space format.

There is a lack of uniqueness for state vectors. They are also non-stationary and will not have a time variant ACF.

Growth Curve Models

( ) tt tfX ε+= , where ( )tf is a deterministic function of time only and tε is a series of random disturbances

Non-linear Models Non-linear models are certainly possible in the real world, but it can be difficult to distinguish between

(i) a non-linear model, (ii) a linear model with normally distributed disturbances to which some outliers

have been added and (iii) linear process with disturbances that are not normally distributed.

There is no clear-cut distinction between linear and non-linear models Non-linear autoregressice process of order p is abbreviated NLAR(p) and described as

( ) tptttt ZXXXfX += −−− ,...,, 21 , where tZ is strict white noise. Other possible models are time-varying parameter models, threshold autoregressive models and also bilinear models, state-dependent models and regime-switching models. In summary, the possible need for non-linear models can be indicated as a result of

a) Time plot showing asymmetry, changing variance, etc. b) Plotting Xt against Xt-1 and looking for limit points, limit cycles, etc. c) Looking at squared values of the observed sequence. d) Applying an appropriate test for non-linearity. e) Taking into account the context, background knowledge and known theory.

Page 122: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 116

Time-series Model Building Variety of purposes of a model:

1. Description of the data: both to model the systematic variation and to model the unexplained variation

2. Facilitate comparisons between sets of data 3. Helps in creating forecasts

There are three stages of statistical model building: Model Specification, Model Fitting and Model Checking. Additionally, this is an iterative, interactive process.

Formulation This stage along with Model Selection is part of Model Specification. Context is crucial in determining how to build a model. Consider the purpose of the model; will it be forecasting or will it be describing past data?

Selection Select a broad class of candidate models (such as the ARIMA family) and then select a model within that family. First select a potentially plausible set of candidate models (usually based on external contextual considerations). Then examine a variety of statistical pointers. The time plot will show if trend and seasonal terms are present. The partial ACF and correlogram can indicate an appropriate structure. The partial autocorrelation function is especially helpful for indicating the likely order of an AR model. Model selection can also be accomplished with a model-selection criterion.

Checking Ensure the fitted model is consistent with background knowledge and with data properties. Some sort of residual analysis is important. If the model is good, residuals should represent a random series. The autocorrelation function of residual series provides

an overall check. None of the ACF should exceed N2 in absolute magnitude. Also, the

production of reliable forecasts provides a convincing verification. Either outliers need to be accommodated in the model or adjusted in some way, or some robust estimation and forecasting methods need to be used.

A “good” model is (i) is consistent with prior knowledge, (ii) is consistent with the properties of the data, (iii) is unable to predict values which violate known constraints and (iv) gives good forecasts out of sample and within sample. Remember the principle of parsimony, that is, a model is desirable if it has relatively small number of parameters but can still adequately describe the data.

Page 123: Methods for Contextual Discovery and Analysis.pdf

Section F. Association, Trend and Slope Comparisons and Time Series 117

Forecasting This is a brief description of the Box-Jenkins forecasting procedure.

i. Look at the time plot to assess presence of trend and seasonality. ii. Take non-seasonal and seasonal differences until the differenced series is judged

to be stationary. iii. Look at the correlogram and sample partial ACF of the differenced series in order

to identify an appropriate ARIMA model. iv. Estimate the parameters of this model. v. Carry out the diagnostic checks on the residuals from the fitted model. If

necessary, the identified model will be adjusted until an adequate model is found. Then forecasts can be computed.

There are other approaches to forecasting, such as using the Kalman filter with State-Space models. There are also a variety of ad-hoc forecasting methods.

Page 124: Methods for Contextual Discovery and Analysis.pdf

Section G. Goodness of Fit 118

Section G. Goodness of Fit Introduction Goodness of Fit is a method of testing whether the expected observations are different than the observed ones. For example, if we expect that both men and women receive the same number of speeding tickets our expected value is 50% of the tickets given to men and 50% of the tickets given to women. These expected values are then compared to the actual observed values to measure our observations against our expectations. In literature we might apply this technique in comparing the readability levels between a pool of male authors and pool of female authors. Or we might study race, age, time period, or nationality compared to optimistic and pessimistic language. In examining survey data, we could analyze verbatim responses to see if there is any difference between the observed and expected responses. Goodness of Fit is a powerful tool that is widely used in social studies and economics. It can also be a very useful tool when applied to textual data for contextual analysis of literature, responses to a political speech and verbatim survey analysis.

Page 125: Methods for Contextual Discovery and Analysis.pdf

Section G. Goodness of Fit 119

Chi-Square Goodness of Fit Test This is a binomial test for a nominal variable with more than two classifications.

This test compares the observed frequencies for a categorical variable with the expected frequencies from a hypothesized population. The hypothesis is best expressed in words. ASSUMPTIONS

a) The data available for analysis consist of a random sample of n independent observations.

b) The measurement scale may be nominal. c) The observations can be classified into r nonoverlapping categories that exhaust

all classification possibilities; that is, the categories are mutually exclusive and exhaustive. The number of observations falling into a given category (or bin) is called the observed frequency of that category.

HYPOTHESES H0: The sample has been drawn from a population that follows a specified distribution H1: The sample has not been drawn from a population that follows the specified distribution TEST STATISTIC

Create a contingency table with the observed and expected frequencies. For examples of contingency tables, see the section on Determining Association. The observed frequency is simply the count of how many occur in the data set. In order to calculate the expected frequency for each category, compute the product of n and the corresponding category probability. (For each category there is a probability that an observation will fall in that category. This depends on the specified distribution in the null hypothesis.)

Test Statistic: NEO

EEOT

n

i i

i −=−=∑ ∑=1

22)(

For large samples, the test statistic is distributed as approximately chi-square with r-1 degrees of freedom. So, if the computed value of T is equal to or greater than the tabulated value of chi-square for r-1 degrees of freedom and significance level α, we can reject the null hypothesis at the α level of significance.

The result of the test statistic largely depends on how you classify the data into bins, or categories. Remember that in each bin the expected frequencies should not be less than 1.

Page 126: Methods for Contextual Discovery and Analysis.pdf

Section G. Goodness of Fit 120

Kolmogorov-Smirnov One-Sample Test This test is useful when it is necessary to compare the empirical distribution

function (which is the distribution function observed from the data) to another distribution function. Unlike the chi-square test, which is meant for use with nominal data, these tests are for continuous data, or data measured at least on an ordinal scale.

The idea with this test is to assume the sampled data came from the cumulative distribution )(xF . Then test the hypothesis )()(: 00 xFxFH = where )(0 xF is some hypothesized distribution. This hypothesis is evaluated by looking for the largest vertical distance between the empirical CDF and the hypothesized distribution )(0 xF . ASSUMPTION The data consist of independent observations, which are a random sample size from the unknown distribution function )(xF . HYPOTHESES A. )()(: 00 xFxFH = where )(0 xF is some hypothesized distribution. )()(: 01 xFxFH ≠ B. )()(: 00 xFxFH ≥ , )()(: 01 xFxFH < C. )()(: 00 xFxFH ≤ , )()(: 01 xFxFH > TEST STATISTIC

The calculation of the test statistic is actually quite computationally complex. This description to perform the calculation of the test statistic by hand comes from Daniel’s Applied Nonparametric Statistics.

Let S(x) designate the sample (or empirical) distribution function; S(x) is the cumulative probability function computed from the sample data. Specifically, S(x) = the proportion of sample observations less than or equal to x = the number of sample observations less than or equal to x divided by n. The test statistic depends on the hypothesis under consideration

A. For the two-sided test the test statistic is )()(sup 0 xFxSDx

−= . When the two

functions are represented graphically, D is the greatest vertical distance between S(x) and )(0 xF .

B. For the one-sided test where the alternative specifies that )()( 0 xFxF < , the test statistic is [ ])()(sup 0 xSxFD

x−=+ . Graphically, this statistic denotes the greatest

vertical distance between )(0 xF and S(x), where the hypothesized function )(0 xF is above the sample function S(x).

C. For the one-sided test where the alternative is )()( 0 xFxF > , the test statistic is [ ])()(sup 0 xFxSD

x−=− . When graphed, this statistic is the greatest vertical

distance between S(x) and )(0 xF when S(x) is above )(0 xF .

Page 127: Methods for Contextual Discovery and Analysis.pdf

Section G. Goodness of Fit 121

Reject Ho at the α level of significance if the test statistic under consideration, D, D+, or D-, exceeds the 1-α quantile shown in the Table of Quantiles of the Kolmogorov Test Statistic (see Part III.)

Page 128: Methods for Contextual Discovery and Analysis.pdf

Section G. Goodness of Fit 122

Kolmogorov-Smirnov Two-Sample Test This test is sensitive to all types of differences that may exist between two

distributions. This is a general or omnibus test. ASSUMPTIONS

a) The data for analysis consist of two independent random samples of sizes m and n. The observations may be designated X1, X2, …, Xm and Y1, Y2,…, Yn.

b) The data are measured on at least an ordinal scale. HYPOTHESES Let F1(x) and F2(x) designate the unknown distribution functions of the X’s and the Y’s, respectively. The following two-sided and one-sided tests may be performed.

A. (Two-sided) )()(: 210 xFxFH = for all x, )()(: 211 xFxFH ≠ for at least one x B. (One-sided) )()(: 210 xFxFH ≤ for all x, )()(: 211 xFxFH > for at least one x C. (One-sided) )()(: 210 xFxFH ≥ for all x, )()(: 211 xFxFH < for at least one x

TEST STATISTIC Let S1(x) and S2(x) respectively represent the empirical distribution functions of the observed X’s and the observed Y’s. S1(x) = (number of observed X’s ≤ x)/m S2(x) = (number of observed Y’s ≤ y)/n The test statistic for each of the hypotheses:

A. (Two-sided): )()(max 21 xSxSD −=

B. (One-sided): [ ])()(max 21 xSxSD −=+ C. (One-sided): [ ])()(min 12 xSxSD −=−

For m=n, reject Ho at the α level of significance if the appropriate test statistic, D, D+, or D-, exceeds the 1-α quantile shown in the Table of “Quantiles of the Smirnov Test Statistic for two samples of equal size n”. If nm ≠ , use the Table of “Quantiles of the Smirnov Test Statistic for two samples of different size”.

Page 129: Methods for Contextual Discovery and Analysis.pdf

Section G. Goodness of Fit 123

Lillefors This computes a test statistic as in the Kolmogorov-Smirnov Test but uses a

different set of p-values to do so and hence allows us to compute estimates of unknown population parameters when performing hypothesis tests in which the population specified in the null hypothesis is either normally or exponentially distributed. ASSUMPTIONS The data consist of independent observations, from a random sample of size n from some unknown distribution function )(xF , with unknown mean µ and/or unknown variance 2σ . HYPOTHESES H0: The sampled population is normally distributed H1: The sampled population is not normally distributed TEST STATISTIC Using the same notation as with the Kolmogorov-Smirnov one-sample test, The test statistic is )()(sup 0 xFxSD

x−= .

Reject H0 if the computed value of D is greater than the critical value for n and preselected α shown in the appropriate Table of “Critical values for Lillefors test”.

Page 130: Methods for Contextual Discovery and Analysis.pdf

Section H. Multivariate Methods 124

Section H. Multivariate Methods Multivariate methods are used for analyzing data in which many simultaneous measurements have been collected. These methods include procedures for data reduction and grouping, investigation of dependence among variables, models for prediction and hypothesis testing. In this overview we will focus on 1) sorting and grouping and 2) investigating dependence among variables. The procedures we will examine for sorting and grouping are called Factor Analysis, Principal Component Analysis, Cluster Analysis and Discriminant Analysis. The procedures for investigating dependence among variables are MANOVA and Regression of which we will focus on MANOVA (having explained regression earlier in the manual). Factor and Principal Component Analysis Factor and Principal Component Analysis examines the underlying cause. It is a method for answering the question “What factors go into . . .? or What are the major components of . . .?” This statistical method requires no underlying mathematical model. It aims to transform the observed variables to a new set of variables that are uncorrelated and arranged in decreasing order of importance. The aim is to reduce the dimensionality of the problem and to find new variables that will help to make the data easier to understand. These new variables (components) are linear combinations of the original variables and it is hoped that the first few components will account for most of the variation in the original data. Factor and Principal Component Analyses are used to analyze interrelationships among a large number of variables and to explain these variables in terms of their common underlying dimensions (factors). The statistical approach involves finding a way of condensing the information contained in a number of original variables into a smaller set of dimensions (factors) with a minimum loss of information (Hair et al., 1992.) These statistical methods are used to determine the nature of leadership, intelligence or the nature of a purchasing decision. When examining text it may be applied to determining the nature of style, authorship, or genre. It could be applied to discover the nature of writing during a specific time period (during war or peace) or the nature of religious writing. Factor analysis requires that users have data in the form of correlations. This uses an estimate of common variance among the original variables to generate factor solutions. Generally the number of factors will always be less than the number of original variables. In Principal Component Analysis the total variance among variables is used so the solution will contain as many factors as variables. There is only one method for Principal

Page 131: Methods for Contextual Discovery and Analysis.pdf

Section H. Multivariate Methods 125

Component Analysis, whereas other multidimensional methods have multiple methods for completing analysis. The basic steps for Factor Analysis are (1.) to collect the data, (2.) to generate the correlation matrix, (3.) to extract the initial factor solution, (4.) interpretation and rotation and (5.) to use the scales develop factor scores or rankings. The output table helps determine the factors or components that can be retained for further analysis. A good rule is to select factors with eigenvalues greater than 1; however if the number of variables is small the analysis may not be meaningful. The selected factors are uncorrelated. The first output table shows the factors, eigenvalues, percentage of variance explained by each individual factor and the cumulative percent of the variance explained by the individual factors. Additional output tables show the variable’s contribution to each of the factors with and without rotation. Generally without rotation the variables show a high degree of contribution to the initial factor. When rotated the variables begin to pool into the various factors. The table becomes easier to interpret when the variables are pooled or loaded between the factors. The final step is to name the factors according to the pooled pattern of the variables. (Ask, are there commonalities among the pooled variables?) Copy the variable names into Microsoft Word and convert them to Document Explorer. By applying the classification lexicons to the variable names, users can test a wide variety of word groupings to find a fitting name for individual factors. Although there are a variety of methods for Factor Analysis, the above section outlines the underlying concepts behind each approach. For further research we refer you to the bibliography (see Part IV.) Cluster Analysis Where Factor Analysis looks at interrelationships among variables, cluster analysis seeks to organize information about the variables into clusters or groups. In other words, it takes a series of measurements from a set of observations and identifies which observations are closest to each other. Additional terms which reflect the clustering nature of cluster analysis are similarity, proximity, resemblance and association. Cluster analysis can be used as an additional test of multiple authorship, but more importantly it can be used as an informal method of assessing relationships between blocks of words. The basic steps for cluster analysis are (1.) to collect the data, (2.) to generate the cluster matrix and (3.) to interpretat the clusters.

Page 132: Methods for Contextual Discovery and Analysis.pdf

Section H. Multivariate Methods 126

The main output of cluster analysis is a “dendrogram” or tree plot which gives you a two dimensional graphical representation of the closeness or proximity of the variables. When selecting which clusters to keep, choose the cluster structures which remain stable for a long distance or look for cluster groupings that agree with expected structures. A good practice is to replicate the analysis on a subset of the data and see if the structures are consistent. Discriminant or Classification Analysis Whereas Factor Analysis and Cluster Analysis allow the data to group itself, Discriminant Analysis uses predefined groups for classification. Discriminant Analysis is a technique for building a predictive model of “group membership” based on observed characteristics. It is a method of assigning subjects to predefined groups or categories for predicting “group membership” from a set of predictors. For example, it is possible to group newspapers into two groups (liberal or conservative) based on their performance in three variables (number of liberal or conservative word counts, size of articles devoted to liberal or conservative issues and number of positive or negative words used within 5 words of the name of the current liberal or conservative President. In literature, discriminant procedures can determine a set of functions that reveal the configuration of separation among the authors and is often followed by a classification analysis in which the variables for author profile are compared to the average profile of each author. The comparisons are made by means of classification functions which measure how closely one profile matches another. The techniques of discriminant and classification analysis are powerful because they are self-verifying, indicating how well the conceptual classification works on the data being studied. There are basically three types of Discriminant Analysis: direct, hierarchical and stepwise. The direct method considers all the variables at once. In the hierarchical method, the researcher selects the order of entry for the variables and in stepwise statistical criteria determines the order of entry. Most texts examine the stepwise method because direct is simply throwing the variables into the model and hierarchical relies on the subjectivity of the researcher. Difficulties or weaknesses in Discriminant Analysis surface when selecting classification criteria appropriate to the observed data. For example, if we were looking at politically liberal and conservative classification we would probably not likely draw data from author’s poetry but from their political speeches, nor would we establish classifications for poetry (love, nature, suffering) by observing political speeches. The variables (word

Page 133: Methods for Contextual Discovery and Analysis.pdf

Section H. Multivariate Methods 127

counts, groupings, frequencies, collocated words) and the classification groupings must be appropriately matched. In addition, the populations must be distinctly separate and non-overlapping and the unknowns must belong to one of the defined groups and not to an unknown possibility. For example male/female is well defined with very little chance of a third possibility, but liberal or conservative, though well defined, likely have additional unknown counterparts. This is not meant to reject liberal and conservative as classifiers, but to point-out that the classifications must be well defined by the researcher. Multivariate Analysis of Variance (MANOVA) Multivariate Analysis of Variance (MANOVA), an extension of Analysis of Variance with several dependent variables, is a technique that tests for dependence among variables and homogeneity of groups. MANOVA explores the nature of the relationships among variables and helps define the dependent relationships among one or more variables. This procedure is useful in evaluating the similarity of patterns from one author to another. The test includes the creation of new dependent variables that maximize group differences. These newly created dependent variables are linear combinations of the measured dependent variables. The objective is to determine if the response variables are altered by the manipulation of the independent variables. The results of MANOVA can answer questions regarding the main effects of the independent variables, the interactions among the independent variables, the importance of the dependent variables, the strength of the association between the dependent variables and the effects and uses of covariates. An example of applying MANOVA is a study where we are testing students’ improvement in two related courses. A MANOVA test could be used because the two measures are probably correlated and we need to take this into account when performing a test for significance. In addition we can examine whether the students improved in only one or the other course or in both. More applicable to this manual, MANOVA can be used to explain the difference between authorship, style, genre, or any of the topics introduced earlier in this manual. Suppose that there exists a set of ten plays ascribed to Shakespeare. However, some scholars hypothesize that Shakespeare wrote only seven of the plays and that the other three were written by an unknown individual. To use MANOVA, we divide the ten plays into two groups -- one containing the seven undisputed texts and the other group will contain the disputed plays. MANOVA allows us to compare the two groups and determines whether the observed difference in the variables is large in relation to the internal consistency within each group of plays. A large observed difference would support the conclusion

Page 134: Methods for Contextual Discovery and Analysis.pdf

Section H. Multivariate Methods 128

that different authors wrote the two groups of plays, while a small difference would suggest that one author wrote all ten plays. The MANOVA technique can be applied to any number of authors or variables outlined earlier in this manual. Based on the frequencies, MANOVA states the probability of a set of data arising if a single author wrote all of the materials examined. With MANOVA comes a long list of assumptions regarding distribution, linearity, homogeneity of variances and co variances and a long list of limitations regarding unequal sample sizes, outliers and multicollinearity and singularity. Before you use MANOVA you may wish to consult with a qualified statistical expert.

Page 135: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 129

PART III. TABLES

1. Normal Distribution 2. Binomial Distribution 3. F Distribution 4. t Distribution 5. Correlation Coefficient 6. Converting r to Z 7. Chi-Square Distribution 8. Studentized Range Statistic 9. Dunnett’s Test 10. Mann-Whitney U Test 11. Wilcox Ranked Sums Test 12. Wilcoxon Signed Ranks Test 13. Sample Size Requirements

Page 136: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 130

1. Normal Distribution – Areas under the Normal Curve

Page 137: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 131

Page 138: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 132

2. Binomial Distribution – Critical Values of the Binomial Test

Page 139: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 133

3. F Distribution – Critical Values

Page 140: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables

134

Page 141: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 135

4. t Distribution – Critical Values

Page 142: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 136

Page 143: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 137

Page 144: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 138

5. Correlation Coefficient – Critical Values of r

Page 145: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 139

6. Converting r to Z

Page 146: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 140

Page 147: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 141

7. Chi-Square Distribution – Critical Values

Page 148: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 142

8. Studentized Range Statistic – Critical Values

Page 149: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 143

9. Dunnett’s Test

Page 150: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 144

10. Mann-Whitney U Test

Page 151: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 145

Page 152: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 146

11. Wilcox Ranked Sums Test

Page 153: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 147

Page 154: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 148

12. Wilcoxon Signed Ranks Test

Page 155: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 149

13. Sample Size Requirements

Page 156: Methods for Contextual Discovery and Analysis.pdf

Part III. Tables 150

Page 157: Methods for Contextual Discovery and Analysis.pdf

Part IV. Bibliography and Appendix 151

PART IV. BIBLIOGRAPHY AND APPENDIX Citations for Part I. – Discovery Methods Ashley, J.S. and Jarratt-Ziemski, K. Superficiality and Bias: The (mis)treatment of Native Americans in U.S. government textbooks. Paper presented at the 1999 WSSA Conference, Fort Worth, Texas. April 1999.

Ball, Catherine. Did Mary Shelley Write Like a Man: Explorations in the Methodology of Language and Gender. Paper presented at the Georgetown Women’s Studies Research Colloquia Series. December 1993.

Blain, Michael (1995). Group Defamation and the Holocaust in Group Defamation and Freedom of Speech: The relationship between language and violence, pp. 45-68. Eds. Monroe Freedman and Eric Freedman. Greenwood Press, Westport, CT.

Cobb, Tomas. “From Concordance to Lexicon: Development and Test of a Corpus-Based Lexical Tutor.” Dissertation, Department of Education Technology, Concordia University. Montreal, Quebec, Canada, January 1997.

Hart, Roderick P. (2000). Campaign Talk: Why elections are good for us. Princeton University Press: Princeton, NJ.

Holmes and Forsythe (1995). “The Federalists Revisited: New Directions in Authorship Attribution.” Literary and Linguistic Computing. Vol.10, No. 2, pp. 111-127.

Miall, David S. (1990). Personal Librarian: a Tool for the Literature Classroom, Literary and Linguistic Computing. Vol. 5, No. 1, pp. 19-23.

Nakamura, J. and Sinclair, J. (1995). The World of Woman in the Bank of English: Internal Criteria for the Classification of Corpora. Literary and Linguistic Computing. Vol. 10, No. 2, pp. 100-110.

Shulte-Sasse, Linda (1988). The Jew as Other under National Socialism. German Quarterly. Vol. 61, No. 1, pp. 22-49.

Page 158: Methods for Contextual Discovery and Analysis.pdf

Part IV. Bibliography and Appendix 152

Citations for Part II. – Analytical Methods Conover, W.J. (1999). Practical Nonparametric Statistics. Third Edition. John Wiley & Sons, Inc., New York, NY.

Chatfield, Chris (2001). Time-Series Forecasting. Champman & Hall / CRC Press, Boca Raton, FL.

Daniel, Wayne W. (1990). Applied Nonparametric Statistics. PWS-KENT, Boston.

Hollander, Myles and Douglas A. Wolfe (1999). Nonparametric Statistical Methods. Second Edition. John Wiley & Sons, Inc., New York, NY.

Sheskin, David J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. CRC Press, Inc., Boca Raton, FL.

Probability levels for the Wilcoxon signed-rank test Frank Wilcoxon, S.K. Katti and Roberta A. Wilcox, “Critical Values and Probability Levels for the Wilcoxon Rank Sum Test and the Wilcoxon Signed Rank Test.” Originally prepared and distributed by Lederie Laboratories Division, American Cyanamid Company, Pearl River, New York, in cooperation with the Department of Statistics, The Florida State University, Talahassee, Florida. Revised October 1968. Copyright 1963 by the American Cyanamid Company and the Florida State University. Quantiles of the Mann-Whitney Test Statistic Adapted from L.R. Verdooren, “Extended Tables of Critical Values for Wilcoxon’s Test Statistic,” Biometrika, 50 (1963), 177-186. Upper tail probabilities for the null distribution of the Ansari-Bradley W statistic Myles Hollander and Douglas A. Wolfe, Nonparametric Statistical Methods, copyright 1973 John Wiley & Sons, Inc. Percentiles of the Chi-square distribution A. Hald and S.A. Sinkbaek, “A Table of Percentage Points of the chi-squared distribution,” Skandinavisk Aktuarietidskrift, 33 (1950) 168-175. Critical values of the Kruskal-Wallis Test Statistic W. H. Kruskal and W. A. Wallis, “Use of Ranks in One-Criterion Analysis of Variance,” J. Amer. Statist. Assoc., 47 (1952), 583-621, Addendum, Ibid., 48 (19534), 907-911. Kendall’s coefficient of concordance Donald B. Owen, Handbook of Statistical Tables, Reading, Mass: Addison-Weseley, 1962 and Maurice G. Kendall, Rank Correlation Methods, fourth edition, Charles Criffin & Company, Ltd, High Wycombe, Bucks., England; reprinted by permission. Critical values of minimum rj for comparison of k treatments against one control in b sets of observations (two-tailed test)

Page 159: Methods for Contextual Discovery and Analysis.pdf

Part IV. Bibliography and Appendix 153

A.L. Rhyne, Jr. and R.G.D. Steel, “Tables for a Treatments versus Control Multiple Comparisons Sign Test,” Technometrics, Vol. 7, No. 3 (Aug 1965), pp. 297-298; reprinted by permission. Critical values of L, for Page’s ordered alternatives test E.B. Page, “Ordered Hypotheses for Multiple Treatments: A Significance Test for Linear Ranks,” J. Amer. Statist. Assoc., 58 (1963), 216-230. Quantiles of the Kolmogorov Test Statistic L.H. Miller, “Table of Percentage Points of Kolmogorov Statistics,” J. Amer. Statist. Assoc., 51(1956), 111-121. Quantiles of the Smirnov Test Statistic for two samples of equal size n Z.W. Birnbaum and R.A. Hall, “Small-Sample Distribution for Multi-Sample Statistics of the Smirnov Type,” Ann. Math. Statist., 31 (1960), 710-720. Quantiles of the Smirnov Test Statistic for two samples of different size Frank J. Massey, Jr., “Distribution Table for the Deviation between Two Sample Cumulatives,” Ann. Math. Statist., 23 (1952), 435-441. Critical Values for Lillefors test Andrew L. Mason and C.B. Bell, “New Lillefors and Srinivasan Tables with Applications,” Communic. Statisti.—Simul., Vol. 15, No. 2 (1986), pp. 457-459. Critical Values of Spearman’s rank correlation coefficient Jerrold H. Zar, Biostatistical Analysis, 2e © 1984, pp. 577-578. Reprinted y permission of Prentice Hall, Inc., Englewood Cliffs, New Jersey. Critical values for use with the Kendall tau Statistic L. Kaarsemaker and A. van Wijngaarden, “Tables for Use in Rank Correlation,” Statistica Neerlandica, 7 (1953), 41-54.

Page 160: Methods for Contextual Discovery and Analysis.pdf

Appendix 154

Appendix i What proportion of the words in textbooks deal with issues of a particular minority group and what can be inferred from such data? Ashley and Jarratt-Ziemski (1999) counted the total number of words in college history textbooks that dealt with Native American issues and compared this to the number of words addressing issues of other minority groups. Here the researchers measured the count of words according to the theme of the Native American and the genre of college texts. The researchers argued that the data showed that college textbooks exhibited a superficiality bias in treatment of topics related to Native American history and legal status. This example illustrates one of the most common uses for total word counts – that they are used to reveal the weight that an author places on a particular theme, audience, story, policy, etc. ii How do plays with a similar theme compare in vocabulary use? A tragedy may contain much vocabulary of a comedic nature whereas a comedy may contain a substantial count of words of a tragic nature. iii By performing word counts on political propaganda, what themes and imagery can be identified – what can be inferred about the authors? In Fighting Words Michael Blain (1995) examined the war rhetoric of Adolph Hitler and propagandists of the Nazi regime. From various sources, particularly Hitler’s book Mein Kampf as well as speech transcripts, pamphlets and newspapers he searched on individual words such as villain, victim, vindication, violation, destruction, cunning, egotism, devil, monster, demon, innocence, vulnerability, sacrifice, etc. By these counts Blain found that propaganda revolved around themes of loyalty, medical diagnosis and cure, religious guilt and redemption as well as a drama of murder and revenge that painted the common people of German ethnicity as God’s purifiers on theological and biological grounds and the Jew as a disease and a murderer. Additionally Blain saw that Hitler painted the masses of German society with the image of feminine purity. Contrastively the Jew was depicted as innately lascivious. Thus Hitler urged the people to gather to a dominating male figure, himself. This study was almost entirely based on examining word counts where Blain assayed the texts and located the themes and images. His results were convincing. In the next study the researcher similarly looked at themes and topic importance, this time in American politics. iv How can counts of individual words help the researcher to better understand a political speaker – are there contrasts between the speaker’s choice of words and his self-characterization? Hart (2000) found that in the 1996 presidential campaign "Bob Dole, the self-declared 'most optimistic man in America,' used less verbal optimism in his campaign speeches than any Republican since Tom Dewey with one exception” and that “Clinton…used more human-interest language (you, us, people, family) than any candidate from either party between 1948 and the present with the exception of Hubert Humphrey…President Clinton stressed the common ties among the American people, Mr. Dole used twice as many denial words (can't, shouldn't, couldn't)…" (p. 4). It was stated earlier that discovering topic importance can provide direction for further analyses and lead to seeing the author’s views of topics and the expected audience of the text. In this data Dole’s self-portrayal was contradicted by his speech. These kinds of data may lead the researcher

Page 161: Methods for Contextual Discovery and Analysis.pdf

Appendix 155

to ask why and in what other instances is such a contrast detectable. Clinton’s use of human-interest words might prompt the researcher to analyze the in-context use of such terms. Are they used consistently? Does Clinton define them in the traditional sense or otherwise? And so forth. v How can word-group analyses be used to evaluate communications? Steven Lankton's consulting firm used a word categorization method to set verbs into the language groups of visual based, feeling based and hearing based. The firm then gathered transcripts of communications between employees and customers from a variety of media: face-to-face, telephone, documents, etc. Verbs of each category were counted and the ratio of these verbs to the total number of verbs was noted for each document. The results were used to evaluate whether certain media were more conducive to productive communication and whether the media were compatible with the communications styles of employees and customers. In this study words were categorized into language groups, then the researchers determined the connection between the appearance of those words and communication styles. From here the researchers may have noted that texts loaded with visual based words originated from face-to-face conversations but that in transcripts of phone conversations that these kinds of verbs were almost entirely absent. The last step was to infer an explanation for the phenomena. The firm was eventually able to make recommendations for altering media interfaces to be more compatible with the communication style of employees and clients. These changes lead to an improvement in communication efficiency, business choices, customer and employee satisfaction and use of business resources. (See the URL http://www.lankton.com/kgroups.htm) Researchers would be able to apply the same process to the method of counting by language groups whether the texts originate with inaugural addresses, a classic novel or poetry. vi How can word-group analyses be used in qualitative evaluation of campaign speeches? In Campaign Talk: Why elections are good for us Roderick Hart (2000) studied US presidential elections from 1948 to 1996. In his study he examined letters, debates, television ads and broadcasts, newsprint and speeches. These texts originated with the press, politicians, relatives, acquaintances and citizens. By categorizing words according to their connotations he produced five scores for texts (pp. 246-251): (1) Certainty. “Language indicating resoluteness, inflexibility, completeness and a tendency to speak ex cathedra.” (2) Optimism. “Language endorsing some person, group, concept, or event, or highlighting their positive entailments.” (3) Activity. “Language featuring movement, change, the implementation of ideas and the avoidance of inertia.” (4) Realism. “Language describing tangible, immediate, recognizable matters that affect people’s everyday lives.” (5) Commonality. “Language highlighting the agreed-upon values of a group and rejecting idiosyncratic modes of engagement.” Hart’s discoveries were very informative. For example, of the 1992 campaign he notes: “the news [media] used virtually no Self-References but instead employed Communication Terms (said, advised, demanded, mentioned), thereby putting politicians, not reporters themselves, on stage.” (p. 188) And again, on the people’s voice in newspaper editorials: “My findings suggest that the richness of lay political expression lies not in its oddness but in its dependability….many voters [are] concerned… with enduring realities and this fundamentalist strain distinguishes the people’s voice.” (pp. 214-215) By categorizing words as Hart did and then searching on those words it is possible discover much about the authors and the contexts from which the texts are produced. Hart got so much information he wrote a book about it – what is given here is a meager portion of his rich analyses. Though analyses like the ones Hart made take time, careful planning and diligence, they have the potential to reveal volumes.

Page 162: Methods for Contextual Discovery and Analysis.pdf

Appendix 156

vii How can word-group analyses be used to evaluate a WWII Nazi propaganda film? Linda Schulte-Sasse (1988) published her analysis of the Nazi propaganda film Jud Sus. (The arena of political propaganda is rich with examples of word count analysis.) In her publication she noted that the antagonist Sus, a wealthy Jew, used many words and phrases of French origin - this contrasted an absence of such words in the speech of the protagonists, three volk-type Germans. Schulte-Sasse concluded that the “courtly” speech of Sus indicated his loyalties and affections were oriented internationally rather than centered on the national well-being and that his character was such that he favored the old social class distinction, placing himself with the lords and ladies above the common volk. The example illustrates the point that searching by language groups is often an advanced form of document discovery whereas while searching for individual words and performing total word counts the researcher is often at the initial stages of discovery. As is often the case with counting by language groups, Schulte-Sasse had a prior understanding of the origins of the text, including the social and political situation of the time and it is likely that from the outset she expected results similar to those she found. Researchers who have a familiarity with a particular field will be able to design more “advanced” or defined searches as exemplified here – these kinds of searches have the potential to reveal information that is more domain-specific. viii What particular uses for given words or phrases distinguish Shakespeare from another author? G. D. Monsarrat noted that in 1997 A Funeral Elegy by ‘W.S.’ was included in the Riverside, Bevington and Norton editions of Shakespeare though David Bevington stated, “the attribution [of the Elegy to Shakespeare] remains uncertain.” In his work Monsarrat had observed that the Elegy was much more like the works of John Ford than those of Shakespeare. Accordingly, Monsarrat examined word and phrase use in the Elegy compared them to literary works by Ford and Shakespeare. These are two of his word use analyses: “Shakespeare uses float only once, meaning ‘sea’ but never metaphorically…. It is one of Ford’s favorite metaphors….” And “Shakespeare uses ‘commonwealth’ with the meaning ‘body politic’, never metaphorically for what has been called ‘the little world of man’”, though Ford did. Monsarrat’s results were convincing that the Elegy was more likely written by Ford than Shakespeare. This example shows how variations in word use relate to or may help to determine authorship. Monsarrat relied on the fact that words can vary in meaning according to authorial discretion in order to successfully argue authorship of a piece of literature whose authorship has historically been questioned. ix Does authorial gender correlate with the way that particular words are used? Catherine Ball (1993) analyzed the writing of Mary Shelley and other authors of that time period, seeking to determine if Shelly “[wrote] like a man.” Ball used a concordance to examine the uses of relative pronouns, who, whom, whose, which and that in works by Austin, Shelley, Dickens and Charlotte and Emily Bronte. For instance, for each pronoun she noted whether the antecedent to the pronoun was personal or non-personal and what grammatical role the pronoun had in the relative clause, i.e., subject, object, etc. The resulting data allowed her to conclude that female authors use relative pronouns distinctly from male authors. This study illustrates that when a particular word or type of word, in this case a relative pronoun, is examined in a concordance, patterns in the surrounding words are much easier to identify. These patterns may be attributed to gender, as with Ball’s study, or to other

Page 163: Methods for Contextual Discovery and Analysis.pdf

Appendix 157

factors such as the time period of the composition, the audience addressed, the themes being treated, etc. This kind of linguistic analysis can be very rewarding. x How can concordances be used to compose a dictionary and teach foreign language vocabulary? Because they show the various ways in which the words of a text are used, concordances can serve to write a dictionary. A dictionary can have many uses – the compilation process can also serve a variety of functions. The following is an example of how a concordance was used in writing a dictionary in order to teach foreign language vocabulary: Thomas Cobb (1997) had students learning English from Oman use a concordance of spoken and written English to build their own dictionaries. Cobb observed that in writing their dictionaries students were required to determine the meaning the word was given by its context. Additionally, using the concordance to write the dictionaries the students saw which English words were more commonly used than others. Cobb noted that this method of using a concordance to teach vocabulary was successful. This is a great example of the creative possibilities for concordance use. Every foreign language learner or instructor knows that the best way to learn a language is by direct exposure, preferably in conversation. Perhaps the use of a concordance in this manner comes in a close second. xi How can concordances be used to detect plagiarism? Concordances greatly facilitate identification of: (1) regularities in word use across several texts and (2) consistencies in word use within the same text. In her online concordance tutorial, the above-mentioned Ball suggests a “poor-man’s plagiarism machine.” (See URL http://www.georgetown.edu/cball/corpora/tutorial.html.) By making a concordance of two or more texts (usually a student paper and sources from the bibliography of the paper) the researcher can look for matching phrases. Finding matches is easy when the researcher keys on definite and indefinite articles. Matches between the questioned document and likely sources are possible instances of plagiarism. xii What are the readability or grade levels of speeches given by several presidents of the United States and what might these imply? Researchers for the website YourDictonary.com used the Flesch-Kincaid scale to assess the readability levels of presidential addresses and found the following: Washington’s farewell address was 12th grade, Roosevelt’s declaration of war was at a grade level of 11.5, John F. Kennedy was at a 9.6, Nixon at 9.1. The researchers supposed that the decline in scores was related to speeches having been televised nationally. (See cited website.) xiii What aspects of an author’s life become more transparent via performing collocations on their texts? David Miall (1990) saw that building a concordance could reveal special properties and associations of the vocabulary the author used. He offered an example from the poetry of Coleridge, whose only sister died at an early age. Miall found that in Coleridge’s works, collocations for the word sister included words such as gloomy, distress, woes, pain and agony as well as instances of love. Here Miall used collocation tools to examine the trends in vocabulary use and by doing so came to see that the death of his sister probably left a profound impression on Coleridge.

Page 164: Methods for Contextual Discovery and Analysis.pdf

Appendix 158

xiv How can imagery in a text be discovered through collocations? In the King James Bible a collocation count for the word love shows that it is most commonly associated with the words hate, neighbor and husband. The theme of love is being contrasted with that of hate. Images of marital and communal bonds are also more salient. xv How can style be analyzed to classify texts by genre? Nakamura and Sinclair (1995) analyzed collocations on the word woman as it has been used throughout a variety of media, including books (novels), newsprint, conversation and BBC reports. The procedure was to find all of the occurrences of woman in the texts and to examine which words occurred near woman. Their suspicion was that patterns in word associations would be distinct for each medium. Their findings indicated interesting things such as the following: 1. In transcripts taken from BBC reports adjectives commonly associated with woman typically dealt with country, area, citizenship or religion. 2. Adjectives exclusively found in books and collocated with woman usually described a positive or negative quality, e.g., confident, healthy, mature, lonely, gossipy, unhappy or a physical characteristic, e.g., small, thin, plump, slender, stout, etc. 3. Verbs conjugated in the 3rd-person-singular tense, e.g., says, wants, finds, feels, etc. were only to be found collocated with woman in texts taken from books. 4. Woman was collocated with verbs of violence and crime, e.g., shot, injured, convicted, questioning, etc. in the BBC and newspaper texts. The findings of this study by themselves may or may not be of interest to every researcher but they do exemplify the point that patterns in associations often correspond to the medium of a text. Likewise, collocations often correspond to other aspects of the origin of a text such as genre, era, time, place and author. The researcher must take care in performing such analyses to test for statistical soundness and is referred to the second half of the manual for instructions on statistical measurements. xvi When authorship is unknown, how can style analysis be used to establish a probable author? G. D. Monsarrat (2002) analyzed A Funeral Elegy, a poem that has traditionally been attributed to William Shakespeare. In his analysis Monsarrat compared aspects of the text under question to two separate collections of poems, one by Shakespeare and another by John Ford. Monsarrat eventually concluded that the Elegy was more likely written by Ford than by Shakespeare. His evidence for this conclusion stemming from the following observations: 1. The Elegy contains this line – Had taught him in both fortunes to be free. Shakespeare never used both fortunes even though he used fortunes 125 times but Ford used both fortunes five times. 2. The Elegy uses the word ornament in the sense of moral qualities and virtues. For Shakespeare ornaments was not used in this sense but usually in a derogatory manner, e.g., the corrupt ornaments of the world. Ford used ornaments in a manner similar to the Elegy, e.g., speciall ornaments of a prepared minde. 3. The Elegy contains this phrase – Reasons law. Shakespeare never associates reason and law but Ford does twice. 4. The Elegy uses the phrase – like a seeled Dove.

Page 165: Methods for Contextual Discovery and Analysis.pdf

Appendix 159

Shakespeare never used seeled dove but the phrase is found more than once in Ford’s works. As an explanation of the foregoing examples, number one is a great example of how styles may differ by way of association. Though both authors used the word fortunes one exhibits a pattern of using two words in a particular association and the other does not. Number two is an example of discretionary use of content. Playing on the dynamics of word meaning, one author uses ornaments in a metaphorical sense once and in other instances in a denotative, though extended, sense. Number three is another example of patterns in associations. Number four is a simple example of patterns of content. Ford uses the phrase seeled dove while Shakespeare doesn’t. Monsarrat’s article is an excellent example of style analysis, contains other examples similar to these and is recommended. xvii How has Document Explorer been used to discover a poetic text and write an exegesis that serves as a template for translation of the source into several different languages? In a personal interview, Dr. Cynthia Hallen of Brigham Young University explained how she wrote an exegesis of a scriptural passage from Isaiah. In the King James Bible, Isaiah chapter 24, verse 11 reads, “O thou afflicted…I will lay thy stones with fair colours and lay thy foundations with sapphires.” Hallen began by performing a search on each of the content words and viewed the uses of those words in context. By this approach she found recurring phrasal imagery. Perhaps not the same exact phrase but the same word or words within the same semantic domain which occurred in close proximity to each other. In this way she was able to examine the imagery and understand the meaning of the symbols. Subsequently she did a write up for translators to use as a reference in translation – the exegesis was an explanation of the meaning of passages that provided a more accurate translation. The exegesis was not formed by the interpretation of an isolated passage by examining the use of the same imagery in other scriptures. Here Hallen relied heavily on three features of the program: (1.) searching with filters (2.) viewing in context (3.) and, in a limited way, using the concordance builder. (Filter use is addressed in Section B of the manual.) This is a great case for the point that translation word for word is not always adequate. Hallen sought to translate the meaning and symbolism of the passage rather than the words alone. Imagine the difficulty of translating Isaiah or any other poetic text into Hmong, Chinese or Navaho. It is difficult to search out all of the occurrences of a certain word and much more difficult to find the instances of a certain image in a text. In Humanities metaphor is one of the most powerful and beautiful forms of expression. It can hold many levels of meaning and is very versatile. The greatest authors speak through imagery and not via mere words. Hence the importance of a tool that allows the discovery and comparison of images in a text. xviii With the case of apology as it was translated by Chinese newspapers after a U.S. spy plane was downed in Chinese airspace – was apology translated as an acknowledgement of and sorrow for guilt or merely as a regret?