practical text mining with perl

7
Practical Text Mining With Perl Roger Bilisoly Department of Mathematical Sciences Central Connecticut State University WILEY A JOHN WILEY & SONS, INC., PUBLICATION

Upload: others

Post on 28-Mar-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Practical Text Mining With Perl

Roger Bilisoly Department of Mathematical Sciences Central Connecticut State University

WILEY

A JOHN WILEY & SONS, INC., PUBLICATION

Contents

List of Figures xiii

List of Tables xv

Preface xvii

Acknowledgments xxiii

1 Introduction 1

1.1 Overview of this Book 1

1.2 Text Mining and Related Fields 2

1.2.1 Chapter 2: Pattern Matching 2

1.2.2 Chapter 3: Data Structures 3

1.2.3 Chapter 4: Probability 3

1.2.4 Chapter 5: Information Retrieval 3

1.2.5 Chapter 6: Corpus Linguistics 4

1.2.6 Chapter 7: Multivariate Statistics 4

1.2.7 Chapter 8: Clustering 5

1.2.8 Chapter 9: Three Additional Topics 5

1.3 Advice for Reading this Book 5

vii

viii CONTENTS

Text Patterns 7

2.1 Introduction 7

2.2 Regular Expressions 8

2.2.1 First Regex: Finding the Word Cat 8

2.2.2 Character Ranges and Finding Telephone Numbers 10

2.2.3 Testing Regexes with Perl 12

2.3 Finding Words in a Text 15

2.3.1 Regex Summary 15

2.3.2 Nineteenth-Century Literature 17

2.3.3 Perl Variables and the Function s p l i t 17

2.3.4 Match Variables 20

2.4 Decomposing Poe's "The Tell-Tale Heart" into Words 21

2.4.1 Dashes and String Substitutions 23

2.4.2 Hyphens 24

2.4.3 Apostrophes 27

2.5 A Simple Concordance 28

2.5.1 Command Line Arguments 33

2.5.2 Writing to Files 33

2.6 First Attempt at Extracting Sentences 34

2.6.1 Sentence Segmentation Preliminaries 35

2.6.2 Sentence Segmentation for A Christmas Carol 37

2.6.3 Leftmost Greediness and Sentence Segmentation 41

2.7 Regex Odds and Ends 46

2.7.1 Match Variables and Backreferences 47

2.7.2 Regular Expression Operators and Their Output 48

2.7.3 Lookaround 50

2.8 References 52

Problems 52

Quantitative Text Su m maries 59

3.1 Introduction 59

3.2 Scalars, Interpolation, and Context in Perl 59

3.3 Arrays and Context in Perl 60

3.4 Word Lengths in Poe's "The Tell-Tale Heart" 64

3.5 Arrays and Functions 66

3.5.1 Adding and Removing Entries from Arrays 66

3.5.2 Selecting Subsets of an Array 69

3.5.3 Sorting an Array 69

3.6 Hashes 73

3.6.1 Using a Hash 74

3.7 Two Text Applications 77

CONTENTS JX

3.7.1 Zipf s Law for A Christmas Carol 77

3.7.2 Perl for Word Games 83

3.7.2.1 An Aid to Crossword Puzzles 83

3.7.2.2 Word Anagrams 84

3.7.2.3 Finding Words in a Set of Letters 85

3.8 Complex Data Structures 86

3.8.1 References and Pointers 87

3.8.2 Arrays of Arrays and Beyond 90

3.8.3 Application: Comparing the Words in Two Poe Stories 92

3.9 References 96

3.10 First Transition 97

Problems 97

Probability and Text Sampling 105

4.1 Introduction 105

4.2 Probability 105

4.2.1 Probability and Coin Flipping 106

4.2.2 Probabilities and Texts 108

4.2.2.1 Estimating Letter Probabilities for Poe and Dickens 109

4.2.2.2 Estimating Letter Bigram Probabilities 112

4.3 Conditional Probability 115

4.3.1 Independence 117

4.4 Mean and Variance of Random Variables 118

4.4.1 Sampling and Error Estimates 120

4.5 The Bag-of-Words Model for Poe's "The Black Cat" 123

4.6 The Effect of Sample Size 124

4.6.1 Tokens vs. Types in Poe's "Hans Pfaall" 124

4.7 References 128

Problems 129

Applying Information Retrieval to Text Mining 133

5.1 Introduction 133

5.2 Counting Letters and Words 134

5.2.1 Counting Letters in Poe with Perl 134

5.2.2 Counting Pronouns Occurring in Poe 136

5.3 Text Counts and Vectors 138

5.3.1 Vectors and Angles for Two Poe Stories 139

5.3.2 Computing Angles between Vectors 140

5.3.2.1 Subroutines in Perl 140

5.3.2.2 Computing the Angle between Vectors 143

5.4 The Term-Document Matrix Applied to Poe 143

X CONTENTS

5.5 Matrix Multiplication 147

5.5.1 Matrix Multiplication Applied to Poe 148

5.6 Functions of Counts 150

5.7 Document Similarity 152

5.7.1 Inverse Document Frequency 153

5.7.2 Poe Story Angles Revisited 154

5.8 References 157

Problems 157

Concordance Lines and Corpus Linguistics 161

6.1 Introduction 161

6.2 Sampling 162

6.2.1 Statistical Survey Sampling 162

6.2.2 Text Sampling 163

6.3 Corpus as Baseline 164

6.3.1 Function vs. Content Words in Dickens, London, and Shelley 168

6.4 Concordancing 169

6.4.1 Sorting Concordance Lines 170

6.4.1.1 Code for Sorting Concordance Lines 171

6.4.2 Application: Word Usage Differences between London and

Shelley 172

6.4.3 Application: Word Morphology of Adverbs 176

6.5 Collocations and Concordance Lines 179

6.5.1 More Ways to Sort Concordance Lines 179

6.5.2 Application: Phrasal Verbs in The Call of the Wild 181

6.5.3 Grouping Words: Colors in The Call of the Wild 184

6.6 Applications with References 185

6.7 Second Transition 187

Problems 188

Multivariate Techniques with Text 191

7.1 Introduction 191

7.2 Basic Statistics 192

7.2.1 z-Scores Applied to Poe 193

7.2.2 Word Correlations among Poe's Short Stories 195

7.2.3 Correlations and Cosines 199

7.2.4 Correlations and Covariances 201

7.3 Basic linear algebra 202

7.3.1 2 by 2 Correlation Matrices 202

7.4 Principal Components Analysis 205

7.4.1 Finding the Principal Components 206

CONTENTS Xi

7.4.2 PCA Applied to the 68 Poe Short Stories 206

7.4.3 Another PCA Example with Poe's Short Stories 209

7.4.4 Rotations 209

7.5 Text Applications 211

7.5.1 A Word on Factor Analysis 211

7.6 Applications and References 211

Problems 212

Text

8.1 8.2

8.3

8.4 8.5

Clustering

Introduction

Clustering

8.2.1 Two-Variable Example of &-Means

8.2.2 &-Means with R

8.2.3 He versus She in Poe's Short Stories

8.2.4 Poe Clusters Using Eight Pronouns

8.2.5 Clustering Poe Using Principal Components

8.2.6 Hierarchical Clustering of Poe's Short Stories

A Note on Classification

8.3.1 Decision Trees and Overfitting

References

Last Transition

Problems

A Sample of Additional Topics

9.1 9.2

9.3 9.4

9.5

Introduction

Perl Modules

9.2.1 Modules for Number Words

9.2.2 The StopWords Module

9.2.3 The Sentence Segmentation Module

9.2.4 An Object-Oriented Module for Tagging

9.2.5 Miscellaneous Modules

Other Languages: Analyzing Goethe in German

Permutation Tests

9.4.1 Runs and Hypothesis Testing

9.4.2 Distribution of Character Names in Dickens and London

References

ndix A: Overview of Perl for Text Mining

A.l

A.2

Basic Data Structures

A. 1.1 Special Variables and Arrays

Operators

219

219

220

220

223

224

229

230

234

235

235

236

236

236

243

243

243

244

245

245

247

248

248

251

252

254

258

259

259

262

263

XII CONTENTS

A.3 Branching and Looping 266

A.4 A Few Perl Functions 270 A.5 Introduction to Regular Expressions 271

Appendix B: Summary of R used in this Book 275 B.l Basics of R 275

B.l.l Data Entry 276 B.l.2 Basic Operators 277

B.1.3 Matrix Manipulation 278 B.2 This Book's R Code 279

References 283

Index 291