abstract - uoc.philology · abstract the aim of this ... (discriminant function analysis dfa)...

17
1 : . Abstract The aim of this research is the comparative evaluation of different statistical techniques employed to Automatic Text Categorization in Modern Greek. The statistical methods considered were taken from the family of multivariate data analysis methods and consist of: a) Cluster Analysis b) Multiple Linear Regression c) Discriminant Function Analysis. Furthermore, a number of different predictor variables were used in order to investigate their relative contribution to the overall discriminatory performance of each statistical method. - : , , , , , 1. ( ) ( ) . , , . (Discriminant Function Analysis) (Mikros & Carayannis 2000, Stamatatos, Fakotakis, Kokkinakis 2001), (Cluster Analysis) (Tambouratzis et al. 2000) (Multiple Regression) (Stamatatos, Fakotakis, Kokkinakis 1999, Stamatatos, Kokkinakis, Fakotakis 2000). , . ( ) . 2. 900 .

Upload: doanxuyen

Post on 29-Apr-2018

220 views

Category:

Documents


1 download

TRANSCRIPT

1

:

.

Abstract The aim of this research is the comparative evaluation of d ifferent statistical techniques employed to Automatic Text Categorization in Modern Greek. The statistical methods considered were taken from the family of multivariate data analysis methods and consist of: a) Cluster Analysis b) Multiple Linear Regression c) Discriminant Function Analysis. Furthermore, a number of d ifferent pred ictor variables were used in order to investigate their relative contribution to the overall discriminatory performance of each statistical method.

- :

, , ,

, ,

1.

( )

( )

.

,

, .

(Discriminant Function Analysis) (Mikros & Carayannis

2000, Stamatatos, Fakotakis, Kokkinakis 2001), (Cluster Analysis)

(Tambouratzis et al. 2000) (Multiple Regression) (Stamatatos,

Fakotakis, Kokkinakis 1999, Stamatatos, Kokkinakis, Fakotakis 2000). ,

.

( )

.

2.

900

.

2

(

1):

1:

( )

33.692

225

136,3

81

969

150

24.087

161

622,5

80

5748

150

26.976

180

106,8

83

739

150

35.136

234

124,4

81

1434

150

102.906

686

1499,7

81

5960

150

Media 31.395

209

117,5

80

671

150

254.192

282

692,9

900

.

:

1. « » .

«Media» « »,

.

« » « »,

.

,

« »

.

2.

.

(75%) (< 250 ),

2% 1000 .

( . Mikros 2002),

. ,

500

(Baillie 1974, Ledger & Merriam 1994: 244)

.

3

.

3.

(register)

.

(Rudman 1998: 357)

.

3.

.

:

:

.

, WordNet (Junker & Abecker 1997, Scott &

Matwin 1999, Buenaga et al. 1997).

,

(spam mail) (G mez et al. 2000, Sahami et al. 1998).

.

(Burrows 1992, Burrows & Craig 1994),

( . Koller & Sahami 1997, Zaiane & Antonie 2002).

:

(

, , , . .)

.

.

(Karlgren 1999: 161)

(Forsyth & Holmes 1996: 170).

:

. . .

(Hoch 1994).

. :

4

( ):

, .

Type/ Token ratio: (types) .

500 (Biber 1988)

Yule s K:

(Tweedie & Baayen 1998: 350).

:

« » ( , ,

) ( , ,

, ).

( )

( )

:

, , . . . .

15

1

15 .

: 5

: ( ), ! % -

( ):

.

( ): 50

.

( ): 20

. 6 120 (20 6)

60

(20 3).

500

.

5

4.

( . . Dixon & Mannion 1993, Matthews & Merriam

1993, Holmes & Forsyth 1995).

Forsyth & Holmes (1996: 163) .

.

.

. Stamatatos, Fakotakis,

Kokkinakis (2000)

( . . BNC ).

(Stamatatos, Fakotakis, Kokkinakis

(2001) 30 50 (

).

,

, .

.

.

(frequency profiling).

( . Hofland

& Johansson 1982, Rayson et al. 1997, Granger & Rayson 1998).

Iakovou,

Markopoulos, Mikros (2002) (2003)

.

:

1. .

2. - : . .

4

4 - ,

. . ( - ( ), -

( ), - ( ), - ( ) . . .

3. ( ) - .

6

4. -

. ( ) ~ ( , , )

5. k

-

3 - .

6. -

7. n ( 4 k)

n

( 5) 2 (Schutze et al. 1995), (mutual information),

(information gain) (Lewis & Ringuette 1994),

(Principal Component Analysis) (Wiener et al. 1993)

(log likelihood) (Dunning 1993).

,

(Kageura

1997).

, 2 (Manning & Schutze 1999: 174).

5.

(cross-

validation) U leave-one-out .

1 .

.

.

(macro-averaging).

F1

, .

:

F2

1 ( 1)

. :

7

: 6

. 3 ( , ,

)

:

150 .

75 .

:

.

:

o

o ( M + )

o +

o +

o

o

o

2 .

.

3 6

3 6 12

6 .

6.

(Multiple

Linear Regression), (Discriminant Function Analysis

DFA)

(Hierarchical Cluster Analysis).

6.1

(I )

.

,

.

.

Ward

(El-Hamdouchi & Willet 1986).

8

, ,

(Pearson Correlation) .

,

,

.

.

.

,

( ), ( )

(

1,

2):

1: ,

2:

0

10

20

30

40

50

60

YM++

F1 6

3

05

101520253035404550

F1

150 75

6

3

0

10

20

30

40

50

60

F1

6 3

9

6.2

( . Biebericher et al. 1988, Fuhr et al. 1991,

Yang & Chute 1994).

( ) ( ).

, ,

. « »

. « »

. « »

2:

y= a + w1X1 + w2X2 + w nXn ( 2)

: y= EM

a=

w=

X= AM

Stepwise, Enter . .

Enter .

,

( ) (

3):

3:

,

6.3

a

priori .

« »

.

01020304050607080

F1

150 75

6

3 0102030405060708090

++

F1 6

3

10

k-1

, k .

( )

3:

Djk= a + w1X1k + w2X2k + ... + w nXnk ( 3)

:

Djk= j k.

a=

w i= i

Xik= i k

( . Karlgren & Cutting 1994)

.

( ), ( )

(

4):

4: , .

(

5):

5:

.

0102030405060708090

100

++

F1 6

3

0102030405060708090

F1

150 75

6

3

Function 1

1062-2-6-10

Fun

ctio

n 2

8

6

4

2

0

-2

-4

-6

-8

Group Centroids

MEDIA

MEDIA

3

Function 1

86420-2-4-6-8

Fun

ctio

n 2

6

4

2

0

-2

-4

-6

Group Centroids

11

7.

6 :

6: .

.

,

92%.

.

7:

7:

0

10

20

30

40

50

60

70

80

90

100

++

F1

,

.

.

0

20

40

60

80

F1 150 75

0

10

20

30

40

50

60

70

80

90

F1 6

3

12

F .

8:

0

2

4

6

8

10

12

F

( )

MO (F)

8:

, .

:

13

2: ( )

. ,

,

.

Baillie, D. W. 1974. Authorship attribution in Jacobean dramatic texts . Computers in the

humanities, ed. by J.L.Mitchell, Edinburgh: Edinburgh University Press.

Biber, Douglas. 1988. Variation across speech and writing. Cambridge: Cambridge University

Press.

Wilks' L F10 % 0,887 22,763 1,51746E-2114 "- 0,893 21,421 2,77415E-2025 0,909 17,890 6,22179E-1754 , 0,935 12,488 9,77125E-1277 5 0,946 10,256 1,4206E-0989 0,960 7,537 6,05704E-0793 : 0,960 7,368 8,81204E-0794 9 0,961 7,259 1,12183E-0699 10 0,963 6,826 2,92779E-06

100 11 0,965 6,544 5,45472E-06106 ; 0,966 6,228 1,09362E-05107 6 0,967 6,183 1,20659E-05111 15 0,967 6,098 1,45413E-05112 0,967 6,069 1,55028E-05116 Yule K 0,969 5,727 3,27666E-05123 12 0,972 5,200 0,00010324127 13 0,973 4,978 0,00016708131 0,977 4,240 0,000815163138 14 0,978 4,063 0,001187014140 4 0,979 3,924 0,001591351156 0,982 3,298 0,005865007160 7 0,983 3,120 0,008449015163 STTR 0,983 3,076 0,009253816165 0,983 3,072 0,00931835170 0,985 2,811 0,015814403175 0,986 2,533 0,027445062180 0,987 2,362 0,038356071186 "- 0,988 2,083 0,065315221191 1 0,989 1,954 0,083157476192 8 0,989 1,945 0,084477584193 3 0,990 1,884 0,094539568197 . . 0,991 1,548 0,172454647202 2 0,995 0,854 0,511480483

14

Biebericher, Peter, Fuhr, Norbert, Lustig, Gerhard , Schwantner, Michael, Knorz, Gerhard . 1988.

The automatic indexing system AIR/ PHYS

from research to application . Proceedings

of the 11th International Conference on Research and Development in Information Retrieval

(SIGIR 88), 333-342.

Buenaga Manuel, G mez Jose, & Diaz Belen. 1997. Using WordNet to complement training

information in text categorization . In Milkov R., Nicolov N., and Nikolov N. editors,

Proceedings of RANLP-97, 2nd International Conference on Recent Advances in Natural

Language Processing, ed. by R. Milkov, N. Nicolov, and N. Nikolov, 202 207. Tzigov:

Chark, BL.

Burrows, John F. & Craig, Hugh. 1994. Lyrical d rama and the Turbid Montebanks : Styles of

d ialogue in romantic and renaissance tragedy . Computers and the Humanties 28: 63-86.

Burrows, John F. 1992. Not unless you ask nicely: the interpretive nexus between analysis and

information . Literary and Linguistic Computing 7. 91-109.

Dixon, Peter & Mannion, David . 1993. Goldsmth s period ical essays: a statistical analysis of

eleven doubtful cases . Literary and Linguistic Computing 8. 1-19.

Dunning, Ted . 1993. Accurate methods for the statistics of surprise and coincidence .

Computational Linguistics 19. 61-74.

El-Hamdouchi, A. & Willet, Peter. (1986). Hierarchic document classification using Ward 's

clustering method . Proceedings of the 9th annual international ACM conference on

research and development in information retrieval (SIGIR 86), 149-156.

Forsyth, Richard , S. & Holmes, David , I. 1996. Feature-finding for text classification . Literary

and Linguistic Computing 11. 163-174.

Fuhr, Norbert, Hartmann, Stephan, Lustig, G., Schwantner, Michael, Tzeras, Konstad inos. 1991.

AIR/ X - a ru le-based multistage indexing system for large subject fields . Proceedings of

the RIAO'91, 606-623.

Granger, Sylviane & Rayson Paul. 1998. Automatic profiling of learner texts . Learner English

on Computer, ed. by Sylviane Granger, 119-131. Longman: London and New York.

G mez Jose & de Buenaga Manuel. 1997. Integrating a lexical database and a training

collection for text categorization . Proceedings of the ACL/EACL Workshop on Automatic

Information Extraction and Building of Lexical Semantic Resources for NLP. 39-44.

Hoch, Rainer. 1994. Using IR techniques for text classification in document analysis . 17th

Annual International ACM SIGIR Conference on Research and Development in Information

Retrieval (SIGIR 94), 31-40.

Hofland Knut & Johansson Stig. 1982. Word frequencies in British and American English. The

Norwegian Computing Centre for the Humanities: Bergen, Norway.

Iakovou, aria., Markopoulos, George., Mikros, George. (2003).

:

. 6 , 18-21 2003, .

Iakovou, aria., Markopoulos, George., Mikros, George. 2002.

15

. 2

, .

Junker Markus & Abecker Andreas. 1997. Exploiting thesaurus knowledge in ru le induction

for text classification . Proceedings of RANLP-97, 2nd International Conference on Recent

Advances in Natural Language Processing, ed . by R. Milkov, N. Nicolov, and N. Nikolov.

202 207. Tzigov: Chark, BL.

Karlgren, Jussi. 1999. Stylistic experiments in information retrieval . Natural Language

Information Retrieval, ed. by T. Strzalkowski, 147-166. Kluwer: Dodrecht.

Karlgrenn, Jussi & Cutting, Douglass. 1994. Recognizing text genres with simple metrics using

d iscriminant analysis . Proceedings of the 15th. International Conference on Computational

Linguistics (COLING 94), volume II, 1071-1075. Kyoto, Japan.

Koller, Daphne & Sahami, Mehran. 1997. Hierarchically classifying documents using very few

words. International Conference on Machine Learning, Nashville, volume 14. 170-178.

Morgan-Kauffman: San Francisco.

Ledger, Gerard & Merriam, Thomas, V. N. 1994. Shakespeare, Fletcher, and the Two Noble

Kinsmen . Literary and Linguistic Computing 9. 235-248.

Lewis, David , D., & Ringuette, Marc. 1994. A comparison of two learning algorithms for text

categorization . Proceedings of the Third Annual Symposium on Document Analysis and

Information Retrieval (SDAIR 94), 81-93.

Manning, Christopher, D. & Schutze, Hinrich. 1999. Foundations of statistical natural language

processing. Cambridge, Massachusetts: MIT Press.

Mikros, George & Carayannis, George. 2000. Modern Greek Corpus Taxonomy . Proceedings of

the 2nd International Conference on Language Resources and Evaluation, volume I. 129-134.

Athens, Greece.

Mikros, George. 2002. Quantitative parameters in corpus design: Estimating the optimum text-

size in Modern Greek language . Proceedings of the 3rd International Conference on

Language Resources and Evaluation, volume III. 834-838. Gran Canaria, Spain.

Rayson, Paul, Leech, Geoffrey., & Hodges, Mary. 1997. Social d ifferentiation in the use of

English vocabulary: some analyses of the conversational component of the British

National Corpus . International Journal of Corpus Linguistics. 2. 133 - 152.

Rudman, Joseph. 1998. The state of authorship attribution stud ies: some problems and

solutions . Computers and the Humanities 31. 351-365.

Sahami Mehran, Dumais Susan, Heckerman David , & Horvitz Eric. 1998. A bayesian approach

to filtering junk e-mail . Learning for Text Categorization: Papers from the 1998 Workshop.

AAAI Tech. Rep. WS-98-05. 55-62.

Schutze Hinrich, Hull, David A., Pedersen Jan, O. 1995. A comparison of classifiers and

document representations for the routing problem . 18th Annual International ACM SIGIR

Conference on Research and Development in Information Retrieval (SIGIR 95), 229-237.

Scott Sam & Matwin Stan. 1999. Feature engineering for text classification . Proceedings of

ICML-99, 16th International Conference on Machine Learning, ed. by I. Bratko and S.

Dzeroski. 379 388. Morgan Kaufmann Publishers: San Francisco, US.

16

Stamatatos, Efstathios, Fakotakis, Nikolaos, Kokkinakis, George, 1999. Automatic authorship

attribution . Proceedings of the Ninth Conference of the European Chapter of the Association for

the Computational Linguistics (EACL 99), July 8-12, 1999. 158-164. Bergen, Norway.

Stamatatos, Efstathios, Fakotakis, Nikolaos., Kokkinakis, George, 2000. Automatic Text

Categorization in Terms of Genre and Author . Computational Linguistics 26. 471-495.

Stamatatos, Efstathios, Fakotakis, Nikolaos., Kokkinakis, George. 2001. Computer-based

authorship attribution without lexical measures . Computers and the Humanities 35. 193

214.

Tambouratzis, George, Markantonatou, Stella, Xairetakis, Nikolaos, Carayannis, George. 2000.

Automatic style categorization of corpora in the Greek language . Proceedings of the 2nd

International Conference on Language Resources and Evaluation, volume I. 135-140. Athens,

Greece.

Tweedie, Fiona & Baayen, Harald , R. 1988. How variable a constant can be? Measures of

lexical richness in perspective . Computers and the Humanities 32. 323-352.

Wiener, Erik, Pedersen, Jan, O., Weigend , Andreas, S. 1993. A neural network approach to

topic spotting . Proceedings of the Fourth Annual Symposium on Document Analysis and

Information Retrieval, 22-34.

Yang, Yiming & Chute, Christopher G. 1994. An example based mapping method for text

categorization and retrieval . ACM Transaction on Information System (TOIS), 12. 252-277.

Zaiane Osmar R. & Antonie Maria-Luiza. 2002. Classifying text documents by associating

terms with text categories . Proceedings of the Thirteenth Australasian Database Conference

(ADC2002), Melbourne, Australia. Conferences in Research and Practice in Information

Technology, volume 5. ed. by Zhou, X., 215-222.

This document was created with Win2PDF available at http://www.daneprairie.com.The unregistered version of Win2PDF is for evaluation or non-commercial use only.