arabic keyphrase extraction

1

Benha University

Faculty of Engineering at Shoubra

Computer engineering department

Graduation Project

Supervised by: Prof. Dr. Abdulwahab Al-Sammak

Arabic Keyphrase Extraction

كلمتين وبس

2

Prepared by:

Ahmed Ali

Ahmed Mostafa Mohammed

Ahmed Rashad Basiouny

Mohab Tarek El-Shishtawy

Mostafa Mahmoud El-Abady

Sherif Mohammed Nasr

A graduation project submitted to the Computer Engineering Department in fulfillment of the requirements for the degree of B.Sc. in Computer Engineering

Cairo, Egypt

June 14, 2012

3

Acknowledgement In the name of ALLAH, most Gracious, most Merciful First of all, thanks Allah for the power and the ability he gave us to make this project real. Then, would like to express our gratitude to all those who gave us the possibility to complete this project. Prof. Dr. Abdulwahab Al-sammak, for his advice, guidance, support and encouragement. Prof. Dr. Tarek Elshishtawy, for his advice, effort and his support.

The Stanford NLP Group, for their valuable resources.

The Linguistic Data Consortium (LDC), for their valuable resources.

Our parents, brothers and sisters who endured this time with us and

were always a great source of encouragement.

4

Table of Contents:

1. Introduction 6

2. Data Mining 2.1 Introduction 2.2 The Scope of Data Mining 2.3 Background 2.4 KDD Process: 2.5 The Cross-Industry Standard Process for Data Mining 2.6 Simplified process in KDD

12

3. KeyPhrase Extraction 3.1 introduction 3.2 Supervised Machine Learning Techniques

3.2.1 C4.5 decision tree induction algorithm 3.2.2 GenEx (Genitor and extractor) 3.2.2.1 Extractor 3.2.2.2 Genitor 3.2.3 Sakhr 3.2.4 Kea 3.2.5 using Linguistic knowledge and Machine Learning Techniques

3.3 unsupervised Machine Learning Techniques 3.3.1 KP-Miner 3.3.1.1 System Overview 3.3.1.2 Candidate keyphrase selection 3.3.1.3Candidate keyphrases weight calculation 3.3.1.4Final Candidate Phrase List Refinement

3.3.1.5 Evaluation and Drawbacks

19

4. Proposed System 4.1 Introduction 4.2 Pre-Processing Phase

37

5

4.3 Segmentation 4.4 POS Tagging Phase

4.4.1 Training data supplied to POS Tagger 4.4.2 POS Tag Set 4.4.3 Lemmatization

4.5 Candidate key phrase 4.6 Feature Extraction phase

5. Results and Future Work 5.1 Results 5.2 Future work

54

Appendix 59

6

Chapter 1 Introduction

7

Many academic journals ask their authors to provide a list of about five to fifteen

keywords, to appear on the first page of each article. Since these key words are often

phrases of two or more words, they are called keyphrases. There is a wide variety of tasks

for which keyphrases are useful.

Background and Related Work:

The task of extracting keyphrases from free text documents is becoming increasingly

important as the uses for such technology expands, as the amount of electronic textual

content grows fast, keyphrases can contribute to manage the process of handling these

large amounts of textual information. Keyphrases play an important role in digital

libraries, web contents, and content management systems, especially in cataloging and

information retrieval purposes.

The limited number of documents that have author-assigned keyphrases as metadata

description raises the need for a tool that can automatically extract keyphrases from text.

Such a tool can enable many different types of information retrieval and analysis systems.

It can provide the automation of:

• Generating metadata that gives a high-level description of a document's contents. This

provides tools for text-mining related tasks such as document and Web page retrieval

purposes.

• Summarizing documents for prospective readers. Keyphrases can represent a highly

condensed summary of the document in question (Avanzo & Magnini, 2005).

• Highlighting important topics within the body of the text, to facilitate speed reading

(skimming), which allows deciding whether it is relevant or not.

• Measuring the similarity between documents, making it possible to cluster and

categorize documents (Karanikolas & Skourlas, 2006).

• Searching: more precise upon using them as the basis for search indexes or as a way of

browsing a collection of documents.

Many remarkable efforts have been proposed and implemented for automatically

extracting keyphrases for English documents and other languages. In contrast, little

efforts are achieved for documents written in Arabic language. Although, some

8

researchers applied their keyphrase extraction systems to Arabic documents, but the

proven efficiency of the extracted keyphrases was not satisfactory.

Work on automatic keyphrase extraction started fairly recently. First attempts to

approach this task were purely based on heuristics (Krulwich and Burkey, 1996). However

keyphrases generated by this approach, failed to map well to author assigned keywords

indicating that the applied heuristics were poor ones (Turney, 1999). Motivated by the

spectrum of potential applications of accurate keyphrase extraction and the failings of the

heuristic model, Peter Turney devised a powerful, machine learning based, Keyphrase

extraction system called GenEx (Turney, 1999, Turney, 2000). In building this system,

Turney was the first to approach the task of keyphrase extraction as a supervised learning

problem. Turney uses the degree of statistical association determined through the use of

web mining techniques in order to determine semantic relatedness. The major drawback

of this work is that it takes up a lot of time in order to calculate the coherence feature

(almost 15 mins per document) (Turney, 2003). In addition, there are a number of other

systems were specifically for extracting keyphrases from web documents such as those

presented in (Chen et al., 2005) and (Kelleher and Luz, 2005).

Kea (Frank et al., 1999; Witten et al., 1999, 2000) is another remarkable effort in this area,

identifies candidate keyphrases in the same manner as Extractor. Kea then uses the Naïve

Bayes algorithm to classify the candidate phrases as keyphrases or not. In Kea, candidate

phrases are classified using only two features: (i) the TFxIDF, and (ii) the relative distance.

The TFxIDF (term frequency times inverse document frequency) method which captures a

word's frequency in a single document compared to its rarity in the whole document

collection. It is used to assign a high value to a phrase that is relatively frequent in the

input document (TF component), yet relatively rare in other documents (IDF component).

The relative distance feature of a phrase in a given document is defined as the number of

words that precede the first occurrence of the phrase divided by the number of words in

the document. Kea uses the Naïve Bayes algorithm to calculate the probability of

membership in a class (the probability that the candidate phrase is a keyphrase). Kea

ranks each of the candidate phrases by the estimated probability that they belong to the

keyphrase class. If the user requests N phrases, then Kea gives the top N phrases with the

highest estimated probability as output.

KP-Miner (El-Beltagy & Rafea, 2008) is an unsupervised machine learning algorithm which

uses the TFxIDF measures with two boosting factors. The first depends on phrase length,

and the second depends on phrase position in the document. The KP-Miner system does

not need to be trained on particular document set. It also has the advantage of being

configurable, as the rules and heuristics adopted by the system are related to the general

nature of documents and keyphrases. This implies that users can use their understanding

of the input document to fine-tune the system to their particular needs.

9

Proposed System: In this work, the automatic keyphrase extraction is treated as a supervised machine

learning task. Two important issues are defined: how to define the candidate keyphrase

terms, and what features of these terms are considered discriminative, i.e., how to

represent the data, and consequently what is given as input to the learning algorithm. Our

motivation is that adding linguistic knowledge (such as lexical features and syntactic rules)

to the extraction process, rather than relying only on statistics, may obtain better results.

Thus, the current work is based on combining the linguistic knowledge and the machine

learning techniques to extract keyphrases from Arabic documents with reasonable

accuracy. The Linguistic knowledge will play important roles in different stages of our

proposed system:

1. Analysis stage, where the document is tokenized into sentences and words. Each word

is analyzed to extract

POS tag

Lemma

2. Candidate keyphrase extraction stage, where set of syntactic rules are used to

determine the allowed sequence of words of the generated n-gram terms according to

their POS tags and Lemma.

3. Features Vector calculation stage, where some of the selected features of each

candidate phrase are linguistic-based, in addition to the statistical-based features.

The proposed system is based on three main steps: Document pre-processing, part of

speech analysis, lemmatization, candidate phrases extraction, and feature vector

calculation. The following sections describe these steps in details.

Document Preprocessing:

The input document is segmented at two levels. In the first level, the document is

segmented into its constituent sentences based on the Arabic phrases delimiter

characters such as comma, semicolon, colon, hyphen, and dot. This process is useful for

calculating part of the features vector of the candidate terms. In the second level, each

sentence is segmented into its constituent words based on the criteria that words are

usually separated by spaces.

10

Part of Speech tagging: part-of-speech tagging is the process of marking up a word in a text as corresponding to

a particular part of speech, based on both its definition, as well as its context, and

relationship with adjacent and related words in a phrase, sentence, or paragraph. This is the

process of identification of words as nouns, verbs, adjectives, adverbs, etc.

Lemmatization:

The process of extracting the abstract form: it describes the basic form from which the

given word is logically derived. Usually, this from differs from the word stem form which is

obtained after removing the prefix and suffix parts of the word. For example, the stem of

the word "مسئية" is “ which represents a human being object. In contrast, the abstract مسء"

form of the word is "مسئي", which represents the adjective of visual object. The abstract

form of the given word is represented as follows:

• The single form for nouns.

• The single and male form for adjectives.

• The past form for verbs.

• The stem form for stop-words.

The abstract form of the given word is extremely useful during the process of extracting

candidate keyphrases.

Candidate Phrases Extraction: We used following syntactic rules for extracting candidate phrases:

1- The candidate phrase can start only with some sort of nouns like general-noun, place-

noun, proper-noun, and declined-noun.

2- The candidate phrase can end only with general-noun, place-noun, proper-noun,

declined-noun, time-noun, augmented-noun, adjective, and adverb.

3- For three words phrase, the second word is allowed to be count-noun, conjunction,

preposition, and comparison, in addition to those cited in rule 2.

4- We created two lists of Stop-words, stop-words that shouldn’t appear in a Keyphrase

and stop-words that can be the middle word of a three-word Keyphrase.

http://en.wikipedia.org/wiki/Parts_of_speech

http://en.wikipedia.org/wiki/Lexicography

http://en.wikipedia.org/wiki/Phrase

http://en.wikipedia.org/wiki/Sentence_(linguistics)

http://en.wikipedia.org/wiki/Paragraph

http://en.wikipedia.org/wiki/Noun

http://en.wikipedia.org/wiki/Verb

http://en.wikipedia.org/wiki/Adjective

http://en.wikipedia.org/wiki/Adverb

11

Feature Vector Calculation: Each candidate phrase is assigned a number of features used to evaluate its importance.

In our algorithm, three factors control the selection of features and their values.

a) Normalized Phrase Words (NPW).

b) The Phrase Relative Frequency (PRF).

c) The Word Relative Frequency (WRF).

d) Normalized Sentence Location (NSL).

e) Normalized Phrase Location (NPL).

f) Normalized Phrase Length (NPLen).

g) Sentence Contain Verb (SCV).

h) Is It Question (IIT).

i) (Is-Key).

many authors; starting from Turney (1997, 1999, 2000), used the features (a), (b) and (c),

the proposed algorithm uses different normalization technique to satisfy our hypothesis

of feature importance. the original form of candidate abstract keyphrase form is retained

for presentation to the user in case the phrase does turn out to be a keyphrase. This

process is a straightforward operation. The proposed algorithm is computed for all

candidates instead of unique stemmed keyphrases (KEA and Turney), which eliminates the

need for selecting the most frequent keyphrase, when several different forms occur.

Results: The program was tested on many documents in various fields and many authors. In order

to evaluate the performance of the proposed system, many experiments were carried out

to test the proposed system. A total of 25 documents were used. The first experiment

aimed to measure the level of acceptance of the extracted keyphrases. Since there is no

author-assigned keyphrases for these documents, a human judge was adopted to

evaluate this level; we have compared the results with KP-miner, but we couldn’t

compare it with Sakhr because the output was inapplicable to be compared.

12

Chapter 2 Data Mining

13

2.1 Introduction Data mining ( knowledge discovery) , is the computer-assisted process of digging through

and analyzing enormous sets of data and then extracting the meaning of the data. Data

mining tools predict behaviors and future trends, allowing businesses to make proactive,

knowledge-driven decisions. Data mining tools can answer business questions that

traditionally were exhaustively time consuming to resolve. They scour databases for

hidden patterns, finding predictive information that experts may miss because it lies

outside their expectations. Data mining is a step in KDD process aimed at discovering

patterns and relationships in preprocessed and transformed data.

2.2 The Scope of Data Mining Data mining derives its name from the similarities between searching for valuable

business information in a large database — for example, finding linked products in

gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both

processes require either sifting through an immense amount of material, or intelligently

probing it to find exactly where the value resides. Given databases of sufficient size and

quality, data mining technology can generate new business opportunities by providing

these capabilities:

Automated prediction of trends and behaviors. Data mining automates the process of

finding predictive information in large databases. Questions that traditionally required

extensive hands-on analysis can now be answered directly from the data — quickly. A

typical example of a predictive problem is targeted marketing. Data mining uses data on

past promotional mailings to identify the targets most likely to maximize return on

investment in future mailings. Other predictive problems include forecasting bankruptcy

and other forms of default, and identifying segments of a population likely to respond

similarly to given events.

Data mining tools sweep through databases and identify previously hidden patterns in

one step. An example of pattern discovery is the analysis of retail sales data to identify

seemingly unrelated products that are often purchased together. Other pattern discovery

problems include detecting fraudulent credit card transactions and identifying anomalous

data that could represent data entry keying errors

14

2.3 Background The manual extraction of patterns from data has occurred for centuries. Early methods of

identifying patterns in data include Bayes' theorem (1700s) and regression analysis

(1800s). The proliferation, ubiquity and increasing power of computer technology has

dramatically increased data collection, storage, and manipulation ability. As data

sets have grown in size and complexity, direct "hands-on" data analysis has increasingly

been augmented with indirect, automated data processing, aided by other discoveries in

computer science, such as neural networks, cluster analysis, genetic

algorithms (1950s), decision trees (1960s), and support vector machines (1990s). Data

mining is the process of applying these methods with the intention of uncovering hidden

patterns in large data sets. It bridges the gap from applied statistics and artificial

intelligence (which usually provide the mathematical background) to database

management by exploiting the way data is stored and indexed in databases to execute the

actual learning and discovery algorithms more efficiently, allowing such methods to be

applied to ever larger data sets

2.4 KDD Process: The Knowledge Discovery in Databases (KDD) process is commonly defined with the

stages:

(1) Selection

(2) Pre-processing

(3) Transformation

(4) Data Mining

(5) Interpretation/Evaluation

Let’s examine the knowledge discovery process in the diagram above:

15

Data comes from variety of sources is integrated into a single data store called target data

Data then is pre-processed and transformed into standard format.

The data mining algorithms process the data to the output in form of patterns or rules.

Then those patterns and rules are interpreted to new or useful knowledge or information.

A wide range of organizations in various industries are making use of data mining

including manufacturing, marketing, chemical, aerospace, etc., to take advantages over

their competitors. The needs for a standard data mining therefore increased dramatically.

The data mining process must be reliable and repeatable by business people with little

knowledge or no data mining background. In 1990, a cross-industry standard process for

data mining (CRISP-DM) first published after going through a lot of workshops, and

contributions from over 300 organizations. Let’s examine the cross-industry standard

process for data mining in greater details.

2.5 The Cross-Industry Standard Process for Data Mining (CRISP-DM)

Cross-Industry Standard Process for Data Mining (CRISP-DM) consists of six phases

intended as a cyclical process as the following figure:

Cross-Industry Standard Process for Data Mining (CRISP-DM)

http://www.dataminingtechniques.net/wp-content/uploads/2011/06/CRISP-DM.png

http://www.dataminingtechniques.net/wp-content/uploads/2011/06/CRISP-DM.png

16

Business understanding - In the business understanding phase, first it is a must to

understand business objectives clearly and make sure to find out what the client really

want to achieve. Next, we have to assess the current situation by finding about the

resources, assumptions, constraints and other important factors which should be

considered. Then from the business objectives and current situations, we need to create

data mining goals to achieve the business objective and within the current situation.

Finally a good data mining plan has to be established to achieve both business and data

mining goals. The plan should be as details as possible that have step-by-step to perform

during the project including the initial selection of data mining techniques and tools.

Data understanding - First, the data understanding phase starts with initial data collection

that collects data from available sources to get familiar with data. Some important

activities must be carried including data load and data integration in order to make the

data collection successfully. Next, the “gross” or “surface” properties of acquired data

need to be examined carefully and reported. Then, the data need to be explored by

tackling the data mining questions, which can be addressed using querying, reporting and

visualization. Finally, the data quality must be examined by answering some important

questions such as “Is the acquired data complete?”, “Is there any missing values in the

acquired data?”

Data preparation - The data preparation normally consumes about 90% of the time. The

outcome of the data preparation phase is the final data set. Once data sources available

are identified, they need to be selected, cleaned, constructed and formatted into the

desired form. The data exploration task at a greater depth may be carried during this

phase to notice the patterns based on business understanding.

Modeling - First, modeling techniques have to be selected to be used for the prepared

dataset. Next, the test scenario must be generated to validate the models’ quality and

validity. Then, one or more models are created by running the modeling tool on the

prepared dataset. Last but not least, models need to be assessed carefully involving

stakeholders to make sure that created models are meet business initiatives.

Evaluation - In the evaluation phase, the model results must be evaluated in the context

of business objectives in the first phase. In this phase, new business requirements may be

raised due to new patterns has been discovered in the model results or from other

factors. Gaining business understanding is an iterative process in data mining. The go or

no-go decision must be made in this step to move to the deployment phase.

17

Deployment - The knowledge or information that gain through data mining process needs

to be presented in such a way that stakeholders can use it when they want it. Based on

the business requirements, the deployment phase could be as simple as creating a report

or as complex as a repeatable data mining process across the organization. In this phase,

the deployment, maintained and monitoring plans have to be created for deployment and

future supports. From project point of view, the final report of the project need to

summary the project experiences and review the project to see what need to improved

created learned lessons.

The CRISP-DM offers a uniform framework for experience documentation and guidelines.

In addition the CRISP-DM can apply in different industry with different type of data.

2.6 Simplified process in KDD: This section will contain simplified process such as (1) pre-processing, (2) data mining,

and (3) results validation.

2.6.1 Pre-processing Before data mining algorithms can be used, a target data set must be assembled. As data

mining can only uncover patterns actually present in the data, the target dataset must be

large enough to contain these patterns while remaining concise enough to be mined

within an acceptable time limit. A common source for data is a data mart or data

warehouse. Pre-processing is essential to analyze the multivariate datasets before data

mining. The target set is then cleaned. Data cleaning removes the observations

containing noise and those with missing data.

2.6.2 Data Mining Data mining involves six common classes of tasks:[1]

Anomaly detection (Outlier/change/deviation detection) – The identification of

unusual data records, that might be interesting or data errors and require further

investigation.

Association rule learning (Dependency modeling) – Searches for relationships

between variables. For example a supermarket might gather data on customer

purchasing habits. Using association rule learning, the supermarket can determine

which products are frequently bought together and use this information for

marketing purposes. This is sometimes referred to as market basket analysis.

Clustering – is the task of discovering groups and structures in the data that are in

some way or another "similar", without using known structures in the data.

http://en.wikipedia.org/wiki/Data_mining#cite_note-Fayyad-0

18

Classification – is the task of generalizing known structure to apply to new data.

For example, an e-mail program might attempt to classify an e-mail as "legitimate"

or as "spam".

Regression – Attempts to find a function which models the data with the least

error.

Summarization – providing a more compact representation of the data set,

including visualization and report generation.

2.6.3 Results Validation The final step of knowledge discovery from data is to verify that the patterns produced by

the data mining algorithms occur in the wider data set. Not all patterns found by the data

mining algorithms are necessarily valid. It is common for the data mining algorithms to

find patterns in the training set which are not present in the general data set. This is called

over fitting. To overcome this, the evaluation uses a test set of data on which the data

mining algorithm was not trained. The learned patterns are applied to this test set and the

resulting output is compared to the desired output. For example, a data mining algorithm

trying to distinguish "spam" from "legitimate" emails would be trained on a training set of

sample e-mails. Once trained, the learned patterns would be applied to the test set of e-

mails on which it had not been trained. The accuracy of the patterns can then be

measured from how many e-mails they correctly classify. A number of statistical methods

may be used to evaluate the algorithm, such as ROC curves.

If the learned patterns do not meet the desired standards, then it is necessary to re-

evaluate and change the pre-processing and data mining steps. If the learned patterns do

meet the desired standards, then the final step is to interpret the learned patterns and

turn them into knowledge.

19

Chapter 3 Keyphrase Extraction

20

3.1 introduction Several keyphrase extraction techniques have been proposed and implemented

successfully in different context. Attempts on keyphrase extraction can be classified into

two main streams, which are supervised machine learning (Most of the prior work in

document keyphrases extraction problem is based on machine learning

techniques.)Algorithms and unsupervised machine learning algorithms.

3.2 Supervised Machine Learning Techniques We will Start Speaking about those two main streams, and we will start by speaking about

Techniques that are based on supervised machine learning which are:

Turney (1997, 1999, 2000)

Sakhr

Turney was the first one to approach the problem of Keyphrase Extraction as a supervised

Learning and presented two different two different machine learning algorithms for

extracting keyphrases from a document, The first algorithm is based on the C4.5 decision

tree classifier (Quinlan, 1993), and the second is the GenEx (Genitor and Extractor)

algorithm (Turney, 1997, 1999, 2000).

3.2.1 C4.5 decision tree induction algorithm C4.5 decision tree induction algorithm was used in order to classify phrases as positive or

negative examples of keyphrases, in this section; we describe the feature vectors, the

settings we used for C4.5’s parameters, the bagging procedure, and the method for

sampling the training data.

The task of supervised learning is to learn how to assign cases (or examples) to classes.

For keyphrase extraction, a case is a candidate phrase, which we wish to classify as a

positive or negative example of a keyphrase. We classify a case by examining its features.

A feature can be any property of a case that is relevant for determining the class of the

case. C4.5 can handle real-valued features, integer-valued features, and features with

values that range over an arbitrary, fixed set of symbols. C4.5 takes as input a set of

training data, in which cases are represented as feature vectors. In the training data, a

teacher must assign a class to each feature vector (hence supervised learning). C4.5

generates as output a decision tree that models the relationships among the features and

the classes (Quinlan, 1993).

A decision tree is a rooted tree in which the internal vertices are labeled with tests on

feature values and the leaf vertices are labeled with classes. The edges that leave an

internal vertex are labeled with the possible outcomes of the test associated with that

vertex. For example, a feature might be, “the number of words in the given phrase,” and a

21

test on a feature value might be, “the number of words in the given phrase is less than

two,” which can have the outcomes “true” or “false”. A case is classified by beginning at

the root of the tree and following a path to a leaf in the tree, based on the values of the

features of the case. The label on the leaf is the predicted class for the given case.

the documents have been converted into sets of feature vectors by first making a list of

all phrases of one, two, or three consecutive non-stop words that appear in a given

document, with no intervening punctuation.

Iterated Lovins stemmer has been used to find the stemmed form of each of these

phrases. For each unique stemmed phrase, we generated a feature vector,

as described in Table 3.1 A description of the feature vectors used by C4.5.

Table 3.1

22

C4.5 has access to nine features (features 3 to 11) when building a decision tree. The

leaves of the tree predict class (feature 12). When a decision tree predicts that the class of

a vector is 1, then the phrase whole_phrase is a keyphrase, according to the tree. This

phrase is suitable for output for a human reader. We used the stemmed form of the

phrase, stemmed_phrase, for evaluating the performance of the tree.

Table 3.2 shows the number of feature vectors that were generated for each corpus. The

large majority of these vectors were negative examples of keyphrases (class 0).In a real-

world application, the user would want to specify the desired number of output

keyphrases for a given document. However, a standard decision tree does not let the user

control the number of feature vectors that are classified as belonging in class 1. Therefore

we ran C4.5 with the -p option, which generates soft-threshold decision trees (Carter and

Catlett, 1987; Quinlan, 1987, 1990, 1993). Soft-threshold decision trees can generate a

probability estimate for the class of each vector. For a given document, if the user

specifies that K keyphrases are desired, then we select the K vectors that have the highest

estimated probability of being in class 1.

Table 3.2

In addition to the -p option, we also used -c100 and -m1 (Quinlan, 1993). These two

options maximize the bushiness of the trees. In our preliminary experiments, we found

that these parameter settings appear to work well when used in conjunction with

bagging. Bagging involves generating many different decision trees and allowing them to

vote on the classification of each example (Breiman, 1996a, 1996b; Quinlan, 1996). In

general, decision tree induction algorithms have low bias but high variance. Bagging

multiple trees tends to improve performance by reducing variance. Bagging appears to

have relatively little impact on bias.

Because we used soft-threshold decision trees, we combined their probability estimates

by averaging them, instead of voting. In preliminary experiments with the training

documents, we obtained good results by bagging 50 decision trees. Adding more trees

had no significant effect. The standard approach to bagging is to randomly sample the

23

training data, using sampling with replacement (Breiman, 1996a, 1996b; Quinlan, 1996).

In preliminary experiments with the training data, we achieved good performance by

training each of the 50 decision trees with a random sample of 1% of the training data.

The standard approach to bagging is to ignore the class when sampling, so the distribution

of classes in the sample tends to correspond to the distribution in the training data as a

whole. In Table 3.2, we see that the positive examples constitute only 0.2% to 2.4% of the

total number of examples. To compensate for this, we modified the random sampling

procedure so that 50% of the sampled examples were in class 0 and the other 50% were

in class 1. This appeared to improve performance in preliminary experiments on the

training data. This strategy is called stratified sampling (Deming, 1978; Buntine, 1989;

Catlett, 1991; Kubat et al.,1998). Kubat et al. (1998) found that stratified sampling

significantly improved the performance of C4.5 on highly skewed data, but Catlett (1991)

reported mixed results.

Boosting is another popular technique for combining multiple decision trees (Freund and

Schapire, 1996; Quinlan, 1996; Maclin and Opitz, 1997). We chose to use bagging instead

of boosting, because the modifications to bagging that we use here (averaging soft-

threshold decision trees and stratified sampling) are simpler to apply to the bagging

algorithm than to the more complicated boosting algorithm. We believe that analogous

modifications would be required for boosting to perform well on this task.

3.2.2 GenEx (Genitor and extractor) GenEx (Genitor and extractor) has two components, the Genitor genetic algorithm

(Whitley, 1989) and the Extractor keyphrase extraction algorithm (Turney, 1997, 1999).

Extractor takes a document as input and produces a list of keyphrases as output. Extractor

has twelve parameters that determine how it processes the input text. In GenEx, the

parameters of Extractor are tuned by the Genitor genetic algorithm (Whitley, 1989), to

maximize performance (fitness) on training data. Genitor is used to tune Extractor, but

Genitor is no longer needed once the training process is complete. When we know the

best parameter values, we can discard Genitor. Thus the learning system is called GenEx

(Genitor plus Extractor) and the trained system is called Extractor (GenEx minus Genitor).

The GenEx algorithm originally was used to reduce the amount of negative training

examples.

24

3.2.2.1 Extractor What follows is a conceptual description of the Extractor algorithm. For clarity, we

describe

Extractor at an abstract level that ignores efficiency considerations. That is, the actual

Extractor software is essentially an efficient implementation of the following algorithm.12

In the following, the twelve parameters appear in small capitals (see Table 3.3 for a list of

the Parameters).

There are ten steps to the Extractor algorithm:

1. Find Single Stems: Make a list of all of the words in the input text. Drop words with less

than three characters. Drop stop words, using a given stop word list. Convert all remaining

words to lower case. Stem the words by truncating them at STEM_LENGTH characters.

The advantages of this simple form of stemming (stemming by truncation) are speed and

flexibility. Stemming by truncation is much faster than either the Lovins (1968) or Porter

(1980) stemming algorithms. The aggressiveness of the stemming can be adjusted by

changing STEM_LENGTH. This gives Genitor control over the level of aggressiveness.

2. Score Single Stems: For each unique stem, count how often the stem appears in the

text and note when it first appears. If the stem “evolut” first appears in the word

“Evolution”, and “Evolution” first appears as the tenth word in the text, then the first

appearance of “evolut” is said to be in position 10. Assign a score to each stem. The score

is the number of times the stem appears in the text, multiplied by a factor. If the stem first

appears before FIRST_LOW_THRESH, then multiply the frequency by FIRST_LOW_FACTOR.

If the stem first appears after FIRST_HIGH_THRESH, then multiply the frequency by

FIRST_HIGH_FACTOR. Typically FIRST_LOW_FACTOR is greater than one and

FIRST_HIGH_FACTOR is less than one. Thus, early, frequent stems receive a high score and

late, rare stems receive a low score. This gives Genitor control over the weight of early

occurrence versus the weight of frequency.

3. Select Top Single Stems: Rank the stems in order of decreasing score and make a list of

the top NUM_WORKING single stems. Cutting the list at NUM_WORKING, as opposed to

allowing the list to have an arbitrary length, improves the efficiency of Extractor. It also

acts as a filter for eliminating lower quality stems.

4. Find Stem Phrases: Make a list of all phrases in the input text. A phrase is defined as a

sequence of one, two, or three words that appear consecutively in the text, with no

intervening stop words or punctuation. Stem each phrase by truncating each word in the

phrase at STEM_LENGTH characters. In our corpora, phrases of four or more words are

relatively rare. Therefore Extractor only considers phrases of one, two, or three words.

25

5. Score Stem Phrases: For each stem phrase, count how often the stem phrase appears in

the text and note when it first appears. Assign a score to each phrase, exactly as in step 2,

using the parameters FIRST_LOW_FACTOR, FIRST_LOW_THRESH, FIRST_HIGH_FACTOR,

and FIRST_HIGH_THRESH. Then make an adjustment to each score, based on the number

of stems in the phrase. If there is only one stem in the phrase, do nothing. If there are two

stems in the phrase, multiply the score by FACTOR_TWO_ONE. If there are three stems in

the phrase, multiply the score by FACTOR_THREE_ONE. Typically FACTOR_TWO_ONE and

FACTOR_THREE_ONE are greater than one, so this adjustment will increase the score of

longer phrases. A phrase of two or three stems is necessarily never more frequent than

the most frequent single stem contained in the phrase. The factors FACTOR_TWO_ONE

and FACTOR_THREE_ONE are designed to boost the score of longer phrases, to

compensate for the fact that longer phrases are expected to otherwise have lower scores

than shorter phrases.

6. Expand Single Stems: For each stem in the list of the top NUM_WORKING single stems,

find the highest scoring stem phrase of one, two, or three stems that contains the given

single stem. The result is a list of NUM_WORKING stem phrases. Keep this list ordered by

the scores calculated in step 2.

Now that the single stems have been expanded to stem phrases, we no longer need the

scores that were calculated in step 5. That is, the score for a stem phrase (step 5) is now

replaced by the score for its corresponding single stem (step 2). The reason is that the

adjustments to the score that were introduced in step 5 are useful for expanding the

single stems to stem phrases, but they are not useful for comparing or ranking stem

phrases.

7. Drop Duplicates: The list of the top NUM_WORKING stem phrases may contain

duplicates. For example, two single stems may expand to the same two-word stem

phrase. Delete duplicates from the ranked list of NUM_WORKING stem phrases,

preserving the highest ranked phrase. For example, suppose that the stem “evolu” (e.g.,

“evolution” truncated at five characters) appears in the fifth position in the list of the top

NUM_WORKING single stems and “psych” (e.g., “psychology” truncated at five characters)

appears in the tenth position. When the single stems are expanded to stem phrases, we

might find that “evolu psych” (e.g., “evolutionary psychology” truncated at five

characters) appears in the fifth and tenth positions in the list of stem phrases. In this case,

we delete the phrase in the tenth position. If there are duplicates, then the list now has

fewer than NUM_WORKING stem phrases.

8. Add Suffixes: For each of the remaining stem phrases, find the most frequent

corresponding whole phrase in the input text. For example, if “evolutionary psychology”

appears ten times in the text and “evolutionary psychologist” appears three times, then

26

“evolutionary psychology” is the more frequent corresponding whole phrase for the stem

phrase “evolu psych”. When counting the frequency of whole phrases, if a phrase has an

ending that indicates a possible adjective, then the frequency for that whole phrase is set

to zero. An ending such as “al”, “ic”, “ible”, etc., indicates a possible adjective. Adjectives

in the middle of a phrase (for example, the second word in a three-word phrase) are

acceptable; only phrases that end in adjectives are penalized. Also, if a phrase contains a

verb, the frequency for that phrase is set to zero. To check for verbs, we use a list of

common verbs. A word that might be either a noun or a verb is included in this list only

when it is much more common for the word to appear as a verb than as a noun. For

example, suppose the input text contains “manage”, “managerial”, and “management”. If

STEM_LENGTH is, say, five, the stem “manag” will be expanded to “management” (a

noun), because the frequency of “managerial” will be set to zero (because it is an

adjective, ending in “al”) and the frequency of “manage” will be set to zero (because it is a

verb, appearing in the list of common verbs). Although “manage” and “managerial” would

not be output, their presence in the input text helps to boost the score of the stem

“manag” (as measured in step 2), and thereby increase the likelihood that “management”

will be output.

9. Add Capitals: For each of the whole phrases (phrases with suffixes added), find the best

capitalization, where best is defined as follows. For each word in a phrase, find the

capitalization with the least number of capitals. For a one-word phrase, this is the best

capitalization.

For a two-word or three-word phrase, this is the best capitalization, unless the

capitalization is inconsistent. The capitalization is said to be inconsistent when one of the

words has the capitalization pattern of a proper noun but another of the words does not

appear to be a proper noun (e.g., “Turing test”). When the capitalization is inconsistent,

see whether it can be made consistent by using the capitalization with the second lowest

number of capitals (e.g., “Turing Test”). If it cannot be made consistent, use the

inconsistent capitalization. If it can be made consistent, use the consistent capitalization.

For example, given the phrase “psychological association”, the word “association” might

appear in the text only as “Association”, whereas the word “psychological” might appear

in the text as “PSYCHOLOGICAL”, “Psychological”, and “psychological”. Using the least

number of capitals, we get “psychological Association”, which is inconsistent. However, it

can be made consistent, as “Psychological Association”.

10. Final Output: We now have an ordered list of mixed-case (upper and lower case, if

appropriate) phrases with suffixes added. The list is ordered by the scores calculated in

step 2. That is, the score of each whole phrase is based on the score of the highest scoring

single stem that appears in the phrase. The length of the list is at most NUM_WORKING,

27

and is likely less, due to step 7. We now form the final output list, which will have at most

NUM_PHRASES phrases. We go through the list of phrases in order, starting with the top-

ranked phrase, and output each phrase that passes the following tests, until either

NUM_PHRASES phrases have been output or we reach the end of the list. The tests are (1)

the phrase should not have the capitalization of a proper noun, unless the flag

SUPPRESS_PROPER is set to 0 (if 0 then allow proper nouns; if 1 then suppress proper

nouns); (2) the phrase should not have an ending that indicates a possible adjective; (3)

the phrase should be longer than MIN_LENGTH_LOW_RANK, where the length is

measured by the ratio of the number of characters in the candidate phrase to the number

of characters in the average phrase, where the average is calculated for all phrases in the

input text that consist of one to three consecutive non-stop words; (4) if the phrase is

shorter than MIN_LENGTH_LOW_RANK, it may still be acceptable, if its rank in the list of

candidate phrases is better than (closer to the top of the list than)

MIN_RANK_LOW_LENGTH; (5) if the phrase fails both tests (3) and (4), it may still be

acceptable, if its capitalization pattern indicates that it is probably an abbreviation; (6) the

phrase should not contain any words that are most commonly used as verbs; (7) the

phrase should not match any phrases in a given list of stop phrases (where “match”

means equal strings, ignoring case, but including suffixes).

That is, a phrase must pass tests (1), (2), (6), (7), and at least one of tests (3), (4), and (5).

Although our experimental procedure does not consider capitalization or suffixes when

comparing machine-generated keyphrases to human-generated keyphrases, steps 8 and 9

are still useful, because some of the screening tests in step 10 are based on capitalization

and suffixes. Of course, steps 8 and 9 are essential when the output is for human readers.

3.2.2.2 Genitor A genetic algorithm may be viewed as a method for optimizing a string of bits, using

techniques that are inspired by biological evolution. A genetic algorithm works with a set

of bit strings, called a population of individuals. The initial population is usually randomly

generated. New individuals (new bit strings) are created by randomly changing existing

individuals (this operation is called mutation) and by combining substrings from parents to

make new children (this operation is called crossover). Each individual is assigned a score

(called its fitness) based on some measure of the quality of the bit string, with respect to a

given task. Fitter individuals get to have more children than less fit individuals. As the

genetic algorithm runs, new individuals tend to be increasingly fit, up to some asymptote.

Genitor is a steady-state genetic algorithm (Whitley, 1989), in contrast to many other

genetic algorithms, such as Genesis (Grefenstette 1983, 1986), which are generational. A

generational genetic algorithm updates its entire population in one batch, resulting in a

sequence of distinct generations. A steady-state genetic algorithm updates its population

28

one individual at a time, resulting in a continuously changing population, with no distinct

generations. Typically a new individual replaces the least fit individual in the current

population. Whitley (1989) suggests that steady-state genetic algorithms tend to be more

aggressive (they have greater selective pressure) than generational genetic algorithms.

3.2.2.3 GenEx The parameters in Extractor are set using the standard machine learning paradigm of

supervised learning. The algorithm is tuned with a dataset, consisting of documents paired

with target lists of keyphrases. The dataset is divided into training and testing subsets. The

learning process involves adjusting the parameters to maximize the match between the

output of Extractor and the target keyphrase lists, using the training data. The success of

the learning process is measured by examining the match using the testing data.

We assume that the user sets the value of NUM_PHRASES, the desired number of

phrases, to a value between five and fifteen. We then set NUM_WORKING to . The

remaining ten parameters are set by Genitor. Genitor uses a binary string of 72 bits to

represent the ten parameters, as shown in Table 3.3. We run Genitor with a population

size of 50 for 1050 trials (these are default settings). Each trial consists of running

Extractor with the parameter settings specified in the given binary string, processing the

entire training set. The fitness measure for the binary string is based on the average

precision for the whole training set. The final output of Genitor is the highest scoring

binary string. Ties are broken by choosing the earlier string.

We first tried to use the average precision on the training set as the fitness measure, but

GenEx discovered that it could achieve high average precision by adjusting the parameters

so that less than NUM_PHRASES phrases were output. This is not desirable, so we

modified the fitness measure to penalize GenEx when less than NUM_PHRASES phrases

were output:

total_matches = total number of matches between GenEx and human (1)

total_machine_phrases = total number of phrases output by GenEx (2)

precision = total_matches ¤ total_machine_phrases (3)

num_docs = number of documents in training set (4)

total_desired = num_docs × NUM_PHRASES (5)

penalty = ( total_machine_phrases x total_desired ) 2 (6)

fitness = precision × penalty (7)

29

The penalty factor varies between 0 and 1. It has no effect (i.e., it is 1) when the number

of phrases output by GenEx equals the desired number of phrases. The penalty grows

(i.e., it approaches 0) with the square of the gap between the desired number of phrases

and the actual number of phrases. Preliminary experiments on the training data

confirmed that this fitness measure led GenEx to find parameter values with high average

precision while ensuring that NUM_PHRASES phrases were output.

The twelve parameters of Extractor, with types and ranges.

Table 3.3

Since STEM_LENGTH is modified by Genitor during the GenEx learning process, the fitness

measure used by Genitor is not based on stemming by truncation. If the fitness measure

were based on stemming by truncation, a change in STEM_LENGTH would change the

apparent fitness, even if the actual output keyphrase list remained constant. Therefore

fitness is measured with the Iterated Lovins stemmer.

We ran Genitor with a Selection Bias of 2.0 and a Mutation Rate of 0.2. These are the

default settings for Genitor. We used the Adaptive Mutation operator and the Reduced

Surrogate Crossover operator (Whitley, 1989). Adaptive Mutation determines the

appropriate level of mutation for a child according to the hamming distance between its

two parents; the less the difference, the higher the mutation rate. Reduced Surrogate

Crossover first identifies all positions in which the parent strings differ. Crossover points

are only allowed to occur in these positions.

30

3.2.3 Sakhr Sakhr is another remarkable effort in Keyphrase extraction filed , but it’s a closed source

so no official information about how it works and which algorithems used in extracting

keyphrases , but from our usage of the their online application ( which they don’t put any

sign for it in their website ) we found that it take a very long time in processing which

show us that they almost depend on a huge database on extracting keyphrases from the

input documents .

3.2.4 Kea Kea (Frank et al., 1999; Witten et al., 1999, 2000) is another remarkable effort in this area,

identifies candidate keyphrases in the same manner as Extractor. Kea then uses the Naïve

Bayes algorithm to classify the candidate phrases as keyphrases or not. In Kea, candidate

phrases are classified using only two features: (i) the TFxIDF, and (ii) the relative distance.

The TFxIDF (term frequency times inverse document frequency) method which captures a

word's frequency in a single document compared to its rarity in the whole document

collection. It is used to assign a high value to a phrase that is relatively frequent in the

input document (TF component), yet relatively rare in other documents (IDF component).

The relative distance feature of a phrase in a given document is defined as the number of

words that precede the first occurrence of the phrase divided by the number of words in

the document. Kea uses the Naïve Bayes algorithm to calculate the probability of

membership in a class (the probability that the candidate phrase is a keyphrase). Kea

ranks each of the candidate phrases by the estimated probability that they belong to the

keyphrase class. If the user requests N phrases, then Kea gives the top N phrases with the

highest estimated probability as output.

3.2.5 Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques

This system was done by (El-shishtawy T.A. & Al-sammak A.K.) A supervised learning

technique for extracting keyphrases of Arabic documents is presented. The extractor is

supplied with linguistic knowledge to enhance its efficiency instead of relying only on

statistical information such as term frequency and distance. During analysis, an annotated

Arabic corpus is used to extract the required lexical features of the document words. The

knowledge also includes syntactic rules based on part of speech tags and allowed word

sequences to extract the candidate keyphrases. In this work, the abstract form of Arabic

words is used instead of its stem form to represent the candidate terms. The Abstract

form hides most of the inflections found in Arabic words. The paper introduces new

features of keyphrases based on linguistic knowledge, to capture titles and subtitles of a

document. A simple ANOVA test is used to evaluate the validity of selected features.

31

Then, the learning model is built using the LDA - Linear Discriminant Analysis – and

training documents.

The automatic keyphrase extraction is treated as a supervised machine learning task. Two

important issues are defined: how to define the candidate keyphrase terms, and what

features of these terms are considered discriminative, i.e., how to represent the data, and

consequently what is given as input to the learning algorithm. Our motivation is that

adding linguistic knowledge (such as lexical features and syntactic rules) to the extraction

process, rather than relying only on statistics, may obtain better results.


learning techniques to extract keyphrases from Arabic documents with reasonable

accuracy. The Linguistic knowledge will play important roles in different stages of our

proposed system:

1. Analysis stage, where the document is tokenized into sentences and words. Each word

is analyzed using an annotated Arabic corpus to extract its POS tags, category, and

abstract form.

2. Candidate keyphrase extraction stage, where set of syntactic rules are used to

determine the allowed sequence of words of the generated n-gram terms according to

their POS tags and categories.

3. Features Vector calculation stage, where some of the selected features of each

candidate phrase are linguistic-based, in addition to the statistical-based features.

Although the system was trained with a train set from the IT-Field , it give a good results

in other domains like (politics) , and although it give an acceptable outputs , but it have a

drawback that it depend on a corpus which make it inapplicable if we decided to convert

it into a web application because the size of the corpus is very large compared to other

systems like Kp-miner which will be mentioned next .

3.3 unsupervised Machine Learning Techniques In this part of the chapter we will give a hint about the systems that uses unspurvised

Machine learning techinques in order to extract the candidate keyphrase of the input

documents , the only system uses unsupervised Machine Learning Techniques to extract

the keyphrases is Kp Miner (El-Beltagy & Rafea, 2008) .

32

3.3.1 KP-Miner KP-Miner (El-Beltagy, 2006) (El-Beltagy, 2009) is a system for the extraction of keyphrases

from English and Arabic documents. The keyphrase extraction process in KP-Miner is an

un-supervised one.

3.3.1.1 System Overview KP-miner is an unsupervised machine learning Algorithem used to get the keyphrase the

system has the advantage of being configurable as the rules and heuristics adopted by the

system are related to the general nature of documents and keyphrase. This implies that

the users of this system can use their understanding of the document(s) being input into

the system, to fine tune it to their particular needs.

The work on KP-Miner was inspired by the nature of documents and keyphrases specially

the next there points :

The number of keyphrase in any given document will usually be less than that of

single keywords. Effective keyphrase extraction is then dependant on the

determination of an appropriate boosting factor for keyphrases. In this work, this

boosting factor is related to the ratio of single to compound terms in each input

document.

Without the use of linguistic features, the extraction of meaningful keyphrases is

dependant on the repetition of these within the document.

Using IDF information in phrase weight calculation would bias the extraction

towards unseen phrases. This would be unfair when building a general rather than

a domain specific extractor as possible phrase combinations are much larger than

what can be captured from a limited IDF training corpus.

The position of the first occurrence of any given phrase is significant in two ways.

The first is related to the fact that the more important a term is, the more likely it

is to appear ‘sooner’ in the document. The second is based on the observation that

after a given threshold is passed in any given document, phrases occurring for the

first time are highly unlikely to be keyphrases.

Keyphrase extraction in the KP-Miner system is a three step process: candidate Keyphrase

selection, candidate Keyphrase weight calculation and finally keyphrase refinement. Each

of these steps is explained in the following sub-sections.

33

3.3.1.2 Candidate keyphrase selection

In KP-Miner, a set of rules is employed in order to elicit candidate keyphrases. As a phrase

will never be separated by punctuation marks within some given text and will rarely have

stop words within it, the first condition a sequence of words has to display in order to be

considered a candidate keyphrase, is that it is not be separated by punctuation marks or

stop words. A total of 187 common stopwords (the, then, in, above, etc) are used in the

candidate keyphrase extraction step. After applying this first condition on any given

document, too many candidates will be generated; some of which will make no sense to a

human reader. To filter these out, two further conditions are applied. The first condition

states that a phrase has to have appeared at least n times in the document from which

keyphrases are to be extracted, in order to be considered a candidate keyphrase. This is

called the least allowable seen frequency (lasf) factor and in the English version of the

system, this is set to 3. However, if a document is short, n is decremented depending on

the length of the document. The second condition is related to the position where a

candidate keyphrase first appears within an input document. Through observation as well

as experimentation, it was found that in long documents, phrases occurring for the first

time after a given threshold are very rarely keyphrases. So a cutoff constant CutOff is

defined in terms of a number of words after which if a phrase appears for the first time, it

is filtered out and ignored. The initial prototype of the KPMiner system (El-Beltagy, 2006),

set this cutoff value to a constant (850). Further experimentation carried out in (El-

Beltagy, 2009) revealed that an optimum value for this constant is 400. In 190 the

implementation of the KP-Miner system, the phrase extraction step described above is

carried out in two phases. In the first phase, words are scanned until either a punctuation

mark or a stop word is encountered. The scanned sequence of words and all possible n-

grams within the encountered sequence where n can vary from 1 to sequence length-1,

are stemmed and stored in both their original and stemmed forms. If the phrase (in its

stemmed or original form) or any of its sub-phrases, has been seen before, then the count

of the previously seen term is incremented by one, otherwise the previously unseen term

is assigned a count of one. Very weak stemming is performed in this step using only the

first step of the Porter stemmer (Porter, 1980). In the second phase, the document is

scanned again for the longest possible sequence that fulfills the conditions mentioned

above. This is then considered as a candidate keyphrase. Unlike most of the other

keyphrase extraction systems, the devised algorithm places no limit on the length of

keyphrases, but it was found that extracted keyphrases rarely exceed three terms.

34

3.3.1.3Candidate keyphrases weight calculation Single key features obtained from documents by models such as TF-IDF (Salton and

Buckley, 1988) have already been shown to be representative of documents from which

they’ve been extracted as demonstrated by their wide and successful use in clustering and

classification tasks. However, when applied to the task of keyphrase extraction, these

same models performed very poorly (Turney, 1999). By looking at almost any document, it

can be observed that the occurrence of phrases is much less frequent than the occurrence

of single terms within the same document.

So it can be concluded that one of the reasons that TF-IDF performs poorly on its own

when applied to the task of keyphrase extraction, is that it does not take this fact into

consideration which results in a bias towards single words as they occur in larger

numbers. So, a boosting factor is needed for compound terms in order to balance this bias

towards single terms. In this work for each input document d from which keyphrases are

to be extracted, a boosting factor Bd is calculated as follows:

Bd= |Nd| /(|Pd| *µ)

and if Bd > s then Bd = s

Here |Nd| is the number of all candidate terms in document d, |Pd| is the number of

candidate terms whose length exceeds one in document d and µ and s are weight

adjustment constants.

The values used by the implemented system are 3 for s and 2.3 for µ . To calculate the

weights of document terms, the TF-IDF model in conjunction with the introduced boosting

factor, is used. However, another thing to consider when applying TF-IDF for a general

application rather than a corpus specific one, is that keyphrase combinations do not occur

as frequently within a document set as do single terms. In other words, while it is possible

to collect frequency information for use by a general single keyword extractor from a

moderately large set of random documents, the same is not true for keyphrase

information. There are two possible approaches to address this observation. In the first, a

very large corpus of a varied nature can be used to collect keyphrase related frequency

information. In the second, which is adopted in this work, any encountered phrase is

considered to have appeared only once in the corpus. This mean that for compound

phrases, frequency within a document as well as the boosting factor are really what

determine its weight as the idf value for all compound phrases will be a constant c

determined by the size of the corpus used to build frequency information for single terms.

If the position rules described in (El-Beltagy, 2009) are also employed, then the position

factor is also used in the calculation for the term weights. In summary, the following

35

equation is used to calculate the weight of candidate keyphrases whether single or

compound:

wi j = tfi j* idf * Bi* Pf

Where:

wij = weight of term tj in Document Di

tfi j = frequency of term tj in Document Di

idf = log2 N/n where N is the number of documents in the collection and n is number of

documents where term tj occurs at least once. If the term is compound, n is set to 1.

Bi = the boosting factor associated with document Di

Pf= the term position associated factor.

If position rules are not used this is set to 1.

3.3.1.4 Final Candidate Phrase List Refinement The KP-Miner system, allows the user to specify a number n of keyphrases s/he wants

back and uses the sorted list to return the top n keyphrases requested by the user. The

default number of n is five. As stated in step one, when generating candidate keyphrases,

the longest possible sequence of words that are uninterrupted by possible phrase

terminators, are sought and stored and so are sub-phrases contained within that

sequence provided that they appear somewhere in the text on their own. For example, if

the phrase ‘excess body weight’ is encountered five times in a document, the phrase itself

will be stored along with a count of five. If the sub-phrase , ‘body weight’, is also

encountered on its own, than it will also be stored along with the number of times it

appeared in the text including the number of times it appeared as part of the phrase

‘excess body weight’. This means that an overlap between the count of two or more

phrases can exist. Aiming to eliminate this overlap in counting early on can contribute to

the dominance of possibly noisy phrases or to overlooking potential keyphrases that are

encountered as sub-phrases. However, once the weight calculation step has been

performed and a clear picture of which phrases are most likely to be key ones is obtained,

this overlap can be addressed through refinement. To refine results in the KP-Miner

system, the top n keys are scanned to see if any of them is a sub-phrase of another. If any

of them are, then its count is decremented by the frequency of the term of which it is a

part. After this step is completed, weights are recalculated and a final list of phrases

sorted by weight, is produced. The reason the top n keys rather than all candidates, are

used in this step is so that lower weighted keywords do not affect the outcome of the final

keyphrase list. It is important to note that the refinement step is an optional one, but

36

experiments have shown that in the English version of the system, omitting this step leads

to the production of keyphrase lists that match better with author assigned keyword. In

(El-Beltagy, 2009) the authors suggested that employing this step leads to the extraction

of higher quality keyphrases.

3.3.1.5 Evaluation and Drawbacks

Despite the fact that the KP-Miner was designed as a general purpose keyphrase

extraction system, and despite the simplicity of the system and the fact that it requires no

training to function, it seems to have performed relatively well when carrying out the task

of keyphrase extraction from scientific documents. The keyphrases generated from Kp-

Miner are not accurate for all cases, wrong output like verbs, stop-words, and depends on

statistical features only.

37

Chapter 4 Proposed System

38

4.1 Introduction As we talked in previous chapter about the importance of Data Mining in general and Key

Phrase extraction in specific, Now we will move on to present the component of our the

system and how it works and what we used as is and what we added and how we

integrate all these components together to make our Arabic Key phrase Extraction

System.

In this work, the automatic key phrase extraction makes a processing on the original text

and adopts it to deal with our modules and it is treated as a supervised machine learning

task. Two important issues are defined: how to define the candidate key phrase terms,

and what features of these terms are considered discriminative, i.e., how to represent the

data, and consequently what is given as input to the learning algorithm. Our motivation is

that adding linguistic knowledge (such as lexical features and syntactic rules) to the

extraction process, rather than relying only on statistics, may obtain better results.


learning techniques to extract key phrase s from Arabic documents with reasonable

accuracy and it’s a domain in specific as it can run in any environment without causing any

problems. The Linguistic knowledge will play important roles in different stages of our

proposed system:

1. Analysis and pre-processing stage, where the document is corrected and removing

any non-Arabic characters and removing diactries and the text tokenized into

words and sentences , and there is a sub stage called a Segmenter, where

document get an appropriate preprocessing to be adopted with other stages like

POS tagging to get more accurate results .

2. POS Tagging stage, where every word in text is assigned with its proper position in

text (Noun, Verb, Adj, etc.) using Stanford POS tagger to be used in further

processes.

3. Lemmatization stage, where every word gets its abstract form without all

additional prefix or post-fixes this is done using AraMoroh module which we will

talk in detail.

4. Candidate key phrase extraction stage, where set of syntactic rules are used to

determine the allowed sequence of words of the generated n-gram terms

according to their POS tags and categories.

5. Feature extraction stage, during this phase we calculate complex formulas and

statistics for words, sentences and whole document to determine the weight of

every candidate key phrase.

39

6. Machine Learning stage, where the train process happen to assign weights to all

features calculated in last step and calculate a formula to determine is Key phrase

or not.

In the next section of this chapter of this book we will talk in details about all of these 6

steps to show how we implemented or used them.

4.2 Pre-Processing Phase This is the first phase in our project and the main tasks of this phase are correcting the

input document and then we start the Tokenization process.

The correction process starts by finding the errors in the input document and then starts

the correction process but before starting in how we are going to correct the errors in the

input we should mention some of these errors , the errors come from using non-Arabic

letters in the document like using Question Mark ‘?’ in Arabic document instead of this

we correct it to the Arabic Question Mark using checking for the Unicode of the most

common letters that are used in Arabic documents while it is non-Arabic one ( Table

show the most common errors ) , but why we should do that ? This correction process

help a lot in getting more accurate results in the next phases which totally depend on this

process and the errors in this step will make other errors in the next phases

Non Arabic Letters Arabic Letters

Unicode Character Unicode Character

\u003f ? \u061f ؟

\u002c , \u060c ،

\u0021 ! \u0021 !

Table 4.1

During this phase we also remove the similar punctuation characters if you come after

each other, i.e if we found “؟؟” or “ !!!!؟؟؟؟ ” we remove one of them because it won’t

affect our processing on the input document.

The next snapshot show an input String with non-Arabic Letters and the output after

replacing this letters with Arabic ones.

40

After we are done with Replacing non-Arabic Characters with Arabic one , we go to the

next step which is removing Diactries “انتشكيم” from the input document , Although this is

Diactries may give more accurate results but Almost it’s rarely used nowadays and also it

make a conflict with our other modules “ getting lemma of every word ” so we are

remove it to avoid this problems . (Next Table shows the Diactries we are removing and

the Unicode of them).

Diactries Unicode

\u064B

\u064C

\u064D

\u064E

\u064F

\u0650

\u0651

\u0652

Table 4.2

The next snapshot show an input String with Diactries and the output after removing it.

Figure 4.1

After removing Diactries and to be sure that there won’t be errors we replace spaces

between words into only one space between every Two words , then we remove spaces

form the start and the end of the input document .

41

([^\u002d\u003a\u003f\u061f\u0021\u002e\u060c\u061B\u0686\u0698\u06

AF\u0621-\u0636\u0637-\u0643\u0644\u0645-\u0648\u0649-\u064B-

\u064E\u064F\u0650\u0651\u0652]+)

Now and after we replaced non-Arabic letters with Arabic Ones and also removing of

Diactries , we are going to one of the most important tasks in this phase which is

Tokenization . Tokenization process in based on segmenting the input document at two

levels the first one is the segmentation of the input document into sentences and the

other level in the segmentation of this sentences into words.

In the first level which is the segmentation of the input document into sentences, we

segment the input document into its constituent sentences based on the Arabic phrases

delimiter characters such as comma, semicolon, colon, hyphen, and dot, to make this we

used the next regex (figure) which splits the input document into sentences based on

Arabic phrases delimiter characters.

figure 4.2

This process is useful for calculating part of the features vector of the candidate terms,

such as Normalized Sentence Location (NSL), Normalized Phrase Location (NPL),

Normalized Phrase Length (NPLen), and Sentence Contain Verb (SCV). We will explain in

more details these words when we start to speak about Machine learning phase. Next

snapshot show the output after segmenting the input into sentences.

Figure 4.3

Now we have segmented the input document into sentences based on Arabic Phrases

delimiter, we are going to the second level which is the segmentation of these sentences

into words; each sentence is segmented into its constituent words based on the criteria

that words are usually separated by spaces. In this phase we created a method to check

for every character in the word if the first character of the word isn’t an Arabic letter we

42

discard it else we take it into our accounts. Next snapshot show the output after segment

the sentences into Words based on spaces.

Figure 4.4

During our work in the preprocessing phase we used some useful which helped us in make

a more accurate correction and segmentations, this methods was a method called

isArabic which check if the input letter is Arabic or not and also we have a method that

check for the punctuation letters and a method check if the input character is a stop

punctuation or not and this was helpful in segmentation process we mentioned earlier.

The last thing we did in our preprocessing phase was some method which will help in our

upcoming phases, this methods was a method called isNext which check if there is an

input to read or not and another one called getNextSentence which return the next

sentence to the caller object.

43

4.3 Segmentation: It is sub stage of preprocessing stage and it is a Tokenization of raw text is a standard pre-

processing step for many NLP tasks. It requires more extensive token pre-processing,

which is usually called segmentation

That Segmenter taken from Stanford, The Stanford Word Segmenter currently supports

Arabic and Chinese. The provided segmentation schemes have been found to work well

for a variety of applications.

We modified that Segmenter to support Arabic only and to be integrated with our pre-

processing module.

Arabic is a root-and-template language with abundant bound morphemes. These

morphemes include possessives, pronouns, and discourse connectives. Segmenting bound

morphemes reduces lexical sparsity and simplifies syntactic analysis.

The Arabic Segmenter model processes raw text according to the Penn Arabic Treebank 3

(ATB) standard.

We integrated that Segmenter to get an appropriate preprocessing to be adopted with

the coming stage Called POS (Part Of Speach) to get more accurate results .

Segmenter can separate the prefix and suffix of the word to be prepared

For Example :

if we have a sentence :

And after processing in the Segmenter it will be :

44

4.4 POS Tagging Phase: A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some

language and assigns parts of speech to each word (and other token), such as noun, verb,

adjective, etc., although generally computational applications use more fine-grained POS

tags like 'noun-plural'.

During searching for a POS tagger we have found that Stanford University has a section in

the college doing researches on Arabic and one of their products is POS Tagger, so after

testing for a while we decided to use this tagger in our work as there is no open source

projects doing the same job available for free although it has some weakness that we

tried to avoid we used this system.

This POS tagger module is big and generally it works the same way for several languages

so it used for several languages through trained grammars specifically for each language

available language are as follow “Arabic, English, Chinese, French, German” the trained

data for Arabic called “arabic-accurate.tagger”.

As a part of our work we tried to eliminate all unused libraries and modules from this

tagger and letting only modules that relate to Arabic language making it faster to load and

more light.

The parser assumes precisely the tokenization of Arabic used in the Penn Arabic

Treebank (ATB). We do now have a software component for segmenting Arabic, but you

have to download and run it first; it isn't included in the parser (see at the end of this

answer). The Arabic parser simply uses a whitespace tokenizer. As far as we are aware,

ATB tokenization has only an extensional definition; it isn't written down anywhere.

Segmentation is done based on the morphological analyses generated by the Buckwalter

analyzer. The segmentation can be characterized thus:

Almost all clitics are separated off as separate words. This includes clitic pronouns,

prepositions, and conjunctions. However, the clitic determiner (definite article)

"Al" (ال) is not separated off. Inflectional and derivational morphology is not

separated off.

[GALE ROSETTA: These separated off clitics are not overtly marked as

proclitics/enclitics, although we do have a facility to strip off the '+' and '#'

characters that the IBM segmenter uses to mark enclitics and proclitics,

respectively. See the example below using the option -escaper

edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper]

45

Parentheses are rendered -LRB- and -RRB-

Quotes are rendered as (ASCII) straight single and double quotes (' and "), not as

curly quotes or LaTeX-style quotes (unlike the Penn English Treebank).

Dashes are represented with the ASCII hyphen character (U+002D).

Non-break space is not used.

The parsers are trained on unvocalized Arabic. One grammar

(atbP3FactoredBuckwalter.ser.gz or atb3FactoredBuckwalter.ser.gz) is trained on input

represented exactly as it is found in the Penn Arabic Treebank. The other grammars

(arabicFactored.ser.gz and arabicFactoredBuckwalter.ser.gz) are trained on a more

normalized form of Arabic. This form deletes the tatweel character and other diacritics

beyond the short vowel markers which are sometimes not written (Alef with hamza or

madda becomes simply Alef, and Alef maksura becomes Yaa), and prefers ASCII characters

(Arabic punctuation and number characters are mapped to corresponding ASCII

characters). Your accuracy will suffer unless you normalize text in this way, because words

are recognized simply based on string identity. [GALE ROSETTA: This is precisely the

mapping that the IBM ar_normalize_v5.pl script does for you.]

4.4.1 Training data supplied to POS Tagger The POS Tagger is required as a machine learning tool supplied with a previous training.

The trained object has been serialized and stored as a training file which has to be

supplied to the program each time it’s used. Stanford University provides two trained files

included within the program under the names “arabic-accurate.tagger” and “Arabic-

fast.tagger”. However, the first file has shown relatively higher accuracy than the second

one.

Examples

Parsing [sent. 1 len. 8]: مستقل قضاء خالل من العدل نشر و . (ROOT (S (CC و)

(VP (VBD نشر) (NP (DTNN العدل))

(PP (IN من) (NP (NN خالل) (NP (NN قضاء) (JJ مستقل))))) (PUNC .)))

46

4.4.2 POS Tag Set The parser uses an "augmented Bies" tag set. The so-called "Bies mapping" maps down

the full morphological analyses from the Buckwalter analyzer to a subset of the POS tags

used in the Penn English Treebank (but some with different meanings). We augment this

set to represent which words have the determiner "Al" (ال) cliticized to them. These extra

tags start with "DT", and appear for all parts of speech that can be preceded by "Al", so

we have DTNN, DTCD, etc. This is an early definition of the Bies mapping.

4.4.3 Lemmatization Abstract form (Lemma): it describes the basic form from which the given word is logically

derived. Usually, this from differs from the word stem form which is obtained after

removing the prefix and suffix parts of the word. For example, the stem of the word

which represents a human being object. In contrast, the abstract form of ,"مسء“ is "انمسئية“

the word is "مسئي", which represents the adjective of visual object. This abstract form can

be used to represent many different words having the same logical meaning "Visual

object" such as " انمسئية ذجاننما ", " انمسئي اننموذج , " The abstract form of the ." انمسئية اننمرجة

given word is represented as follows:

• The single form for nouns.

• The single and male form for adjectives.

• The past form for verbs.

• The stem form for stop-words.

The abstract form of the given word is extremely useful during the process of extracting

candidate key phrase s. For example, the words "انمشسوع" and "انمشازيع" have different

word-stems defined as “مشسوع" and “مشازيع" respectively. But, their abstract forms are the

same and defined as “مشسوع". This abstract form is used for extracting candidate key

phrase s by recommending a strong key-term like “ انكتسوني مشسوع " to represent the terms

" اإلنكتسوني انمشسوع " and " also For example the Abstract form of the word" اإلنكتسونية انمشازيع

.(أكم) is (سيأكم) and for (شجسة) is (أشجاز) and for ,(حمساء) is (أحمس)

This unified key-term cannot be achieved by using the word-stem form of the words.

So this technique will improve the result but if we can find a right tool to extract Abstract

Form or Lemma from the text, searching again leads us this time to open source project

called “AraMorph” which is a Java port of the homonym product developed

in Perl by Tim Buckwalter on behalf of the Linguistic Data Consortium (LDC).

47

The product includes Java classes for the morphological analysis of arabic text files,

whatever their encoding.

Aramorph or Arabic WordNet consists of 9228 synsets (6252 nominal, 2260 verbal, 606

adjectival, and 106 adverbial), containing 18,957 Arabic expressions. This number includes

1155 synsets that correspond to Named Entities which have been extracted automatically

and are being checked by the lexicographers.

This module able to different forms (all possible Lemma solutions) it could be many as it

doesn’t know the context which this word used in and due to ambiguity in Arabic

Language because of diactires and misspelled words and with this forms it returns also for

every solution (initial POS not depending on whole sentence only used for this word) and

prefix and suffix for this word plus it glossed English word.

For clarification we will input this module an Arabic word “كتاب” to see its output.

Example

Processing token : كتاب Transliteration : ktAb Token not yet processed. Token has direct solutions. SOLUTION #3 Lemma : kAtib

Vocalized as : كتاب Morphology : prefix : Pref-0 stem : N suffix : Suff-0 Grammatical category : stem : كتاب NOUN Glossed as : stem : authors/writers SOLUTION #1 Lemma : kitAb Vocalized as : كتاب Morphology : prefix : Pref-0 stem : Ndu suffix : Suff-0 Grammatical category : stem : كتاب NOUN

48

Glossed as : stem : book SOLUTION #2 Lemma : kut~Ab Vocalized as : كتاب Morphology : prefix : Pref-0 stem : N suffix : Suff-0 Grammatical category :

stem : تابك NOUN Glossed as : stem : kuttab (village school)/Quran school

As we can see word “كتاب” may have different meaning with the same spelling only

changing diactries so it will return all possible solutions.

From this different solutions for every word in sentence we had to choose from it by a

passing parameter from previous module (Stanford POS Tagger) which will help us in

choosing the right form suggested solutions if we still cannot determine the right

solutions we sort them every time by a complex algorithms depending on their suffixes

and prefixes and alphabetical order so we can choose one each time and limit

randomness this may be not the best solution but its result not bad specially in large

Arabic text.

As we said it depend on large dataset for Arabic words and s formations so we can call it

domain-unspecific but how this works from inside, well it contains three main dictionaries

one for stem, other for all possible prefixes in language and last one for all possible

suffixes in Arabic and it uses brute force algorithm trying to guess word so for word “كتاب”

it try to do so

49

prefix stem suffix

ktAb Ø Ø

ktA b Ø

ktA Ø b

kt Ab Ø

kt A b

kt Ø Ab

k tAb Ø

k tA b

k t Ab

k Ø tAb

Ø ktAb Ø

Ø ktA b

Ø kt Ab

Ø k tAb

Ø Ø ktAb

Dictionaries also include the following one for prefix

w wa Pref-Wa and <pos>wa/CONJ+</pos> f fa Pref-Wa and;so <pos>fa/CONJ+</pos>

and this one for suffix

perfect verb, null suffix: banA-h, daEA-h h hu PVSuff-0ah he/it <verb> it/him <pos>+(null)/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS</pos> hmA humA PVSuff-0ah he/it <verb> them (both) <pos>+(null)/PVSUFF_SUBJ:3MS+humA/PVSUFF_DO:3D</pos> hm hum PVSuff-0ah he/it <verb> them <pos>+(null)/PVSUFF_SUBJ:3MS+hum/PVSUFF_DO:3MP</pos> hA hA PVSuff-0ah he/it <verb> it/them/her <pos>+(null)/PVSUFF_SUBJ:3MS+hA/PVSUFF_DO:3FS</pos> hn hun~a PVSuff-0ah he/it <verb> them <pos>+(null)/PVSUFF_SUBJ:3MS+hun~a/PVSUFF_DO:3FP</pos> k ka PVSuff-0ah he/it <verb> you <pos>+(null)/PVSUFF_SUBJ:3MS+ka/PVSUFF_DO:2MS</pos> k ki PVSuff-0ah he/it <verb> you <pos>+(null)/PVSUFF_SUBJ:3MS+ki/PVSUFF_DO:2FS</pos>

50

kmA kumA PVSuff-0ah he/it <verb> you (both) <pos>+(null)/PVSUFF_SUBJ:3MS+kumA/PVSUFF_DO:2D</pos> km kum PVSuff-0ah he/it <verb> you <pos>+(null)/PVSUFF_SUBJ:3MS+kum/PVSUFF_DO:2MP</pos> kn kun~a PVSuff-0ah he/it <verb> you <pos>+(null)/PVSUFF_SUBJ:3MS+kun~a/PVSUFF_DO:2FP</pos> ny niy PVSuff-0ah he/it <verb> me <pos>+(null)/PVSUFF_SUBJ:3MS+niy/PVSUFF_DO:1S</pos> nA nA PVSuff-0ah he/it <verb> us <pos>+(null)/PVSUFF_SUBJ:3MS+nA/PVSUFF_DO:1P</pos>

And this last one for stem

;--- ktb ;; katab-u_1 ktb katab PV write ktb kotub IV write ktb kutib PV_Pass be written;be fated;be destined ktb kotab IV_Pass_yu be written;be fated;be destined ;; kAtab_1 kAtb kAtab PV correspond with kAtb kAtib IV_yu correspond with ;; >akotab_1 >ktb >akotab PV dictate;make write Aktb >akotab PV dictate;make write ktb kotib IV_yu dictate;make write ktb kotab IV_Pass_yu be dictated ;; takAtab_1 tkAtb takAtab PV correspond tkAtb takAtab IV correspond

4.5 Candidate key phrase:

After investigating different key phrase s written for Arabic documents, we found that the

following syntactic rules are effective for extracting candidate phrases:

1- The candidate phrase can start only with some sort of nouns like general-noun, place-

noun, proper-noun, and declined-noun.

2- The candidate phrase can end only with general-noun, place-noun, proper-noun,

declined-noun, time-noun, augmented-noun, adjective, and adverb.

51

3- For three words phrase, the second word is allowed to be count-noun, conjunction,

preposition, and comparison, in addition to those cited in rule 2. It is worthwhile to note

that the used rules are language-dependent, and the given rules are applicable only to

Arabic language.

4- We created two lists of Stop-words, stop-words that shouldn’t appear in a Keyphrase

and stop-words that can be the middle word of a three-word Keyphrase.

4.6 Feature Extraction phase: Statistical is the main repository of all information needed to be saved , In that module we

can update Statistical for each Sentence and accumulate Statistical for each new sentence

to that previous ones to get all Statistical all over the Document . Calculating Statistical

module uses Candidates key phrases for that Sentence and those Lemmas and those POSs

(Part Of Speech) and those original words. First we calculate all words count in each

sentence ignoring all punctuation count in that sentence , and then we calculate all words

count in all Document, after that we get the maximum phrase Length in all Document to

be used in Calculating the Normalization of each phrase length to maximum length, and it

‘s to important to Calculate Maximum Phrase Frequency to can be used in Calculating

feature called Phrase Relative Frequency(PRF) and as the same way calculating the

maximum word frequency to be used in feature called Word Relative Frequency (WRF) .

In the Scope of one sentence, we save if that Sentence contain verb or not , it will be

highly effective in calculating features because and as the same way about if that

Sentence contains question because most of writers Questioning about their main topic

purpose

Features

Each candidate phrase is assigned a number of features used to evaluate its importance.

In our algorithm, three factors control the selection of features and their values.

1. The absolute importance of the phrase, which identifies its importance

independent of its original document. Therefore, most feature values are

normalized when necessary, to have ranges from zero to one.

2. Heuristics: where the feature values are computed, based on our hypothesis of its

importance, after investigating many human written key phrase s.

3. All the extracted features and values are based upon the abstract form of the

phrases.

52

The following features are adopted:

a) Normalized Phrase Words (NPW), which is the number of words in each phrase

normalized to the maximum number of words in a phrase. The values of this feature can

be 1, 1/2, or 1/3. The hypothesis is that key phrase s consists of three words are better

than key phrase s contain two words, and so on..

b) The Phrase Relative Frequency (PRF), which represents the frequency of abstract form

of the candidate phrase normalized by dividing it by the most frequent phrase in the given

document. PRF has a maximum value of 1; when the candidate key phrase is the most

frequent one in a given document.

c) The Word Relative Frequency (WRF): The frequency of the most frequent single

abstract word in a candidate phrase, normalized by dividing it by the maximum number of

repetitions of all phrase words in a given document. The feature is calculated as follows:

First, the frequency of all unique abstract words used in phrases for a given document is

computed. Second, the maximum number of repletion is found, and used to normalize the

computed frequencies. Third For each phrase, the maximum normalized frequency of its

words is selected as a WRF. WRF has a maximum value of 1, when it contains the most

frequent word of all words of phrases in a given document .

d) Normalized Sentence Location (NSL), which measures the location of the sentence

containing the candidate phrase within the document. We use the heuristic that key

phrases located near the beginning and end of document are important phrases. We use

the simple distribution function NSL= (2(I/m)-1)2, where I is the location of the sentence

within a document divided by total sentences in that document (m). The maximum value

of NSL is 1 for first (I=0), and last sentences (I=m) in the document.

e) Normalized Phrase Location (NPL) feature is adopted to measure the location of the

candidate phrase within its sentence. The NPL is given by (2(x/n)-1)2, where x is the

occurrence location of the phrase within a sentence divided by the total number of words

of that sentence (n). Our motivation is that important key phrase s occur near the

beginning and ending of sentences.

f) Normalized Phrase Length (NPLen), which is the length of the candidate phrase (in

words), divided by the number of words of its sentence. This feature has a value of one,

when the whole sentence is a key phrase . Our hypothesis is that this will capture titles

and subtitles of the document, which are likelihood to contain key phrase s .

g) Sentence Contain Verb (SCV). This feature has a value of zero if the sentence of the

candidate phrase contains verb, else it has a value of one. Our motivation is that, this

53

feature will give more weight to key phrase s written in titles and subtitles of a document.

The feature value is assigned after analyzing the part of speech of sentence words.

h) Is It Question (IIT): This feature has a value of one if the sentence of the candidate

phrase is written in a question form; else its value is 0. The hypothesis is that some

authors highlight their main concepts as question forms. The feature is adopted to

capture the important key phrase s written in documents as questions. During this work,

question forms are only identified by part of speech tagging, when detecting question

marks and/or question words.

54

Chapter 5 Results and Future Work

55

5.1 Results: The program was tested on many documents in various fields and many authors. In order

to evaluate the performance of the proposed system, many experiments were carried out

to test the proposed system. A total of 25 documents were used. The first experiment

aimed to measure the level of acceptance of the extracted keyphrases. Since there is no

author-assigned keyphrases for these documents, a human judge was adopted to

evaluate this level; we have compared the results with KP-miner, but we couldn’t

compare it with Sakhr because the output was inapplicable to be compared.

5.1.1 Overall results Total # of Documents 25

Categories selected Politics , sports , community , technology , religion, psychology

Output Keyphrase per document 15,20 Table 5.1

Results for Our System Precision 0.25 (for 15 keyphrase)

0.171 (for 20 keyphrase) Recall 0.443 (for 15 keyphrase)

0.447 (for 20 keyphrase) Table 5.2

Results for Kp-miner

Precision 0.214 (for 15 keyphrase) 0.178 (for 20 keyphrase)

Recall 0.399 (for 15 keyphrase) 0.414 (for 20 keyphrase)

Table 5.3

56

5.1.2 Results for document samples However the training was done for technology field, the test on other fields like

psychology.

Our system علم ، علم النفس الفسولوج ، علم النفس العصب ،علم النفس االجتماع ،علم

النفس االكلنك ، علم النفس ، النفس ، العلم ف مجال ، علم النفس الشخصة ،

، االكلنك النفس ، االجتماع النفس ، العصب النفس ، الفسولوج النفس

سرعة السارة متغر ، علم النفس التربوي ، علم النفس المعرف ، السارة متغر

مستقل ، علم النفس التنظم ، علم النفس االرتقائ

KP-Miner ،علم النفس، متغر، درس،، قوانن السلوك، فروع علم النفس، ظواهر، دراسة

سلوك، السلوك التعلم، التجرب، واالدراكات، المنهج، العالقة، نفسة، المناهج،

المخدرات التجربة،الفروع، المبكرة، العصب، المعرفة، الفرد،

Table 5.4

The following table is the worst result we got from those test files compared to very good

output from KP-Miner

Our system مبارك ، فرتب بقة التفاصل ، سارة ، التسجل ،مصر ، صور ، الجماعة

سكت ، صاحب ، احد ، برامج ، االستودو بمغادرة مطالبت ، المصرة الوطنة

، التفاصل بقة ، العمال ، الشبكة فى مسئول ، سدة ، بقة فرتب ، صاحبنا

الفنانن ، محاكمة مبارك

KP-Miner محاكمة مبارك، بدء التسجل، واالنتخابات الرئاسة، ممتلئ الجسم، تصبب

عرقا، استوقفت سارة أجرة، مدان التحرر، صاحبنا،االخوان المسلمون،

مصر، مبارك، بى، والفتات، سدة، الفنانن، وظلوا، صورة، الرسام، والحظت

Table 5.5

Another test on politics

Our system ، االخوان ، اسقاط مرشح االخوان ، الخوف من االخوان ، االخوان فوق النقد

االخوان حكم ، االخوان جنة ، مرشح اسقاط مقابل ، مبارك ، مصر فى الوضع

، نظام ومرشح االخوان ، اسالب الى اللجوء ، االخوان من الثقة ، اسود

االخوان ، علمانن من االخوان ، ومرشح االخوان مرشح ، كراهة او االخوان

اسود ، تحفظ على االخوان ، حكم االخوان

57

KP-Miner اإلخوان، نظام مبارك، محبة الوطن، كراهة اإلخوان، الثورة، النقد، مثل

الدولة، احمد شفق، نبغى، الحد، مسئول، مصر، فإننى، ظلوا، االنتخابات،

محترم مبارك، الفروق،

Table 5.6

5.2 Future Work

There are many improvements will be done to refine the output and produce more

efficient results.

1. Providing more features that represents the writer style of typing to be more

relevant and provide valuable Keyphrase best suit the content.

2. Do a better training technique to generate equation that can be used with new

feature.

3. Do some fixes to code to optimize both the performance and size of program.

Add some features that relate to some punctuation in Arabic language and their effect

such as:

a) “،” Symbol called ةالفصل and it come as a conjunction between words or phrases

and it can be treated as a conjunction letter “و”.

b) " : " Symbol called فوقتان and it comes after subtitles to give more details for that

subtitle and it also comes for quotes so it can be get more weight to the phrase

before that symbol or it can be weighted correctly after testing in future .

c) "؛" symbol called الفصلة المنقوطة and it comes to show that what after this symbol is

an explanation of what before it .

a. " - - " any phrase or word get between that symbol is called جملة اعتراضه

and that phrase can be weighted correctly after much testing because this

phrase can be neglected or can be more important , so it must be

definitely decided .

d) " ... " symbol is mark for deleting عالمه الحذف and it’s too important to ignore any

number of sentences or phrases related with this symbol.

e) "- " symbol called شرطه and if the coming after is one word it will be important

because it will express a subject coming late.

58

Appendix

Definitions:

Nouns:

is a name or an attribute of aperson (Ali), place (Mecca), thing (house) ,

or quality (honor). The word "noun" comes from the Latin nomen = "name."

The noun or substantive category in Arabic includes in addition to simple

nouns the pronouns, adjectives, adverbs, and verbids (participles andverbal

nouns).

Pronouns :

Pronouns in Arabic ا ئ ر م ,belong to the category of "nouns." Therefore ال ض

everything that applies to nouns will apply to them. Pronouns have genders,

numbers, and grammatical case. Pronouns are always definite nouns.

General-noun: عامة أسماء

it can be classified into concrete/abstract,human/non-human,

animate/inanimate nouns which can be used in any type of text to create

lexical cohesion. The types of general noun which we encounter in Arabic are :

a( concrete human noun: اسماء بشرة ملموسة ex : رجال ، نساء

b) abstract human nouns: اسماء بشرة مجردة ex: انسانة ، بشرة

d) concrete animate non- human noun: اسماء غر بشرة حة ملموسة ex: ، مخلوق

سمك

e) abstract inanimate non- human noun: اسماء غر بشرة غر حة مجردة ex: ، شئ

مادة

Place-noun:

with a form maf’al مفعل or similar, e.g. maktab مكتب, maktaba مكتبة"library"

(from kataba كتب"to write"); maṭbaḫ مطبخ "kitchen" (from ṭabaḫa " طبخ to

cook"); masraḥ"theater"مسرح (from saraḥa سرح"to release"). Nouns of place

formed from verbs other than Form I have the same form as the passive

participle, e.g. mustašfan مستشفى"hospital" (from the Form X verb istašfā

.("to cure"استشفى

http://arabic.tripod.com/Pronouns1.htm

http://arabic.tripod.com/Adjectives.htm

http://arabic.tripod.com/adverbs.htm

http://arabic.tripod.com/ActiveParticiple.htm

http://arabic.tripod.com/VerbalNouns.htm

http://arabic.tripod.com/VerbalNouns.htm

http://en.wikipedia.org/wiki/Passive_participle

http://en.wikipedia.org/wiki/Passive_participle

59

Time-noun:

it is a name derived from a verb to indicate the time of the occurrence of the

act , e.g mo’d موعد(from wa’da وعد) or mzhab مذهب (from zahab ذهب).

Proper-noun:

it is refer to unique or particular objects (cannot be preceded by words such as

"some" or "any") permanently it‘s names of persons or places

Common-noun:

it is refer to non-unique or non-particular objects (can be preceded by words

such as "some" or "any").

Adjective:

Adjectives in Arabic follow the nouns or pronouns they modify in gender,

number, grammatical case, and the state of definiteness. They always

come after the words they modify. Adjectives in Arabic belong to the "noun"

category, and there are several types of nouns that can serve as adjectives.

Declined-noun:

Nouns undergo inflection ف which means that parts of them change in , تصس

order to express changes in gender, number, case, tense, voice, person, or

mood.

The declension of Arabic nouns expresses changes in:

Gender— Arabic nouns have two grammatical genders ( مؤنث - مركس ).

Number— Arabic nouns have three grammatical numbers( جمع- مثنى – مفسد ).

Case— Arabic nouns have three grammatical cases( .( جس – نصة -زفع

State— Arabic nouns have three grammatical states( مضاف- معسفو -نكسه ).

Declension األسماء تصسيف

Gender Number Case State

Masculine ركس فسد Singular م م Raf"

(nom.) نكسة Absolute مسف وع

Feminine ؤنث ثن ى Dual م م Nasb

(acc./dat./voc.) وب معسفة Determinate منص

Plural جمع Jarr

(gen./abl.) و ز مجس Construct ضاف م

60

Mass nouns:

are nouns that refer to single as well as plural units when they are

grammatically singular and to plural units when they are grammatically plural.

These usually refer to plants or animals.

Examples:

Plural Mass Nouns Singular Mass Nouns

thimaar ازثم

thamar ثمس

fruits fruit/fruits

'ashjaar أشجاز

shajar شجس

trees tree/trees

Adverb:

Arabic adverbs are part of speech. Generally they're words that modify any

part of language other than a noun. Adverbs can modify verbs, adjectives

(including numbers), clauses, sentences and other adverbs. In Arabic an adverb

is mostly translated with an adverb in the 4th declension like ىو يتكهمم كثيسا عن

.Huwa yatakallam kathiiran 3an ibnihi (he speaks a lot about his son) اتنو

Count-noun:

are nouns that refer to single units when they are grammatically singular, and

to plural units when they grammatically plural.

Example :

Plural Count Nouns Singular Count Nouns

rijaal زجال rajul م زج

Conjunction:

a word that connects sentences, clauses or words, it is like “ka- كا "related to "as"

and "fa- فـ " "thus, so"

Preposition:

it expresses a relationship between two entities, There are only

twenty Arabic prepositions the most important and commonly used are si

x prepositions

(min, ila, ala, ba, la, fi) ( من ،الى ،على ، بـ ، لـ ، فى).

61

Comparison:

elative forms of adjectives that are used for both comparisons (ex. "bigger")

and superlatives (ex. "best"). Elative adjectives are invariable and take three

regular forms:

:e.g (af3al) أفعل .1

(akbar)أكبر (Kibiir) كبر

.(w-) ـو or (i-) ـ corresponds to adjectives that end in - (af3a) أفعى .2

e.g: حلو(Helw) أحلى(aHla)

.corresponds to adjectives with a doubled/geminate root - (afa3ll) أفعل .3

e.g: جدد (gediid) أجد (agadd).

Nominative case - المرفوع (al-marfū E): This case is marked by a Damma, it is a case of a noun or pronoun that is

functioning as the subject of a clause or sentence. Other words such as

adjectives may have a nominative case in agreement with a noun. e.g. ذىة انوند

انى انمدزسة

Accusative case - المنصوب (al-manSūb):

This case is marked by a fatHa. It is the case that identifies the direct object of

a verb, or certain other grammatical parts.e.g: حضر الرجل اللقاء

Dative (Arabic الثاني به فعولالم ):

the case that indicates the indirect object of a verb. e.g: الرجل اعطى ابنته قلما

Genitive case - المجرور (al-majrūr): This case is marked by a kasra, a case that indicates possession. e.g: صاحب القلم

فى بته

Essive (Arabic الا لح ):

a case that expresses the temporary state of the referent specified by a noun.

It means "while," or "in the capacity of." e.g: مشى الرجل ضاحكا

Locative (Arabic فعول ان ظ رف ، فيه الم ك الم ):

a case that indicates a location. It corresponds to the English prepositions "in,"

"on," "at," and "by." In Arabic, it is only used with place expressions, such as

"front" or "back." e.g. الرجل وقف امام الباب

62

The genitive construct: In Arabic, two nouns can be placed one after the other in what is called a genitive construct (اإلضافة) to indicate possession. First comes the noun being possessed (انمضاف), then comes the noun referring to the owner (انمضاف انيو).

Temporal (Arabic فعول ان ظ رف ، فيه الم م الز ):

a case that indicates a time. It corresponds to the English prepositions "in,"

"on," "at," and "by." In Arabic, it is only used with time expressions, such as

"morning" or "evening." e.g. عمل الرجل صباحا

Partitive (Arabic التمييز ):

a case that indicates "partialness," "without result," or "without specific

identity."e.g. thirteen men came.

Cognate Accusative (Arabic فعول المطل ق الم ):

a case that identifies the object of an intransitive verb; with the object having

the same root as the verb.

Final (Arabic فعول جله الم أل ):

a case that indicates a final cause of an action. e.g. اجتهد الطالب امال فى التفوق

Comitative (Arabic فعول ع ه الم م ):

a case that indicates companionship. It corresponds to the English preposition

"with."e.g. ذهب الطالب مع صدقه

Perlative (Arabic فعول ع ه الم م ):

in Arabic, it indicates a movement along the referent of the noun that is

marked.

e.g. the man walked along the beach.

Vocative (Arabic المن اد ى ):

a case that indicates that somebody or something is being

directly addressed by the speaker. E.g. امحمد ،هل معك من مال؟

Ablative:

the case that indicates the source, agent, or instrument of action of the verb. It

http://arabic.tripod.com/Roots.htm

63

indicates the object of most common prepositions.

e.g.الرجل اتى متاخرا من عمل ه

64

BUCKWALTER TRANSLITERATION:

Buckwalter transliteration

Unicode Arabic Letters

' U+0621 ARABIC LETTER HAMZA

| U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE

> U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE

& U+0624 ARABIC LETTER WAW WITH HAMZA ABOVE

< U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW

} U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE

A U+0627 ARABIC LETTER ALEF

b U+0628 ARABIC LETTER BEH

p U+0629 ARABIC LETTER TEH MARBUTA

t U+062A ARABIC LETTER TEH

v U+062B ARABIC LETTER THEH

j U+062C ARABIC LETTER JEEM

H U+062D ARABIC LETTER HAH

x U+062E ARABIC LETTER KHAH

d U+062F ARABIC LETTER DAL

* U+0630 ARABIC LETTER THAL

r U+0631 ARABIC LETTER REH

z U+0632 ARABIC LETTER ZAIN

s U+0633 ARABIC LETTER SEEN

$ U+0634 ARABIC LETTER SHEEN

S U+0635 ARABIC LETTER SAD

D U+0636 ARABIC LETTER DAD

T U+0637 ARABIC LETTER TAH

Z U+0638 ARABIC LETTER ZAH

E U+0639 ARABIC LETTER AIN

g U+063A ARABIC LETTER GHAIN

_ U+0640 ARABIC TATWEEL

f U+0641 ARABIC LETTER FEH

q U+0642 ARABIC LETTER QAF

k U+0643 ARABIC LETTER KAF

l U+0644 ARABIC LETTER LAM

m U+0645 ARABIC LETTER MEEM

n U+0646 ARABIC LETTER NOON

h U+0647 ARABIC LETTER HEH

w U+0648 ARABIC LETTER WAW

Y U+0649 ARABIC LETTER ALEF MAKSURA

y U+064A ARABIC LETTER YEH

F U+064B ARABIC FATHATAN

N U+064C ARABIC DAMMATAN

65

K U+064D ARABIC KASRATAN

a U+064E ARABIC FATHA

u U+064F ARABIC DAMMA

i U+0650 ARABIC KASRA

~ U+0651 ARABIC SHADDA

o U+0652 ARABIC SUKUN

` U+0670 ARABIC LETTER SUPERSCRIPT ALEF

{ U+0671 ARABIC LETTER ALEF WASLA

P U+067E ARABIC LETTER PEH

J U+0686 ARABIC LETTER TCHEH

V U+06A4 ARABIC LETTER VEH

G U+06AF ARABIC LETTER GAF

POS Tag Set:

JJ adjective

RB adverb

CC coordinating conjunction

DT determiner/demonstrative pronoun

FW foreign word

NN common noun, singular

NNS common noun, plural or dual

NNP proper noun, singular

NNPS proper noun, plural or dual

RP particle

VBP imperfect verb (***nb: imperfect rather than present tense)

VBN passive verb (***nb: passive rather than past participle)

VBD perfect verb (***nb: perfect rather than past tense)

UH interjection

PRP personal pronoun

66

PRP$ possessive personal pronoun

CD cardinal number

IN subordinating conjunction (FUNC_WORD) or preposition (PREP)

WP relative pronoun

WRB wh-adverb

, punctuation, token is , (PUNC)

. punctuation, token is . (PUNC)

: punctuation, token is : or other (PUNC)

AraMorph

Dictionaries file format:

"dictPrefixes" contains all Arabic prefixes and their concatenations. Sample entries:

w wa Pref-Wa and <pos>wa/CONJ+</pos>

f fa Pref-Wa and;so <pos>fa/CONJ+</pos>

b bi NPref-Bi by;with <pos>bi/PREP+</pos>

k ka NPref-Bi like;such as <pos>ka/PREP+</pos>

wb wabi NPref-Bi and + by/with <pos>wa/CONJ+bi/PREP+</pos>

fb fabi NPref-Bi and + by/with <pos>fa/CONJ+bi/PREP+</pos>

wk waka NPref-Bi and + like/such as <pos>wa/CONJ+ka/PREP+</pos>

fk faka NPref-Bi and + like/such as <pos>fa/CONJ+ka/PREP+</pos>

Al Al NPref-Al the <pos>Al/DET+</pos>

wAl waAl NPref-Al and + the <pos>wa/CONJ+Al/DET+</pos>

fAl faAl NPref-Al and/so + the <pos>fa/CONJ+Al/DET+</pos>

bAl biAl NPref-BiAl with/by + the

<pos>bi/PREP+Al/DET+</pos>

kAl kaAl NPref-BiAl like/such as + the

<pos>ka/PREP+Al/DET+</pos>

67

wbAl wabiAl NPref-BiAl and + with/by the

<pos>wa/CONJ+bi/PREP+Al/DET+</pos>

fbAl fabiAl NPref-BiAl and/so + with/by + the

<pos>fa/CONJ+bi/PREP+Al/DET+</pos>

wkAl wakaAl NPref-BiAl and + like/such as + the

<pos>wa/CONJ+ka/PREP+Al/DET+</pos>

fkAl fakaAl NPref-BiAl and + like/such as + the

<pos>fa/CONJ+ka/PREP+Al/DET+</pos>

"dictSuffixes" contains all Arabic suffixes and their concatenations. Sample entries:

p ap NSuff-ap [fem.sg.] <pos>+ap/NSUFF_FEM_SG</pos>

ty atayo NSuff-tay two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS</pos>

tyh atayohi NSuff-tay his/its two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+hu/POSS_PRON_3MS</pos>

tyhmA atayohimA NSuff-tay their two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+humA/POSS_PRON_3D</pos>

tyhm atayohim NSuff-tay their two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+hum/POSS_PRON_3MP</pos>

tyhA atayohA NSuff-tay its/their/her two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+hA/POSS_PRON_3FS</pos>

tyhn atayohin~a NSuff-tay their two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+hun~a/POSS_PRON_3FP</pos>

tyk atayoka NSuff-tay your two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+ka/POSS_PRON_2MS</pos>

tyk atayoki NSuff-tay your two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+ki/POSS_PRON_2FS</pos>

68

tykmA atayokumA NSuff-tay your two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+kumA/POSS_PRON_2D</pos>

tykm atayokum NSuff-tay your two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+kum/POSS_PRON_2MP</pos>

tykn atayokun~a NSuff-tay your two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+kun~a/POSS_PRON_2FP</pos>

ty atay~a NSuff-tay my two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+ya/POSS_PRON_1S</pos>

tynA atayonA NSuff-tay our two

<pos>+atayo/NSUFF_FEM_DU_ACCGEN_POSS+nA/POSS_PRON_1P</pos>

"dictStems" contains all Arabic stems. Sample entries:

;--- ktb

;; katab-u_1

ktb katab PV write

ktb kotub IV write

ktb kutib PV_Pass be written;be fated;be destined

ktb kotab IV_Pass_yu be written;be fated;be destined

;; kAtab_1

kAtb kAtab PV correspond with

kAtb kAtib IV_yu correspond with

;; >akotab_1

>ktb >akotab PV dictate;make write

Aktb >akotab PV dictate;make write

ktb kotib IV_yu dictate;make write

ktb kotab IV_Pass_yu be dictated

;

;; kitAb_1

69

ktAb kitAb Ndu book

ktb kutub N books

;; kitAboxAnap_1

ktAbxAn kitAboxAn NapAt library;bookstore

ktbxAn kutuboxAn NapAt library;bookstore

;; kutubiy~_1

ktby kutubiy~ Ndu book-related

;; kutubiy~_2

ktby kutubiy~ Ndu bookseller

ktby kutubiy~ Nap booksellers <pos>kutubiy~/NOUN</pos>

;; kut~Ab_1

ktAb kut~Ab N kuttab (village school);Quran school

ktAtyb katAtiyb Ndip kuttab (village schools);Quran schools

;; kutay~ib_1

ktyb kutay~ib NduAt booklet

;; kitAbap_1

ktAb kitAb Nap writing

;; kitAbap_2

ktAb kitAb Napdu essay;piece of writing

ktAb kitAb NAt writings;essays

;; kitAbiy~_1

ktAby kitAbiy~ N-ap writing;written <pos>kitAbiy~/ADJ</pos>

;; katiybap_1

ktyb katiyb Napdu brigade;squadron;corps

ktA}b katA}ib Ndip brigades;squadrons;corps

ktA}b katA}ib Ndip Phalangists

;; katA}ibiy~_1

ktA}by katA}ibiy~ Nall brigade;corps <pos>katA}ibiy~/NOUN</pos>

ktA}by katA}ibiy~ Nall brigade;corps <pos>katA}ibiy~/ADJ</pos>

;; katA}ibiy~_2

70

ktA}by katA}ibiy~ Nall Phalangist <pos>katA}ibiy~/NOUN</pos>

ktA}by katA}ibiy~ Nall Phalangist <pos>katA}ibiy~/ADJ</pos>

;; makotab_1

mktb makotab Ndu bureau;office;department

mkAtb makAtib Ndip bureaus;offices

;; makotabiy~_1

mktby makotabiy~ N-ap office <pos>makotabiy~/ADJ</pos>

;; makotabap_1

mktb makotab NapAt library;bookstore

mkAtb makAtib Ndip libraries;bookstores

THE THREE COMPATIBILITY TABLES :

Compatibility table "tableAB" lists compatible Prefix and Stem morphological categories, such as:

NPref-Al N

NPref-Al N-ap

NPref-Al N-ap_L

NPref-Al N/At

NPref-Al N/At_L

NPref-Al N/ap

NPref-Al N/ap_L

Compatibility table "tableAC" lists compatible Prefix and Suffix morphological categories, such as: NPref-Al Suff-0

NPref-Al NSuff-u

NPref-Al NSuff-a

NPref-Al NSuff-i

NPref-Al NSuff-An

NPref-Al NSuff-ayn

71

Compatibility table "tableBC" lists compatible Stem and Suffix morphological categories, such as: PV PVSuff-a

PV PVSuff-ah

PV PVSuff-A

PV PVSuff-Ah

PV PVSuff-at

PV PVSuff-ath

Grammatical categories

Prefixes

Category Description

CONJ Cunjunction

EMPHATIC_PARTICLE Emphatic particle

FUNC_WORD TODO : to be precisely defined

FUT_PART Future particle

INTERJ Interjection

INTERROG_PART Interrogative particle

IV1S Imperfective 1st person singular

IV2MS Imperfective 2nd person masculine

singular

IV2FS Imperfective 2nd person feminine

singular

IV3MS Imperfective 3rd person masculine

singular

IV3FS Imperfective 3rd person feminine

singular

72

IV2D Imperfective 2nd person dual

IV2FD Imperfective 2nd person feminine dual

IV3MD Imperfective 3rd person masculine dual

IV3FD Imperfective 3rd person feminine dual

IV1P Imperfective 1st person plural

IV2MP Imperfective 2nd person masculine plural

IV2FP Imperfective 2nd person feminine plural

IV3MP Imperfective 3rd person masculine plural

IV3FP Imperfective 3rd person feminine plural

NEG_PART Negative particle

PREP Preposition

RESULT_CLAUSE_PARTICLE Result clause particle

Stems


ABBREV Abbreviation

ADJ Adjective

ADV Adverb

DEM_PRON_F Feminine demonstrative pronoun

DEM_PRON_FS Feminine singular demonstrative pronoun

DEM_PRON_FD Dual demonstrative pronoun

DEM_PRON_MS Masculine singular demonstrative pronoun

DEM_PRON_MD Masculine dual demonstrative pronoun

DEM_PRON_MP Masculine plural demonstrative pronoun

DET Determinative ?

INTERROG TODO : to be precisely defined

NO_STEM No stem for the word

NOUN Noun

NOUN_PROP Proper noun

NUMERIC_COMMA Decimal separator

PART Particle

PRON_1S Personal pronoun : 1st person singular

PRON_2MS Personal pronoun : 2nd person masculine singular

73

PRON_2FS Personal pronoun : 2nd person feminine singular

PRON_3MS Personal pronoun : 3rd person masculine singular

PRON_3FS Personal pronoun : 3rd person feminine singular

PRON_2D Personal pronoun : 2nd person common dual

PRON_3D Personal pronoun : 3rd person common dual

PRON_1P Personal pronoun : 1st person plural

PRON_2MP Personal pronoun : 2nd person masculine plural

PRON_2FP Personal pronoun : 2nd person feminine plural

PRON_3MP Personal pronoun : 3rd person masculine plural

PRON_3FP Personal pronoun : 3rd person feminine plural

REL_PRON Relative pronoun

VERB_IMPERATIVE Imperative verb

VERB_IMPERFECT imperfective verb

VERB_PERFECT Perfective verb

NO_RESULT Word that could not be analyzed

Suffixes


CASE_INDEF_NOM Indefinite, nominative

CASE_INDEF_ACC Indefinite, accusative

CASE_INDEF_ACCGEN Indefinite, accusative/genitive

CASE_INDEF_GEN Indefinite, genitive

CASE_DEF_NOM Definite, nominative

CASE_DEF_ACC Definite, accusative

CASE_DEF_ACCGEN Definite, accusative/genitive

CASE_DEF_GEN Definite, genitive

NSUFF_MASC_SG_ACC_INDEF Nominal suffix : masculine singular, accusative, indefinite

NSUFF_FEM_SG Nominal suffix : feminine singular

NSUFF_MASC_DU_NOM Nominal suffix : dual masculine, nominative

NSUFF_MASC_DU_NOM_POSS Nominal suffix : dual masculine, nominative, construct state

NSUFF_MASC_DU_ACCGEN Nominal suffix : dual masculine, accusative/genitive

NSUFF_MASC_DU_ACCGEN_POSS Nominal suffix : dual masculine, accusative/genitive,

construct state

74

NSUFF_FEM_DU_NOM Nominal suffix : dual feminine, nominative

NSUFF_FEM_DU_NOM_POSS Nominal suffix : dual feminine, nominative, construct state

NSUFF_FEM_DU_ACCGEN Nominal suffix : dual feminine, accusative/genitive

NSUFF_FEM_DU_ACCGEN_POSS Nominal suffix : dual feminine, nominative, construct state

NSUFF_MASC_PL_NOM Nominal suffix : masculine plural, nominative

NSUFF_MASC_PL_NOM_POSS Nominal suffix : masculine plural, nominative, construct

state

NSUFF_MASC_PL_ACCGEN Nominal suffix : masculine plural, accusative/genitive

NSUFF_MASC_PL_ACCGEN_POSS Nominal suffix : masculine plural, accusative/genitive,

construct state

NSUFF_FEM_PL Nominal suffix : feminine plural

POSS_PRON_1S Personnal suffix : 1st person singular

POSS_PRON_2MS Personnal suffix : 2nd person masculine singular

POSS_PRON_2FS Personnal suffix : 2nd person feminine singular

POSS_PRON_3MS Personnal suffix : 3rd person masculine singular

POSS_PRON_3FS Personnal suffix : 3rd person feminine singular

POSS_PRON_2D Personnal suffix : 2nd person common dual

POSS_PRON_3D Personnal suffix : 3rd person common dual

POSS_PRON_1P Personnal suffix : 1st person plural

POSS_PRON_2MP Personnal suffix : 2ème person masculine plural

POSS_PRON_2FP Personnal suffix : 2ème person feminine plural

POSS_PRON_3MP Personnal suffix : 3ème person masculine plural

POSS_PRON_3FP Personnal suffix : 3ème person feminine plural

IVSUFF_DO:1S Imperfective verb direct object : 1st person singular

IVSUFF_DO:2MS Imperfective verb direct object : 2nd person masculine

singular

IVSUFF_DO:2FS Imperfective verb direct object : 2nd person feminine

singular

IVSUFF_DO:3MS Imperfective verb direct object : 3rd person masculine

75

singular

IVSUFF_DO:3FS Imperfective verb direct object : 3rd person feminine

singular

IVSUFF_DO:2D Imperfective verb direct object : 2nd person common dual

IVSUFF_DO:3D Imperfective verb direct object : 3rd person common dual

IVSUFF_DO:1P Imperfective verb direct object : 1st person plural

IVSUFF_DO:2MP Imperfective verb direct object : 2nd person masculine

plural

IVSUFF_DO:2FP Imperfective verb direct object : 2nd person feminine plural

IVSUFF_DO:3MP Imperfective verb direct object : 3rd person masculine

plural

IVSUFF_DO:3FP Imperfective verb direct object : 3rd person feminine plural

IVSUFF_MOOD:I Imperfective verb : indicative mode

IVSUFF_SUBJ:2FS_MOOD:I Imperfective verb : subject marker, 2nd person feminine

singular, indicative mode

IVSUFF_SUBJ:D_MOOD:I Imperfective verb : subject marker, dual, indicative mode

IVSUFF_SUBJ:3D_MOOD:I Imperfective verb : subject marker, 3rd person common

dual, indicative mode

IVSUFF_SUBJ:MP_MOOD:I Imperfective verb : subject marker, masculine plural,

indicative mode

IVSUFF_MOOD:S Imperfective verb : subjunctive/jussive mode

IVSUFF_SUBJ:2FS_MOOD:SJ Imperfective verb : subject marker, 2nd person feminine

singular, subjunctive/jussive mode

IVSUFF_SUBJ:D_MOOD:SJ Imperfective verb : subject marker, dual, subjunctive/jussive mode

IVSUFF_SUBJ:MP_MOOD:SJ Imperfective verb : subject marker, masculine plural,

subjunctive/jussive mode

IVSUFF_SUBJ:3MP_MOOD:SJ Imperfective verb : subject marker, 3rd person du masculine

plural, subjunctive/jussive mode

IVSUFF_SUBJ:FP Imperfective verb : subject marker, feminine plural

76

PVSUFF_DO:1S Perfective verb direct object : 1st person singular

PVSUFF_DO:2MS Perfective verb direct object : 2nd person masculine singular

PVSUFF_DO:2FS Perfective verb direct object : 2nd person feminine singular

PVSUFF_DO:3MS Perfective verb direct object : 3rd personmasculine singular

PVSUFF_DO:3FS Perfective verb direct object : 3rd person feminine singular

PVSUFF_DO:2D Perfective verb direct object : 2nd person common dual

PVSUFF_DO:3D Perfective verb direct object : 3rd person common dual

PVSUFF_DO:1P Perfective verb direct object : 1st person plural

PVSUFF_DO:2MP Perfective verb direct object : 2nd person masculine plural

PVSUFF_DO:2FP Perfective verb direct object : 2nd person feminine plural

PVSUFF_DO:3MP Perfective verb direct object : 3rd person masculine plural

PVSUFF_DO:3FP Perfective verb direct object : 3rd person feminine plural

PVSUFF_SUBJ:1S Perfective verb subject : 1st person singular

PVSUFF_SUBJ:2MS Perfective verb subject : 2nd person masculine singular

PVSUFF_SUBJ:2FS Perfective verb subject : 2nd person feminine singular

PVSUFF_SUBJ:3MS Perfective verb subject : 3rd person masculine singular

PVSUFF_SUBJ:3FS Perfective verb subject : 3rd person feminine singular

PVSUFF_SUBJ:2MD Perfective verb subject : 2nd person dual masculine

PVSUFF_SUBJ:2FD Perfective verb subject : 2nd person dual feminine

PVSUFF_SUBJ:3MD Perfective verb subject : 3rd person dual masculine

PVSUFF_SUBJ:3FD Perfective verb subject : 3rd person dual feminine

PVSUFF_SUBJ:1P Perfective verb subject : 1st person plural

PVSUFF_SUBJ:2MP Perfective verb subject : 2nd person masculine plural

PVSUFF_SUBJ:2FP Perfective verb subject : 2nd person feminine plural

PVSUFF_SUBJ:3MP Perfective verb subject : 3rd person masculine plural

PVSUFF_SUBJ:3FP Perfective verb subject : 3rd person feminine plural

CVSUFF_DO:1S Imperative verb direct object : 1st person singular

CVSUFF_DO:3MS Imperative verb direct object : 3rd person masculine singular

CVSUFF_DO:3FS Imperative verb direct object : 3rd person feminine singular

77

CVSUFF_DO:3D Imperative verb direct object : 3rd person common dual

CVSUFF_DO:1P Imperative verb direct object : 1st person plural

CVSUFF_DO:3MP Imperative verb direct object : 3rd person masculine plural

CVSUFF_DO:3FP Imperative verb direct object : 3rd person feminine plural

CVSUFF_SUBJ:2MS Imperative verb subject : 2nd person masculine singular

CVSUFF_SUBJ:2FS Imperative verb subject : 2nd person feminine singular

CVSUFF_SUBJ:2MP Imperative verb subject : 2nd person masculine plural