stemming technology e-business technologies prof. dr. eduard heindl by ajay singh

Stemming TechnologyStemming Technology

E-Business TechnologiesE-Business Technologies

Prof. Dr. Eduard Heindl Prof. Dr. Eduard Heindl

By Ajay SinghBy Ajay Singh

IntroductionIntroduction

Information Retrieval (IR) is the Information Retrieval (IR) is the arrangement of documents in a collection arrangement of documents in a collection to meet user's need for information. to meet user's need for information.

RepresentationRepresentation QueryQuery or or profileprofile, , One or more One or more search termssearch terms Information importance weights.Information importance weights. Query hasQuery has Index termsIndex terms (important words or phrases). (important words or phrases). Decision may be binary (retrieve/reject). Decision may be binary (retrieve/reject). Degree of relevanceDegree of relevance

DefinitionDefinition

Stemming is the process for Reducing inflected (or Stemming is the process for Reducing inflected (or

sometimes derived) words to their stem, base or root form sometimes derived) words to their stem, base or root form – generally a written word form. The process of stemming – generally a written word form. The process of stemming is often called conflation. These programs are commonly is often called conflation. These programs are commonly

referred to as stemming algorithms or stemmersreferred to as stemming algorithms or stemmers

UtilityUtility

The process of stemming is The process of stemming is useful in search engines for useful in search engines for

Query expansion.Query expansion. Indexing.Indexing. Natural language Natural language

processingprocessing

AlgorithmsAlgorithms

There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome.

A stemmer for ENGLISH, for example, should identify the STRING "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".

Brute Force Algorithms Brute Force Algorithms

These stemmers employ a lookup table which contains relations between These stemmers employ a lookup table which contains relations between root forms and inflected forms. To stem a word, the table is queried to find root forms and inflected forms. To stem a word, the table is queried to find a matching inflection. If a matching inflection is found, the associated root a matching inflection. If a matching inflection is found, the associated root form is returned.form is returned.

Advantages.Advantages.

Stemming error less.Stemming error less. User friendly.User friendly. ProblemsProblems

They lack elegance to converge to the result fast.They lack elegance to converge to the result fast. Time consuming.Time consuming. Back end updatingBack end updating Difficult to design.Difficult to design.

. .

Suffix Stripping Algorithms Suffix Stripping Algorithms

Suffix stripping algorithms do not rely on a lookup table that Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a consists of inflected forms and root form relations. Instead, a typically smaller list of "rules" are stored which provide a path for typically smaller list of "rules" are stored which provide a path for the algorithm, given an input word form, to find its root form. the algorithm, given an input word form, to find its root form. Some examples of the rules include:Some examples of the rules include:

if the word ends in 'ed', remove the 'ed' if the word ends in 'ed', remove the 'ed' if the word ends in 'ing', remove the 'ing' if the word ends in 'ing', remove the 'ing' if the word ends in 'ly', remove the 'ly'if the word ends in 'ly', remove the 'ly'

Benefits Benefits

SimpleSimple

Lemmatisation Algorithms Lemmatisation Algorithms

The more complex approach to the problem of The more complex approach to the problem of determining a stem of a word is lemmatisation. This determining a stem of a word is lemmatisation. This process involves first determining the part of speech of a process involves first determining the part of speech of a word, and applying different normalization rules for each word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the attempting to find the root since for some languages, the stemming rules change depending on a word's part of stemming rules change depending on a word's part of speech.speech.

This approach is highly conditional upon obtaining the This approach is highly conditional upon obtaining the correct lexical category (part of speech). While there is correct lexical category (part of speech). While there is overlap between the normalization rules for certain overlap between the normalization rules for certain categories, identifying the wrong category or being unable categories, identifying the wrong category or being unable to produce the right category limits the added benefit of to produce the right category limits the added benefit of this approach over suffix stripping algorithms. The basic this approach over suffix stripping algorithms. The basic idea is that, if we are able to grasp more information idea is that, if we are able to grasp more information about the word to be stemmed, then we are able to more about the word to be stemmed, then we are able to more accurately apply normalization rules (which are, more or accurately apply normalization rules (which are, more or less, suffix stripping rules).less, suffix stripping rules).

Hybrid Approaches Hybrid Approaches

Hybrid approaches use two or more of the approaches Hybrid approaches use two or more of the approaches described above in unison. A simple example is a described above in unison. A simple example is a algorithm which first consults a lookup table using brute algorithm which first consults a lookup table using brute force. However, instead of trying to store the entire set of force. However, instead of trying to store the entire set of relations between words in a given language, the lookup relations between words in a given language, the lookup table is kept small and is only used to store a minute table is kept small and is only used to store a minute amount of "frequent exceptions" like "ran => run". If the amount of "frequent exceptions" like "ran => run". If the word is not in the exception list, apply suffix stripping or word is not in the exception list, apply suffix stripping or lemmatisation and output the result lemmatisation and output the result

Affix Stemmers Affix Stemmers

In linguistics, the term affix refers to either a prefix and suffix. In In linguistics, the term affix refers to either a prefix and suffix. In

addition to dealing with suffixes, several approaches also attempt addition to dealing with suffixes, several approaches also attempt to remove common prefixes. For example, given the word to remove common prefixes. For example, given the word indefinitelyindefinitely, identify that the leading "in" is a prefix that can be , identify that the leading "in" is a prefix that can be removed. Many of the same approaches mentioned earlier apply, removed. Many of the same approaches mentioned earlier apply, but go by the name but go by the name affix strippingaffix stripping. .

Matching Algorithms Matching Algorithms

These algorithms use a stem database (for example a set These algorithms use a stem database (for example a set

of documents that contain stem words). These stems, as of documents that contain stem words). These stems, as mentioned above, are not necessarily valid words mentioned above, are not necessarily valid words themselves (but rather common sub-strings, as the themselves (but rather common sub-strings, as the "brows" in "browse" and in "browsing"). In order to stem a "brows" in "browse" and in "browsing"). In order to stem a word the algorithm tries to match it with stems from the word the algorithm tries to match it with stems from the database, applying various constraints, such as on the database, applying various constraints, such as on the relative length of the candidate stem within the word (so relative length of the candidate stem within the word (so that, for example, the short prefix "be", which is the stem that, for example, the short prefix "be", which is the stem of such words as "be", "been" and "being", would not be of such words as "be", "been" and "being", would not be considered as the stem of the word "beside").considered as the stem of the word "beside").

Multilingual Stemming Multilingual Stemming

Multilingual stemming applies morphological rules of two or more Multilingual stemming applies morphological rules of two or more

languages simultaneously instead of rules for only a single languages simultaneously instead of rules for only a single language when interpreting a search query. Commercial systems language when interpreting a search query. Commercial systems using multilingual stemming exist.using multilingual stemming exist.

ChallengesChallenges

Hebrew and Arabic are tough languages for Hebrew and Arabic are tough languages for stemming. stemming.

The morphology, orthography, and character The morphology, orthography, and character encoding of the target language becomes more encoding of the target language becomes more complex for stemmer design in some languages.complex for stemmer design in some languages.

Italian stemmer is more complex than an English Italian stemmer is more complex than an English one (because of more possible verb inflections), a one (because of more possible verb inflections), a Russian one is more complex (more possible noun Russian one is more complex (more possible noun declensions), declensions),

Hebrew one is even more complex (due to non-Hebrew one is even more complex (due to non-catenative morphology and a writing system without catenative morphology and a writing system without vowels). vowels).

Stemmer for Hungarian is easier to due to the Stemmer for Hungarian is easier to due to the precise rules in the language for flexion.precise rules in the language for flexion.

Stemming algorithm for Stemming algorithm for German language.German language. Recently a stemming algorithm for morphological Recently a stemming algorithm for morphological

complex languages like German or Dutch is presented. complex languages like German or Dutch is presented. The main idea is not to use stems as common forms in The main idea is not to use stems as common forms in order to make the algorithm simple and fast. order to make the algorithm simple and fast.

The algorithm consists of two steps: The algorithm consists of two steps: The certain characters and/or character sequences are The certain characters and/or character sequences are

substituted. This step takes linguistic rules and statistical substituted. This step takes linguistic rules and statistical heuristics into account.heuristics into account.

A very simple, context free suffix-stripping algorithm is A very simple, context free suffix-stripping algorithm is applied. Three variations of the algorithm are described: applied. Three variations of the algorithm are described: The simplest one can easily be implemented with 50 lines The simplest one can easily be implemented with 50 lines of C++ code while the most complex one requires about of C++ code while the most complex one requires about 100 lines of code and a small wordlist. Speed and quality 100 lines of code and a small wordlist. Speed and quality of the algorithm can be scaled by applying further of the algorithm can be scaled by applying further linguistic rules and statistical heuristics linguistic rules and statistical heuristics

PerformancePerformance

Direct AssessmentDirect Assessment

The most primitive method for assessing the The most primitive method for assessing the perform ance of a stemmer is to examine its perform ance of a stemmer is to examine its behaviour when applied to samples of words - behaviour when applied to samples of words - especially words which have already been arranged especially words which have already been arranged into 'conflation groups'. This way, specific errors into 'conflation groups'. This way, specific errors (e.g., failing to merge "maintained" with (e.g., failing to merge "maintained" with "maintenance", or wrongly merging "experiment" "maintenance", or wrongly merging "experiment" with "experience") can be identified, and the rules with "experience") can be identified, and the rules adjusted accordingly. This approach is of very adjusted accordingly. This approach is of very limited utility on its own, but can be used to limited utility on its own, but can be used to complement other methods, such as the error-complement other methods, such as the error-counting approach outlined later.counting approach outlined later.

ComponentsComponents

Information Retrieval Information Retrieval Components. Components.

Precision Recall Fall-Out F-measure

Error countingError counting

There is a possibility to evaluate stemming by counting the There is a possibility to evaluate stemming by counting the numbers of two kinds of errors that occur during numbers of two kinds of errors that occur during stemming, namely;stemming, namely;

Under Stemming.Under Stemming. This refers to words that should be grouped together by This refers to words that should be grouped together by

stemming, but aren't. This causes a single concept to be stemming, but aren't. This causes a single concept to be spread over various different stems, which will tend to spread over various different stems, which will tend to decrease the Recall in an IR search. decrease the Recall in an IR search.

Over-StemmingOver-Stemming

This refers to words that shouldn’t be grouped together by This refers to words that shouldn’t be grouped together by stemming, but are. This causes the meanings of the stemming, but are. This causes the meanings of the stems to be diluted, which will effect Precision of IR. stems to be diluted, which will effect Precision of IR. Using a sample file of grouped words, these errors are Using a sample file of grouped words, these errors are then counted.then counted.

Mathematical NotationMathematical Notation

There is a method that returns a value for an There is a method that returns a value for an Under-Stemming (or Conflation) index; Under-Stemming (or Conflation) index;

UI = Under-Stemming Index UI = Under-Stemming Index CI = Conflation Index: proportion of equivalent word CI = Conflation Index: proportion of equivalent word pairs which were successfully grouped to the same pairs which were successfully grouped to the same stem. stem.

UI= 1 - CI UI= 1 - CI

Also the value for an Over-Stemming (or Also the value for an Over-Stemming (or Distinctness) index; Distinctness) index;

OI = Over-Stemming index OI = Over-Stemming index DI = Distinctness Index: proportion of non-DI = Distinctness Index: proportion of non-equivalent word pairs which remained distinct after equivalent word pairs which remained distinct after stemming. stemming.

OI= 1 - DI OI= 1 - DI

Stemmer StrengthStemmer Strength

Number of words per conflation class Number of words per conflation class This is the average size of the groups of words This is the average size of the groups of words coverted to a particular stem (regardless of whether coverted to a particular stem (regardless of whether they are all correct). they are all correct).

This metric is obviously dependent on the number This metric is obviously dependent on the number of words processed, but for a word collection of of words processed, but for a word collection of given size, a higher value indicates a heavier given size, a higher value indicates a heavier stemmer. The value is easily calculated as follows: stemmer. The value is easily calculated as follows:

WC = Mean number of words per conflation class WC = Mean number of words per conflation class

N = Number of unique words before Stemming N = Number of unique words before Stemming

S = Number of unique stems after StemmingS = Number of unique stems after Stemming

MWC=N/SMWC=N/S

Index Compression Index Compression

The Index Compression Factor represents the extent to The Index Compression Factor represents the extent to

which a collection of unique words is reduced which a collection of unique words is reduced (compressed) by stemming, the idea being that the (compressed) by stemming, the idea being that the heavier the Stemmer, the greater the Index Compression heavier the Stemmer, the greater the Index Compression Factor. This can be calculated by; Factor. This can be calculated by;

IC = Index Compression Factor IC = Index Compression Factor

N = Number of unique words before Stemming N = Number of unique words before Stemming

S = Number of unique stems after Stemming S = Number of unique stems after Stemming

ICF =( N-S)/NICF =( N-S)/N

Applications Applications

Information retrievalInformation retrieval

Usage in commercial products .Usage in commercial products .

References.References.

J. Carlberger and V. Kann. 1999. Implementing an efficient part-of-speech tagger, Software Practice and Experience, 29, 815-832, 1999.

D. Harman. 1991. How effective is suffixing? Journal of the American Society for Information Science, 42(1): 7-15.

D.A. Hull. 1996. Stemming Algorithms - A Case Study for Detailed Evaluation. Journal of the American Society for Information Science, 47(1): 70-84

R. Krovetz. 1993. Viewing Morphology as an Inference Process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, pp 191-202.

M.F. Porter. 1980. An algorithm for suffix stripping. Program, vol 14, no 3, pp 130-130. Xu and W. B. Croft. 1998. Corpus-based Stemming using Co-occurrence of Word Variants. ACM Transactions on Information Systems, Volume 16, Number 1, pp 61-81, January 1998.

W. Kraaij and R.Pohlmann. 1994. Porter's stemming algorithm for Dutch. In L.G.M. Noordman and W.A.M. de Vroomen,editors, Informatie wetenschap 1994: Wetenschappelijke bijdragen aan de derde STINFON Conferentie, pp. 167-180.

M. Hassel. 2001. Internet as Corpus – Automatic Construction of a Swedish News Corpus. NODALIDA ’01 - 13th Nordic Conference on Computational Linguistics, May 21-22 2001, Uppsala, Sweden

M. Popovic and P. Willett. 1992. The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43(5): 384-390.

Stemming algorithms research. www.cranfield.ac.uk/research. Key word stemming. www.cybertouch.info Stemming technology research. www.vlex.be Lovins, Julie B. Development of a Stemming Algorithm, Electronic systems lab, MIT,

USA

stemming technology e-business technologies prof. dr. eduard heindl by ajay singh

Documents

process of stemming

stemmers stemming

algorithms algorithms

types of stemming algorithms

root word

definition stemming

root forms

root form relations