era - a comparison of stemmers on source code identifiers for software search
Post on 13-Jan-2015
240 Views
Preview:
DESCRIPTION
TRANSCRIPT
A Comparison of Stemmers on Source Code Identifiers for
Software SearchAndrew Wiese, Valerie Ho, Emily Hill
Montclair State University
Thursday, October 6, 2011
Problem: Source Code Search
• Challenge: Query words may not exactly match source code words & can hurt search
• Example: “add item” query should match
• add, adds, adding, added
• item, items
• Stemming used by Information Retrieval (IR) systems to strip suffixes
• reduce all words to root form, or stem
• a.k.a. word conflation
Thursday, October 6, 2011
What makes stemming source code different from traditional IR?
• Word choice more restrictive in naming identifiers than in natural language (NL) documents
• NL: stem, stems, stemmer, stemming, stemmed
• Code: stem, stemmer
• Classes that encapsulate actions have names with nominalized verbs:
• play → player
• compile → compiler
• Tradtional IR prefer light Porter’s
• tends not to stem across parts of speech
• E.g., noun ‘player’ will not stem to verb ‘play’
Thursday, October 6, 2011
Stemming Challenges
• Understemming
• stemmer assigns different stems to words in the same concept
• reduces number of relevant results in search (i.e., reduces recall)
• Overstemming
• stemmer assigns the same stem for words with different meanings (e.g., business conflated with busy, university with universe)
• increases number of irrelevant results (i.e., reduces precision)
• Stemmers categorized by type of error
• Light stemmers: understem
• Heavy stemmers: overstem
Thursday, October 6, 2011
A Brief History of Stemming• Light Stemmers (tend not to stem across parts of speech)
• Porter (1980): rule-based, simple & efficient• Most popular stemmer in IR & SE
• Snowball (2001): minor rule improvements
• KStem (1993): morphology-based• based on word’s structure & hand-tuned dictionary
• in experiments shown to outperform porter’s
• Heavy Stemmers
• Lovins (1968): rule-based
• Paice (1990): rule-based
• MStem: morphological (PC-Kimmo), specialized for source code using word frequencies
Thursday, October 6, 2011
Our Contribution
• Compare performance of 5 stemmers on source code identifiers
• Evaluation 1: compare conflated word classes
• started from 100 most frequently occurring words in 9,000 open source Java programs
• analyzed by 2 human Java programmers in terms of accuracy & completeness
• Evaluation 2: compare effect of using 5 stemmers vs not stemming on 8 search tasks
Thursday, October 6, 2011
Stemmer Word Classes Comparison
• accurate: word class contains no unrelated words
• complete: word class not missing related words(rely on greediness & diversity of stemmers)
• context sensitive (CS): multiple senses or disagreement
None Context Sensitive
PORTER PAICE SNOWBALL KSTEM MSTEM
TOTAL%NEW RESULTS
10 13 13 13 13 27
5 5 13 13
4 4 6 6
1 3 5 4
2 3 3
2 1 2
1
1
10 13 23 31 41 550.30 0.40 0.53 0.71
3 11 25 28 32 46 50
10
20
30
40
50
60
70
80
90
100
None CSPorte
rPaice
Snowball
KStemMStem
No
. A
ccura
te &
Co
mp
lete
58%53%
37%32%29%
Thursday, October 6, 2011
Word Classes Example• Stemmer comparison for 2 examples
• Underlined words in all stemmer classesTable I
STEMMER WORD CLASS COMPARISONS FOR 4 EXAMPLES (UNDERLINEDWORDS ARE IN THE WORD CLASSES FOR ALL STEMMERS)
Word(A & C)
Stemmer Word Class
Porter element, elemental, elemente, elementsSnwbl element, elemental, elemente, elements
element KStem element(MStem) MStem element, elemental, elements
Paice el, ela, ele, element, elemental, elementary,elemente, elementen, elements, elen, eles,eli, elif, elise, elist, ell, elle, ellen, eller, els,else, elseif, elses, elsif
Porter import, importable, importance, important,imported, importer, importers, importing,imports
Snowbl import, importable, importance, important,importantly, imported, importer, importers,importing, imports
import(Kstem)
KStem import, importable, imported, importer,importers, importing, imports
MStem import, importable, importance, important,importantly, imported, importer, importers,importing, imports
Paice import, importable, importance, important,importantly, importar, imported, importer,importers, importing, imports
Porter add, adde, addes, addsSnwbl add, adde, addes, adds
add KStem add, addable, added, addes, adding, adds(CS) MStem add, addable, added, adder, adding, addition,
additional, additionally, additions, additive,additivity, adds
Paice ad, ada, add, addable, adde, added, adder,addes, adding, adds, ade, ads
Porter name, named, namely, names, namingSnwbl name, named, namely, names, naming
name(None)
KStem name, nameable, named, namer, names,naming
MStem name, named, nameless, namely, namer,names, naming, surname
Paice name, nameable, namely, names
contrast, the sense of ‘add’ being used to join something toa list is not typically related to ‘addition’. The word classesfor ‘add’, as well as 3 other examples, are shown in Table I.
Overall, the annotators found the morphological parsersMStem and KStem to be the most accurate. The resultsof these two subjects indicate that morphology may bemore important than degree of under- or overstemming,since MStem is a heavy stemmer and KStem light. MStemwas the only accurate and complete stemmer for 12 of thewords, whereas KStem was accurate and complete for 11. Incontrast, the rule-based stemmers Porter and Snowball wereuniquely accurate and complete stemmers for 2 words, andPaice 6. Of the rule-based stemmers, light Snowball has aclear advantage over light Porter and heavy Paice overall.
As expected with heavy stemmers, MStem and Paice bothtend to overstem, although for different reasons. MStemfrequently stems across different parts of speech, whichgenerally leads to increased completeness. However, occa-sionally this tendency conflates words that do not representthe same concept, as in conflating the adjective ‘true’ with
the adverb ‘truly’ and noun ‘truth’. In contrast, Paice fre-quently conflates unrelated words together, such as ‘element’with ‘else’ and ‘static’ with ‘state’, ‘statement’, ‘station’,‘stationary’, ‘statistic’, and ‘status’.
The annotators observed a difference between the mor-phological stemmers (MStem and KStem) and the rule-basedstemmers (Porter, Paice, and Snowball), which frequentlyand inaccurately associated non-words or foreign languagewords. For example, all 3 rule-based stemmers conflated‘method’ with french ‘methode’ and ‘methodes’; ‘panel’with Spanish ‘paneles’; and ‘any’ with non-words ‘anys’and, in the case of Porter and Snowball, ‘ani’. MStem andKStem were less prone to these errors because MStem usesword frequencies to eliminate unlikely stems, and KStemuses an English dictionary.
C. Threats to ValidityBecause the words were selected exclusively from Java
programs, these results may not generalize to all program-ming languages. MStem was trained on the same set of9,000+ Java programs that were used to create the 100 mostfrequent word set annotated by the human evaluators. Due tothe large size of the entire word set (over 700,000 words),it is unlikely that MStem was over-trained on the subsetof 100 words. Since completeness is based on the unionof word classes created by the stemmers, the observationsmay not generalize to all morphological and rule-basedstemmers. Because determining accuracy and completenesscan be ambiguous, we limited this threat by separating outthe ‘context sensitive’ examples in our analysis.
III. EFFECT OF STEMMING ON SOURCE CODE SEARCH
In this section, we compare the effect of using Porter,Snowball, KStem, Paice, and MStem with no stemming(None) on searching source code.
A. Study DesignTo compare the effect of stemming on software search,
we use the common tf-idf scoring function [9] to scorea method’s relevance to the query. Tf-idf multiplies twocomponent scores together: term frequency (tf) and inversedocument frequency (idf). The intuition behind tf is thatthe more frequently a word occurs in a method, the morerelevant the method is to the query. In contrast, idf dampensthe tf by how frequently the word occurs in the code base.Because we recalculate idf values for each program andstemmer combination, the tf-idf scores can widely varybetween stemmers that are heavy and light.
We use 8 of 9 concerns and queries from a previous sourcecode search study of 18 developers [10]. For one concern nosubject was able to formulate a query returning any relevantresults, leaving us with 8 concerns. For each concern, 6developers formulated queries, totaling 48 queries, 29 ofwhich are unique. The concerns are mapped at the method
Table ISTEMMER WORD CLASS COMPARISONS FOR 4 EXAMPLES (UNDERLINED
WORDS ARE IN THE WORD CLASSES FOR ALL STEMMERS)
Word(A & C)
Stemmer Word Class
Porter element, elemental, elemente, elementsSnwbl element, elemental, elemente, elements
element KStem element(MStem) MStem element, elemental, elements
Paice el, ela, ele, element, elemental, elementary,elemente, elementen, elements, elen, eles,eli, elif, elise, elist, ell, elle, ellen, eller, els,else, elseif, elses, elsif
Porter import, importable, importance, important,imported, importer, importers, importing,imports
Snowbl import, importable, importance, important,importantly, imported, importer, importers,importing, imports
import(Kstem)
KStem import, importable, imported, importer,importers, importing, imports
MStem import, importable, importance, important,importantly, imported, importer, importers,importing, imports
Paice import, importable, importance, important,importantly, importar, imported, importer,importers, importing, imports
Porter add, adde, addes, addsSnwbl add, adde, addes, adds
add KStem add, addable, added, addes, adding, adds(CS) MStem add, addable, added, adder, adding, addition,
additional, additionally, additions, additive,additivity, adds
Paice ad, ada, add, addable, adde, added, adder,addes, adding, adds, ade, ads
Porter name, named, namely, names, namingSnwbl name, named, namely, names, naming
name(None)
KStem name, nameable, named, namer, names,naming
MStem name, named, nameless, namely, namer,names, naming, surname
Paice name, nameable, namely, names
contrast, the sense of ‘add’ being used to join something toa list is not typically related to ‘addition’. The word classesfor ‘add’, as well as 3 other examples, are shown in Table I.
Overall, the annotators found the morphological parsersMStem and KStem to be the most accurate. The resultsof these two subjects indicate that morphology may bemore important than degree of under- or overstemming,since MStem is a heavy stemmer and KStem light. MStemwas the only accurate and complete stemmer for 12 of thewords, whereas KStem was accurate and complete for 11. Incontrast, the rule-based stemmers Porter and Snowball wereuniquely accurate and complete stemmers for 2 words, andPaice 6. Of the rule-based stemmers, light Snowball has aclear advantage over light Porter and heavy Paice overall.
As expected with heavy stemmers, MStem and Paice bothtend to overstem, although for different reasons. MStemfrequently stems across different parts of speech, whichgenerally leads to increased completeness. However, occa-sionally this tendency conflates words that do not representthe same concept, as in conflating the adjective ‘true’ with
the adverb ‘truly’ and noun ‘truth’. In contrast, Paice fre-quently conflates unrelated words together, such as ‘element’with ‘else’ and ‘static’ with ‘state’, ‘statement’, ‘station’,‘stationary’, ‘statistic’, and ‘status’.
The annotators observed a difference between the mor-phological stemmers (MStem and KStem) and the rule-basedstemmers (Porter, Paice, and Snowball), which frequentlyand inaccurately associated non-words or foreign languagewords. For example, all 3 rule-based stemmers conflated‘method’ with french ‘methode’ and ‘methodes’; ‘panel’with Spanish ‘paneles’; and ‘any’ with non-words ‘anys’and, in the case of Porter and Snowball, ‘ani’. MStem andKStem were less prone to these errors because MStem usesword frequencies to eliminate unlikely stems, and KStemuses an English dictionary.
C. Threats to ValidityBecause the words were selected exclusively from Java
programs, these results may not generalize to all program-ming languages. MStem was trained on the same set of9,000+ Java programs that were used to create the 100 mostfrequent word set annotated by the human evaluators. Due tothe large size of the entire word set (over 700,000 words),it is unlikely that MStem was over-trained on the subsetof 100 words. Since completeness is based on the unionof word classes created by the stemmers, the observationsmay not generalize to all morphological and rule-basedstemmers. Because determining accuracy and completenesscan be ambiguous, we limited this threat by separating outthe ‘context sensitive’ examples in our analysis.
III. EFFECT OF STEMMING ON SOURCE CODE SEARCH
In this section, we compare the effect of using Porter,Snowball, KStem, Paice, and MStem with no stemming(None) on searching source code.
A. Study DesignTo compare the effect of stemming on software search,
we use the common tf-idf scoring function [9] to scorea method’s relevance to the query. Tf-idf multiplies twocomponent scores together: term frequency (tf) and inversedocument frequency (idf). The intuition behind tf is thatthe more frequently a word occurs in a method, the morerelevant the method is to the query. In contrast, idf dampensthe tf by how frequently the word occurs in the code base.Because we recalculate idf values for each program andstemmer combination, the tf-idf scores can widely varybetween stemmers that are heavy and light.
We use 8 of 9 concerns and queries from a previous sourcecode search study of 18 developers [10]. For one concern nosubject was able to formulate a query returning any relevantresults, leaving us with 8 concerns. For each concern, 6developers formulated queries, totaling 48 queries, 29 ofwhich are unique. The concerns are mapped at the method
Thursday, October 6, 2011
Stemming and Source Code Search• search technique: tf-idf
• search tasks: 8 with 48 queries from prior study [Shepherd, et al. ’07]
• Paice: overstemming & understemming mistakes improved results for 2 tasks (e.g., textfield report element)
!
!
!!! !
!
!!! !
!!
!!! !
!
!!! !
!
!!!
Area
Und
er th
e C
urve
NoStem Porter Snowbl KStem MStem Paice
0.5
0.6
0.7
0.8
0.9
1.0
Thursday, October 6, 2011
Conclusion
• Morphological stemmers appear to be more accurate & complete than rule-based
• In search, stemming more consistently produces relevant results than not stemming
• Heavy stemmers like MStem & Paice appear to be more effective in searching source code than light stemmers like Porter
• Future work: more examples (less frequent & more domain-specific), more human judgements, more search tasks, other SE tasks beyond search
Thursday, October 6, 2011
top related