text classification using machine learning student: hung vo course: cp-sc 881 instructor: professor...
TRANSCRIPT
Text Classification using Machine Learning
Text Classification using Machine LearningStudent: Hung VoCourse: CP-SC 881Instructor: Professor Luo FengClemson University04/27/2011Text classification
Applications
ApplicationsTopic spottingEmail routing,Language guessing, Webpage Type Classification,A Product Review Classification Task, MethodsNaive Bayes classifier Tf-idfLatent semantic indexingSupport vector machines (SVM)Artificial neural networkkNNSecision trees, such as ID3 or C4.5Concept MiningRough set based classifierSoft set based classifier20 topics,~1000 documents each74.49% correctProjectBuild Nave Bayes classifier from scratchMS Visual C# 2008MS .Net Framework 3.5
Data set
Training DataTest DataApproachLearn ModelApply ModelLearning AlgorithmModelModelModelIDAtt1Att2Cat1Yes0.5ANNo0.7BIDAtt1Att2Cat1No0.8?MYes0.2?Training DataTest DataIDAtt1Att2Cat1No0.8BMYes0.2ATest DataModelBased on wordsProbability of words in document/categoryNeed to extract important words first, then build parametersBuilding ModelTokenizeRemove stop wordsStemmerCalculate parametersTokenizeCollect all data in one categories
Tokenize
All text in a CategoryTokenized result
Remove Stop WordsStop wordsHigh frequency Appear almost documentsLeast important
Stop words removedStemmerRemove prefix, suffixReturn the base, root or stem of words
Stemmed tokensVocabulary and frequencySort Count
Vocabulary of first categoryRemove low frequency tokensDrop tokens/terms appear less than thresholdChosen threshold: 1069604 -> 12881Faster processingCalculate parametersModelD Set of documentsN = #d(D)V Set of vocabulary/tokens/termsC set of CategoriesFor each category c in CNc Number of documentsPrior = Nc/N textc All text in category creturn V,prior, condprobApply model for text classificationNeed: C,V,prior, condprob, dd document to be classifiedW extracted tokens from (V,d) foreach c in Cdo score[c] = log(prior[c])foreach t in Wdo score[c] += log(condprob[t,c])return argmaxc in C(score[c])Result74.49%Near topicDemo
Future WorksMore methodsCompare themBuild different modelsThresholdWord phraseThank you