text classification using machine learning student: hung vo course: cp-sc 881 instructor: professor...

22
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011

Upload: aubrey-green

Post on 02-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Text Classification using Machine Learning

Text Classification using Machine LearningStudent: Hung VoCourse: CP-SC 881Instructor: Professor Luo FengClemson University04/27/2011Text classification

Applications

ApplicationsTopic spottingEmail routing,Language guessing, Webpage Type Classification,A Product Review Classification Task, MethodsNaive Bayes classifier Tf-idfLatent semantic indexingSupport vector machines (SVM)Artificial neural networkkNNSecision trees, such as ID3 or C4.5Concept MiningRough set based classifierSoft set based classifier20 topics,~1000 documents each74.49% correctProjectBuild Nave Bayes classifier from scratchMS Visual C# 2008MS .Net Framework 3.5

Data set

Training DataTest DataApproachLearn ModelApply ModelLearning AlgorithmModelModelModelIDAtt1Att2Cat1Yes0.5ANNo0.7BIDAtt1Att2Cat1No0.8?MYes0.2?Training DataTest DataIDAtt1Att2Cat1No0.8BMYes0.2ATest DataModelBased on wordsProbability of words in document/categoryNeed to extract important words first, then build parametersBuilding ModelTokenizeRemove stop wordsStemmerCalculate parametersTokenizeCollect all data in one categories

Tokenize

All text in a CategoryTokenized result

Remove Stop WordsStop wordsHigh frequency Appear almost documentsLeast important

Stop words removedStemmerRemove prefix, suffixReturn the base, root or stem of words

Stemmed tokensVocabulary and frequencySort Count

Vocabulary of first categoryRemove low frequency tokensDrop tokens/terms appear less than thresholdChosen threshold: 1069604 -> 12881Faster processingCalculate parametersModelD Set of documentsN = #d(D)V Set of vocabulary/tokens/termsC set of CategoriesFor each category c in CNc Number of documentsPrior = Nc/N textc All text in category creturn V,prior, condprobApply model for text classificationNeed: C,V,prior, condprob, dd document to be classifiedW extracted tokens from (V,d) foreach c in Cdo score[c] = log(prior[c])foreach t in Wdo score[c] += log(condprob[t,c])return argmaxc in C(score[c])Result74.49%Near topicDemo

Future WorksMore methodsCompare themBuild different modelsThresholdWord phraseThank you