data mining - simple guide for beginners.pdf
TRANSCRIPT
-
7/27/2019 DATA MINING - Simple Guide For Beginners.pdf
1/5
www.dwbiconcepts.com
DATAMININGASimpleGuidefortheBeginners!
Thispaperintroducesthesubjectofdatamininginsimplelucidlanguageandmovesontobuildmorecomplexconcepts.Starthereifyouareabeginner.Author:AkashMitraDated:21Mar2012Source:http://www.dwbiconcepts.com
DataMining.Ihaveanallergytothisterm.NotbecauseIhatethesubjectofdataminingitself,butbecausethistermissomuchover-usedandmisusedandexploitedandcommercializedandoftenconveyedininaccuratemanner,ininappropriateplacesandoftenwithintentionalvagueness.SowhenIdecidedtowriteaboutwhatisdatamining,IwasconvincedthatIneedtowriteaboutwhatisNOTdataminingfirst,inordertobuildaformaldefinitionofdatamining.
WhatisDataMining?(Andwhatitisnot)HereistheWikipediadefinitionofdatamining:DataminingistheprocessofdiscoveringnewpatternsfromlargedatasetsNowthequestionis:whatdoestheabovedefinitionreallymeanandhowdoesitdifferfromfindinginformationfromdatabases?Weoftenstoreinformationin
databases(asindatawarehouses)andretrievetheinformationfromthedatabasewhenweneedit.Isthatdatamining?Answerisno.Wewillsoonseewhyisitso.Letsstartwiththebigpicturefirst.Dataminingisbasicallyoneofthestepsintheprocessofknowledgediscoveryindatabase(KDD).Knowledgediscoveryprocessisbasicallydividedin5steps:
Selection Pre-processing
-
7/27/2019 DATA MINING - Simple Guide For Beginners.pdf
2/5
www.dwbiconcepts.com
Transformation DataMining Evaluation
Selectionisthestepwhereweidentifythedata,pre-processingiswherewe
cleanseandprofilethedata,transformationstepisrequiredfordatapreparation,andthenisdatamining.LastlyweuseEvaluationtotesttheresultofthedatamining.NoticeherethetermKnowledgeasinKnowledgeDiscoveryinDatabase(KDD).WhydidyousayKnowledge?Whynotinformationordata?Thisisbecausetherearedifferencesamongthetermsdata,informationand
knowledge.Letsunderstandthisdifferencethroughoneexample.
Yourunalocaldepartmentalstoreandyoulogallthedetailsofyourcustomersinthestoredatabase.Youknowthenamesofyourcustomersandwhatitemstheybuyeachday.Forexample,Alex,JessicaandPaulvisityourstoreeverySundayandbuyscandle.Youstorethisinformationinyourstoredatabase.Thisisdata.Anytimeyouwanttoknowwhoarethevisitorsthatbuycandle,
youcanqueryyourdatabaseandgettheanswer.Thisisinformation.Youwanttoknowhowmanycandlesaresoldoneachdayofweekfromyourstore,youcanagainqueryyourdatabaseandyoudgettheanswerthatsalsoinformation.Butsupposethereare1000othercustomerswhoalsobuycandlefromyouoneverySunday(mostlywithsomepercentageofvariations)andallofthemareChristianbyreligion.So,youcanconcludethatAlex,JessicaandPaulmustbealsoChristian.
NowthereligionofAlex,JessicaandPaulwerenotgiventoyouasdata.Thiscouldnotberetrievedfromthedatabaseasinformation.Butyoulearntthispieceofinformationindirectly.Thisistheknowledgethatyoudiscovered.AndthisdiscoverywasdonethroughaprocesscalledDataMining.NowtherearechancesthatyouarewrongaboutAlex,JessicaandPaul.Buttherearefareamountofchancesthatyouareactuallyright.ThatiswhyitisveryimportanttoevaluatetheresultofKDDprocess.
-
7/27/2019 DATA MINING - Simple Guide For Beginners.pdf
3/5
www.dwbiconcepts.com
Now,thereasonIgaveyouthisexampleisIwantedtomakeacleardistinctionbetweenknowledgeandinformationinthecontextofdatamining.Thisisimportanttounderstandourfirstquestionwhyretrievinginformationfromdeepdownofyourdatabaseisnotsameasdatamining.Nomatterhowcomplextheinformationretrievalprocessis,nomatterhowdeeptheinformationis
locatedat,itsstillnotdatamining.Aslongasyouarenotdealingwithpredictiveanalysisornotdiscoveringnewpatternfromtheexistingdatayouarenotdoingdatamining.
WhataretheapplicationsofDataMining?Whenitcomestoapplyingdatamining,yourimaginationistheonlybarrier(not
reallytherearetechnologicalhindrancesaswellaswewillseelater).Butitstruethatdataminingisappliedinalmostanyfieldsstartingfromgeneticstohumanrightsviolation.OneofthemostimportantapplicationsisinMachineLearning.Machinelearningisabranchofartificialintelligenceconcernedwiththedesignanddevelopmentofalgorithmsthatallowcomputerstoevolvebehaviorsbasedonempiricaldata.Machinelearningmakesitpossibleforcomputerstotakeautonomousdecisionsbasedonthedataavailablefrompastexperiences.Manyofthestandardproblemsoftodaysworldarebeingsolvedbytheapplicationofmachinelearningassolvingthemotherwise(e.g.throughthedeterministicalgorithmicapproach)wouldbeimpossiblegiventhebreadthanddepthoftheproblem.
Letmestartwithoneexampleoftheapplicationofdataminingthatenablesmachine-learningalgorithmtodriveanautonomousvehicle.Thisvehicledoesnothaveanydriveranditmovesaroundtheroadallbyitself.Thewayitmaneuversandovercomestheobstaclesisbyapplyingtheimagesthatitsees(throughaVGAcamera)andthenusingdataminingtodeterminethecourseofactionbasedonthedataofitspastexperiences.
Therearenotableapplicationsofdatamininginthesubjectssuchas
VoicerecognitionThinkofSiriiniPhone.Howdoesitunderstandyourcommands?Clearlyitsnotdeterministicallyprogrammableaseverybodyhasdifferenttoneandaccentandvoice.Andnotonlyitunderstands,italsoadaptsbetterwithyourvoiceasyoukeepusingitmoreandmore.
ClassificationofDNAsequencesDNAsequencecontainsbiologicalinformation.OneofthemanyapproachesofDNAsequencingisthroughsequenceminingwheredataminingtechniquesareappliedtofindstatisticallyrelevantpatters,which
-
7/27/2019 DATA MINING - Simple Guide For Beginners.pdf
4/5
www.dwbiconcepts.com
arethencomparedwithpreviouslystudiedsequencestounderstandthegivensequence.
NaturalLanguageprocessingConsiderthefollowingconversationsbetweencustomer(Mike)andshop-keeper
(Linda).
Mike:Youhaveplayingcards?Linda:WehaveonebluestackfromJacksonsandalsooneotherfromDeborahMike:Whatistheprice?Linda:Jacksons$4andDeborahs$7.Mike:Okaygivemetheblueoneplease.
Nowconsiderthis.WhatifLindawasanautomatedmachine?Youcouldprobablyhavethesamekindofconversationsstill,butitwouldprobablyhadmuchmoreunnatural.
Mike:Youhaveplayingcards?Robot:Yes.Mike:Whattypeofplayingcardsdoyouhave?Robot:WehaveJacksonsandDeborahsplayingcards.Mike:Whatarethecolorsoftheplayingcards?
Robot:WhichCompanysplayingcarddoyouwanttoknowthecolorof?Mike:WhatisthecolorofJacksonsplayingcards?Robot:Blue.Mike:WhatarethepricesofJacksonsanddeborahsplayingcards?Robot:Jacksonsplayingcardscostyou$4andDeborahsplayingcardscostyou$7.Mike:Ok,thencanIbuytheblueones?Robot:Wedonothaveanyproductcalledblueones.Mike:CanIhavethebluecolorplayingcardsplease?Robot:Sure!
Iknowtheaboveexampleisabitofovershoot,butyougottheidea.Machinesdonotunderstandnaturallanguage.Anditsachallengetomakethemunderstandthesame.Anduntilwedothatwewontbeabletobuildareallyusefulhuman-computerinterface.Recently,realadvancementonnaturallanguageprocessingisdoneaftertheapplicationofdatamining.Priorimplementationsoflanguage-processingtaskstypicallyinvolvedthedirecthandcodingoflargesetsofrules.Butthemachine-learningparadigminsteadusedgenerallearningalgorithmsoften,although
notalways,groundedinstatisticalinferencetoautomaticallylearnsuchrulesthroughtheanalysisoflargecorporaoftypicalreal-worldexamples.
-
7/27/2019 DATA MINING - Simple Guide For Beginners.pdf
5/5
www.dwbiconcepts.com
MethodsofdataminingNowiftheaboveexamplesinterestyouthenletscontinuelearningmoreaboutdatamining.Oneofthefirsttasksthatwehavetodonextistounderstandthedifferentapproachesthatareusedinthefieldofdatamining.Belowlistshowsmostoftheimportantmethods:
AnomalyDetectionThisisthemethodofdetectingpatternsinagivendatasetthatdoesnotconformtoanestablishednormalbehavior.Thisisappliedinnumberofdifferentfieldssuchasnetworkintrusiondetection,sharemarketfrauddetectionetc.
AssociationRuleLearningThisisamethodofdiscoveringinterestingrelationsbetweenvariablesinlargedatabases.EverseenBuyerswhoboughtthisproduct,alsoboughtthese:typeofmessagesine-commercewebsites(e.g.inAmazon.com)?ThatsanexampleofAssociationRulelearning.
ClusteringClusteringisthemethodofassigningasetofobjectsintogroups(calledclusters)sothattheobjectsinthesameclusteraremoresimilar(insomesenseoranother)toeachotherthantothoseinotherclusters.
Clusteranalysisiswidelyusedinmarketresearchwhenworkingwithmultivariatedata.Marketresearchersoftenusethistocreatecustomersegmentation,productsegmentationetc.
ClassificationThismethodisusedforthetaskofgeneralizingknownstructuretoapplytonewdata.Forexample,anemailprogrammightattempttoclassifyanemailaslegitimateorspam.
Regression
Attemptstofindafunction,whichmodelsthedatawiththeleasterror.Theaboveexampleofautonomousdrivingusesthismethod.Nextwewouldlearnabouteachofthesemethodsingreaterdetailwithexamplesoftheirapplication.Meanwhile,letmeknowifyouhaveanyquestion/suggestiononthisarticle.Pleasevisitwww.dwbiconcepts.comtofindmore.
OriginalPaperislocatedhere:http://www.dwbiconcepts.com/data-warehousing/11-data-mining/97-data-mining-for-beginners.html