data mining - simple guide for beginners.pdf

Upload: jayaprakash-reddy

Post on 14-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 DATA MINING - Simple Guide For Beginners.pdf

    1/5

    www.dwbiconcepts.com

    DATAMININGASimpleGuidefortheBeginners!

    Thispaperintroducesthesubjectofdatamininginsimplelucidlanguageandmovesontobuildmorecomplexconcepts.Starthereifyouareabeginner.Author:AkashMitraDated:21Mar2012Source:http://www.dwbiconcepts.com

    DataMining.Ihaveanallergytothisterm.NotbecauseIhatethesubjectofdataminingitself,butbecausethistermissomuchover-usedandmisusedandexploitedandcommercializedandoftenconveyedininaccuratemanner,ininappropriateplacesandoftenwithintentionalvagueness.SowhenIdecidedtowriteaboutwhatisdatamining,IwasconvincedthatIneedtowriteaboutwhatisNOTdataminingfirst,inordertobuildaformaldefinitionofdatamining.

    WhatisDataMining?(Andwhatitisnot)HereistheWikipediadefinitionofdatamining:DataminingistheprocessofdiscoveringnewpatternsfromlargedatasetsNowthequestionis:whatdoestheabovedefinitionreallymeanandhowdoesitdifferfromfindinginformationfromdatabases?Weoftenstoreinformationin

    databases(asindatawarehouses)andretrievetheinformationfromthedatabasewhenweneedit.Isthatdatamining?Answerisno.Wewillsoonseewhyisitso.Letsstartwiththebigpicturefirst.Dataminingisbasicallyoneofthestepsintheprocessofknowledgediscoveryindatabase(KDD).Knowledgediscoveryprocessisbasicallydividedin5steps:

    Selection Pre-processing

  • 7/27/2019 DATA MINING - Simple Guide For Beginners.pdf

    2/5

    www.dwbiconcepts.com

    Transformation DataMining Evaluation

    Selectionisthestepwhereweidentifythedata,pre-processingiswherewe

    cleanseandprofilethedata,transformationstepisrequiredfordatapreparation,andthenisdatamining.LastlyweuseEvaluationtotesttheresultofthedatamining.NoticeherethetermKnowledgeasinKnowledgeDiscoveryinDatabase(KDD).WhydidyousayKnowledge?Whynotinformationordata?Thisisbecausetherearedifferencesamongthetermsdata,informationand

    knowledge.Letsunderstandthisdifferencethroughoneexample.

    Yourunalocaldepartmentalstoreandyoulogallthedetailsofyourcustomersinthestoredatabase.Youknowthenamesofyourcustomersandwhatitemstheybuyeachday.Forexample,Alex,JessicaandPaulvisityourstoreeverySundayandbuyscandle.Youstorethisinformationinyourstoredatabase.Thisisdata.Anytimeyouwanttoknowwhoarethevisitorsthatbuycandle,

    youcanqueryyourdatabaseandgettheanswer.Thisisinformation.Youwanttoknowhowmanycandlesaresoldoneachdayofweekfromyourstore,youcanagainqueryyourdatabaseandyoudgettheanswerthatsalsoinformation.Butsupposethereare1000othercustomerswhoalsobuycandlefromyouoneverySunday(mostlywithsomepercentageofvariations)andallofthemareChristianbyreligion.So,youcanconcludethatAlex,JessicaandPaulmustbealsoChristian.

    NowthereligionofAlex,JessicaandPaulwerenotgiventoyouasdata.Thiscouldnotberetrievedfromthedatabaseasinformation.Butyoulearntthispieceofinformationindirectly.Thisistheknowledgethatyoudiscovered.AndthisdiscoverywasdonethroughaprocesscalledDataMining.NowtherearechancesthatyouarewrongaboutAlex,JessicaandPaul.Buttherearefareamountofchancesthatyouareactuallyright.ThatiswhyitisveryimportanttoevaluatetheresultofKDDprocess.

  • 7/27/2019 DATA MINING - Simple Guide For Beginners.pdf

    3/5

    www.dwbiconcepts.com

    Now,thereasonIgaveyouthisexampleisIwantedtomakeacleardistinctionbetweenknowledgeandinformationinthecontextofdatamining.Thisisimportanttounderstandourfirstquestionwhyretrievinginformationfromdeepdownofyourdatabaseisnotsameasdatamining.Nomatterhowcomplextheinformationretrievalprocessis,nomatterhowdeeptheinformationis

    locatedat,itsstillnotdatamining.Aslongasyouarenotdealingwithpredictiveanalysisornotdiscoveringnewpatternfromtheexistingdatayouarenotdoingdatamining.

    WhataretheapplicationsofDataMining?Whenitcomestoapplyingdatamining,yourimaginationistheonlybarrier(not

    reallytherearetechnologicalhindrancesaswellaswewillseelater).Butitstruethatdataminingisappliedinalmostanyfieldsstartingfromgeneticstohumanrightsviolation.OneofthemostimportantapplicationsisinMachineLearning.Machinelearningisabranchofartificialintelligenceconcernedwiththedesignanddevelopmentofalgorithmsthatallowcomputerstoevolvebehaviorsbasedonempiricaldata.Machinelearningmakesitpossibleforcomputerstotakeautonomousdecisionsbasedonthedataavailablefrompastexperiences.Manyofthestandardproblemsoftodaysworldarebeingsolvedbytheapplicationofmachinelearningassolvingthemotherwise(e.g.throughthedeterministicalgorithmicapproach)wouldbeimpossiblegiventhebreadthanddepthoftheproblem.

    Letmestartwithoneexampleoftheapplicationofdataminingthatenablesmachine-learningalgorithmtodriveanautonomousvehicle.Thisvehicledoesnothaveanydriveranditmovesaroundtheroadallbyitself.Thewayitmaneuversandovercomestheobstaclesisbyapplyingtheimagesthatitsees(throughaVGAcamera)andthenusingdataminingtodeterminethecourseofactionbasedonthedataofitspastexperiences.

    Therearenotableapplicationsofdatamininginthesubjectssuchas

    VoicerecognitionThinkofSiriiniPhone.Howdoesitunderstandyourcommands?Clearlyitsnotdeterministicallyprogrammableaseverybodyhasdifferenttoneandaccentandvoice.Andnotonlyitunderstands,italsoadaptsbetterwithyourvoiceasyoukeepusingitmoreandmore.

    ClassificationofDNAsequencesDNAsequencecontainsbiologicalinformation.OneofthemanyapproachesofDNAsequencingisthroughsequenceminingwheredataminingtechniquesareappliedtofindstatisticallyrelevantpatters,which

  • 7/27/2019 DATA MINING - Simple Guide For Beginners.pdf

    4/5

    www.dwbiconcepts.com

    arethencomparedwithpreviouslystudiedsequencestounderstandthegivensequence.

    NaturalLanguageprocessingConsiderthefollowingconversationsbetweencustomer(Mike)andshop-keeper

    (Linda).

    Mike:Youhaveplayingcards?Linda:WehaveonebluestackfromJacksonsandalsooneotherfromDeborahMike:Whatistheprice?Linda:Jacksons$4andDeborahs$7.Mike:Okaygivemetheblueoneplease.

    Nowconsiderthis.WhatifLindawasanautomatedmachine?Youcouldprobablyhavethesamekindofconversationsstill,butitwouldprobablyhadmuchmoreunnatural.

    Mike:Youhaveplayingcards?Robot:Yes.Mike:Whattypeofplayingcardsdoyouhave?Robot:WehaveJacksonsandDeborahsplayingcards.Mike:Whatarethecolorsoftheplayingcards?

    Robot:WhichCompanysplayingcarddoyouwanttoknowthecolorof?Mike:WhatisthecolorofJacksonsplayingcards?Robot:Blue.Mike:WhatarethepricesofJacksonsanddeborahsplayingcards?Robot:Jacksonsplayingcardscostyou$4andDeborahsplayingcardscostyou$7.Mike:Ok,thencanIbuytheblueones?Robot:Wedonothaveanyproductcalledblueones.Mike:CanIhavethebluecolorplayingcardsplease?Robot:Sure!

    Iknowtheaboveexampleisabitofovershoot,butyougottheidea.Machinesdonotunderstandnaturallanguage.Anditsachallengetomakethemunderstandthesame.Anduntilwedothatwewontbeabletobuildareallyusefulhuman-computerinterface.Recently,realadvancementonnaturallanguageprocessingisdoneaftertheapplicationofdatamining.Priorimplementationsoflanguage-processingtaskstypicallyinvolvedthedirecthandcodingoflargesetsofrules.Butthemachine-learningparadigminsteadusedgenerallearningalgorithmsoften,although

    notalways,groundedinstatisticalinferencetoautomaticallylearnsuchrulesthroughtheanalysisoflargecorporaoftypicalreal-worldexamples.

  • 7/27/2019 DATA MINING - Simple Guide For Beginners.pdf

    5/5

    www.dwbiconcepts.com

    MethodsofdataminingNowiftheaboveexamplesinterestyouthenletscontinuelearningmoreaboutdatamining.Oneofthefirsttasksthatwehavetodonextistounderstandthedifferentapproachesthatareusedinthefieldofdatamining.Belowlistshowsmostoftheimportantmethods:

    AnomalyDetectionThisisthemethodofdetectingpatternsinagivendatasetthatdoesnotconformtoanestablishednormalbehavior.Thisisappliedinnumberofdifferentfieldssuchasnetworkintrusiondetection,sharemarketfrauddetectionetc.

    AssociationRuleLearningThisisamethodofdiscoveringinterestingrelationsbetweenvariablesinlargedatabases.EverseenBuyerswhoboughtthisproduct,alsoboughtthese:typeofmessagesine-commercewebsites(e.g.inAmazon.com)?ThatsanexampleofAssociationRulelearning.

    ClusteringClusteringisthemethodofassigningasetofobjectsintogroups(calledclusters)sothattheobjectsinthesameclusteraremoresimilar(insomesenseoranother)toeachotherthantothoseinotherclusters.

    Clusteranalysisiswidelyusedinmarketresearchwhenworkingwithmultivariatedata.Marketresearchersoftenusethistocreatecustomersegmentation,productsegmentationetc.

    ClassificationThismethodisusedforthetaskofgeneralizingknownstructuretoapplytonewdata.Forexample,anemailprogrammightattempttoclassifyanemailaslegitimateorspam.

    Regression

    Attemptstofindafunction,whichmodelsthedatawiththeleasterror.Theaboveexampleofautonomousdrivingusesthismethod.Nextwewouldlearnabouteachofthesemethodsingreaterdetailwithexamplesoftheirapplication.Meanwhile,letmeknowifyouhaveanyquestion/suggestiononthisarticle.Pleasevisitwww.dwbiconcepts.comtofindmore.

    OriginalPaperislocatedhere:http://www.dwbiconcepts.com/data-warehousing/11-data-mining/97-data-mining-for-beginners.html