steps for effective text data cleaning

Upload: xwpom2

Post on 07-Jul-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/18/2019 Steps for Effective Text Data Cleaning

    1/6

    Steps for effective text datacleaning (with case study using

    Python)BIG DATA BUSINESS ANALYTICS PYTHONSHARE

    SHIVA !A"SA# $ "%VE!ER &'$ &* + ,

    -he days when one would get data in ta.ulated spreadsheets are truly .ehind

    us/ A 0o0ent of silence for the data residing in the spreadsheet poc1ets/ -oday$

    0ore than 23 of the data is unstructured 4 it is either present in data silos or 

    scattered around the digital archives/ 5ata is .eing produced as we spea1 4

    fro0 every conversation we 0a1e in the social 0edia to every content

    generated fro0 news sources/ In order to produce any 0eaningful actiona.le

    insight fro0 data$ it is i0portant to 1now how to wor1 with it in its unstructured

    for0/ As a 5ata Scientist at one of the fastest growing 5ecision Sciences fir0$

    0y .read and .utter co0es fro0 deriving 0eaningful insights fro0 unstructured

    text infor0ation/

    %ne of the first steps in wor1ing with text data is to pre6process it/ It is an

    essential step .efore the data is ready for analysis/ a7ority of availa.le text

    data is highly unstructured and noisy in nature 4 to achieve .etter insights or to

    .uild .etter algorith0s$ it is necessary to play with clean data/ 8or exa0ple$

    social 0edia data is highly unstructured 4 it is an infor0al co00unication 4

    typos$ .ad gra00ar$ usage of slang$ presence of unwanted content li1e 9R#s$

    Stopwords$ Expressions etc/ are the usual suspects/

    In this .log$ therefore I discuss a.out these possi.le noise ele0ents and how

    you could clean the0 step .y step/ I a0 providing ways to clean data using

    Python/

     As a typical .usiness pro.le0$ assu0e you are interested in finding: which are

    the features of an iPhone which are 0ore popular a0ong the fans/ ;ou have

    http://www.analyticsvidhya.com/blog/category/big-data/http://www.analyticsvidhya.com/blog/category/big-data/http://www.analyticsvidhya.com/blog/category/business-analytics/http://www.analyticsvidhya.com/blog/category/python-2/http://www.analyticsvidhya.com/blog/author/shivam5992/http://www.analyticsvidhya.com/blog/author/shivam5992/http://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/#commentshttp://www.analyticsvidhya.com/blog/category/business-analytics/http://www.analyticsvidhya.com/blog/category/python-2/http://www.analyticsvidhya.com/blog/author/shivam5992/http://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/#commentshttp://www.analyticsvidhya.com/blog/category/big-data/

  • 8/18/2019 Steps for Effective Text Data Cleaning

    2/6

    extracted consu0er opinions related to iPhone and here is a tweet you

    extracted:

    , iphone =a0p> you?re aws0 apple/ 5isplayIsAweso0e$ sooo

    happppppy http:++www/apple/co0@

    Steps for data cleaning:

    Here is what you do:

    &/ Escaping HTML characters: 5ata o.tained fro0 we. usually contains a lot

    of ht0l entities li1e =lt> =gt> =a0p> which gets e0.edded in the original

    data/ It is thus necessary to get rid of these entities/ %ne approach is to

    directly re0ove the0 .y the use of specific regular expressions/ Another 

    approach is to use appropriate pac1ages and 0odules (for exa0ple

    ht0lparser of Python)$ which can convert these entities to standard ht0l

    tags/ 8or exa0ple: =lt> is converted to is converted to

  • 8/18/2019 Steps for Effective Text Data Cleaning

    3/6

    standard encoding for0at/ 9-862 encoding is widely accepted and is

    reco00ended to use/

    Snippet:

    tweet = original_tweet.decode("utf8").encode(ascii!!ignore!)

    Output:

    BB

  • 8/18/2019 Steps for Effective Text Data Cleaning

    4/6

    BB

  • 8/18/2019 Steps for Effective Text Data Cleaning

    5/6

    Snippet:

      tweet = _slang_loopup(tweet)

    Outcome:

    BB

  • 8/18/2019 Steps for Effective Text Data Cleaning

    6/6

    &/ (rammar checking: Lra00ar chec1ing is 0a7orly learning .ased$ huge

    a0ount of proper text data is learned and 0odels are created for the

    purpose of gra00ar correction/ -here are 0any online tools that are

    availa.le for gra00ar correction purposes/

    / Spelling correction: In natural language$ 0isspelled errors are

    encountered/ o0panies li1e Loogle and icrosoft have achieved a decent

    accuracy level in auto0ated spell correction/ %ne can use algorith0s li1e

    the #evenshtein 5istances$ 5ictionary #oo1up etc/ or other 0odules and

    pac1ages to fix these errors/

     

    End )otes:Hope you found this article helpful/ -hese were so0e tips and tric1s$ I have

    learnt while wor1ing with a lot of text data/ If you follow the a.ove steps to clean

    the data$ you can drastically i0prove the accuracy of your results and draw

    .etter insights/ 5o share your views+dou.ts in the co00ents section and I

    would .e happy to participate/