steps for effective text data cleaning
Post on 07-Jul-2018
229 Views
Preview:
TRANSCRIPT
-
8/18/2019 Steps for Effective Text Data Cleaning
1/6
Steps for effective text datacleaning (with case study using
Python)BIG DATA BUSINESS ANALYTICS PYTHONSHARE
SHIVA !A"SA# $ "%VE!ER &'$ &* + ,
-he days when one would get data in ta.ulated spreadsheets are truly .ehind
us/ A 0o0ent of silence for the data residing in the spreadsheet poc1ets/ -oday$
0ore than 23 of the data is unstructured 4 it is either present in data silos or
scattered around the digital archives/ 5ata is .eing produced as we spea1 4
fro0 every conversation we 0a1e in the social 0edia to every content
generated fro0 news sources/ In order to produce any 0eaningful actiona.le
insight fro0 data$ it is i0portant to 1now how to wor1 with it in its unstructured
for0/ As a 5ata Scientist at one of the fastest growing 5ecision Sciences fir0$
0y .read and .utter co0es fro0 deriving 0eaningful insights fro0 unstructured
text infor0ation/
%ne of the first steps in wor1ing with text data is to pre6process it/ It is an
essential step .efore the data is ready for analysis/ a7ority of availa.le text
data is highly unstructured and noisy in nature 4 to achieve .etter insights or to
.uild .etter algorith0s$ it is necessary to play with clean data/ 8or exa0ple$
social 0edia data is highly unstructured 4 it is an infor0al co00unication 4
typos$ .ad gra00ar$ usage of slang$ presence of unwanted content li1e 9R#s$
Stopwords$ Expressions etc/ are the usual suspects/
In this .log$ therefore I discuss a.out these possi.le noise ele0ents and how
you could clean the0 step .y step/ I a0 providing ways to clean data using
Python/
As a typical .usiness pro.le0$ assu0e you are interested in finding: which are
the features of an iPhone which are 0ore popular a0ong the fans/ ;ou have
http://www.analyticsvidhya.com/blog/category/big-data/http://www.analyticsvidhya.com/blog/category/big-data/http://www.analyticsvidhya.com/blog/category/business-analytics/http://www.analyticsvidhya.com/blog/category/python-2/http://www.analyticsvidhya.com/blog/author/shivam5992/http://www.analyticsvidhya.com/blog/author/shivam5992/http://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/#commentshttp://www.analyticsvidhya.com/blog/category/business-analytics/http://www.analyticsvidhya.com/blog/category/python-2/http://www.analyticsvidhya.com/blog/author/shivam5992/http://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/#commentshttp://www.analyticsvidhya.com/blog/category/big-data/
-
8/18/2019 Steps for Effective Text Data Cleaning
2/6
extracted consu0er opinions related to iPhone and here is a tweet you
extracted:
, iphone =a0p> you?re aws0 apple/ 5isplayIsAweso0e$ sooo
happppppy http:++www/apple/co0@
Steps for data cleaning:
Here is what you do:
&/ Escaping HTML characters: 5ata o.tained fro0 we. usually contains a lot
of ht0l entities li1e =lt> =gt> =a0p> which gets e0.edded in the original
data/ It is thus necessary to get rid of these entities/ %ne approach is to
directly re0ove the0 .y the use of specific regular expressions/ Another
approach is to use appropriate pac1ages and 0odules (for exa0ple
ht0lparser of Python)$ which can convert these entities to standard ht0l
tags/ 8or exa0ple: =lt> is converted to is converted to
-
8/18/2019 Steps for Effective Text Data Cleaning
3/6
standard encoding for0at/ 9-862 encoding is widely accepted and is
reco00ended to use/
Snippet:
tweet = original_tweet.decode("utf8").encode(ascii!!ignore!)
Output:
BB
-
8/18/2019 Steps for Effective Text Data Cleaning
4/6
BB
-
8/18/2019 Steps for Effective Text Data Cleaning
5/6
Snippet:
tweet = _slang_loopup(tweet)
Outcome:
BB
-
8/18/2019 Steps for Effective Text Data Cleaning
6/6
&/ (rammar checking: Lra00ar chec1ing is 0a7orly learning .ased$ huge
a0ount of proper text data is learned and 0odels are created for the
purpose of gra00ar correction/ -here are 0any online tools that are
availa.le for gra00ar correction purposes/
/ Spelling correction: In natural language$ 0isspelled errors are
encountered/ o0panies li1e Loogle and icrosoft have achieved a decent
accuracy level in auto0ated spell correction/ %ne can use algorith0s li1e
the #evenshtein 5istances$ 5ictionary #oo1up etc/ or other 0odules and
pac1ages to fix these errors/
End )otes:Hope you found this article helpful/ -hese were so0e tips and tric1s$ I have
learnt while wor1ing with a lot of text data/ If you follow the a.ove steps to clean
the data$ you can drastically i0prove the accuracy of your results and draw
.etter insights/ 5o share your views+dou.ts in the co00ents section and I
would .e happy to participate/
top related