steps for effective text data cleaning

8/18/2019 Steps for Effective Text Data Cleaning

1/6

Steps for effective text datacleaning (with case study using

Python)BIG DATA BUSINESS ANALYTICS PYTHONSHARE

SHIVA !A"SA# $ "%VE!ER &'$ &* + ,

-he days when one would get data in ta.ulated spreadsheets are truly .ehind

us/ A 0o0ent of silence for the data residing in the spreadsheet poc1ets/ -oday$

0ore than 23 of the data is unstructured 4 it is either present in data silos or

scattered around the digital archives/ 5ata is .eing produced as we spea1 4

fro0 every conversation we 0a1e in the social 0edia to every content

generated fro0 news sources/ In order to produce any 0eaningful actiona.le

insight fro0 data$ it is i0portant to 1now how to wor1 with it in its unstructured

for0/ As a 5ata Scientist at one of the fastest growing 5ecision Sciences fir0$

0y .read and .utter co0es fro0 deriving 0eaningful insights fro0 unstructured

text infor0ation/

%ne of the first steps in wor1ing with text data is to pre6process it/ It is an

essential step .efore the data is ready for analysis/ a7ority of availa.le text

data is highly unstructured and noisy in nature 4 to achieve .etter insights or to

.uild .etter algorith0s$ it is necessary to play with clean data/ 8or exa0ple$

social 0edia data is highly unstructured 4 it is an infor0al co00unication 4

typos$ .ad gra00ar$ usage of slang$ presence of unwanted content li1e 9R#s$

Stopwords$ Expressions etc/ are the usual suspects/

In this .log$ therefore I discuss a.out these possi.le noise ele0ents and how

you could clean the0 step .y step/ I a0 providing ways to clean data using

Python/

As a typical .usiness pro.le0$ assu0e you are interested in finding: which are

the features of an iPhone which are 0ore popular a0ong the fans/ ;ou have

http://www.analyticsvidhya.com/blog/category/big-data/http://www.analyticsvidhya.com/blog/category/big-data/http://www.analyticsvidhya.com/blog/category/business-analytics/http://www.analyticsvidhya.com/blog/category/python-2/http://www.analyticsvidhya.com/blog/author/shivam5992/http://www.analyticsvidhya.com/blog/author/shivam5992/http://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/#commentshttp://www.analyticsvidhya.com/blog/category/business-analytics/http://www.analyticsvidhya.com/blog/category/python-2/http://www.analyticsvidhya.com/blog/author/shivam5992/http://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/#commentshttp://www.analyticsvidhya.com/blog/category/big-data/


2/6

extracted consu0er opinions related to iPhone and here is a tweet you

extracted:

, iphone =a0p> you?re aws0 apple/ 5isplayIsAweso0e$ sooo

happppppy http:++www/apple/co0@

Steps for data cleaning:

Here is what you do:

&/ Escaping HTML characters: 5ata o.tained fro0 we. usually contains a lot

of ht0l entities li1e =lt> =gt> =a0p> which gets e0.edded in the original

data/ It is thus necessary to get rid of these entities/ %ne approach is to

directly re0ove the0 .y the use of specific regular expressions/ Another

approach is to use appropriate pac1ages and 0odules (for exa0ple

ht0lparser of Python)$ which can convert these entities to standard ht0l

tags/ 8or exa0ple: =lt> is converted to is converted to


3/6

standard encoding for0at/ 9-862 encoding is widely accepted and is

reco00ended to use/

Snippet:

tweet = original_tweet.decode("utf8").encode(ascii!!ignore!)

Output:

BB


4/6

BB


5/6

Snippet:

tweet = _slang_loopup(tweet)

Outcome:

BB


6/6

&/ (rammar checking: Lra00ar chec1ing is 0a7orly learning .ased$ huge

a0ount of proper text data is learned and 0odels are created for the

purpose of gra00ar correction/ -here are 0any online tools that are

availa.le for gra00ar correction purposes/

/ Spelling correction: In natural language$ 0isspelled errors are

encountered/ o0panies li1e Loogle and icrosoft have achieved a decent

accuracy level in auto0ated spell correction/ %ne can use algorith0s li1e

the #evenshtein 5istances$ 5ictionary #oo1up etc/ or other 0odules and

pac1ages to fix these errors/

End )otes:Hope you found this article helpful/ -hese were so0e tips and tric1s$ I have

learnt while wor1ing with a lot of text data/ If you follow the a.ove steps to clean

the data$ you can drastically i0prove the accuracy of your results and draw

.etter insights/ 5o share your views+dou.ts in the co00ents section and I

would .e happy to participate/

steps for effective text data cleaning

Documents

steps for cleaning a lenovo computer system through physical...

alcoholics anonymous basic text book steps workbook

steps in cleaning floors using vinegar and white vinegar for...

informational text slo goal next steps pd session

6 steps of mechanised cleaning process water tank...

cleaning a balance reference paper - mettler toledo · but...

cleaning and sanitation. define cleaning, sanitizing and...

care & cleaning - moooi carpets · to achieve the most...

steps in cleaning floors using vinegar and white vinegar

data cleaning primer - pda, inc cleaning manual.pdf · data...

text categorization hongning wang cs@uva. today’s lecture...

cleaning and disinfecting your facility · 2020-05-19 ·...

advanced cleaning...

solutions guide · 2019-10-04 · the common cleaning and...

6 steps to start a window cleaning business

text preprocessing 1. what is text preprocessing? cleaning...

improving text classiﬁcation accuracy by training...

cleaning, sanitizing and the seven steps of …

5 easy steps for a facebook spring cleaning

new controller guidebook 3rd edition text steps: eliminate...