data profiling with r
TRANSCRIPT
![Page 1: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/1.jpg)
Want to follow along with this session using R?
Download the script and data from the session
scheduler. Also download R and RStudio.
It’s easy to follow along!
![Page 2: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/2.jpg)
© 2016 RED PILL Analytics
Text Here
![Page 3: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/3.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Using R for Data Profiling
3
Michelle Kolbemedium.com/@datacheesehead @mekolbe linkedin.com/in/michellekolbe
![Page 4: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/4.jpg)
© 2016 RED PILL Analytics
Do you have a data quality problem?
![Page 5: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/5.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
What to Check for?
• Accuracy• Consistency• Completeness• Uniqueness• Distribution• Range
5
![Page 6: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/6.jpg)
© 2016 RED PILL Analytics
Why Profile Your Data?
![Page 7: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/7.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Benefits
• Trust in data• Find problems in advance• Shorten development time on projects• Improve understanding of data & business knowledge
7
![Page 8: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/8.jpg)
© 2016 RED PILL Analytics
Why R?
![Page 9: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/9.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Why R?
• Free!• Easy to use• Flexible• Powerful analytics• Great community!
9
![Page 10: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/10.jpg)
© 2016 RED PILL Analytics
Getting Started in R
![Page 11: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/11.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
What is R?
• A programming environment• Fairly simple to use & understand• Allows a user to manipulate & analyze data• Open source• Real power comes from available packages you can install from LARGE community
• Easy to learn with programming background• Con: Memory management & speed vs C++ or Python
11
![Page 12: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/12.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Tools for R
• First download R from r-project.org• Then download R Studio, the best R IDE
12
![Page 13: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/13.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
R Basics
• Case sensitive• <- assigns to a variable• # begins a comment• ??<keyword> will search R documentation for help
13
![Page 14: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/14.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Using Packages
• First install install.packages(“<package name>”)
• Once installed, load the package library(“<package name>”)
• Note that every time you open R you’ll need to load the packages you’ll be using
• You’ll see your packages that are installed and loaded in R Studio
14
![Page 15: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/15.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Connecting to Data in R
• Data should be read into R and stored into an object• Easiest with CSV• Can download datasets from a url or located on a drived <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")
15
![Page 16: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/16.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Connecting to Oracle
• RODBC• Load package in R library(RODBC)
• View available data sourcesodbcDataSources()
• Can read tables and send sql queriescon <- odbcConnect("Oracle Sample", uid="system", pwd="oracle")d <- sqlQuery(con, "select sysdate from dual”)
16
ODBC
Con
necti
on N
ame
![Page 17: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/17.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Connecting to Oracle• RJDBC
• Load Package library(RJDBC)
• Create connection driverjdbcDriver <- JDBC(driverClass=“oracle.jdbc.OracleDriver”, classPath=“lib/ojdbc6.jar”)
• Open Connection jdbcConnection <- dbConnect(jdbcDriver, “jdbc:oracle:thin@//database.hostname.com:port/service_name_or_sid”, “username”, “password”)
• QuerydbGetQuery(jdbcConnection, “select sysdate from dual”)
• Close Connection dbDisconnect(jdbcConnection)
17
![Page 18: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/18.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
ROracle
• Open Source but maintained by Oracle• Faster: 79 times faster than RJDBC and 2.5 times faster than RODBC
• Provides scalability and stability
18
![Page 19: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/19.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Variables
• Can store data in variables using <- or =• Do not need to define variable first• RStudio shows your variables on the right
19
![Page 20: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/20.jpg)
© 2016 RED PILL Analytics
Using R Studio
![Page 21: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/21.jpg)
![Page 22: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/22.jpg)
![Page 23: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/23.jpg)
![Page 24: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/24.jpg)
![Page 25: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/25.jpg)
![Page 26: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/26.jpg)
![Page 27: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/27.jpg)
![Page 28: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/28.jpg)
![Page 29: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/29.jpg)
![Page 30: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/30.jpg)
![Page 31: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/31.jpg)
![Page 32: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/32.jpg)
![Page 33: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/33.jpg)
![Page 34: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/34.jpg)
![Page 35: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/35.jpg)
![Page 36: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/36.jpg)
![Page 37: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/37.jpg)
![Page 38: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/38.jpg)
![Page 39: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/39.jpg)
© 2016 RED PILL Analytics
Our Data Set to Profile
![Page 40: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/40.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
First, Load the Data into R
40
![Page 41: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/41.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Summarize the Data• Summary is an R function to show you basic details about each column in your dataset
41
![Page 42: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/42.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Summarize the Data
42
![Page 43: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/43.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Filter the dataset• Use Function Nesting to get a subset of data in the summary
43
![Page 44: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/44.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Bad Data?• If the Mean is 218 for Yards, is it possible to have a max of 5177 or is this bad data?
44
![Page 45: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/45.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Group Data by Position• Here we are grouping with the by function and getting the mean of 4 columns
45
![Page 46: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/46.jpg)
© 2016 RED PILL Analytics
Visualizing Data
![Page 47: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/47.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Grammar of Graphics Package• ggplot2 provides many graphing and charting capabilities with R• Based on Grammar of Graphics by Leland Wilkinson
47
![Page 48: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/48.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Bar Chart• Let’s view our distribution by Age. Since this is basically discrete data, we’ll use a Bar Chart.
48
![Page 49: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/49.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Histogram• Our data imported into R with Factors for some metrics
• Change to Int by converting to a matrix then back to data frame
49
![Page 52: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/52.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Histogram with Some Data Cleanup• Removed low values
52
![Page 53: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/53.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Distribution• Density charts are thought to be superior to histograms because you do not need to be concerned with bins
53
![Page 54: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/54.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Distribution with 0 value data back in
54
![Page 55: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/55.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Quick Clean Uprm removes a variable or dataset
55
![Page 56: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/56.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Group the Chart by a Dimension• We can add a “facet wrap” to group our charts by a dimension
56
![Page 57: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/57.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Distributions for Categorical Data• Can get a count of how many records exist for each value in a table format
57
![Page 58: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/58.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Distribution for 2 data points• Can change this to a 2 way cross tab distribution
58
![Page 61: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/61.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics 61
![Page 62: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/62.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Scatterplot with Regression
62
![Page 64: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/64.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Add a Bar Chart to the Line
64
![Page 65: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/65.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Stacked Bars are Rarely Helpful
65
![Page 66: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/66.jpg)
© 2016 RED PILL Analytics
What about Text fields?
![Page 68: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/68.jpg)
© 2016 RED PILL Analytics
Missing Data
![Page 69: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/69.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Null vs NA in RR treats NA like other languages consider NULL
69
NULL NADefinition Null object, a reserved word Logical constant of length 1 containing a
missing value indicator
Behavior in Vector Not allowed. Won’t save within vector. Exists and represents missing value.
Behavior in List (such as Data Frame)
Can exist if not assigned but created with it.
Exists and represents missing value.
![Page 70: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/70.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Nulls on ImportOur dataset had nulls in it when we pulled it into R. How were they assigned?
70
![Page 71: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/71.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Finding Missing Data
71
![Page 72: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/72.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
But look what else we found in Jeff’s records!
72
![Page 73: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/73.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Make Missing Data Consistent in R
73
![Page 74: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/74.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Check the whole dataset now
74
![Page 75: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/75.jpg)
© 2016 RED PILL Analytics
What to do about missing & bad data?
![Page 76: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/76.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Handling Bad Data in ETL
76
RejectClean
& Fill InLoad As Is
![Page 77: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/77.jpg)
© 2016 RED PILL Analytics
Using Data Quality Package
![Page 78: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/78.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
DataQualityR Package
78
![Page 80: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/80.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Categorical Results
80
![Page 81: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/81.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
In Summary
• R gives you a quick and easy way to learn about your data before investing time into ETL
• Open source means no investment into tools• R isn’t scary or all statistical and stuff!
81
![Page 82: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/82.jpg)
© 2016 RED PILL Analytics
Text Here
![Page 83: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/83.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics 83
![Page 84: Data Profiling with R](https://reader034.vdocuments.mx/reader034/viewer/2022042619/587237a81a28ab102f8b5779/html5/thumbnails/84.jpg)
www.RedPillAnalytics.com [email protected] @RedPillA © 2016 RED PILL Analytics
Using R for Data ProfilingSession #1805
84
Michelle Kolbemedium.com/@datacheesehead @mekolbe linkedin.com/in/michellekolbe
Fill out a session survey in the mobile
app!!