introduction to web data analytics using r and python
TRANSCRIPT
Introduction to Web Data Analytics Using R and Python
Mamata Jenamani
Associate Professor Department of Industrial & Systems Engineering,
Indian Institute of Technology, Kharagpur
http://kr.renesas.com/edge_ol/global/13/index.jsp
http://www.forbes.com/sites/louiscolumbus/2015/05/25/roundup-of-analytics-big-data-business-intelligence-forecasts-and-market-estimates-2015/#5c7790b4869a
Predicted Demand
http://www.forbes.com/sites/louiscolumbus/2015/05/25/roundup-of-analytics-big-data-business-intelligence-forecasts-and-market-estimates-2015/#5c7790b4869a
What is right environment for analytics?
http://www.kdnuggets.com/2016/07/burtchworks-sas-r-python-analytics-pros-prefer.html
http://spectrum.ieee.org/computing/software/the-2015-top-ten-programming-languages
The 2015 Top Ten Programming Languages
2015 ranking 2014 ranking
Recent trend in programming languages for teaching
http://cacm.acm.org/blogs/blog-cacm/176450-python-is-now-the-most-popular-introductory-teaching-language-at-top-us-universities/fulltext
Recent trend in programming languages
http://blog.codeeval.com/codeevalblog/2016/2/2/most-popular-coding-languages-of-2016
Based on hundreds of thousands of data points we've collected by processing over 1,200,000+ challenge submissions in 26 different programming languages.
Recent trend in programming languages
https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community
Recent trend in programming languages
Rank
Language Share Trend
1 Java 23.8 % -0.7 %
2 Python 13.0 % +2.3 %
3 PHP 10.5 % -0.7 %
4 C# 9.0 % -0.3 %
5 Javascript 7.7 % +0.7 %
6 C++ 7.2 % -0.4 %
7 C 7.0 % -0.2 %
8 Objective-C 4.5 % -0.8 %
9 R 3.2 % +0.6 %
Worldwide change in Github contribution, July 2016 compared to a year ago:
http://pypl.github.io/PYPL.html
Recent trend in programming languages
https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community
Recent trend in programming languages
https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community
https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community
http://www.kdnuggets.com/polls/2015/analytics-data-mining-data-science-software-used.html
Top 10 most in-demand skills in big data market
http://www.forbes.com/sites/louiscolumbus/2014/12/29/where-big-data-jobs-will-be-in-2015/#48c70566404a
http://www.forbes.com/sites/louiscolumbus/2015/11/16/where-big-data-jobs-will-be-in-2016/#14b0163df7f1
What is Analytics • Analytics as the scientific process of transforming data into
insights for the purpose of making better decisions. The Institute for Operations Research and the Management Sciences (INFORMS),
https://www.informs.org/About-INFORMS/News-Room/INFORMS-in-the-News/Best-definition-of-analytics
• Analytics is the process of developing actionable insights through problem definition and the application of statistical models and analysis against existing and/or simulated future data. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.269.7294&rep=rep1&type=pdf
• Analytics is the discovery, interpretation, and communication of meaningful patterns in data. Analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favours data visualization to communicate insight.
https://en.wikipedia.org/wiki/Analytics
Data Analysis and Data Analytics
• Data analysis refers to hands-on data exploration and evaluation.
• Data analytics is a broader term and includes data analysis as necessary subcomponent. Analytics defines the science behind the analysis. The science means understanding the cognitive processes an analyst uses to understand problems and explore data in meaningful ways.
• Analytics also include data extraction, transformation, and loading; specific tools, techniques, and methods; and how to successfully communicate results. http://www.kdnuggets.com/2015/02/interview-david-kasik-boeing-data-analytics.html
Data Mining and Data Analytics
• Data analytics is distinguished from data mining by the scope, purpose and focus of the analysis. Data miners sort through huge data sets using sophisticated software to identify undiscovered patterns and establish hidden relationships. Data analytics focuses on inference, the process of deriving a conclusion based on what is already known by the researcher. http://searchdatamanagement.techtarget.com/definition/data-analytics
Web Data Analytics • Web mining aims to discover useful information or
knowledge from the Web hyperlink structure, page content, and usage data.
• Although Web mining uses many data mining techniques, it is not purely an application of traditional data mining techniques due to the heterogeneity and semi-structured or unstructured nature of the Web data.
Bing Liu, Web Data Mining
• Web data analytics aims to use the discovered
information to aid decision making
Web data mining categories
• Discovers useful knowledge from hyperlinks • Ex. Discovering communities of users who share
common interests (Social Network Mining)
Web structure mining,
• Extracts useful information from Web page contents
• Ex. Analyzing customer reviews and forum postings to discover consumer opinions sentiment
Web content mining
• Discovers user access patterns from Web usage logs (Click stream data)
• Ex. Web user behavior modeling, Website personalization
Web usage mining
Data analytics steps • Discretization, Data cleaning, Data
integration, Data transformation, Data reduction
Pre-processing
• Application of data mining (statistics and machine learning) and operations research tools
Processing
• Application of evaluation and visualization techniques
Post-processing
• Application of the domain knowledge to interpret the data for decision making
Decision making
Characteristics of Data • Data on user, item or rating are described by a random
variable and are categorized as continuous or discrete. • Continuous Variable A variable that can assume any
value on a continuous scale within a range is said to be continuous. – Example: 1) time spend by a buyer on a particular page,
2)interestingness of a joke as rated by the user in Jester • Discrete Variable Variables that can assume a finite or
countably infinite number of values are said to be discrete. – Example: 1)Profession of a user, 2) Rating in a product in
Amazon
Scales of measurement Measurement means assigning numbers or other symbols to characteristics of objects according to certain pre-specified rules.
– One-to-one correspondence between the numbers and the characteristics being measured.
– The rules for assigning numbers should be standardized and applied uniformly.
– Rules must not change over objects or time.
• Scaling involves creating a continuum upon which measured objects are located.
Naresh K. Malhotra, Marketing Research: An Applied Orientation, Pearson; 6 edition2009
Characteristics of Scale Description By description, we mean the unique labels or descriptors
that are used to designate each value of the scale. All scales possess description.
Order By order, we mean the relative sizes or positions of the
descriptors. Order is denoted by descriptors such as greater than, less than, and equal to.
Distance The characteristic of distance means that absolute
differences between the scale descriptors are known and may be expressed in units.
Origin The origin characteristic means that the scale has a unique
or fixed beginning or true zero point.
Primary Scales of Measurement
7 3 8
Scale Nominal Numbers Assigned to Runners Ordinal Rank Order of Winners Interval Performance Rating on a 0 to 10 Scale Ratio Time to Finish in Seconds
Third place
Second place
First place
Finish
Finish
8.2 9.1 9.6
15.2 14.1 13.4
Primary Scales of Measurement Scale Basic Characteristics Common
Examples Permissible Statistics Descriptive Inferential
Nominal Numbers identify & classify objects
Social Security nos., numbering of football players
Percentages, mode
Chi-square, binomial test
Ordinal Nos. indicate the relative positions of objects but not the magnitude of differences between them
Quality rankings, rankings of teams in a tournament
Percentile, median
Rank-order correlation, Friedman ANOVA
Interval Differences between objects can be compared, zero point is arbitrary
Temperature (Fahrenheit) Celsius)
Range, mean, standard deviation
Product-moment correlation, t tests, regression
Ratio Zero point is fixed, ratios of scale values can be compared
Length, weight Geometric mean, harmonic mean
Coefficient of variation
Statistical method for understanding the data
Descriptive Statistics • Descriptive statistics are brief descriptive coefficients that
summarize a given data set • Univariate analysis: involves describing the distribution of
a single variable including – Central tendency: mean, median, and mode – Dispersion
• Range, quantiles, inter quantile range • Spread: variance and standard deviation • Shape of the distribution : skewness and kurtosis.
– Characteristics of a variable's distribution in graphical form • Distribution, histograms , stem-and-leaf plot, box plot
• Bivariate analysis: to describe the relationship between pairs of variables. – Cross-tabulations and contingency tables – Graphical representation via scatterplots – Quantitative measures of dependence: correlation, covariance
Univariate analysis for understanding the data
• Central tendency: mean, median, and mode
6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50
Mean Median Mode ~ 24 24 24
6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50, 517
Mean Median Mode ~ 54 24 24
The median is less sensitive to outliers (extreme scores) than the mean and thus a better measure than the mean for highly skewed distributions
Univariate analysis for understanding the data • Dispersion
– Range, quantiles, inter quantile range – Spread: variance and standard deviation
Minimum First Quartile Median Third Quartile Maximum
6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50 • Range = Max - Min = 44 • Standard Deviation (SD) = 11.2 • Variance = s^2 = 126.4
Univariate analysis for understanding the data • Skewness Measures asymmetry of data
– Positive or right skewed: Longer right tail
– Negative or left skewed: Longer left tail
2/3
1
2
1
3
21
)(
)(Skewness
Then, ns.observatio be ,...,Let
−
−=
∑
∑
=
=
n
ii
n
ii
n
xx
xxn
nxxx
• Measures peakedness of the distribution of data. The kurtosis of normal distribution is 0.
3)(
)(Kurtosis
Then, ns.observatio be ,...,Let
2
1
2
1
4
21
−
−
−=
∑
∑
=
=
n
ii
n
ii
n
xx
xxn
nxxx
Univariate analysis for understanding the data
• Characteristics of a variable's distribution in graphical form – Bar diagram and Pie charts are used for categorical variables – Histogram and Box-plot are used for numerical variable.
Figure 3: Age Distribution
02
46
810
1214
16
40 60 80 100 120 140 More
Age in Month
Num
ber o
f Sub
ject
s
Mean 90.41666667
Standard Error 3.902649518
Median 84
Mode 84
Standard Deviation 30.22979318
Sample Variance 913.8403955
Kurtosis -1.183899591
Skewness 0.389872725
Range 95
Minimum 48
Maximum 143
Sum 5425
Count 60
Data Preparation (Preprocessing) • Data in the real world has many problems
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
• Example: occupation=“” – noisy: containing errors or outliers
• Example: Salary=“-10”, Age=“222” – inconsistent: containing discrepancies in codes or names
• Example: Age= 42 Birthday= 03/07/1997 “42” Birthday=“03/07/1997”
• Data need to be formatted for a given software tool • Data need to be made adequate for a given method
– Computational stability of the algorithms
1. http://paginas.fe.up.pt/~ec/files_0910/slides/aula_2_DataPreparation.pdf 2. Chapter 2: Han and Kamber, Data Mining Book 3. http://www.cs.unm.edu/~mueen/Teaching/CS591/Lectures/3_Data.pdf
Major Tasks in Preprocessing • Data discretization
– Part of data reduction but with particular importance, especially for numerical data
• Data cleaning – Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies • Data integration
– Integration of multiple databases, data cubes, or files • Data transformation
– Normalization and aggregation • Data reduction
– Obtains reduced representation in volume but produces the same or similar analytical results
Discretization of Continuous Variables
• Divide the range of a continuous attribute into intervals – Some methods require discrete values, e.g. most
versions of Naïve Bayes – Reduce data size by discretization – Prepare for further analysis
• Useful for generating a summary of data • Also called binning
– Equal width binning – Equal height binning – Other methods: Entropy based, Holte 1R
Binning • Equal width binning
– It divides the range into N intervals of equal size (range): uniform grid
– If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B -A)/N
• Equal height binning – It divides the range into N intervals, each containing
approximately the same number of samples – Generally preferred because avoids clumping – In practice, “almost-equal” height binning is used to give more
intuitive break points – Additional considerations:
• don’t split frequent values across bins • create separate bins for special values (e g 0)
Data cleaning
– Fill in missing values (manual vs. automatic) • Ignore • Constant: “unknown”, a new class?! • Attribute mean (of entire set or subset) • Most probable value: inference-based
– Identify outliers and smooth out noisy data • Binning method • Clustering • Combined computer and human inspection • Regression
– Correct inconsistent data – Resolve redundancy caused by data integration
Outlier Detection in Univariate Data
Compute mean and std. deviation. If the value is two three standard deviations away from the mean, it may be considered as an outlier
An observation is an extreme outlier if (Q1-3×IQR, Q3+3×IQR), and declared a mild outlier if it lies outside of the interval (Q1-1.5×IQR, Q3+1.5×IQR) (IQR = Inter Quartile Range, IQR=(Q3-Q1)
Outlier Detection in multivariate Data • Statistical Methods
– Mahalnobis Distance – Outliers: Multivariate data points with large distances
• Data mining Methods – Distance based measures: An observation is defined
as a distance based outlier if at least a fraction β of the observations in the dataset are further than r from it.
– Clustering based methods consider a cluster of small sizes, including the size of one observation, as clustered outliers.
http://www2.cs.uh.edu/~ceick/7362/T1-1.pdf
Handling missing values • Ignore records (use only cases with all values)
– Usually done when class label is missing as most prediction methods
– do not handle missing data well – Not effective when the percentage of missing values per
attribute varies considerably as it can lead to insufficient and/or biased sample sizes
• Ignore attributes with missing values – Use only features (attributes) with all values (may leave
out important features) • Fill in the missing value manually
– tedious + infeasible?
Handling missing values • Use a global constant to fill in the missing value
– e.g., “unknown”. (May create a new class!) • Use the attribute mean to fill in the missing value
– It will do the least harm to the mean of existing data • Use the attribute mean for all samples belonging to the
same class to fill in the missing value • Use the most probable value to fill in the missing value
– Inference-based such as Bayesian formula or decision tree – Identify relationships among variables
• Linear regression, Multiple linear regression, Nonlinear regression – Nearest-Neighbour estimator
• Finding the k neighbours nearest to the point and fill in the most frequent value or the average value
• Finding neighbours in a large dataset may be slow
Data Integration • Combines data from multiple sources into a coherent store • Remove redundancies • Data integration: • Schema integration
– integrate metadata from different sources – Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id ≡ B.cust-# • Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different sources are different
– possible reasons: different representations, different scales, e.g., metric vs. British units
Data transformation
• Smoothing: remove noise from data – binning, regression, clustering
• Aggregation: – summarization, data cube construction
• Generalization: concept hierarchy climbing • Attribute/feature construction
– New attributes constructed from the given ones (add att. Area which is based on height and width)
• Normalization – Scale values to fall within smaller specified range
Data Normalization • min-max normalization
• z-score normalization (standardization)
• normalization by decimal scaling
Where j is the smallest integer such that Max(| |)<1
Data Reduction • Data cube aggregation
– Aggregation operations are applied to the data in the construction of a data cube.
• Attribute subset selection – Irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed – Sometimes according to your knowledge of the business
• Dimensionality reduction – Encoding mechanisms are used to reduce the data set size.
• Numerosity reduction – The data are replaced or estimated by alternative, smaller
data representations
Data cube aggregation
Attribute subset selection
Dimensionality reduction
• Matrix decomposition methods – Singular Value Decomposition
• Principal Component Analysis – Finding major directions – computes k orthonormal vectors
• Signal processing techniques – Discrete Fourier Transform – Discrete Wavelet transform
Numerosity reduction • Reduce data volume by choosing alternative, smaller
forms of data representation – Parametric methods
• Assume the data fits some model, estimate model parameters, use the estimated value instead of the actual data
• Regression and log-linear models – Non-parametric methods
• Do not assume models • Histograms, clustering, sampling
– Discretization and concept hierarchy generation • where raw data values for attributes are replaced by ranges or
higher conceptual levels • Data discretization is a form of numerosity reduction that is very
useful for the automatic generation of concept hierarchies.
Parametric Methods -Regression
Non-parametric methods
Concept Hierarchy Generation and Discretization
Sampling