introduction to web data analytics using r and python

Introduction to Web Data Analytics Using R and Python

Mamata Jenamani

Associate Professor Department of Industrial & Systems Engineering,

Indian Institute of Technology, Kharagpur

http://kr.renesas.com/edge_ol/global/13/index.jsp

http://www.forbes.com/sites/louiscolumbus/2015/05/25/roundup-of-analytics-big-data-business-intelligence-forecasts-and-market-estimates-2015/#5c7790b4869a

Predicted Demand

http://www.forbes.com/sites/louiscolumbus/2015/05/25/roundup-of-analytics-big-data-business-intelligence-forecasts-and-market-estimates-2015/#5c7790b4869a

What is right environment for analytics?

http://www.kdnuggets.com/2016/07/burtchworks-sas-r-python-analytics-pros-prefer.html

http://spectrum.ieee.org/computing/software/the-2015-top-ten-programming-languages

The 2015 Top Ten Programming Languages

2015 ranking 2014 ranking

Recent trend in programming languages for teaching

http://cacm.acm.org/blogs/blog-cacm/176450-python-is-now-the-most-popular-introductory-teaching-language-at-top-us-universities/fulltext

Recent trend in programming languages

http://blog.codeeval.com/codeevalblog/2016/2/2/most-popular-coding-languages-of-2016

Based on hundreds of thousands of data points we've collected by processing over 1,200,000+ challenge submissions in 26 different programming languages.


https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community


Rank

Language Share Trend

1 Java 23.8 % -0.7 %

2 Python 13.0 % +2.3 %

3 PHP 10.5 % -0.7 %

4 C# 9.0 % -0.3 %

5 Javascript 7.7 % +0.7 %

6 C++ 7.2 % -0.4 %

7 C 7.0 % -0.2 %

8 Objective-C 4.5 % -0.8 %

9 R 3.2 % +0.6 %

Worldwide change in Github contribution, July 2016 compared to a year ago:

http://pypl.github.io/PYPL.html

http://www.kdnuggets.com/polls/2015/analytics-data-mining-data-science-software-used.html

Top 10 most in-demand skills in big data market

http://www.forbes.com/sites/louiscolumbus/2014/12/29/where-big-data-jobs-will-be-in-2015/#48c70566404a

http://www.forbes.com/sites/louiscolumbus/2015/11/16/where-big-data-jobs-will-be-in-2016/#14b0163df7f1

What is Analytics • Analytics as the scientific process of transforming data into

insights for the purpose of making better decisions. The Institute for Operations Research and the Management Sciences (INFORMS),

https://www.informs.org/About-INFORMS/News-Room/INFORMS-in-the-News/Best-definition-of-analytics

• Analytics is the process of developing actionable insights through problem definition and the application of statistical models and analysis against existing and/or simulated future data. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.269.7294&rep=rep1&type=pdf

• Analytics is the discovery, interpretation, and communication of meaningful patterns in data. Analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favours data visualization to communicate insight.

https://en.wikipedia.org/wiki/Analytics

Data Analysis and Data Analytics

• Data analysis refers to hands-on data exploration and evaluation.

• Data analytics is a broader term and includes data analysis as necessary subcomponent. Analytics defines the science behind the analysis. The science means understanding the cognitive processes an analyst uses to understand problems and explore data in meaningful ways.

• Analytics also include data extraction, transformation, and loading; specific tools, techniques, and methods; and how to successfully communicate results. http://www.kdnuggets.com/2015/02/interview-david-kasik-boeing-data-analytics.html

Data Mining and Data Analytics

• Data analytics is distinguished from data mining by the scope, purpose and focus of the analysis. Data miners sort through huge data sets using sophisticated software to identify undiscovered patterns and establish hidden relationships. Data analytics focuses on inference, the process of deriving a conclusion based on what is already known by the researcher. http://searchdatamanagement.techtarget.com/definition/data-analytics

Web Data Analytics • Web mining aims to discover useful information or

knowledge from the Web hyperlink structure, page content, and usage data.

• Although Web mining uses many data mining techniques, it is not purely an application of traditional data mining techniques due to the heterogeneity and semi-structured or unstructured nature of the Web data.

Bing Liu, Web Data Mining

• Web data analytics aims to use the discovered

information to aid decision making

Web data mining categories

• Discovers useful knowledge from hyperlinks • Ex. Discovering communities of users who share

common interests (Social Network Mining)

Web structure mining,

• Extracts useful information from Web page contents

• Ex. Analyzing customer reviews and forum postings to discover consumer opinions sentiment

Web content mining

• Discovers user access patterns from Web usage logs (Click stream data)

• Ex. Web user behavior modeling, Website personalization

Web usage mining

Data analytics steps • Discretization, Data cleaning, Data

integration, Data transformation, Data reduction

Pre-processing

• Application of data mining (statistics and machine learning) and operations research tools

Processing

• Application of evaluation and visualization techniques

Post-processing

• Application of the domain knowledge to interpret the data for decision making

Decision making

Characteristics of Data • Data on user, item or rating are described by a random

variable and are categorized as continuous or discrete. • Continuous Variable A variable that can assume any

value on a continuous scale within a range is said to be continuous. – Example: 1) time spend by a buyer on a particular page,

2)interestingness of a joke as rated by the user in Jester • Discrete Variable Variables that can assume a finite or

countably infinite number of values are said to be discrete. – Example: 1)Profession of a user, 2) Rating in a product in

Amazon

Scales of measurement Measurement means assigning numbers or other symbols to characteristics of objects according to certain pre-specified rules.

– One-to-one correspondence between the numbers and the characteristics being measured.

– The rules for assigning numbers should be standardized and applied uniformly.

– Rules must not change over objects or time.

• Scaling involves creating a continuum upon which measured objects are located.

Naresh K. Malhotra, Marketing Research: An Applied Orientation, Pearson; 6 edition2009

Characteristics of Scale Description By description, we mean the unique labels or descriptors

that are used to designate each value of the scale. All scales possess description.

Order By order, we mean the relative sizes or positions of the

descriptors. Order is denoted by descriptors such as greater than, less than, and equal to.

Distance The characteristic of distance means that absolute

differences between the scale descriptors are known and may be expressed in units.

Origin The origin characteristic means that the scale has a unique

or fixed beginning or true zero point.

Primary Scales of Measurement

7 3 8

Scale Nominal Numbers Assigned to Runners Ordinal Rank Order of Winners Interval Performance Rating on a 0 to 10 Scale Ratio Time to Finish in Seconds

Third place

Second place

First place

Finish

Finish

8.2 9.1 9.6

15.2 14.1 13.4

Primary Scales of Measurement Scale Basic Characteristics Common

Examples Permissible Statistics Descriptive Inferential

Nominal Numbers identify & classify objects

Social Security nos., numbering of football players

Percentages, mode

Chi-square, binomial test

Ordinal Nos. indicate the relative positions of objects but not the magnitude of differences between them

Quality rankings, rankings of teams in a tournament

Percentile, median

Rank-order correlation, Friedman ANOVA

Interval Differences between objects can be compared, zero point is arbitrary

Temperature (Fahrenheit) Celsius)

Range, mean, standard deviation

Product-moment correlation, t tests, regression

Ratio Zero point is fixed, ratios of scale values can be compared

Length, weight Geometric mean, harmonic mean

Coefficient of variation

Statistical method for understanding the data

Descriptive Statistics • Descriptive statistics are brief descriptive coefficients that

summarize a given data set • Univariate analysis: involves describing the distribution of

a single variable including – Central tendency: mean, median, and mode – Dispersion

• Range, quantiles, inter quantile range • Spread: variance and standard deviation • Shape of the distribution : skewness and kurtosis.

– Characteristics of a variable's distribution in graphical form • Distribution, histograms , stem-and-leaf plot, box plot

• Bivariate analysis: to describe the relationship between pairs of variables. – Cross-tabulations and contingency tables – Graphical representation via scatterplots – Quantitative measures of dependence: correlation, covariance

Univariate analysis for understanding the data

• Central tendency: mean, median, and mode

6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50

Mean Median Mode ~ 24 24 24

6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50, 517

Mean Median Mode ~ 54 24 24

The median is less sensitive to outliers (extreme scores) than the mean and thus a better measure than the mean for highly skewed distributions

Univariate analysis for understanding the data • Dispersion

– Range, quantiles, inter quantile range – Spread: variance and standard deviation

Minimum First Quartile Median Third Quartile Maximum

6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50 • Range = Max - Min = 44 • Standard Deviation (SD) = 11.2 • Variance = s^2 = 126.4

Univariate analysis for understanding the data • Skewness Measures asymmetry of data

– Positive or right skewed: Longer right tail

– Negative or left skewed: Longer left tail

2/3

1

2

1

3

21

)(

)(Skewness

Then, ns.observatio be ,...,Let

−

−=

∑

∑

=

=

n

ii

n

ii

n

xx

xxn

nxxx

• Measures peakedness of the distribution of data. The kurtosis of normal distribution is 0.

3)(

)(Kurtosis

Then, ns.observatio be ,...,Let

2

1

2

1

4

21

−

−

−=

∑

∑

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Univariate analysis for understanding the data

• Characteristics of a variable's distribution in graphical form – Bar diagram and Pie charts are used for categorical variables – Histogram and Box-plot are used for numerical variable.

Figure 3: Age Distribution

02

46

810

1214

16

40 60 80 100 120 140 More

Age in Month

Num

ber o

f Sub

ject

s

Mean 90.41666667

Standard Error 3.902649518

Median 84

Mode 84

Standard Deviation 30.22979318

Sample Variance 913.8403955

Kurtosis -1.183899591

Skewness 0.389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Data Preparation (Preprocessing) • Data in the real world has many problems

– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

• Example: occupation=“” – noisy: containing errors or outliers

• Example: Salary=“-10”, Age=“222” – inconsistent: containing discrepancies in codes or names

• Example: Age= 42 Birthday= 03/07/1997 “42” Birthday=“03/07/1997”

• Data need to be formatted for a given software tool • Data need to be made adequate for a given method

– Computational stability of the algorithms

1. http://paginas.fe.up.pt/~ec/files_0910/slides/aula_2_DataPreparation.pdf 2. Chapter 2: Han and Kamber, Data Mining Book 3. http://www.cs.unm.edu/~mueen/Teaching/CS591/Lectures/3_Data.pdf

http://paginas.fe.up.pt/%7Eec/files_0910/slides/aula_2_DataPreparation.pdf

http://www.cs.unm.edu/%7Emueen/Teaching/CS591/Lectures/3_Data.pdf

http://www.cs.unm.edu/%7Emueen/Teaching/CS591/Lectures/3_Data.pdf

Major Tasks in Preprocessing • Data discretization

– Part of data reduction but with particular importance, especially for numerical data

• Data cleaning – Fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies • Data integration

– Integration of multiple databases, data cubes, or files • Data transformation

– Normalization and aggregation • Data reduction

– Obtains reduced representation in volume but produces the same or similar analytical results

Discretization of Continuous Variables

• Divide the range of a continuous attribute into intervals – Some methods require discrete values, e.g. most

versions of Naïve Bayes – Reduce data size by discretization – Prepare for further analysis

• Useful for generating a summary of data • Also called binning

– Equal width binning – Equal height binning – Other methods: Entropy based, Holte 1R

Binning • Equal width binning

– It divides the range into N intervals of equal size (range): uniform grid

– If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B -A)/N

• Equal height binning – It divides the range into N intervals, each containing

approximately the same number of samples – Generally preferred because avoids clumping – In practice, “almost-equal” height binning is used to give more

intuitive break points – Additional considerations:

• don’t split frequent values across bins • create separate bins for special values (e g 0)

Data cleaning

– Fill in missing values (manual vs. automatic) • Ignore • Constant: “unknown”, a new class?! • Attribute mean (of entire set or subset) • Most probable value: inference-based

– Identify outliers and smooth out noisy data • Binning method • Clustering • Combined computer and human inspection • Regression

– Correct inconsistent data – Resolve redundancy caused by data integration

Outlier Detection in Univariate Data

Compute mean and std. deviation. If the value is two three standard deviations away from the mean, it may be considered as an outlier

An observation is an extreme outlier if (Q1-3×IQR, Q3+3×IQR), and declared a mild outlier if it lies outside of the interval (Q1-1.5×IQR, Q3+1.5×IQR) (IQR = Inter Quartile Range, IQR=(Q3-Q1)

Outlier Detection in multivariate Data • Statistical Methods

– Mahalnobis Distance – Outliers: Multivariate data points with large distances

• Data mining Methods – Distance based measures: An observation is defined

as a distance based outlier if at least a fraction β of the observations in the dataset are further than r from it.

– Clustering based methods consider a cluster of small sizes, including the size of one observation, as clustered outliers.

http://www2.cs.uh.edu/~ceick/7362/T1-1.pdf

Handling missing values • Ignore records (use only cases with all values)

– Usually done when class label is missing as most prediction methods

– do not handle missing data well – Not effective when the percentage of missing values per

attribute varies considerably as it can lead to insufficient and/or biased sample sizes

• Ignore attributes with missing values – Use only features (attributes) with all values (may leave

out important features) • Fill in the missing value manually

– tedious + infeasible?

Handling missing values • Use a global constant to fill in the missing value

– e.g., “unknown”. (May create a new class!) • Use the attribute mean to fill in the missing value

– It will do the least harm to the mean of existing data • Use the attribute mean for all samples belonging to the

same class to fill in the missing value • Use the most probable value to fill in the missing value

– Inference-based such as Bayesian formula or decision tree – Identify relationships among variables

• Linear regression, Multiple linear regression, Nonlinear regression – Nearest-Neighbour estimator

• Finding the k neighbours nearest to the point and fill in the most frequent value or the average value

• Finding neighbours in a large dataset may be slow

Data Integration • Combines data from multiple sources into a coherent store • Remove redundancies • Data integration: • Schema integration

– integrate metadata from different sources – Entity identification problem: identify real world entities

from multiple data sources, e.g., A.cust-id ≡ B.cust-# • Detecting and resolving data value conflicts

– for the same real world entity, attribute values from different sources are different

– possible reasons: different representations, different scales, e.g., metric vs. British units

Data transformation

• Smoothing: remove noise from data – binning, regression, clustering

• Aggregation: – summarization, data cube construction

• Generalization: concept hierarchy climbing • Attribute/feature construction

– New attributes constructed from the given ones (add att. Area which is based on height and width)

• Normalization – Scale values to fall within smaller specified range

Data Normalization • min-max normalization

• z-score normalization (standardization)

• normalization by decimal scaling

Where j is the smallest integer such that Max(| |)<1

Data Reduction • Data cube aggregation

– Aggregation operations are applied to the data in the construction of a data cube.

• Attribute subset selection – Irrelevant, weakly relevant, or redundant attributes or

dimensions may be detected and removed – Sometimes according to your knowledge of the business

• Dimensionality reduction – Encoding mechanisms are used to reduce the data set size.

• Numerosity reduction – The data are replaced or estimated by alternative, smaller

data representations

Data cube aggregation

Attribute subset selection

Dimensionality reduction

• Matrix decomposition methods – Singular Value Decomposition

• Principal Component Analysis – Finding major directions – computes k orthonormal vectors

• Signal processing techniques – Discrete Fourier Transform – Discrete Wavelet transform

Numerosity reduction • Reduce data volume by choosing alternative, smaller

forms of data representation – Parametric methods

• Assume the data fits some model, estimate model parameters, use the estimated value instead of the actual data

• Regression and log-linear models – Non-parametric methods

• Do not assume models • Histograms, clustering, sampling

– Discretization and concept hierarchy generation • where raw data values for attributes are replaced by ranges or

higher conceptual levels • Data discretization is a form of numerosity reduction that is very

useful for the automatic generation of concept hierarchies.

Parametric Methods -Regression

Non-parametric methods

Concept Hierarchy Generation and Discretization

Sampling

introduction to web data analytics using r and python

Education