introduction to web data analytics using r and python

56
Introduction to Web Data Analytics Using R and Python Mamata Jenamani Associate Professor Department of Industrial & Systems Engineering, Indian Institute of Technology, Kharagpur

Upload: sujoy-bag

Post on 10-Feb-2017

161 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Introduction to Web Data Analytics Using R and Python

Introduction to Web Data Analytics Using R and Python

Mamata Jenamani

Associate Professor Department of Industrial & Systems Engineering,

Indian Institute of Technology, Kharagpur

Page 2: Introduction to Web Data Analytics Using R and Python
Page 3: Introduction to Web Data Analytics Using R and Python

http://kr.renesas.com/edge_ol/global/13/index.jsp

Page 4: Introduction to Web Data Analytics Using R and Python

http://www.forbes.com/sites/louiscolumbus/2015/05/25/roundup-of-analytics-big-data-business-intelligence-forecasts-and-market-estimates-2015/#5c7790b4869a

Page 5: Introduction to Web Data Analytics Using R and Python

Predicted Demand

http://www.forbes.com/sites/louiscolumbus/2015/05/25/roundup-of-analytics-big-data-business-intelligence-forecasts-and-market-estimates-2015/#5c7790b4869a

Page 6: Introduction to Web Data Analytics Using R and Python

What is right environment for analytics?

Page 7: Introduction to Web Data Analytics Using R and Python

http://www.kdnuggets.com/2016/07/burtchworks-sas-r-python-analytics-pros-prefer.html

Page 8: Introduction to Web Data Analytics Using R and Python

http://spectrum.ieee.org/computing/software/the-2015-top-ten-programming-languages

The 2015 Top Ten Programming Languages

2015 ranking 2014 ranking

Page 9: Introduction to Web Data Analytics Using R and Python

Recent trend in programming languages for teaching

http://cacm.acm.org/blogs/blog-cacm/176450-python-is-now-the-most-popular-introductory-teaching-language-at-top-us-universities/fulltext

Page 10: Introduction to Web Data Analytics Using R and Python

Recent trend in programming languages

http://blog.codeeval.com/codeevalblog/2016/2/2/most-popular-coding-languages-of-2016

Based on hundreds of thousands of data points we've collected by processing over 1,200,000+ challenge submissions in 26 different programming languages.

Page 11: Introduction to Web Data Analytics Using R and Python

Recent trend in programming languages

https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community

Page 12: Introduction to Web Data Analytics Using R and Python

Recent trend in programming languages

Rank

Language Share Trend

1 Java 23.8 % -0.7 %

2 Python 13.0 % +2.3 %

3 PHP 10.5 % -0.7 %

4 C# 9.0 % -0.3 %

5 Javascript 7.7 % +0.7 %

6 C++ 7.2 % -0.4 %

7 C 7.0 % -0.2 %

8 Objective-C 4.5 % -0.8 %

9 R 3.2 % +0.6 %

Worldwide change in Github contribution, July 2016 compared to a year ago:

http://pypl.github.io/PYPL.html

Page 13: Introduction to Web Data Analytics Using R and Python

Recent trend in programming languages

https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community

Page 14: Introduction to Web Data Analytics Using R and Python

Recent trend in programming languages

https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community

Page 15: Introduction to Web Data Analytics Using R and Python

https://www.codementor.io/learn-programming/beginner-programming-language-job-salary-community

Page 16: Introduction to Web Data Analytics Using R and Python

http://www.kdnuggets.com/polls/2015/analytics-data-mining-data-science-software-used.html

Page 17: Introduction to Web Data Analytics Using R and Python

Top 10 most in-demand skills in big data market

http://www.forbes.com/sites/louiscolumbus/2014/12/29/where-big-data-jobs-will-be-in-2015/#48c70566404a

Page 18: Introduction to Web Data Analytics Using R and Python

http://www.forbes.com/sites/louiscolumbus/2015/11/16/where-big-data-jobs-will-be-in-2016/#14b0163df7f1

Page 19: Introduction to Web Data Analytics Using R and Python

What is Analytics • Analytics as the scientific process of transforming data into

insights for the purpose of making better decisions. The Institute for Operations Research and the Management Sciences (INFORMS),

https://www.informs.org/About-INFORMS/News-Room/INFORMS-in-the-News/Best-definition-of-analytics

• Analytics is the process of developing actionable insights through problem definition and the application of statistical models and analysis against existing and/or simulated future data. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.269.7294&rep=rep1&type=pdf

• Analytics is the discovery, interpretation, and communication of meaningful patterns in data. Analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favours data visualization to communicate insight.

https://en.wikipedia.org/wiki/Analytics

Page 20: Introduction to Web Data Analytics Using R and Python

Data Analysis and Data Analytics

• Data analysis refers to hands-on data exploration and evaluation.

• Data analytics is a broader term and includes data analysis as necessary subcomponent. Analytics defines the science behind the analysis. The science means understanding the cognitive processes an analyst uses to understand problems and explore data in meaningful ways.

• Analytics also include data extraction, transformation, and loading; specific tools, techniques, and methods; and how to successfully communicate results. http://www.kdnuggets.com/2015/02/interview-david-kasik-boeing-data-analytics.html

Page 21: Introduction to Web Data Analytics Using R and Python

Data Mining and Data Analytics

• Data analytics is distinguished from data mining by the scope, purpose and focus of the analysis. Data miners sort through huge data sets using sophisticated software to identify undiscovered patterns and establish hidden relationships. Data analytics focuses on inference, the process of deriving a conclusion based on what is already known by the researcher. http://searchdatamanagement.techtarget.com/definition/data-analytics

Page 22: Introduction to Web Data Analytics Using R and Python

Web Data Analytics • Web mining aims to discover useful information or

knowledge from the Web hyperlink structure, page content, and usage data.

• Although Web mining uses many data mining techniques, it is not purely an application of traditional data mining techniques due to the heterogeneity and semi-structured or unstructured nature of the Web data.

Bing Liu, Web Data Mining

• Web data analytics aims to use the discovered

information to aid decision making

Page 23: Introduction to Web Data Analytics Using R and Python

Web data mining categories

• Discovers useful knowledge from hyperlinks • Ex. Discovering communities of users who share

common interests (Social Network Mining)

Web structure mining,

• Extracts useful information from Web page contents

• Ex. Analyzing customer reviews and forum postings to discover consumer opinions sentiment

Web content mining

• Discovers user access patterns from Web usage logs (Click stream data)

• Ex. Web user behavior modeling, Website personalization

Web usage mining

Page 24: Introduction to Web Data Analytics Using R and Python

Data analytics steps • Discretization, Data cleaning, Data

integration, Data transformation, Data reduction

Pre-processing

• Application of data mining (statistics and machine learning) and operations research tools

Processing

• Application of evaluation and visualization techniques

Post-processing

• Application of the domain knowledge to interpret the data for decision making

Decision making

Page 25: Introduction to Web Data Analytics Using R and Python

Characteristics of Data • Data on user, item or rating are described by a random

variable and are categorized as continuous or discrete. • Continuous Variable A variable that can assume any

value on a continuous scale within a range is said to be continuous. – Example: 1) time spend by a buyer on a particular page,

2)interestingness of a joke as rated by the user in Jester • Discrete Variable Variables that can assume a finite or

countably infinite number of values are said to be discrete. – Example: 1)Profession of a user, 2) Rating in a product in

Amazon

Page 26: Introduction to Web Data Analytics Using R and Python

Scales of measurement Measurement means assigning numbers or other symbols to characteristics of objects according to certain pre-specified rules.

– One-to-one correspondence between the numbers and the characteristics being measured.

– The rules for assigning numbers should be standardized and applied uniformly.

– Rules must not change over objects or time.

• Scaling involves creating a continuum upon which measured objects are located.

Naresh K. Malhotra, Marketing Research: An Applied Orientation, Pearson; 6 edition2009

Page 27: Introduction to Web Data Analytics Using R and Python

Characteristics of Scale Description By description, we mean the unique labels or descriptors

that are used to designate each value of the scale. All scales possess description.

Order By order, we mean the relative sizes or positions of the

descriptors. Order is denoted by descriptors such as greater than, less than, and equal to.

Distance The characteristic of distance means that absolute

differences between the scale descriptors are known and may be expressed in units.

Origin The origin characteristic means that the scale has a unique

or fixed beginning or true zero point.

Page 28: Introduction to Web Data Analytics Using R and Python

Primary Scales of Measurement

7 3 8

Scale Nominal Numbers Assigned to Runners Ordinal Rank Order of Winners Interval Performance Rating on a 0 to 10 Scale Ratio Time to Finish in Seconds

Third place

Second place

First place

Finish

Finish

8.2 9.1 9.6

15.2 14.1 13.4

Page 29: Introduction to Web Data Analytics Using R and Python

Primary Scales of Measurement Scale Basic Characteristics Common

Examples Permissible Statistics Descriptive Inferential

Nominal Numbers identify & classify objects

Social Security nos., numbering of football players

Percentages, mode

Chi-square, binomial test

Ordinal Nos. indicate the relative positions of objects but not the magnitude of differences between them

Quality rankings, rankings of teams in a tournament

Percentile, median

Rank-order correlation, Friedman ANOVA

Interval Differences between objects can be compared, zero point is arbitrary

Temperature (Fahrenheit) Celsius)

Range, mean, standard deviation

Product-moment correlation, t tests, regression

Ratio Zero point is fixed, ratios of scale values can be compared

Length, weight Geometric mean, harmonic mean

Coefficient of variation

Page 30: Introduction to Web Data Analytics Using R and Python

Statistical method for understanding the data

Page 31: Introduction to Web Data Analytics Using R and Python

Descriptive Statistics • Descriptive statistics are brief descriptive coefficients that

summarize a given data set • Univariate analysis: involves describing the distribution of

a single variable including – Central tendency: mean, median, and mode – Dispersion

• Range, quantiles, inter quantile range • Spread: variance and standard deviation • Shape of the distribution : skewness and kurtosis.

– Characteristics of a variable's distribution in graphical form • Distribution, histograms , stem-and-leaf plot, box plot

• Bivariate analysis: to describe the relationship between pairs of variables. – Cross-tabulations and contingency tables – Graphical representation via scatterplots – Quantitative measures of dependence: correlation, covariance

Page 32: Introduction to Web Data Analytics Using R and Python

Univariate analysis for understanding the data

• Central tendency: mean, median, and mode

6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50

Mean Median Mode ~ 24 24 24

6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50, 517

Mean Median Mode ~ 54 24 24

The median is less sensitive to outliers (extreme scores) than the mean and thus a better measure than the mean for highly skewed distributions

Page 33: Introduction to Web Data Analytics Using R and Python

Univariate analysis for understanding the data • Dispersion

– Range, quantiles, inter quantile range – Spread: variance and standard deviation

Minimum First Quartile Median Third Quartile Maximum

6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50 • Range = Max - Min = 44 • Standard Deviation (SD) = 11.2 • Variance = s^2 = 126.4

Page 34: Introduction to Web Data Analytics Using R and Python

Univariate analysis for understanding the data • Skewness Measures asymmetry of data

– Positive or right skewed: Longer right tail

– Negative or left skewed: Longer left tail

2/3

1

2

1

3

21

)(

)(Skewness

Then, ns.observatio be ,...,Let

−=

=

=

n

ii

n

ii

n

xx

xxn

nxxx

• Measures peakedness of the distribution of data. The kurtosis of normal distribution is 0.

3)(

)(Kurtosis

Then, ns.observatio be ,...,Let

2

1

2

1

4

21

−=

=

=

n

ii

n

ii

n

xx

xxn

nxxx

Page 35: Introduction to Web Data Analytics Using R and Python

Univariate analysis for understanding the data

• Characteristics of a variable's distribution in graphical form – Bar diagram and Pie charts are used for categorical variables – Histogram and Box-plot are used for numerical variable.

Figure 3: Age Distribution

02

46

810

1214

16

40 60 80 100 120 140 More

Age in Month

Num

ber o

f Sub

ject

s

Mean 90.41666667

Standard Error 3.902649518

Median 84

Mode 84

Standard Deviation 30.22979318

Sample Variance 913.8403955

Kurtosis -1.183899591

Skewness 0.389872725

Range 95

Minimum 48

Maximum 143

Sum 5425

Count 60

Page 36: Introduction to Web Data Analytics Using R and Python

Data Preparation (Preprocessing) • Data in the real world has many problems

– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

• Example: occupation=“” – noisy: containing errors or outliers

• Example: Salary=“-10”, Age=“222” – inconsistent: containing discrepancies in codes or names

• Example: Age= 42 Birthday= 03/07/1997 “42” Birthday=“03/07/1997”

• Data need to be formatted for a given software tool • Data need to be made adequate for a given method

– Computational stability of the algorithms

1. http://paginas.fe.up.pt/~ec/files_0910/slides/aula_2_DataPreparation.pdf 2. Chapter 2: Han and Kamber, Data Mining Book 3. http://www.cs.unm.edu/~mueen/Teaching/CS591/Lectures/3_Data.pdf

Page 37: Introduction to Web Data Analytics Using R and Python

Major Tasks in Preprocessing • Data discretization

– Part of data reduction but with particular importance, especially for numerical data

• Data cleaning – Fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies • Data integration

– Integration of multiple databases, data cubes, or files • Data transformation

– Normalization and aggregation • Data reduction

– Obtains reduced representation in volume but produces the same or similar analytical results

Page 38: Introduction to Web Data Analytics Using R and Python

Discretization of Continuous Variables

• Divide the range of a continuous attribute into intervals – Some methods require discrete values, e.g. most

versions of Naïve Bayes – Reduce data size by discretization – Prepare for further analysis

• Useful for generating a summary of data • Also called binning

– Equal width binning – Equal height binning – Other methods: Entropy based, Holte 1R

Page 39: Introduction to Web Data Analytics Using R and Python

Binning • Equal width binning

– It divides the range into N intervals of equal size (range): uniform grid

– If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B -A)/N

• Equal height binning – It divides the range into N intervals, each containing

approximately the same number of samples – Generally preferred because avoids clumping – In practice, “almost-equal” height binning is used to give more

intuitive break points – Additional considerations:

• don’t split frequent values across bins • create separate bins for special values (e g 0)

Page 40: Introduction to Web Data Analytics Using R and Python

Data cleaning

– Fill in missing values (manual vs. automatic) • Ignore • Constant: “unknown”, a new class?! • Attribute mean (of entire set or subset) • Most probable value: inference-based

– Identify outliers and smooth out noisy data • Binning method • Clustering • Combined computer and human inspection • Regression

– Correct inconsistent data – Resolve redundancy caused by data integration

Page 41: Introduction to Web Data Analytics Using R and Python

Outlier Detection in Univariate Data

Compute mean and std. deviation. If the value is two three standard deviations away from the mean, it may be considered as an outlier

An observation is an extreme outlier if (Q1-3×IQR, Q3+3×IQR), and declared a mild outlier if it lies outside of the interval (Q1-1.5×IQR, Q3+1.5×IQR) (IQR = Inter Quartile Range, IQR=(Q3-Q1)

Page 42: Introduction to Web Data Analytics Using R and Python

Outlier Detection in multivariate Data • Statistical Methods

– Mahalnobis Distance – Outliers: Multivariate data points with large distances

• Data mining Methods – Distance based measures: An observation is defined

as a distance based outlier if at least a fraction β of the observations in the dataset are further than r from it.

– Clustering based methods consider a cluster of small sizes, including the size of one observation, as clustered outliers.

http://www2.cs.uh.edu/~ceick/7362/T1-1.pdf

Page 43: Introduction to Web Data Analytics Using R and Python

Handling missing values • Ignore records (use only cases with all values)

– Usually done when class label is missing as most prediction methods

– do not handle missing data well – Not effective when the percentage of missing values per

attribute varies considerably as it can lead to insufficient and/or biased sample sizes

• Ignore attributes with missing values – Use only features (attributes) with all values (may leave

out important features) • Fill in the missing value manually

– tedious + infeasible?

Page 44: Introduction to Web Data Analytics Using R and Python

Handling missing values • Use a global constant to fill in the missing value

– e.g., “unknown”. (May create a new class!) • Use the attribute mean to fill in the missing value

– It will do the least harm to the mean of existing data • Use the attribute mean for all samples belonging to the

same class to fill in the missing value • Use the most probable value to fill in the missing value

– Inference-based such as Bayesian formula or decision tree – Identify relationships among variables

• Linear regression, Multiple linear regression, Nonlinear regression – Nearest-Neighbour estimator

• Finding the k neighbours nearest to the point and fill in the most frequent value or the average value

• Finding neighbours in a large dataset may be slow

Page 45: Introduction to Web Data Analytics Using R and Python

Data Integration • Combines data from multiple sources into a coherent store • Remove redundancies • Data integration: • Schema integration

– integrate metadata from different sources – Entity identification problem: identify real world entities

from multiple data sources, e.g., A.cust-id ≡ B.cust-# • Detecting and resolving data value conflicts

– for the same real world entity, attribute values from different sources are different

– possible reasons: different representations, different scales, e.g., metric vs. British units

Page 46: Introduction to Web Data Analytics Using R and Python

Data transformation

• Smoothing: remove noise from data – binning, regression, clustering

• Aggregation: – summarization, data cube construction

• Generalization: concept hierarchy climbing • Attribute/feature construction

– New attributes constructed from the given ones (add att. Area which is based on height and width)

• Normalization – Scale values to fall within smaller specified range

Page 47: Introduction to Web Data Analytics Using R and Python

Data Normalization • min-max normalization

• z-score normalization (standardization)

• normalization by decimal scaling

Where j is the smallest integer such that Max(| |)<1

Page 48: Introduction to Web Data Analytics Using R and Python

Data Reduction • Data cube aggregation

– Aggregation operations are applied to the data in the construction of a data cube.

• Attribute subset selection – Irrelevant, weakly relevant, or redundant attributes or

dimensions may be detected and removed – Sometimes according to your knowledge of the business

• Dimensionality reduction – Encoding mechanisms are used to reduce the data set size.

• Numerosity reduction – The data are replaced or estimated by alternative, smaller

data representations

Page 49: Introduction to Web Data Analytics Using R and Python

Data cube aggregation

Page 50: Introduction to Web Data Analytics Using R and Python

Attribute subset selection

Page 51: Introduction to Web Data Analytics Using R and Python

Dimensionality reduction

• Matrix decomposition methods – Singular Value Decomposition

• Principal Component Analysis – Finding major directions – computes k orthonormal vectors

• Signal processing techniques – Discrete Fourier Transform – Discrete Wavelet transform

Page 52: Introduction to Web Data Analytics Using R and Python

Numerosity reduction • Reduce data volume by choosing alternative, smaller

forms of data representation – Parametric methods

• Assume the data fits some model, estimate model parameters, use the estimated value instead of the actual data

• Regression and log-linear models – Non-parametric methods

• Do not assume models • Histograms, clustering, sampling

– Discretization and concept hierarchy generation • where raw data values for attributes are replaced by ranges or

higher conceptual levels • Data discretization is a form of numerosity reduction that is very

useful for the automatic generation of concept hierarchies.

Page 53: Introduction to Web Data Analytics Using R and Python

Parametric Methods -Regression

Page 54: Introduction to Web Data Analytics Using R and Python

Non-parametric methods

Page 55: Introduction to Web Data Analytics Using R and Python

Concept Hierarchy Generation and Discretization

Page 56: Introduction to Web Data Analytics Using R and Python

Sampling