conceptual data science · data cleaning fill in missing ... (standford/coursera) machine learning...

19
Program Studi: Manajemen Bisnis Telekomunikasi & Informatika Mata Kuliah: Big Data And Data Analytics Oleh: Tim Dosen CONCEPTUAL DATA SCIENCE

Upload: others

Post on 12-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Program Studi: Manajemen Bisnis Telekomunikasi & InformatikaMata Kuliah: Big Data And Data Analytics

Oleh: Tim Dosen

CONCEPTUALDATA SCIENCE

Page 2: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

2 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

o Data Simulation (Monte Carlo)

o Data Preprocessing

o Conceptual Learning Data / Machine Learning

o Model Evaluation / Accuracy

o Case Study / Exercise

OUTLINE

Page 3: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

3 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Modeling and SimulationModeling and simulation (M&S) refers to using models – physical, mathematical, or

otherwise logical representation of a system, entity, phenomenon, or process – as a basis

for simulations – methods for implementing a model (either statically or) over time – to develop data

as a basis for managerial or technical decision making.[1][2] M&S helps getting information about

how something will behave without actually testing it in real life (wikipedia)

An Example of Simulation : Monte Carlo Methods

Monte CarloMonte Carlo methods (or Monte Carlo experiments) are a broad class

of computational algorithms that rely on repeated random sampling to obtain numerical

results. Their essential idea is using randomness to solve problems that might be

deterministic in principle. They are often used in physical and mathematical problems and

are most useful when it is difficult or impossible to use other approaches. Monte Carlo methods are mainly used in three distinct problem classes:[1]optimization, numerical integration, and generating draws from a probability distribution.

Page 4: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

4 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Monte Carlo Example

Page 5: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

5 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

GoldSim Video Monte Carlo Simulation

Page 6: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

6 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Why Simulation• Simulations is generally cheaper, safer and sometimes more ethical than conducting real-world

experiments. For example, supercomputers are sometimes used to simulate the detonation of nuclear devices and their effects in order to support better preparedness in the event of a nuclear explosion.Similar efforts are conducted to simulate hurricanes and other natural catastrophes.

• Simulations can often be even more realistic than traditional experiments, as they allow the free configuration of environment parameters found in the operational application field of the final product. Examples are supporting deep water operation of the US Navy or the simulating the surface of neighbored planets in preparation of NASA missions

• Simulations can often be conducted faster than real time. This allows using them for efficient if-then-else analyses of different alternatives, in particular when the necessary data to initialize the simulation can easily be obtained from operational data. This use of simulation adds decision support simulation systems to the tool box of traditional decision support systems

Page 7: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

7 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Data Preprocessing (Why ?)Measures for data quality: A multidimensional view

Accuracy: correct or wrong, accurate or notCompleteness: not recorded, unavailable, …Consistency: some modified but some not, …Timeliness: timely update? Believability: how trustable the data are correct?Interpretability: how easily the data can be understood?

Page 8: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

8 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

1. Data cleaning Fill in missing values

Smooth noisy data

Identify or remove outliers

Resolve inconsistencies

2. Data reduction Dimensionality reduction

Numerosity reduction

Data compression

3. Data transformation and data discretization Normalization

Concept hierarchy generation

4. Data integration Integration of multiple databases or files

Major Task in Data Preprocessing

Page 9: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

9 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error Incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data e.g., Occupation=“ ” (missing data)

Noisy: containing noise, errors, or outliers e.g., Salary=“−10” (an error)

Inconsistent: containing discrepancies in codes or names e.g., Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

Discrepancy between duplicate records Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?

Data Cleaning

Page 10: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

10 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of entry

not register history or changes of the data

Missing data may need to be inferred

Incomplete (Missing) Data

Page 11: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

11 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Data Reduction Strategies

• Data Reduction• Obtain a reduced representation of the data set that is much smaller in volume but yet

produces the same analytical results

• Why Data Reduction?• A database/data warehouse may store terabytes of data

• Complex data analysis take a very long time to run on the complete dataset

• Data Reduction Strategies1. Dimensionality reduction

1. Feature Extraction

2. Feature Selection

2. Numerosity reduction (Data Reduction)• Regression and Log-Linear Models

• Histograms, clustering, sampling

Page 12: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

12 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

1. Estimation:

Linear Regression, Neural Network, Support Vector Machine, etc

2. Prediction/Forecasting:

Linear Regression, Neural Network, Support Vector Machine, etc

3. Classification:

Naive Bayes, K-Nearest Neighbor, C4.5, ID3, CART, Linear Discriminant Analysis, Logistic Regression, etc

4. Clustering:

K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means, etc

5. Association:

FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc

General Methods in Data Analytics

Page 13: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

13 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

1. Estimation: Error: Root Mean Square Error (RMSE), MSE, MAPE, etc

2. Prediction/Forecasting (Prediksi/Peramalan):

Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc

3. Classification:

Confusion Matrix: Accuracy

ROC Curve: Area Under Curve (AUC)

4. Clustering: Internal Evaluation: Davies–Bouldin index, Dunn index,

External Evaluation: Rand measure, F-measure, Jaccard index, Fowlkes–Mallowsindex, Confusion matrix

5. Association: Lift Charts: Lift Ratio

Precision and Recall (F-measure)

Evaluation (Accuracy, Error)

Page 14: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

14 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Machine LearningIn the field of data analytics, machine learning is a method used to devise complex models and

algorithms that lend themselves to prediction - in commercial use, this is known as predictive analytics.

These analytical models allow researchers, data scientists, engineers, and analysts to "produce

reliable, repeatable decisions and results" and uncover "hidden insights" through learning from historical relationships and trends in the data (wikipedia)

Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. (standford/coursera)

Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. (whatis.com)

Page 15: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

15 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Data SplitThe Split Data operator takes a dataset as its input and delivers the subsets of that dataset through its output portsThe sampling type parameter decides how the examples should be shuffled in the resultant partitions:

1. Linear sampling: Linear sampling simply divides the dataset into partitions without changing the order of the examplesSubsets with consecutive examples are created

2. Shuffled sampling: Shuffled sampling builds random subsets of the datasetExamples are chosen randomly for making subsets

3. Stratified sampling: Stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole datasetIn the case of a binominal classification, stratified sampling builds random subsets so that each subset contains roughly the same proportions of the two values of the label

We split data into 2 group: Training data and Testing data

Page 16: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

16 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Cross Validation Methods

• Cross-Validation method used to avoid overlapping choice from testing data• Cross-Validation step:

• Divide data into k subset (same size)• Use each subset for testing data and the rest for training data

• This method also called k-fold cross-validation• We often use stratified (bertingkat) sampling before cross-validation process, because it reduces

variance estimation

Page 17: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

17 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

10 Fold Cross-ValidationEksperiment Dataset Accuracy

1 93%

2 91%

3 90%

4 93%

5 93%

6 91%

7 94%

8 93%

9 91%

10 90%

Akurasi Rata-Rata 92%Orange Box : k-subset (data testing)

Page 18: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

18 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Case Study : NBA

Page 19: CONCEPTUAL DATA SCIENCE · Data cleaning Fill in missing ... (standford/coursera) Machine learning is a type of artificial intelligence (AI) that provides computers with the ability

Telkom University

19 Creating the great business leaders

Program Studi:MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA

Dosen:Yudi Priyadi, M.T.

Fakultas Ekonomi dan BisnisSchool Economic and Business

Exercise:1. Use one of the following tools : RapidMiner, R, Orange, Weka

2. Create prediction model (prediksi elektabilitas caleg) using data training on data pemilu (datapemilukpu.xls) using the followingalgorithm :. 1. Decision Tree (C4.5)2. Naïve Bayes (NB)3. K-Nearest Neighbor (K-NN)

3. Do evaluation / accuracy testing using 10-fold X Validation

C4.5 NB K-NN

Accuracy 92.45% 77.46% 88.72%

AUC 0.851 0.840 0.5