![Page 1: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/1.jpg)
1
![Page 2: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/2.jpg)
2© 2017 The MathWorks, Inc.
빅데이터처리및머신러닝기법
Application Engineer
엄준상 과장
![Page 3: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/3.jpg)
3
Data Analytics
Turn large volumes of complex data into actionable information
source: Gartner
![Page 4: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/4.jpg)
4
Data Analytics Workflow
Integrate Analytics with
Systems
Desktop Apps
Enterprise Scale
Systems
Embedded Devices
and Hardware
Files
Databases
Sensors
Access and Explore
Data
Develop Predictive
Models
Model Creation e.g.
Machine Learning
Model
Validation
Parameter
Optimization
Preprocess Data
Working with
Messy Data
Data Reduction/
Transformation
Feature
Extraction
![Page 5: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/5.jpg)
5
Example: Working with Big Data in MATLAB
Objective: Create a model to predict the cost of a taxi ride in New York City
Inputs:
– Monthly taxi ride log files
– The local data set is small (~20 MB)
– The full data set is big (~25 GB)
Approach:
– Acecss Data
– Preprocess and explore data
– Develop and validate predictive model (linear fit)
Work with subset of data for prototyping
Scale to full data set on a cluster
![Page 6: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/6.jpg)
6
Example: Working with Big Data in MATLAB
![Page 7: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/7.jpg)
7
Data Access and Pre-processing – Challenges
Data aggregation
– Different sources (files, web, etc.)
– Different types (images, text, audio, etc.)
Data clean up
– Poorly formatted files
– Irregularly sampled data
– Redundant data, outliers, missing data etc.
Data specific processing
– Signals: Smoothing, resampling, denoising,
Wavelet transforms, etc.
– Images: Image registration, morphological
filtering, deblurring, etc.
Dealing with out of memory data (big data)
Challenges
Data preparation accounts for about 80% of the work of data
scientists - Forbes
![Page 8: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/8.jpg)
9
Data Analytics Workflow: Data Access
C Java Fortran Python
Hardware
Software
Servers and Databases
Repositories – SQL, NoSQL, etc.
File I/O – Text, Spreadsheet, etc.
Web Sources – RESTful, JSON, etc.
Business and Transactional Data
Engineering, Scientific and Field
Data Real-Time Sources – Sensors,
GPS, etc.
File I/O – Image, Audio, etc.
Communication Protocols – OPC
(OLE for Process Control), CAN
(Controller Area Network), etc.
![Page 9: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/9.jpg)
10
Data Analytics Workflow: Big Data Access and Pre-processing
![Page 10: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/10.jpg)
11
Big Data in Recent Releases
datastore
– Tabular text files
– Images
– Excel spreadsheets
– (SQL) Databases
– HDFS (Hadoop)
– S3 (Amazon Web Services)
MATLAB MapReduce
– Scales from Desktop to Hadoopairdata = datastore('*.csv');
airdata.SelectedVariables = {'Distance', 'ArrDelay‘};
data = read(airdata);
![Page 11: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/11.jpg)
12
Data Analytics Workflow: Big Data Access and Pre-processing
![Page 12: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/12.jpg)
14
tall arrays in
New data type designed for data that doesn’t fit into memory
Lots of observations (hence “tall”)
Looks like a normal MATLAB array
– Supports numeric types, tables, datetimes, strings, etc…
– Supports several hundred functions for basic math, stats, indexing, etc.
– Statistics and Machine Learning Toolbox support
(clustering, classification, etc.)
![Page 13: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/13.jpg)
15
tall arraySingle
Machine
Memory
tall arrays
Automatically breaks data up into s
mall “chunks” that fit in memory
Tall arrays scan through the datase
t one “chunk” at a time
Processing code for tall arrays is th
e same as ordinary arrays
Single
Machine
MemoryProcess
![Page 14: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/14.jpg)
16
tall array
Cluster of
Machines
Memory
Single
Machine
Memory
tall arrays
With Parallel Computing Toolbox, pr
ocess several “chunks” at once
Can scale up to clusters with MATL
AB Distributed Computing Server
Single
Machine
MemoryProcess
Single
Machine
MemoryProcess
Single
Machine
MemoryProcess
Single
Machine
MemoryProcess
Single
Machine
MemoryProcess
Single
Machine
MemoryProcess
![Page 15: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/15.jpg)
17
Demo: Working with Tall Arrays
![Page 16: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/16.jpg)
18
Data Access and pre-processing – challenges and solution
Data aggregation
– Different sources (files, web, etc.)
– Different types (images, text, audio, etc.)
Data clean up
– Poorly formatted files
– Irregularly sampled data
– Redundant data, outliers, missing data etc.
Data specific processing
– Signals: Smoothing, resampling, denoising,
Wavelet transforms, etc.
– Images: Image registration, morphological
filtering, deblurring, etc.
Dealing with out of memory data (big data)
Challenges
Point and click tools to access
variety of data sources
High-performance environment
for big data
Files
Signals
Databases
Images
Built-in algorithms for data
preprocessing including sensor,
image, audio, video and other
real-time data
![Page 17: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/17.jpg)
21
Consider Machine/Deep Learning When
update as more data becomes available
learn complex non-linear relationships
learn efficiently from very large data sets
Problem is too complex for hand written rules or equations
Speech Recognition Object Recognition Engine Health Monitoring
Program needs to adapt with changing data
Weather Forecasting Energy Load Forecasting Stock Market Prediction
Program needs to scale
IoT Analytics Taxi Availability Airline Flight Delays
Because algorithms can
![Page 18: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/18.jpg)
22
Different Types of Learning
Machine
Learning
Supervised
Learning
Classification
Regression
Unsupervised
LearningClustering
Discover an internal representation from
input data only
Develop predictivemodel based on bothinput and output data
Type of Learning Categories of Algorithms
• No output - find natural groups and
patterns from input data only
• Output is a real number
(temperature, stock prices)
• Output is a choice between classes
(True, False) (Red, Blue, Green)
![Page 19: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/19.jpg)
24
Machine Learning with Big Data
• Descriptive statistics (skewness, tabulat
e, crosstab, cov, grpstats, …)
• K-means clustering (kmeans)
• Visualization (ksdensity, binScatterPlot;
histogram, histogram2)
• Dimensionality reduction (pca, pcacov, f
actoran)
• Linear and generalized linear regression
(fitlm, fitglm)
• Discriminant analysis (fitcdiscr)
• Linear classification methods for SVM
and logistic regression (fitclinear)
• Random forest ensembles of
classification trees (TreeBagger)
• Naïve Bayes classification (fitcnb)
• Regularized regression (lasso)
• Prediction applied to tall arrays
![Page 20: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/20.jpg)
25
Regression Learner
![Page 21: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/21.jpg)
26
Demo: Training a Machine Learning Model
![Page 22: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/22.jpg)
27
Demo: Training a Machine Learning Model
![Page 23: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/23.jpg)
28
Regression LearnerApp to apply advanced regression methods to your data
Added to Statistics and Machine Learning
Toolbox in R2017a
Point and click interface – no coding requi
red
Quickly evaluate, compare and select regr
ession models
Export and share MATLAB code or traine
d models
![Page 24: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/24.jpg)
29
Classification LearnerApp to apply advanced classification methods to your data
Added to Statistics and Machine Learning
Toolbox in R2014a
Point and click interface – no coding requi
red
Quickly evaluate, compare and select clas
sification models
Export and share MATLAB code or traine
d models
![Page 25: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/25.jpg)
30
Tuning Machine Learning ModelsGet more accurate models in less time
Automatically select best
machine leaning “features”
NCA: Neighborhood Component Analysis
Select best “features”
to keep in model from
over 400 candidates
Automatically fine-tune
machine learning parameters
Hyperparameter Tuning
![Page 26: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/26.jpg)
31
Machine Learning Hyperparameters
Hyperparameters
Tune a typical set of
hyperparameters for this model
Tune all
hyperparameters for this model
![Page 27: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/27.jpg)
32
Bayesian Optimization in Action
![Page 28: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/28.jpg)
34
MATLAB Production Server
Server software
– Manages packaged MATLAB progr
ams and worker pool
MATLAB Runtime libraries
– Single server can use runtimes fro
m different releases
RESTful JSON interface
Lightweight client libraries
– C/C++, .NET, Python, and Java
MATLAB Production Server
MATLAB
Runtime
Request Broker
&
Program
ManagerApplications/
Database
Servers RESTful
JSON
Enterprise
Application
MPS Client
Library
![Page 29: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/29.jpg)
35
Integrate analytics with systems
MATLAB
Runtime
C, C++ HDL PLC
Embedded Hardware
C/C++ ++ExcelAdd-in Java
Hadoop/
Spark.NET
MATLABProduction
Server
StandaloneApplication
Enterprise Systems
Python
![Page 30: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/30.jpg)
37
Key Takeaways
Utilize all of your data.
Apply advanced analytics techniques.
Operationalize analytics to enterprise syste
ms and embedded devices.
MATLAB Analytics work
with business and
engineering data
1
MATLAB enables domain experts to do
Data Science
2
3MATLAB Analytics run anywhere
![Page 31: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/31.jpg)
38
Resources to learn and get started mathworks.com/machine-learning
eBook
mathworks.com/big-data
![Page 32: 빅데이터처리및머신러닝기법 - MathWorks · Objective: Create a model to predict the cost of a taxi ride in New York City Inputs: –Monthly taxi ride log files –The](https://reader030.vdocuments.mx/reader030/viewer/2022040619/5f2de5dcace67970873f29d2/html5/thumbnails/32.jpg)
39© 2017 The MathWorks, Inc.