madlib’and’mlbase’ · mad’methodology’ •...
TRANSCRIPT
![Page 1: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/1.jpg)
MADlib and MLbase
Quan Fang Xiaofeng Tao
![Page 2: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/2.jpg)
MADlib Analy7cs Library
![Page 3: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/3.jpg)
What is a mad lib?
-‐ From bigdatablog.emc.com
![Page 4: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/4.jpg)
What is a mad lib?
-‐ From bigdatablog.emc.com
![Page 5: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/5.jpg)
MAD Methodology
• Cohen, Jeffrey, et al. "MAD skills: new analysis prac7ces for big data."Proceedings of the VLDB Endowment 2.2 (2009): 1481-‐1492.
• What does MAD stand for?
![Page 6: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/6.jpg)
Magne7c
• Use all data sources regardless of format or quality
![Page 7: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/7.jpg)
Magne7c Agile
• Use all data sources regardless of format or quality
• Support dynamic workflow and whatever data analysts need
![Page 8: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/8.jpg)
Magne7c Agile Deep
• Use all data sources regardless of format or quality
• Support dynamic workflow and whatever data analysts need
• Analyze large datasets without sampling
![Page 9: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/9.jpg)
Magne7c Agile Deep
• Goal: Give analysts access to familiar math concepts, sta7s7cal methods, and algorithms for database
• Use tradi7onal SQL and a wide range of extension languages
![Page 10: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/10.jpg)
Data Parallel Sta7s7cs
• Give analysts access to familiar math concepts and sta7s7cal methods for database
• Database methods are data parallel and have an SQL like interface
• Database methods fall into one of the following layers of abstrac7on: – Scalar – Vector – Matrix – Func7on
![Page 11: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/11.jpg)
MAD DBMS
• Loading and Unloading – Sca]er / Gather streaming for fully parallel access – External tables
• Extract-‐Transform-‐Load (ETL) and Extract-‐Load-‐Transform (ELT)
![Page 12: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/12.jpg)
MAD DBMS
• Mul7ple storage formats – External tables – Heap format – Append-‐mostly format
• Different storage formats per table par77on • Atomic par77on exchange
![Page 13: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/13.jpg)
MAD Programming
• Tradi7onal SQL queries with extensions • Map/Reduce func7ons wri]en in Python, Perl, or R
• Wide range of extension languages
![Page 14: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/14.jpg)
MAD Implementa7ons
• MADlib – Library for data analy7cs that can run on Greenplum or PostgreSQL
• Greenplum database – Massively parallel processing database
![Page 15: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/15.jpg)
MADlib
• Library for widely used sta7s7cs, data mining, and machine learning that runs on top of a SQL database – Regression models – Machine learning – Linear systems – Topic modelling – Descrip7ve sta7s7cs – And more…. (see doc.madlib.net)
![Page 16: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/16.jpg)
MADlib History
• S7ll in early stages of development • In use by some research universi7es and companies
• Heavily sponsored by Greenplum
![Page 17: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/17.jpg)
MADlib Interface
• Interface consists of extensible SQL like scripts that call MADlib func7ons
• Wide range of extension languages possible but C++ / Python recommended
• Designed for portability
![Page 18: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/18.jpg)
MADlib User Extensions
• User defined aggregates and func7ons in C++ • Driver func7ons use Python to wrap mul7ple MADlib SQL calls
• Templated queries are supported with Python wrappers
![Page 19: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/19.jpg)
MADlib: Logis7c regression
![Page 20: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/20.jpg)
MADlib Example: K-‐means
• Run K-‐means SELECT * FROM madlib.kmeanspp( 'km_sample', 'points', 2, 'madlib.squared_dist_norm2', 'madlib.avg', 20, 0.001 );
![Page 21: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/21.jpg)
MADlib Example: K-‐means
• Calculate silhoue]e coefficient SELECT * FROM madlib.simple_silhouette( 'km_sample', 'points',
(SELECT centroids FROM 'km_centroids'), 'madlib.squared_dist_norm2' );
![Page 22: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/22.jpg)
Live demo
• Postgres command line • Principal Component Analysis – pca_train() – pca_project()
![Page 23: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/23.jpg)
Conclusion
• Due to the economics of data (cheap, high volume), data warehouses in the future need to focus on data analy7cs – Magne7cally a]ract all data sources – Agile analysis workflow – Deeply analyze large datasets without sampling
• MADlib provides analysts with such a toolbox that is portable across SQL databases
![Page 24: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/24.jpg)
MLbase: A Distributed Machine-‐Learning System
![Page 25: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/25.jpg)
Four foci
• (1) A new declara7ve language to specify ML tasks
• (2) A novel op7mizer to select ML algorithms • (3) Provide answers early and con7nuously refine.
• (4) Design a distributed run-‐7me op7mized for the data access pa]erns of ML
![Page 26: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/26.jpg)
A Declara7ve Approach to ML
SQL Result MQL Model
![Page 27: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/27.jpg)
![Page 28: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/28.jpg)
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
COML(Optimizer)
Parser
Executor/Monitoring
Binders of Algorithms
Runtime Runtime Runtime Runtime
LLP
PLP
Master
MLbase Architecture
Algorithms, Data, Sta7s7cs
Adap%ve Op%mizer
MLI Interface between MLBase and Distributed Infra
1
2
3
![Page 29: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/29.jpg)
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
COML(Optimizer)
Parser
Executor/Monitoring
Binders of Algorithms
Runtime Runtime Runtime Runtime
LLP
PLP
Master
MLI Interface between MLBase and Distributed Infra
1
MLI: Machine Learning Interface 1
![Page 30: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/30.jpg)
MLI: Machine Learning Interface
• Shield ML Developers from low-‐level-‐details. • Independence between ML algorithm and run-‐7me
• Current supported run-‐7mes:
1
TupleWare
![Page 31: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/31.jpg)
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
COML(Optimizer)
Parser
Executor/Monitoring
Binders of Algorithms
Runtime Runtime Runtime Runtime
LLP
PLP
Master
Algorithms, Data, Sta7s7cs
2
Algorithms Pool 2
![Page 32: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/32.jpg)
Algorithms Pool
Implementa%on Algorithms pool
MLBase defined Contracts • Type (e.g., classifica7on) • Parameters • Run7me (e.g., O(n)) • Input-‐Specifica7on • Output-‐Specifica7on • …
ML Developer
+
2
Extensibility
![Page 33: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/33.jpg)
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
COML(Optimizer)
Parser
Executor/Monitoring
Binders of Algorithms
Runtime Runtime Runtime Runtime
LLP
PLP
Master
MLbase Architecture
Op%mizer LLP & PLP
3
![Page 34: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/34.jpg)
Query Op7miza7on(Logical Learning Plan) 3
![Page 35: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/35.jpg)
Run Time (Physical Learning Plan)
• (1) The master distributes ML tasks to workers • (2) Monitor progress • (3) Handle failures
3
![Page 36: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/36.jpg)
Conclusion
• MLbase’s core is its op7mizer • Transforms a declara7ve ML task to a sophis7cated learning plan
• Quickly returns a first quality answer, improves in the background
• To be fully distributed and offers a run-‐7me able to exploit ML algorithms
![Page 37: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009](https://reader033.vdocuments.mx/reader033/viewer/2022060719/607f0ff86a70663a9036574a/html5/thumbnails/37.jpg)
ML Developer