![Page 1: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/1.jpg)
How to do Complex Analytics
Michael Stonebraker
![Page 2: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/2.jpg)
2
Big Volume - Little Analytics
• SQL aggregates, group_by
• Find me the average closing price of MSFT on all trading days within the last 3 years
• Find me the average closing price of each stock in the DJIA on trading days in the last 5 years
• High performance on SQL analytics available from the data warehouse crowd
![Page 3: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/3.jpg)
3
Big Data - Big Analytics
• Complex math operations (machine learning, clustering, trend detection, ….)— The world of the “quants”— Mostly specified as linear algebra on array data
• A dozen or so common ‘inner loops’— Matrix multiply— QR decomposition— SVD decomposition— Linear regression
![Page 4: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/4.jpg)
4
Big Data - Big AnalyticsAn Example
• Consider closing price on all trading days for the last 5 years for two stocks A and B
• What is the covariance between the two time-series?
(1/N) * sum (Ai - mean(A)) * (Bi - mean (B))
![Page 5: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/5.jpg)
5
Now Make It Interesting …
• Do this for all pairs of 4000 stocks— The data is the following 4000 x 1000
matrixStoc
kt1 t2 t3 t4 t5 t6 t7
…. t1000
S1
S2
…
S4000
Hourly data? All securities?
![Page 6: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/6.jpg)
6
Solution
• Except for the constant and subtracting off the means:
— Stock * StockT
![Page 7: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/7.jpg)
7
Big Data - Big AnalyticsRequirements
• SQL-style data management— Filters, joins, ….
• Complex array manipulation
![Page 8: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/8.jpg)
8
Big Data - Big AnalyticsSolution Options
• Math package• RDBMS• RDBMS + math package• Array data base• Hadoop
![Page 9: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/9.jpg)
9
Solution OptionsR, SAS, Matlab, et al
• Weak or non-existent data management— Do the correlation only for companies with revenue >
$1B ?
• File system storage
• R doesn’t scale and is not a parallel system— Revolution does a bit better
![Page 10: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/10.jpg)
10
Solution Options RDBMS alone
• SQL simulator (MadLib) is slooooow— And only does some of the required
operations
• Coding operations as UDFs still requires you to simulate arrays on top of tables --- sloooow— And current UDF model not powerful enough
to support iteration
![Page 11: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/11.jpg)
11
Solution OptionsR + RDBMS
• Have to extract and transform the data from RDBMS table to math package data format (e.g. data frames)
• ‘move the world’ nightmare
• Need to learn 2 systems
• And R still doesn’t scale and is not a parallel system
• Some RDBMS vendors are working on these issues
![Page 12: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/12.jpg)
12
Array DBMS(e.g. Paradigm4/SciDB)
• Array SQL data management • With massively scalable array analytics
• In a single system!
• Open source
• Runs in the cloud or private grid of commodity HW
![Page 13: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/13.jpg)
13
Array Versus Relational Tables
• Math functions run directly on native storage format
• Dramatic storage efficiencies as # of dimensions & attributes grows
• High performance on both sparse and dense data
• Math functions run directly on native storage format
• Dramatic storage efficiencies as # of dimensions & attributes grows
• High performance on both sparse and dense data
48 cells
16 cells
![Page 14: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/14.jpg)
14
Hadoop
• Awful performance on data management— No indexes, no statistics, …
• Low level interface — 40 years of DBMS research points to
high level interfaces
• At the very least move to Pig, Hive, …— Another moving part to integrate
![Page 15: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/15.jpg)
15
Hadoop
• No Math— Roll your own or— Use Mahout (yet another moving
part to integrate)
• And Hadoop is very inefficient on math that is not “embarassingly parallel”
![Page 16: Michael Stonebraker How to do Complex Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022051314/557a84a9d8b42acf638b473c/html5/thumbnails/16.jpg)
16
Summary
• RDBMS good on data management, bad on math
• Math products don’t scale and have no data management
• Hadoop is slow and has too many moving parts that are not well integrated— Not good at either task!
• Opportunity for a new DBMS?