modern big data systems for machine learning
TRANSCRIPT
Modern Big Data Systems for
Machine Learning
Antonio Roldao, Ph.D. CQF. 1
10/July/2015, Thomson Reuters, London, UK
About Me
http://anton.io @roldao
2
This Talk on Big Data Systems
Data Big Data as a Buzzword and the useless 4V’s Basic Aspects of Data Advanced Aspects of Data Small Data Innovations
Algorithms for Machine Learning ML Overview Optimization Problems Solving Systems of Linear Equations Accelerating ML Using Different Technologies
Distributed Computing Computing at Scale Platform Examples
Antonio Roldao, Ph.D. CQF. 3
Big Data
4Vs of BD?! Volume Variety Velocity Veracity
Too simplistic and technically useless!
“Any amount of data that is too big for Excel to process.”
1956 Hard-drive with 5 MB
Mostly a marketing Buzzword which mean different things to different people.
Antonio Roldao, Ph.D. CQF. 4
Understanding Data – Basic Storage formats
Uncompressed <-> Compressed Unencrypted <-> Encrypted Human-readable <-> Binary Rigid <-> Templated <-> Self-describing Mainly regular <-> Irregular Different types and encodings…
Generation (write) modes parallel <-> sequential append-only in-place updates random inserts…
Consumption (read) modes parallel <-> sequential random <-> well defined access…
Antonio Roldao, Ph.D. CQF. 5
Understanding Data – Advanced
Represents: How concepts are connected (graph) How connections evolve with time (time series)
Bitemporal (e.g. value depends on time frame) Time value of data (e.g. Useful today, but not tomorrow) Sensitivity (e.g. Medical, Economical, Political, Privacy…) Interdependency (e.g. one wrong bit destroys everything) Cleanliness (e.g. how Noisy it is) Truthfulness (e.g. how Accurate it is) Redundancy (e.g. how safe does it need to be) Density (e.g. how Redundant it is) Accessibility (e.g. Local <-> Global)
Cost / BudgetAntonio Roldao, Ph.D. CQF. 6
Myriad of Data-stores/bases File-Systems
local, distributed, p2p,… rom, tape, spindle, flash, ram,…
Key-Value Stores Relational Object Geo-location Row-based Column-based Time-Series Graph-based ACID compliant or not Sharding Support Replication Support HA Support Blockchain LayerFS …
Antonio Roldao, Ph.D. CQF. 7
Recent Innovations in “Small-data”
XML (1996) YAML (2001) JSON BSON Google Protocol Buffers (initial release 2008) Cap’n Proto Thrift Avro FAST FIX/BFIX Flat Buffers Simple Binary Encoding (2014) Dynamically Adaptive Encoding (Future)
http://www.quora.com/What-are-the-pros-and-cons-of-different-serialization-formats-for-Hadoop
Antonio Roldao, Ph.D. CQF. 8
Processing Data
Antonio Roldao, Ph.D. CQF. 9
Machine Learning
Antonio Roldao, Ph.D. CQF. 10
ML / AI – Boils down to…
Given an input (X) and/or state (S) produce a output (Y)
X may include Index or Time element (e.g. time series)
S may include: a feedback-loop (e.g. reinforcement learning) a previously trained dataset (e.g. supervised learning)
Y divides into two types: predictions (e.g. weather, trading, ...) categorizations
known categories (e.g. object/speech recognition, …) unknown categories (e.g. insight generation, …)
Antonio Roldao, Ph.D. CQF. 11
Dimensionality Reduction
Principal Component Analysis
First component
Subsequent components
Antonio Roldao, Ph.D. CQF. 12
Clustering
k-Means
For x observations cluster into k partitions the where ui represents the mean of points in Si
Antonio Roldao, Ph.D. CQF. 13
General Al
Genetic Algorithms
For n mutations select mi that minimizes the difference between output yi and a given reference (r):
where
Antonio Roldao, Ph.D. CQF. 14
Artificial Neural Networks
Deep Convolutional Neural Network (d-CNN)
Optimization involving Stochastic Gradient Descent + Back-propagation
Antonio Roldao, Ph.D. CQF. 15
All About Optimization
All these schemes involve solving for some constants that Minimize or Maximize some Cost function
Require fundamental Optimization algorithms such as: Direct Methods
Combinatorial Algorithms Greedy Algorithm Minimax Algorithm with alpha-beta pruning …
Iterative Methods Gradient Methods Karmarkar’s Algorithm …
Antonio Roldao, Ph.D. CQF. 16
At the Core of Optimization…
…there is a solution of a System of Linear equation of the form:
with x subject to some constraints.
Which need algos that can be subdivided into two categories:
Direct Methods Gaussian, LU, QR, Cholesky, LDL, …
Iterative Methods MINRES, GC, BiCGSTAB, GMRES, ORTHOMIN, …
Antonio Roldao, Ph.D. CQF. 17
Accelerating Machine Learning
CPU GPGPU FPGA
Sequential Processing Parallel Processing
High FlexibilityHigh AbstractionsMany Libraries…Direct Methods
Ultra-Low-LatencyHigh BandwidthFine grain optimization...Iterative MethodsNeural NetworksMarkov ChainsMonte Carlo
Antonio Roldao, Ph.D. CQF. 18
Networked Computing Systems
Mainframe Computing Cluster Computing Distributed Computing Grid Computing
Orbital Computing Interstellar Computing Galactic Computing Inter-Universe Computing
Cloud Computing
Antonio Roldao, Ph.D. CQF. 19
Modern Big Data Systems – Basic Components Dynamic (abstraction) + Statically-Typed (speed) Languages Need to rethink and re-engineer main systems:
Data & Code Stores Logging Code Revision and Deployment Compute Nodes and Brokers Management Graceful Failure and Recovery Credentials and Access Controls Task Schedulers Messaging Bus Web/Mobile Interfaces Regression Testing…
Containerize and Standardize Services
Antonio Roldao, Ph.D. CQF. 20
Examples – Modern Big Data Systems Finance
Athena/Hydra @ JP Morgan Quartz/Sandra @ Bank of America Slang/SecDB @ Goldman Sachs Optimus/DAL @ Morgan Stanley WSQ Tech @ n-prop shops &
datapark.io @ quants / prop-shops
Machine Learning
Alpha/DL @ Muse.Ai
Antonio Roldao, Ph.D. CQF. 21
Thank you
http://anton.io @roldao