modern big data systems for machine learning

22
Modern Big Data Systems for Machine Learning Antonio Roldao, Ph.D. CQF. 1 10/July/2015, Thomson Reuters, London, UK

Upload: zpektral

Post on 18-Aug-2015

421 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Modern Big Data Systems for Machine Learning

Modern Big Data Systems for

Machine Learning

Antonio Roldao, Ph.D. CQF. 1

10/July/2015, Thomson Reuters, London, UK

Page 2: Modern Big Data Systems for Machine Learning

About Me

http://anton.io @roldao

2

Page 3: Modern Big Data Systems for Machine Learning

This Talk on Big Data Systems

Data Big Data as a Buzzword and the useless 4V’s Basic Aspects of Data Advanced Aspects of Data Small Data Innovations

Algorithms for Machine Learning ML Overview Optimization Problems Solving Systems of Linear Equations Accelerating ML Using Different Technologies

Distributed Computing Computing at Scale Platform Examples

Antonio Roldao, Ph.D. CQF. 3

Page 4: Modern Big Data Systems for Machine Learning

Big Data

4Vs of BD?! Volume Variety Velocity Veracity

Too simplistic and technically useless!

“Any amount of data that is too big for Excel to process.”

1956 Hard-drive with 5 MB

Mostly a marketing Buzzword which mean different things to different people.

Antonio Roldao, Ph.D. CQF. 4

Page 5: Modern Big Data Systems for Machine Learning

Understanding Data – Basic Storage formats

Uncompressed <-> Compressed Unencrypted <-> Encrypted Human-readable <-> Binary Rigid <-> Templated <-> Self-describing Mainly regular <-> Irregular Different types and encodings…

Generation (write) modes parallel <-> sequential append-only in-place updates random inserts…

Consumption (read) modes parallel <-> sequential random <-> well defined access…

Antonio Roldao, Ph.D. CQF. 5

Page 6: Modern Big Data Systems for Machine Learning

Understanding Data – Advanced

Represents: How concepts are connected (graph) How connections evolve with time (time series)

Bitemporal (e.g. value depends on time frame) Time value of data (e.g. Useful today, but not tomorrow) Sensitivity (e.g. Medical, Economical, Political, Privacy…) Interdependency (e.g. one wrong bit destroys everything) Cleanliness (e.g. how Noisy it is) Truthfulness (e.g. how Accurate it is) Redundancy (e.g. how safe does it need to be) Density (e.g. how Redundant it is) Accessibility (e.g. Local <-> Global)

Cost / BudgetAntonio Roldao, Ph.D. CQF. 6

Page 7: Modern Big Data Systems for Machine Learning

Myriad of Data-stores/bases File-Systems

local, distributed, p2p,… rom, tape, spindle, flash, ram,…

Key-Value Stores Relational Object Geo-location Row-based Column-based Time-Series Graph-based ACID compliant or not Sharding Support Replication Support HA Support Blockchain LayerFS …

Antonio Roldao, Ph.D. CQF. 7

Page 8: Modern Big Data Systems for Machine Learning

Recent Innovations in “Small-data”

XML (1996) YAML (2001) JSON BSON Google Protocol Buffers (initial release 2008) Cap’n Proto Thrift Avro FAST FIX/BFIX Flat Buffers Simple Binary Encoding (2014) Dynamically Adaptive Encoding (Future)

http://www.quora.com/What-are-the-pros-and-cons-of-different-serialization-formats-for-Hadoop

Antonio Roldao, Ph.D. CQF. 8

Page 9: Modern Big Data Systems for Machine Learning

Processing Data

Antonio Roldao, Ph.D. CQF. 9

Page 10: Modern Big Data Systems for Machine Learning

Machine Learning

Antonio Roldao, Ph.D. CQF. 10

Page 11: Modern Big Data Systems for Machine Learning

ML / AI – Boils down to…

Given an input (X) and/or state (S) produce a output (Y)

X may include Index or Time element (e.g. time series)

S may include: a feedback-loop (e.g. reinforcement learning) a previously trained dataset (e.g. supervised learning)

Y divides into two types: predictions (e.g. weather, trading, ...) categorizations

known categories (e.g. object/speech recognition, …) unknown categories (e.g. insight generation, …)

Antonio Roldao, Ph.D. CQF. 11

Page 12: Modern Big Data Systems for Machine Learning

Dimensionality Reduction

Principal Component Analysis

First component

Subsequent components

Antonio Roldao, Ph.D. CQF. 12

Page 13: Modern Big Data Systems for Machine Learning

Clustering

k-Means

For x observations cluster into k partitions the where ui represents the mean of points in Si

Antonio Roldao, Ph.D. CQF. 13

Page 14: Modern Big Data Systems for Machine Learning

General Al

Genetic Algorithms

For n mutations select mi that minimizes the difference between output yi and a given reference (r):

where

Antonio Roldao, Ph.D. CQF. 14

Page 15: Modern Big Data Systems for Machine Learning

Artificial Neural Networks

Deep Convolutional Neural Network (d-CNN)

Optimization involving Stochastic Gradient Descent + Back-propagation

Antonio Roldao, Ph.D. CQF. 15

Page 16: Modern Big Data Systems for Machine Learning

All About Optimization

All these schemes involve solving for some constants that Minimize or Maximize some Cost function

Require fundamental Optimization algorithms such as: Direct Methods

Combinatorial Algorithms Greedy Algorithm Minimax Algorithm with alpha-beta pruning …

Iterative Methods Gradient Methods Karmarkar’s Algorithm …

Antonio Roldao, Ph.D. CQF. 16

Page 17: Modern Big Data Systems for Machine Learning

At the Core of Optimization…

…there is a solution of a System of Linear equation of the form:

with x subject to some constraints.

Which need algos that can be subdivided into two categories:

Direct Methods Gaussian, LU, QR, Cholesky, LDL, …

Iterative Methods MINRES, GC, BiCGSTAB, GMRES, ORTHOMIN, …

Antonio Roldao, Ph.D. CQF. 17

Page 18: Modern Big Data Systems for Machine Learning

Accelerating Machine Learning

CPU GPGPU FPGA

Sequential Processing Parallel Processing

High FlexibilityHigh AbstractionsMany Libraries…Direct Methods

Ultra-Low-LatencyHigh BandwidthFine grain optimization...Iterative MethodsNeural NetworksMarkov ChainsMonte Carlo

Antonio Roldao, Ph.D. CQF. 18

Page 19: Modern Big Data Systems for Machine Learning

Networked Computing Systems

Mainframe Computing Cluster Computing Distributed Computing Grid Computing

Orbital Computing Interstellar Computing Galactic Computing Inter-Universe Computing

Cloud Computing

Antonio Roldao, Ph.D. CQF. 19

Page 20: Modern Big Data Systems for Machine Learning

Modern Big Data Systems – Basic Components Dynamic (abstraction) + Statically-Typed (speed) Languages Need to rethink and re-engineer main systems:

Data & Code Stores Logging Code Revision and Deployment Compute Nodes and Brokers Management Graceful Failure and Recovery Credentials and Access Controls Task Schedulers Messaging Bus Web/Mobile Interfaces Regression Testing…

Containerize and Standardize Services

Antonio Roldao, Ph.D. CQF. 20

Page 21: Modern Big Data Systems for Machine Learning

Examples – Modern Big Data Systems Finance

Athena/Hydra @ JP Morgan Quartz/Sandra @ Bank of America Slang/SecDB @ Goldman Sachs Optimus/DAL @ Morgan Stanley WSQ Tech @ n-prop shops &

datapark.io @ quants / prop-shops

Machine Learning

Alpha/DL @ Muse.Ai

Antonio Roldao, Ph.D. CQF. 21

Page 22: Modern Big Data Systems for Machine Learning

Thank you

http://anton.io @roldao