scalable, distributed, machine learning for big data

98
Scalable, Distributed, Machine Learning for Big Data Yu Huang Sunnyvale, California [email protected]

Upload: yu-huang

Post on 08-Sep-2014

9.567 views

Category:

Technology


7 download

DESCRIPTION

Big Data, Parallel Computing, Cloud computing, Lambda architecture, MapReduce, GFS, Hadoop, HFS, Precolator, Caffeine, Pregel, Drill, chukwa, Hive, Pig, Scribe, Flume, Thrift, YARN, Storm, Summingbird, S4, ZooKeeper, Data Freeway, Puma1/2/3, NoSQL, BigTable, Dynamo, Cassandra, HBase, Kafka, Samza, Large Scale Machine Learning, overfitting, curse of dimensionality, load balancing, auto scaling, job scheduling, work flow, Spark, Mahout, Jubatus, Graphlab, BSP, Dremel, Giraph, Hama, Vowpal Wabbit, Strident ML, Storm-Pattern, Samoa.

TRANSCRIPT

Page 1: Scalable, Distributed, Machine Learning for Big Data

Scalable, Distributed, Machine

Learning for Big Data

Yu Huang

Sunnyvale, California

[email protected]

Page 2: Scalable, Distributed, Machine Learning for Big Data

Outline Big Data - Volume, Variety, Velocity

Parallel Computing and Cloud computing

Lambda architecture: Batch, Speed and Serving Layers

◦ Hadoop: MR implementation from Yahoo

◦ Apache Thrift: scalable cross-language services from Facebook

◦ Chukwa: data collection system

◦ Apache Flume: stream data collection

◦ Hive: data warehouse

◦ Pig: high level data-flow language

◦ Zookeeper: high-performance coordination service

◦ YARN (MRv2 or next gen Hadoop)

◦ Summingbird: a lib to write MR programs on MR at Twitter

◦ Storm: Stream processing from Twitter

◦ S4: Stream processing from Yahoo

◦ Scribe: server for stream data aggregating at Facebook

◦ Data Freeway: data stream at Facebook

◦ Puma: Stream processing from Facebook

◦ Kafka: distributed messaging system at Linkedin

◦ Samza: stream processing from LinkedIn

◦ Kinesis: real-time stream processing at Amazon

Page 3: Scalable, Distributed, Machine Learning for Big Data

Outline NoSQL-Not Only SQL database

◦ Google Bigtable

◦ Amazon Dynamo

◦ Cassandra by Facebook

◦ Hbase: like Bigtable

Large Scale Machine Learning

◦ Spark – Lightning-Fast Cluster Computing

◦ Mahout - Scalable ML on Hadoop

◦ Jubatus – Distributed Online Real-time ML

◦ Graphlab – Big Learning on Graphs at UC Berkeley

◦ Vowpal Wabbit – Fast Learning at Yahoo/MS

◦ Trident ML and Storm Pattern: ML on Storm, YARN

◦ Upcoming --- Samoa: ML on S4, Storm

Key Issues in Scalable Distributed ML for Big Data

◦ Load balancing

◦ Auto scaling

◦ Job Scheduling

◦ Workflow management

Reference

Page 4: Scalable, Distributed, Machine Learning for Big Data

Big Data

Volume (large amounts of data gathered);

Variety (various degrees of structure);

Velocity (how data flow, at high rates);

Value (business);

Variability (changes);

Veracity (quality).

Data as a Service (DaaS) in the cloud;

Two main strategies for dealing with big data:

◦ Sampling;

◦ Distributed systems.

Big Data Challenges

◦ Protecting privacy;

◦ Integration of big data technologies into enterprise landscape;

◦ Addressing increasing real time needs with increasing data volume and varieties;

◦ Leveraging cloud computing with big data storage and processing.

Page 5: Scalable, Distributed, Machine Learning for Big Data

Big Data Instances

One billion data instance◦ Web-scale

◦ Guaranteed to contain data in different formats ASCII text, pictures, javascript code, PDF documents…

◦ Guaranteed to contain (near) duplicates

◦ Likely to be badly preprocessed

◦ Storage is an issue

One trillion data instance◦ Beyond the reach of the modern technology

◦ Peer-to-peer paradigm is (arguably) the only way to process the data

◦ Data privacy / inconsistency / skewness issues Can’t be kept in one location

Is intrinsically hard to sample

Page 6: Scalable, Distributed, Machine Learning for Big Data

Big Data Analysis Pipeline

Page 7: Scalable, Distributed, Machine Learning for Big Data

Parallel Computing

Data-Instruction;

◦ SIMD, MIMD, …

Data intensive

◦ Cloud computing

Compute intensive

◦ GPU computing

Shared memory: OpenMP

Distributed memory: MPI,

Hybrid: MR

Page 8: Scalable, Distributed, Machine Learning for Big Data

Cloud Computing

A model for enabling convenient, on-demand network access to a

shared pool of configurable computing resources (Ethernet),

usually for large Internet services;

Dynamic provision of services & resource pools in a coordinated

fashion;

Cloud computing infrastructure is just a web service interface to

operating system virtualization (via hypervisor);

Heterogeneous by virtualization;

Everything as a service (XaaS);

Data intensive: big data;

Distributed parallel, more like utility computing;

Not grid computing.

Page 9: Scalable, Distributed, Machine Learning for Big Data

X-as-a-Service

Page 10: Scalable, Distributed, Machine Learning for Big Data

Lambda Architecture

Equation “query = function(all data)” which is the basis of all data systems (data is more than information);

Human fault-tolerance – the system is unsusceptible to data loss or data corruption

Data immutability – store data in it’s rawest form immutable and for perpetuity.

Re-computation – with the two principles above it is always possible to (re)-compute results

Layered structure:

◦ Batch layer: unrestrained batch compute, horizontal scalable, high latency, read-only database, raw dataset, override speed layer (like Hadoop);

◦ Speed layer: only new data, stream processing, continuous compute, transactional, limited storage of windowed data (such as Storm);

◦ Serving layer: query batch views by load and random access.

Can discard any view, batch and real time, and just recreate everything from the master data.

Mistakes are corrected via recomputation.

◦ Write bad data? Remove the data & recompute.

◦ Bug in view generation? Just recompute the view.

Data storage is highly optimized.

Page 11: Scalable, Distributed, Machine Learning for Big Data

Lambda Architecture Flowchart

Page 12: Scalable, Distributed, Machine Learning for Big Data

Data Analytics System Architecture

Online transaction processing

Facebook

Apache

Apache

Page 15: Scalable, Distributed, Machine Learning for Big Data

Map-Reduce

A program model borrowed from functional programming

Separate details of the original problem from of parallelism;

◦ map() produces one or more intermediate (key/value pairs) from

the split input (“shards”);

◦ reduce() combines intermediate (key/value pairs) into final files

after partitioning and sorting by key;

Scale to a large cluster of machines from a single machine;

Fault tolerance: Map or Reduce;

Locality: Distributed GFS chunks;

Bottleneck: The Reduce phase can’t start until the Map

phase is completely finished (batch, not stream, processing):

◦May not suitable for real time processing and in-depth analysis.

Page 16: Scalable, Distributed, Machine Learning for Big Data

Map-Reduce Pipelinee

Page 17: Scalable, Distributed, Machine Learning for Big Data

Hadoop

HDFS: data storage and transfer, as GFS in Hadoop;◦ NamedNode (job tracker), DataNode (task tracker);

◦ Master Node, Slave Node;

◦ Error handling: replication (3 by default);

Job Tracker: scheduling, JobConf and JobClient;

Task Tracker: status, TaskRunner, map or reduce;

Data In/Out: ◦ HDFS block size in Input Splits

◦ # of reducers in Output;

Task Failure: report;

Job Scheduler: ◦ FIFO, Fair, Capacity,…

Page 18: Scalable, Distributed, Machine Learning for Big Data

Map-Reduce in Hadoop

JobTrackerMapReduce job

submitted by

client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

Page 19: Scalable, Distributed, Machine Learning for Big Data
Page 20: Scalable, Distributed, Machine Learning for Big Data

Hive & Pig Hive: A database/warehouse on top of Hadoop;

◦ SQL as a familiar data warehousing tool

◦ Extensibility – Types, Functions, Formats, Scripts

◦ Scalability and Performance;

◦ Rich data types (structs, lists and maps);

◦ Efficient impl.s of SQL filters, joins and group-by’s on top of MR;

◦ Easy interactions with different programming languages;

HQL is like SQL.

Pig: A platform for easier analyzing large data sets;

◦ Pig Latin: data flow language similar to scripting languages;

◦ Pig Engine: parses, optimizes and automatically executes Pig Latin scripts;

◦ User defined functions for col. transform (TOUPPER), or aggregation (SUM);

UDFs to take advantage of the combiner.

◦ Four join implement.s built in: hash, fragment-replicate, merge, skewed;

◦ Writing load and store functions is easy once an I/O Format exist;

◦ Piggybank - a collection of user contributed UDFs;

◦ DataFu - LinkedIn's collection of Pig UDFs.

Page 21: Scalable, Distributed, Machine Learning for Big Data

Apache Thrift Software framework for scalable cross-language services, at Facebook;

A software stack + a code generation engine to build services b.t.w. C++, Java, Python, PHP, Ruby, Erlang, Perl, C#, OCaml and Delphi, etc.;

Key components in this open source:

◦ Type: for users to develop using completely natively defined types;

◦ Transport: used by the generated code to facilitate data transfer;

◦ Protocl: a certain messaging structure in data transport, agnostic to encoding;

◦ Versioning: staged rollouts of changes to deployed services;

◦ RPC implementation: TProcessor instance for data stream processing to realize remote procedure calls (RPC), and TServer abstraction;

The interface definition language (IDL) allows for definition of Types:

◦ A Thrift IDL file is processed by the code generator to produce code for the target languages to support the defined structs and services in the IDL file.

Similar systems

◦ SOAP. Designed for web services via HTTP, excessive XML parsing overhead;

◦ CORBA. Relatively comprehensive, debatably overdesigned and heavyweight;

◦ Avro. Dynamic typing, untagged data, no manually-assigned field IDs;

◦ COM. Embraced mainly in Windows client software. Not entirely open solution;

◦ Pillar. Lightweight and high-performance, but missing versioning & abstraction;

◦ Protocol Buffers. Closed-source, owned by Google.

Page 23: Scalable, Distributed, Machine Learning for Big Data

Apache Thrift

The Thrift stack is a common class hierarchy implemented

in each language that abstracts out the tricky details of

protocol encoding and network communication

Page 24: Scalable, Distributed, Machine Learning for Big Data

Chukwa A data collection system for monitoring large distributed systems;

Provides flexible/powerful toolkit to display, monitor, and analyze results;

Architecture:

◦ Agents - run on each machine and emit data;

◦ Collectors - receive data from the agent and write it to stable storage;

◦ MapReduce jobs - parsing and archiving the data;

◦ Hadoop Infrastructure Care Center - a web-portal style interface.

Page 25: Scalable, Distributed, Machine Learning for Big Data

ZooKeeper in Hadoop A shared hierarchical name space of data registers;

Exposes common services in simple interface:

◦ Naming, configuration management, locks & synchronization, group membership services, leader selection;

Each node in the namespace is called as a ZNode.

◦ Persistent Nodes, Ephemeral Nodes, Sequence Nodes;

◦ Every ZNode has data and can optionally have children;

◦ Read requests are processed locally at the server;

◦ Write requests are forwarded to the leader.

ZNode paths: every ZNode exists at some path

◦ Canonical, absolute, slash-separated;

◦ No relative references;

◦ Names can have Unicode characters.

ZNode watches can be set on ZNodes

◦ One time change triggers, always ordered.

A client connects to ZooKeeper and initiates a session;

Consistency guarantees;

Support Kerberos security.

Page 26: Scalable, Distributed, Machine Learning for Big Data

ZooKeeper Service

ZooKeeper service is replicated over a set of machines;

All machines store a copy of the data (in memory);

A leader is elected on service startup;

Clients only connect to a single ZooKeeper server & maintains a TCP connection;

Client can read from any Zookeeper server, writes go through the leader & needs majority consensus.

Page 27: Scalable, Distributed, Machine Learning for Big Data

Precolator Describes how web search index is kept up to date, at google;

◦ Google’s indexing system stores tens of petabytes of data and processes

billions of updates per day on thousands of machines;

Incremental update to big data: code in Java, no need for batch

process;

Provide transaction/locking, based on GFS, build on top of

BigTable;

Architecture:

◦ Applications are a sequence of observers;

◦ A observer is called via notification;

◦ A notification is triggered when table data changes;

◦ Applications call BigTable’s TableServers via RPC;

◦ TableServers call GFS ChunkServer;

Page 28: Scalable, Distributed, Machine Learning for Big Data

Precolator

Random accesses to the document repository while maintaining

data invariants;

Faster than comparable MapReduce, improved latency (100x), ,

reduce the doc’s average age by 50%;

Time stamping and locking via Chubby lock server;

Page 29: Scalable, Distributed, Machine Learning for Big Data

Caffeine Caffeine is a new search scheme (algorithm) based on Precolator;

Even with changes, most white hat optimization tactics continue to prevail;

More competition for single, generic-type keywords, less stability of rankings, and increased focus on long-tail keywords in SEO;

Feature site titles and snippets with higher phrase/keyword density.

Faster index: returned at faster speeds

Fresher results: more current, such as blog posts the last few days.

More emphasis on social media, like Facebook, LinkedIn, Blogger, etc.

Less emphasis on universal search.

◦ Lower on the page to make paid search more visible.

Increased prominence of video.

◦ As prominently featuring video listings.

Keywords in domain name.

◦ Do weigh keyword domain names even higher.

◦ For a new site, a microsite with your keywords embedded within the URL.

“organize the world's information, make it universally accessible and useful”.

Page 30: Scalable, Distributed, Machine Learning for Big Data

Panda/Farmer Update

SEO (Search Engine Optimization): ‘crawl’ the web (spider?); create this page Index and the Quality Team; the Spam Team throwing away stuff in the Index that shouldn’t be there (by the Crawl Team as well).

Google Mayday update: degrade lower-quality websites, place more weight on quality signals, lowering weight of textual relevancy signals.

◦ Anti-spam and user behavior.

Google Farmer (renamed as Panda) update: hurt “Content Farms”, or sites that contain huge amounts of content of poor quality in order to rank on as high number of keyword combinations as possible;

◦ Placing the emphasis on user experience (average time spent on the site/specific page, bounce rate, Click Through Rate etc. )

◦ The social trend - “+1” buttons was added near each result;

◦ Personalized Search - The changes in results between users could arise from geographic differences, daytime changes; If the user is logged in to Google account the results would be adjusted even further since Google’s servers collect information about the user and his browsing habits.

Page 32: Scalable, Distributed, Machine Learning for Big Data

Dremel Scalable, interactive ad-hoc query system for analysis of nested data;

◦ multi-level execution trees and SQL-like language to express ad hoc queries

◦ column-striped storage representation of nested data

BigQuery, interactive query service as external implement. of Dremel;

◦ Hive and Pig are slow

Data model is based on strongly-typed nested records

Tablet Storage and Horizontal Partitioning to save space

Levels are packed as bit sequence;

Queries based on their priorities and load balances with fault tolerance

◦ Slots and histograms

◦ Handles stragglers

◦ Tablets are three-way replicated

Interoperates with Google's data management tools

◦ In situ data access (e.g., GFS, Bigtable)

◦ MapReduce pipelines

Page 34: Scalable, Distributed, Machine Learning for Big Data

Apache Drill• Open source Implementation of Google BigQuery

• Flexibility: broader range of query languages

Fast

◦ Low latency queries

◦ Columnar execution: like google dremel

◦ Complement native interfaces and MapReduce/Hive/Pig

Open

◦ Community driven open source project

◦ Under Apache Software Foundation

Modern

◦ Standard ANSI SQL:2003 (select/into)

◦ Nested/hierarchical data support

◦ Schema is optional Query any HBase, Cassandra or MongoDB table

◦ Supports RDBMS, Hadoop and NoSQL

DrQL: SQL-like query language

Mongo Query Language

Page 35: Scalable, Distributed, Machine Learning for Big Data

YARN

Yet Another Resource Negotiator: MRv2 (Next Gen. Hadoop);

◦ Predictable Latency – A major customer concern;

◦ Support for alternate programming paradigms to MR.

Separate the tasks of Job Tracker

◦ Resource management

◦ Job Scheduling / Management

Resource Manager: Manages the global assignment of compute resources to applications;

◦ A pure scheduler (capacity/fair scheduler) and an Application Manager to accept job submissions for Application Master;

Node Manager: the per-machine framework agent for monitoring the resource usage, reporting to the Scheduler;

Application Master: manages the application’s life cycle (scheduling and coordination), a single job or a DAG of jobs.

Container: a process started by Node Manager to grant an application the privilege to use a certain amount of resources.

Page 36: Scalable, Distributed, Machine Learning for Big Data

ResourceManager

MapReduce Status

Job Submission

Client

NodeManager

NodeManager

Container

NodeManager

App Mstr

Node Status

Resource Request

YARN Architecture

Page 37: Scalable, Distributed, Machine Learning for Big Data

Storm: Distributed, Real-Time Built by Backtype, recently bought by Twitter, written in Clojure;

Tuples: ordered list of elements;

Streams: Unbounded sequence of tuples

Spout: Source of Stream

◦ E.g. Read from Twitter streaming API, event data,…

Bolts: Processes input streams and produces new streams

◦ E.g. Functions, Filters, Aggregation, Joins,…

Topologies: a DAG of spouts and bolts;

Tasks: instances of Spouts and Bolts;

Stream grouping b.w. spout and bolt:7 options

◦ All grouping, non grouping;

◦ Global grouping, local grouping;

◦ Shuffle grouping, direct grouping;

◦ Fields grouping.

Guarantee Message processing;

Multilang support and transactional topologies;

Applied for stream processing, continuous computation, distributed RPC;

Trident: high-level abstraction on top of Storm, like Pig of Hadoop.

Page 38: Scalable, Distributed, Machine Learning for Big Data

UI: web-based

Nimbus: Master node

like JobTracker

Supervisor: work node

to manage workers

Zookeeper: store

meta data

UI

Storm: System Architecture

Spout Bolt

Page 39: Scalable, Distributed, Machine Learning for Big Data

Summingbird A library to write MR programs like Scala on distributed MR platforms, with Storm

(stream) & Scalding (batch) on top of Cascading;

Data: stream and snapshot;

Components in Summingbird:

◦ Producer: data stream abstraction for Platform to compile MR workflow

◦ Platform: implemented for any stream MR library;

◦ Source: stream of data

◦ Store: “reduce” of streaming MR; all key-value pairs’ snapshot represent;

◦ Sink: materialize an un-aggregated “stream” representation, not snapshot;

◦ Service: perform a “lookup join” from Store’s snapshot or Sink’s stream;

◦ Plan: final representation of the MR flow produced by a Platform.

Related projects:

◦ Algebird is an abstract algebra library for Scala;

◦ Bijection’s Injection typeclass to share serialization between different platforms and

clients;

◦ Chill augments Kryo with options, and provides with Storm, Scala, Hadoop;

◦ Storehaus’s async key-value store traits mplement Summingbird’s client;

◦ Tormenta provides a layer over Storm’s Scheme and Spout interfaces.

Page 40: Scalable, Distributed, Machine Learning for Big Data

S4: Simple Scalable Streaming System

Apache Incubator S4 by Yahoo, written in Java;

◦ Real-time/decentralized/scalable/event-driven/stream processing;

◦ Actors programming model (PEs);

◦ All in-memory, no disk bottlenecks;

◦ Pluggable event serving policies: load shedding, throttling, blocking;

◦ Failover, checkpointing , replication and recovery;

◦ Dynamic load balancing/adaptive load management.

Communication, scheduling & distribution across containers;

◦ S4 applications are built as a graph of: Processing elements (PEs): event handling

Streams of events that interconnect PEs

◦ S4 processing nodes: distributed PE containers/hosts for PE;

◦ S4 clusters define named ensembles of S4 processing nodes;

◦ S4 events are dispatched to nodes according to their key.

◦ PEs communicate asynchronously by sending events on streams.

◦ Communication Layer: cluster management/failover (ZooKeeper);

◦ S4 adapters: applications to convert external streams into S4 events. Adapters are also S4 applications, then scaled easily.

Page 42: Scalable, Distributed, Machine Learning for Big Data

Data Freeway A scalable data stream framework at Facebook

Scribe: Simple push/RPC-based logging system

Calligraphus: Call sync every 7 seconds

◦ RPC -> File System Each log category is represented by 1 or more FS directories

Each directory is an ordered list of files

◦ Bucketing support Application buckets are application-defined shards.

Infrastructure buckets allows log streams from x B/s to x GB/s

Continuous Copier: File System to File System

◦ Low latency and smooth network usage

◦ Deployment; Implemented as long-running map-only job, can move to any simple job scheduler

◦ Coordination: Use lock files on HDFS for now, move to Zookeeper soon

PTail: File System -> Stream ( -> RPC )

◦ Checkpoints inserted into the data stream

◦ Can roll back to tail from any data checkpoints

◦ No data loss/duplicates

Page 43: Scalable, Distributed, Machine Learning for Big Data

Data Freeway Architecture

Page 44: Scalable, Distributed, Machine Learning for Big Data

Scribe

Scribe is a server for aggregating streaming log data, designed to scale to a very large number of nodes and robust to network/node failures;

Scribe servers are arranged in a directed graph, with each server knowing only about the next server in the graph;

Scribe is unique in that clients log entries consisting of two strings, a category and a message;◦ The category is a high level description of the intended destination of the message

and can have a specific configuration in the scribe server;

The server allows for configurations based on category prefix, and a default configuration that can insert the category name in the file path;

Flexibility and extensibility is provided through the “store” abstraction;◦ Stores are loaded dynamically based on a configuration file, and can be changed

at runtime without stopping the server;

◦ Stores are implemented as a class hierarchy, and stores can contain other stores;

Scribe is implemented as thrift service using non-blocking C++ server.

Page 45: Scalable, Distributed, Machine Learning for Big Data

Apache Flume Flume is a distributed service for collecting, aggregating, and moving

large amounts of log data;

◦ A simple/flexible architecture based on streaming data flows;

◦ Robust/Fault tolerant with tunable reliability mechanisms and failover and recovery mechanisms;

◦ Use an extensible data model that allows for online analytic application;

Data flow model:

◦ An Event is a unit of data that flows through a Flume agent; a Flume agent is a process (JVM) that hosts the components;

◦ A Flume source stores an event into one or more channels (passive stores) until it’s consumed by a Flume sink;

◦ The sink puts the event into HDFS or forwards it to the Flume source of the next Flume agent (next hop) in the flow;

◦ The source and sink within the given agent run asynchronously with the events staged in the channel.

Set up:

◦ Flume agent configuration is stored in a local file (source, sink, channel);

◦ The agent knows what individual components to load and how they are connected to constitute the flow;

◦ Build multi-hop flows where events travel through multiple agents before reaching the final destination.

Page 47: Scalable, Distributed, Machine Learning for Big Data

Puma: Real-Time MR

Real time Data Pipeline developed in Facebook, to be open source soon

◦ Utilize existing log aggregation pipeline (Scribe-HDFS)

◦ Extend low-latency capabilities of HDFS (Sync+PTail)

◦ High-throughput writes (HBase)

Support for real time reliable aggregation: Unique user count, most frequent elements

◦ Utilize HBase atomic increments to maintain roll-ups

◦ Complex HBase schemas for unique-user calculations

◦ Store checkpoint information directly in HBase

Multiple Group-By operations per log line

The first key in Group-By is always time/date-related

New 2 versions

◦ Puma 2 Simple

◦ Puma 3 Better performance

PQL – Puma Query Language

Log Stream

AggregationsStorage

Serving

Page 48: Scalable, Distributed, Machine Learning for Big Data

Puma2: Real-Time MR

Map phase with PTail◦ PTail provide parallel data streams

◦ Divide the input log stream into N shards

◦ 1st version only supported random bucketing

◦ Now supports application-level bucketing

Reduce phase with HBase◦ HBase does single increment on multiple columns

◦ Every row+column in HBase is an output key

◦ Aggregate key counts using atomic counters

◦ Also maintain per-key lists or other structures

PTail Puma2 HBase Serving

Page 49: Scalable, Distributed, Machine Learning for Big Data

Puma3: Real-Time MR

Puma3 is sharded (split) by aggregation key.

Each shard is a hash map in memory.

Each entry in hash map is a pair of an aggregation key and a

user-defined aggregation.

HBase as persistent key-value storage.

PTail Puma3 HBase

Serving

Write workflow Checkpoint workflow

Read workflow

Join

Unique Counts Calculation

◦ Adaptive sampling

◦ Bloom filter (future)

Most frequent item (future)

◦ Lossy counting

◦ Probabilistic lossy counting

Special Aggregations

Page 50: Scalable, Distributed, Machine Learning for Big Data

Amazon Kinesis Kinesis scales for real-time processing of streaming big data;

Kinesis requires that a user create at least two applications—a “Producer” and a Kinesis application (also called a “Worker”)—using Amazon’s Kinesis APIs;

The “Producer” takes data from some source and converts it into a "Kinesis Stream," a continuous flow of 50-kilobyte data chunks sent in the form of HTTP PUTs;

The "Worker" takes the data from the Kinesis Stream and does whatever processing is required;

The Kinesis application can run on any type of Amazon EC2 instance, and Kinesis will auto-scale the instances to handle varying streaming loads;

The Kinesis SDK libraries, used to create Kinesis Producers and applications, only available for Java, but write Kinesis applications in any language by simply calling the Kinesis APIs directly;

Stream output is sent to Amazon’s S3, DynamoDB, or Redshift;

Kinesis can create DAGs of Kinesis applications and data streams.

Page 50

Page 51: Scalable, Distributed, Machine Learning for Big Data

Amazon Kinesis

Page 51

Page 52: Scalable, Distributed, Machine Learning for Big Data

NoSQL- Not Only SQL Class of non-relational data storage systems (non-RDBS)

Usually do not require a fixed table schema nor use concept of joins

All NoSQL offerings relax one or more of the BASE/ACID properties◦ Strong: ACID(Atomicity Consistency Isolation Durability)

◦ Weak: BASE(Basically Available Soft-state Eventual consistency )

Three major papers for NoSQL movement◦ BigTable (Google)

◦ Dynamo (Amazon) Gossip protocol (discovery and error detection)

Distributed key-value data store

Eventual consistency

◦ CAP Theorem: consistency, availability and partitions

NoSQL solutions fall into two major areas◦ Key/Value or ‘the big hash table’.

Amazon S3 (Dynamo)

◦ Schema-less in multiple flavors, column, document or graph-based. Cassandra (column-based)

CouchDB (document-based)

Neo4J (graph-based)

HBase (column-based)

Consistency

Partition

tolerance

Availability

Page 53: Scalable, Distributed, Machine Learning for Big Data

BigTable at Google

A Bigtable is a sparse, distributed, persistent multi-dim. sorted map

The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

Rows with consecutive keys are grouped together as “tablets”.

Column keys are grouped into sets called “column families”, which form the unit of access control.

A column key is named using the following syntax: family:qualier.

Bigtable uses the distributed Google File System (GFS) to store log and data files.

The Google SSTable file format is used internally to store Bigtable data. ◦ An SSTable provides a persistent, ordered immutable map from keys to values,

where both keys and values are arbitrary byte strings.

Bigtable relies on a persistent distributed lock service

◦ Chubby (a name space).

Page 54: Scalable, Distributed, Machine Learning for Big Data

Dynamo: Key-value Store Distributed, Highly available storage system from Amazon;

◦ SLA: Application can deliver its functionality in abounded time.

Simple interface associated with a key: get(key) and put(key, data)

◦ Binary objects (data<1MB) identified by a unique key.

Partitioning for scale incrementally◦ consistent hashing: the output range of a hash function is treated as a “ring”.

◦ “virtual node” : Each node can be responsible for more than one virtual node.

Replication for high availability and durability

◦ “preference list”: The list of nodes that is responsible for storing a particular key.

Data versioning: vector clocks to capture causality between versions;

◦ A vector clock is a list of (node, counter) pairs.

Execution of get ()/put (): client, coordinator

Handling failures: “sloppy quorum” and hinted handoff

Replica synchronization: anti-entropy protocol using Merkle (hash) tree

Membership/Failure Detection

◦ Gossip-based protocol

Implementation: Java and APIs over HTTP

◦ BDB or MySQL;

Note: Amazon S3 service powered by Dynamo.

Page 55: Scalable, Distributed, Machine Learning for Big Data

Cassandra

A Decentralized Structured Storage System;

Design goals:

◦ High availability, eventual consistency, incremental scalability, optimistic replication, “knobs” to tune tradeoffs between consistency, durability and latency, low total cost of ownership, and minimal administration;

Architecture:

Each node communicates with each other through the Gossip protocol, which exchanges information across the cluster every second;

A commit log on each node to capture write activity; data durability assured

Data also written to an in-memory structure (memtable) and then to disk once the memory structure is full (an SStable);

It is a row-oriented, column structure;

A key space is akin to a database in the RDBMS world;

A column family is similar to an RDBMS table but is more flexible/dynamic;

A row in a column family is indexed by its key; other columns may be indexed

Cassandra ~= bigtable + dynamo

Page 56: Scalable, Distributed, Machine Learning for Big Data

HBase HBase is an open-source, distributed, column-oriented

database built on top of HDFS based on Google BigTable;

◦ Part of the Hadoop ecosystem (written in Java)

◦ Native connections to Map-Reduce

HBase by default manages a ZooKeeper instance as the authority on cluster state;

Structures data as tables of column-oriented rows

◦ Large, variable, number of columns per row

◦ Rows stored in sorted order

◦ Region: contiguous set of sorted rows, made of Stores

◦ Table: split roughly into equal sized regions

◦ RegionServer: keeps log of every update, manage region split

◦ Master: assigns Table Regions to RegionServers

◦ MemStore: Holds in-memory modifications to the Store.

HBase is not fully ACID-compliant

Can random read and write (no built-in joins)

◦ Single row operations (put, get, scan)

◦ Multiple row operations (scan, multiPut)

Page 57: Scalable, Distributed, Machine Learning for Big Data

HBase Architecture

Page 59: Scalable, Distributed, Machine Learning for Big Data

Kafka A distributed publish-subscribe messaging system;

Maintains feeds of messages in categories called topics: a category or feed name to which messages are published;

Each partition is an ordered, immutable sequence of messages that is continually appended to a commit log;

◦ Each partition has one server which acts as the "leader" and zero or more servers which act as "followers";

Messaging traditionally has two models: queuing and publish-subscribe;

◦ Publish messages in the process to a Kafka topic producers;

◦ Subscribe to topics and consume the published messages by pulling: consumer;

Kafka cluster comprised of one or more servers called a broker to store published messages.

Efficiency on a single partition

◦ A very simple storage: log == list of files, message addressed by a log offset;

◦ Efficient data transfer: No message caching, zero-copy transfer, FS buffering;

◦ Stateless broker: Consumer maintains its own state, SLA-based retention.

Distributed coordination: Auto load balancing

◦ Make a partition within a topic the smallest unit of parallelism;

◦ No central “master” node, but ZooKeeper helps for a consensus service;

Delivery guarantees: messages in order delivered to a consumer.

◦ Built-in replication to store each message in multi brokers.

Page 60: Scalable, Distributed, Machine Learning for Big Data

Kafka Cluster

Page 61: Scalable, Distributed, Machine Learning for Big Data

Samza Samza is a stream processing framework on the top of Hadoop (MRv2.0)

◦ Simple API: a very simple call-back based "process message" API;

◦ Managed state: snapshotting and restoration;

◦ Fault tolerance: work with YARN to migrate your tasks;

◦ Durability: uses Kafka to guarantee no messages get lost;

◦ Scalability: partitioned and distributed at every level;

◦ Pluggable: provides a pluggable API to run Samza with other messaging systems;

◦ Processor isolation: works with YARN, to give security and resource scheduling.

Concepts in Samza:

◦ Streams: immutable messages of a similar type or category;

◦ Jobs: code that performs a logical transformation on a set of input streams ;

◦ Partitions: each partition in the stream is a totally ordered sequence of messages;

◦ Tasks: the unit of parallelism of the job;

◦ Dataflow graphs: nodes - streams containing data, edges - jobs performing transformations

◦ Containers: unit of physical parallelism to runs one or more tasks.

Samza architecture: a stream processing built with Kafka and YARN

◦ Streaming: Kafka

◦ Executation: YARN

◦ Processing: Samza API

◦ Uses YARN and Kafka to provide stage-wise stream processing /partitioning.

Page 62: Scalable, Distributed, Machine Learning for Big Data

Samza Concepts and Architecture

Page 63: Scalable, Distributed, Machine Learning for Big Data

Machine Learning “Machine Learning is programming computers to optimize a

performance criterion using example data or past experience”

◦ Supervised model: labeled data;

◦ Unsupervised model: unlabeled data;

◦ Semi-supervised model: both labeled and unlabeled data;

◦ Reinforcement Learning: learn by interacting with an environment.

Types of ML algorithms

◦ Prediction: predicting a variable from data

◦ Classification: assigning records to predefined groups

◦ Clustering: splitting records into groups based on similarity

◦ Association learning: seeing what often appears together with what

Relationship with others

◦ Artificial intelligence: emulate how the brain works with programming; ML is a branch of AI

◦ Data mining: building models in order to detect the patterns;

◦ Statistical analysis: probabilistic models, on which to infer using data;

◦ Information retrieval: retrieval of information from a collection of data (doc).

Page 64: Scalable, Distributed, Machine Learning for Big Data

Some Issues in ML Training/testing data (70%/30%)

Data unbalanced (one class’ data more than others)

◦ Sampling, learning algorithm modification (cost-sensitive), ensemble,…

“Open set” (how to handle unknown or unfamiliar classes);

Feature extraction

◦ Sparse coding, vector quantization,…

Curse of Dimensionality: Sensitivity to “noise”

◦ Dimension reduction, manifold learning/distance metric learning

Linear or non-linear model

◦ Local/Global minimum (convex/concave obj. function): Learning rate

◦ Regularization: L-1/L-2 norm

◦ Kernel trick: mapping nonlinear feature space to high dim. linear space

Discriminative or generative model

◦ Bottom up (conditional distribution) /Top down (joint distribution)

Over-fitting: Learn the “noise”

◦ Cross validation with grid search

Vanishing gradient and sensitivity of initialization

Performance evaluation

◦ Precision/Recall, confusion matrix, ROC, i.e. receiver operat. characteristic)

Page 65: Scalable, Distributed, Machine Learning for Big Data

“Data Unbalancing” Issue in ML Resampling methods for balancing the data set.

Over-sampling, under-sampling, importance sampling;

Modification of existing learning algorithms. Cost-sensitive learning;

One class classification;

Classifier ensemble (bagging, boosting, random forest…)

Measuring the classifier performance in imbalanced

domains. ROC, F-measure,…

Relationship between class imbalance and other data

complexity characteristics.

Page 66: Scalable, Distributed, Machine Learning for Big Data

“Open Set” Issue in ML How to handle unknown or unfamiliar classes?

Label as one of known classes or as unknown;

Zero shot learning/unseen class detection;

Novelty detection with null space methods; One class SVM;

Multiple classes: Artificial super class from all given classes;

Combine several one class classifiers learned separately;

K-nearest neighbors;

Page 67: Scalable, Distributed, Machine Learning for Big Data

“Curse of Dimensionality” in ML

Curse of dimensionality: distributing bins or basis

functions uniformly in the input space may work in 1

dimension, but will become exponentially useless in

higher dimensions;

Learning a "state-of-nature" from a number of

samples in a high-dimensional feature space with

each feature having a number of possible values,

enormous amount of training data are required to

ensure that there are several samples with each

comb. of values;

With a fixed number of training samples, the

predictive power reduces as the dimensionality

increases, and this is known as the Hughes effect

or Hughes phenomenon;

How to avoid it? Dimension reduction: PCA, LDA, MDS;

Manifold learning: ISOMAP, LLE, Eigenmap;

Page 68: Scalable, Distributed, Machine Learning for Big Data

“Over-fitting” Issue in ML

A statistical model describes ”noise” instead of the

underlying relationship;

Over-fitting generally occurs when a model is

excessively complex, such as having too many

parameters relative to the number of observations;

A model which has been over-fit will generally have

poor predictive performance, as it can exaggerate

minor fluctuations in the data;

How to avoid over-fitting? Explicitly penalize overly complex models;

Test the model's ability to generalize by evaluating its

performance on a set of data not used for training, which is

assumed to approximate the typical unseen data that a model

will encounter;

Methods: cross-validation, regularization, early stopping,

pruning, Bayesian priors on parameters or model comparison;

Page 69: Scalable, Distributed, Machine Learning for Big Data

Large Scale Machine Learning

Data independent model

◦ Assumes each data instance can be independently computed, such as Hadoop

Data locally dependency model

◦ Assumes many data vertices locally connected with its neighbor vertices, each data vertex updates its own status in parallel according to the status of its connected neighbors vertices, like GraphLab.

Deep learning: Learn multi-layers of data represent. (features)

◦ Unsupervised pre-training + supervised fine tuning

◦ MLP, CNN, DBN, DBM, SDAE,…;

Online ML: fast and memory‐efficient

◦ Stochastic/incremental gradient descent (SGD)

Ensemble learning: easy to be distributed, scalable

◦ Boosting, bagging, stacking, random forest,…

Open Sources: Mahout, R, WEKA, MLPack, MLBase,…

ML on parallel machines

◦ GPU, cloud or cluster (distributed), multi-core,…

Page 70: Scalable, Distributed, Machine Learning for Big Data

Trade-off in Large Scale ML

Small scale vs Large scale

◦ We have a small-scale learning problem when the active budget constraint is the number of examples 𝑛.

◦ We have a large-scale learning problem when the active budget constraint is the computing time 𝑇.

Statistical Perspective

◦ It is good to optimize an objective function that ensures a fast estimation rate when the number of examples increases.

Optimization Perspective

◦ To efficiently solve large problems, it is preferable to choose an optimization algorithm with strong convergence properties.

Incorrect Conclusion

◦ To address large-scale learning problems, use the best algorithm to optimize an objective function with fast estimation rates

Learning with approximate optimization

◦ Stochastic gradient descent (historically associated with BP)

Page 71: Scalable, Distributed, Machine Learning for Big Data

Some Issues in Large Scale ML

Job scheduling

◦ Schedule and monitor “batch” jobs;

Parallel execution

◦ Distributed

◦ SIMD;

Auto Scaling

◦ Scale up (vertical)

◦ Scale out (horizontal)

Monitoring

Fault-tolerant

◦ Failover

◦ recover

Load balancing

◦ Distribute work load across the cluster

Work flow management

◦ Choreography

◦ Orchestration

Page 72: Scalable, Distributed, Machine Learning for Big Data

Page 72

Job Scheduling

A job scheduler is a program that enables an enterprise to

schedule and, in some cases, monitor computer "batch" jobs

(units of work, such as the running of a payroll program).

A job scheduler can initiate and manage jobs automatically by

processing prepared job control language statements or through

equivalent interaction with a human operator.

Functions:

Avoid starvation;

Maximize throughput;

Minimize response time;

Optimal use of resources.

Hadoop: FIFO, FAIR (Facebook), Capacity, Dynamic Priority Schedulers.

Page 73: Scalable, Distributed, Machine Learning for Big Data

Page 73

Auto-Scaling

Auto-scaling: scales up/down when the load increases/ decreases,

ability to handle increasing amount of work gracefully;

Vertical scalability:

Scaling Up: maintain performance levels as concurrent request

increases;

Horizontal scalability:

Scaling Out: meet demand through replication, across a pool of servers;

Dimensions

Load. Handling increasing load by adding resources;

Geographic. Maintain perf. in case geographically distributed systems;

Functional. Adding new features using minimum effort.

Amazon’s Cloud Watch: EC2 (CPU, Disk/Network I/O), ELB.

Page 74: Scalable, Distributed, Machine Learning for Big Data

Page 74

Load Balancing

Load balancing distributes workload across one or more servers,

network interfaces, hard drives, or other computing resources;

A load balancer provides the means by which instances of

applications can be provisioned and de-provisioned automatically,

without requiring changes to the network or its configuration;

Determine the maximum connection rate that the various solutions

are capable of supporting;

Failover: continuation of the service after the failure;

Amazon Elastic Load Balancer (ELB): it facilitates distributing

incoming traffic among multiple AWS instances (like HAProxy);

Span Availability Zones (AZ), and can distribute traffic to different AZs;

Page 75: Scalable, Distributed, Machine Learning for Big Data

Page 75

Workflow Management Workflow is loosely-coupled parallel application, consists of a set

of computational tasks linked via data/control-flow dependencies;

how tasks are structured, who performs them, what their relative order

is, how they are synchronized, how information flows to support the

tasks and how tasks are being tracked.

An activity is a discrete step in a business process (workflow);

Activities are orchestrated together in a workflow;

“Service choreography” – description of coordination between

two/more parties .

“Service orchestration” – business process is modeled using

workflows.

Amazon Simple Workflow (SWF): task coordination and state

management for cloud apps;

Twitter Azkaban: A workflow scheduler that allows the independent

pieces to be declaratively assembled into a single workflow.

Page 76: Scalable, Distributed, Machine Learning for Big Data

Apache Spark -1 Spark: an open source cluster computing system to make data

analytics fast, both fast to run and fast to write (100x faster than Hadoop MR), developed at UC Berkeley;

Spark application: A driver program runs the main function and executes various parallel operations on a cluster.

Resilient distributed dataset (RDD): Collection of elements partitioned across nodes of cluster operated in parallel;

◦ RDDs automatically recover from node failures.

◦ Two types of RDD: Parallelized collections, take an existing Scala collection and run functions on it

in parallel;

Hadoop datasets, run functions on each record of a file in HDFS or supported by Hadoop

◦ Two types of operations: Transformations, create a new dataset from an existing one,

“Distributed reduce” transformations operate on RDDs of key-value pairs

Actions, return a value to the driver program after computation on the dataset.

Shared Variables: used in parallel operations

◦ broadcast variables, cache a value in memory on all nodes

◦ accumulators, only “added” to, such as counters and sums

Page 77: Scalable, Distributed, Machine Learning for Big Data

Apache Spark - 2

Spark originally written in Scala, added with Java and Python API;

Spark streaming: Large scale stream processing framework.

Spark persists (caches) a dataset in memory across operations

Spark can run at Amazon Elastic MapReduce;

Mllib: Implementation of some Machine Learning functionality, as

well associated tests and data generators;

◦ Binary classification: SVM and Logistic Regression (SGD);

◦ Linear regression: SGD

◦ Clustering: k-mean++

◦ Collaborative filtering for recommender system: Alternating LS

◦ Gradient Descent and Stochastic GD

Bagel: Implement. of Google’s Pregel

◦ A graph processing framework.

Page 78: Scalable, Distributed, Machine Learning for Big Data

Spark Runtime and Spark Streaming

Page 79: Scalable, Distributed, Machine Learning for Big Data

Mahout - Scalable ML• Apache Software Foundation Java library;

• Scalable “machine learning“ and “data mining“ library that runs on Apache Hadoop mostly using map/reduce (M/R) paradigm;

• Currently Mahout supports 3”C”s+Extras use cases:

• Collaborative Filtering for recommendation: • Non-distributed: Taste

• Distributed: user-based or item-based on Hadoop

• Collaborative filter with matrix factorization

• Classification: Perceptron, RBM, Winnow, Logistic regression, Naïve Bayes, Complementary NB, HMM, boosting, random forest, SVM, …

• Clustering: Canopy, k-means, mean shift, Dirichlet process, spectral, min-hash, hierarchical, LDA (Latent Dirichlet Allocation), EM,…

• Frequent Patten Mining: Parallel FP-Growth (PFP);

• Locally Weighted Linear Regression;

• SVD (singular value decomposition), PCA, ICA,...

• Evolutionary algorithm: genetic algorithm, ....

• Mahout can produce vector represent. from Lucene (Solr) index;

• Run Mahout at Amazon EC2 and Amazon EMR.

Page 80: Scalable, Distributed, Machine Learning for Big Data

Mahout ML Algorithms

Math

Vectors/Matrices/

SVD

RecommendersClusteringClassification

Freq.

Pattern

Mining

Genetic

Utilities

Lucene/Vectorizer

Collections

(primitives)

Apache

Hadoop

Applications

Examples

Page 81: Scalable, Distributed, Machine Learning for Big Data

Jubatus: real-time ML

Real-time and highly-scalable ML platform from NTT (Japan)

Online learning algorithms in Jubatus◦ Linear classification

Perceptron/Passive Aggressive/Confidence Weighted Learning/Soft

CWL/AROW/Normal HERD

◦ Regression: PA-based regression

◦ Nearest neighbor: LSH/Min-Hash/Euclid LSH

◦ Recommendation: Based on nearest neighbor

◦ Anomaly detection: LOF based on nearest neighbor

◦ Graph analysis: Shortest path/Centrality (PageRank)

◦ Simple statistics

Why Jubatus?◦ Online learning requires frequent model updates

◦ Naïve distributed architecture leads to too many synchronization operations

Solution in Jubatus: Loose model sharing

Basic operations in Jubatus: only work locally to realize real-time◦ UPDATE: local model is updated by each input sample, never shared!

◦ ANALYZE: each sample randomly go the server and then result goes back the

client;

◦ MIX: send out model difference which is merged and distributed.

Everything in the memory (process the data on the fly)

Page 82: Scalable, Distributed, Machine Learning for Big Data

Jubatus

Realtime applicationBatch application

Simple Analysis (Statistics)

Batch(Stored)

BigData

In-depth Analysis(classification, estimation, prediction)

Real time(Online)

Page 83: Scalable, Distributed, Machine Learning for Big Data

GraphLab in C++

A graph-based, high performance, distributed computation framework from Select Lab, CMU;

A unified multicore and distributed API;

Scalable: data and computation;

Access data directly from HDFS;

Data graph is graph with data associated with every vertex/edge;

Update Functions are operations applied on a vertex and transforming data in the scope of the vertex

Scheduler determines order of update Function evaluations

◦ Static: Synchronous or round robin schedule;

◦ Dynamic: Update functions insert new tasks into the schedule.

Shared Data table: global constant parameters can be stored;

Sync operation, similar to MR’s “reduce”: accumulate and apply;

Ensures race-free operation;

Guarantees sequential consistency.

Page 84: Scalable, Distributed, Machine Learning for Big Data

GraphLab Model

Data GraphShared Data Table

SchedulingUpdate Functions

and Scopes

GraphLabModel

Page 85: Scalable, Distributed, Machine Learning for Big Data

BSP Model

Bulk Synchronous Parallel model (Leslie Valiant,

1990);

Advantages over MapReduce and MPI:

◦ Supports message passing paradigm style of application

development

◦ Provides a flexible, simple, and easy-to-use small APIs

◦ Perform better than MPI for communication-intensive

applications

◦ Guarantees impossibility of deadlocks or collisions in the

communication mechanisms

A BSP computation proceeds in a series of

global supersteps.

A superstep consists of three components:

◦ Concurrent computation. Each process can perform using

local data values.

Page 87: Scalable, Distributed, Machine Learning for Big Data

Pregel for Graph Computing

It is a master/slave model for large-scale graph processing;

BSP model-based;

Vertex-centric model: for each vertex

◦ An arbitrary “value” that can be get/set.

◦ List of messages sent to it

◦ List of outgoing edges (edges have a value too)

◦ A binary state (active/inactive)

Combiners:◦ Sometimes vertices only care about a summary value for the messages;

◦ Combiners allow for this (examples: min, max, sum, avg)

◦ Messages combined locally and remotely

Aggregators:◦ Compute aggregate statistics from vertex-reported values

◦ During a superstep, each worker aggregates values from its vertices

◦ At the end of a superstep, partially aggregated values in a tree structure

Fault tolerance: save at the checkpoint and recover if necessary

Confined recovery:◦ Failed worker “catch-up” to the rest, and other workers resend messages to it

(under development?)

Page 88: Scalable, Distributed, Machine Learning for Big Data

Apache Giraph

Developed at Yahoo! and used by Facebook (open source of Pregel);

Reuse Hadoop as Map-Reduce job;

focuses on graph-based bulk synchronous parallel (BSP) computing;

ZooKeeper: responsible for computation state;◦ partition/worker mapping

◦ checkpoint paths, aggregator values, statistics

Master: responsible for coordination

◦ assigns partitions to workers

◦ coordinates synchronization

◦ requests checkpoints

◦ aggregates aggregator values

◦ collects health statuses

Worker: responsible for vertices

◦ invokes active vertices compute() function

◦ sends, receives and assigns messages

◦ computes local aggregation values

Page 89: Scalable, Distributed, Machine Learning for Big Data

Apache Hama

General purposed BSP computing, not for graph processing only;

◦ Job management / monitoring

◦ Checkpoint recovery

◦ Pluggable message transfer architecture

Written In Java;

Local & (Pseudo) Distributed run modes

MapReduce like I/O API

Supports to run in the Clouds using Apache Whirr;

Supports to run with Hadoop 2.0: YARN;

Graph API

Application besides of graph processing: Machine learning:

Collaborative filtering

Clustering: k-means

Gradient descent for training in classification

Page 91: Scalable, Distributed, Machine Learning for Big Data

Vowpal Wabbit @ Yahoo/MS Scalable, fast, efficient linear ML engine written in C/C++

Hadoop compatible AllReduce (not MPI style)

◦ “Map” job moves program to data;

◦ Read (and cache) all data, before initializing AllReduce;

◦ Use map-only Hadoop for process control and error recovery.

◦ Use AllReduce code to sync state;

◦ Always save input examples in a cache file to speed later passes;

◦ Use hashing trick to reduce input complexity.

Algorithms in VW 7.0

◦ Binary classification and regression

◦ Multiclass classification, NN, active learning Reduction: One Against All or Cost-Sensitive OaA or Sequence Prediction

◦ Latent Dirichlet Allocation, and matrix factorization

◦ Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS);

◦ Stochastic Gradient Descent (SGD) and CG;

Online learning/Active learning: no need to load all data into memory

◦ Dimension correction

Feature caching (adaptive learning)

Feature hashing (importance update)

Page 92: Scalable, Distributed, Machine Learning for Big Data

AllReduce in VW

Every node begins with a number (vector)

Every node ends up with the sum;

Ideal to sum local gradients, weights;

Creates a spanning tree over the nodes;

At each iteration

Nodes compute gradient over local data;

AllReduce computes gradient over entire data.Each node receives a subset of data;

Page 93: Scalable, Distributed, Machine Learning for Big Data

Trident-ML A real-time online ML library using scalable online algorithms;

Built on top of Storm, running on a cluster of machines and supports horizontal scaling;

Based on Trident, a high-level abstraction of Storm;

Trident-ML currently supports:

◦ Linear classification (Perceptron, Passive-Aggresive, Winnow, AROW, KLD for text)

◦ Linear regression (Perceptron, Passive-Aggresive)

◦ Clustering (KMeans)

◦ Feature scaling (standardization, normalization)

◦ Text feature extraction

◦ Stream statistics (mean, variance)

◦ Pre-Trained Twitter sentiment classifier

Trident-Ml is hosted on Clojars (a Maven repository).

Trident-ML process unbounded streams of data implemented by an infinite collection of Instance or TextInstance.

Page 94: Scalable, Distributed, Machine Learning for Big Data

Storm-Pattern

Based on Cascading’s sub-project, Pattern, which uses flows as

containers for ML models;

◦ Cascading is a de-facto Java application framework that enables easy large

scale developing work of rich Data Analytics and Data Management

applications with Apache Hadoop and its API.

Importing PMML (Predictive Model Markup Language) model descriptions

from R, SAS, Weka, RapidMiner, KNIME,SQL Server, etc.

◦ PMML, an XML-based file format developed by the Data Mining Group;

Working in tandem with the Lingual JDBC driver, modeling tools can pull

data directly off Hadoop to train or test model quality;

◦ Lingual executes SQL queries as Cascading app. on Hadoop clusters.

Current support for PMML includes:

◦ Random Forest in PMML 4.0+ exported from R/Rattle;

◦ Linear Regression in PMML 1.1+;

◦ Hierarchical Clustering and K-Means Clustering in PMML 2.0+;

◦ Logistic Regression in PMML 4.0.1+.

Page 95: Scalable, Distributed, Machine Learning for Big Data

Samoa: Big Data Mining Scalable Advanced Massive Online Analysis;

◦ A platform of stream data mining on S4/Storm from Yahoo, to be open source soon;

Short term algorithms:

◦ Hoeffding tree, K-means, Gradient boosted decision tree;

Long term algorithms:

◦ Integrate with add-ons packages (like R);

◦ Implement most common ML methods (like Mahout).

Algorithmic implement: Horizontal/Vertical data parallelism;

Platform design: ML Developer API;

Deployment and runtime.

Page 96: Scalable, Distributed, Machine Learning for Big Data

Samoa Runtime

Page 97: Scalable, Distributed, Machine Learning for Big Data

Reference N. Marz and J.Warren. Big Data: Principles and Best Practices of Scalable Real Time

Data Systems. Manning Publications Co. Shelter Island, NY, 2013.

J. Dean and S. Ghemawat. “MapReduce: Simplified data processing on large clusters”. Communications of the ACM, 51(1), 2008

F Chang, et al. "Bigtable: A Distributed Storage System for Structured Data", 7th Symp. on Operating System Design & Implementation. 2006.

G. de Candia et al., “Dynamo: Amazon’s highly available key-value store”. 21st ACM

SIGOPS symposium on Operating systems principles, pages 205-220. ACM, 2007.

J. Lin and D. Ryaboy. “Scaling big data mining infrastructure: The twitter experience”. SIGKDD Explorations,14(2), 2012.

R Bekkerman, M Bilenko and J Langford, “Scaling Up Machine Learning”, Tutorial, KDD 2011.

J Gray & D Borthakur, “Real time Apache Hadoop at Facebook”, at SIGMOD, 2011

L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. “S4: Distributed Stream Computing Platform”. ICDM Workshops, 2010.

S Owen, R Anil, T Dunning, and E Friedman, Mohout at Action, Manning Publications Co. Shelter Island, NY, 2012.

Hadoop, http://hadoop.apache.org.

• Storm, http://storm-project.net.

• S4, http://incubator.apache.org/s4/.

• Zookeeper, http://zookeeper.apache.org/.

• HBase, http://hbase.apache.org/.

• Spark, http://spark.incubator.apache.org/

• Mahout, http://mahout.apache.org/

• Jubatus, http://jubat.us/en/overview.html.

• Graphlab, http://graphlab.org/home/

• Vowpal Wabbit, http://hunch.net/~vw/

Samza, http://samza.incubator.apache.org/

Samoa, http://samoa-project.net/

Page 98: Scalable, Distributed, Machine Learning for Big Data

Thanks!