big data update

42
Big Data Update 2016 by H. Michael Covert and Victoria Loewengart January 22, 2016 Proprietary and Confidential

Upload: trinhtuyen

Post on 13-Feb-2017

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Update

Big Data Update 2016

by H. Michael Covert and Victoria Loewengart

January 22, 2016

Proprietary and Confidential

Page 2: Big Data Update

Agenda

• What does Big Data mean in 2016

• Big Data Trends in 2016

• Real Time Analytics

• Mainstream Adoption Use cases

• Machine Learning in 2016

• Looking ahead

2January 22, 2016 Proprietary and Confidential

Page 3: Big Data Update

Overview

• It’s no longer acceptable to equate big data and Hadoop

• BigData means – Hadoop– Spark– Kafka– Other technologies in their echo-system– Non-Apache technologies with similar volume, velocity,

and variety

• Expect updates in machine learning, real-time data-as-a-service, algorithm markets, Spark, and more.

January 22, 2016 Proprietary and Confidential 3

Page 4: Big Data Update

Big Data Trends for 2016

• The importance of data will continue to be elevated to create incremental improvement in business performance.

• Insight will become a key competitive weapon, as firms move beyond big data and solve problems with data driven thinking.

• Chief data officers will gain power, prestige, and presence

January 22, 2016 Proprietary and Confidential 4

http://blogs.forrester.com/brian_hopkins/15-11-09-forresters_2016_predictions_turn_data_into_insight_and_action

Page 5: Big Data Update

Big Data Trends for 2016 –Tools and Features

• There will be more tools and features that expose data directly to the people who use it –business users. Such as:– Microsoft PowerApps (https://powerapps.microsoft.com/en-us/)

– CloudBI by RJMetrics (https://rjmetrics.com/product/cloudbi)

– Orange (http://orange.biolab.si/)

– Birst, Btime, Adaptive Insight….

– New “data blending” tools like Alteryx, Datawatch, Domo, and others (Note that Tableau and Qlik are also such toolsets)

January 22, 2016 Proprietary and Confidential 5

Page 6: Big Data Update

Big Data Trends for 2016 - Embedding the Analytics

– Prescriptive analytics: • IDC predicts that by 2020 half of all business analytics software will incorporate

prescriptive analytics built on cognitive computing functionality.

– Autonomous agents and things • According to Gartner – “autonomous agents and things" as one of the up-and-

coming trends, which is already marking the arrival of robots, autonomous vehicles, virtual personal assistants, and smart advisers.

– Advanced Machine Learning• According to Gartner - advanced machine learning, deep neural nets (DNNs)

move beyond classic computing and information management to create systems that can autonomously learn to perceive the world, on their own. The explosion of data sources and complexity of information makes manual classification and analysis infeasible and uneconomic. DNNs automate these tasks and make it possible to address key challenges related to the information of everything trend.

January 22, 2016 Proprietary and Confidential 6

https://www.idc.com/research/viewtoc.jsp?containerId=259835http://www.gartner.com/newsroom/id/3143521

Page 7: Big Data Update

Big Data Trends for 2016 – Jobs!

• Talent shortage is not over!

– According to Gartner, 220,000 Big Data jobs will need to be filled in 2016.

– The shortage will extend from data scientists to data architects and experts in data management (IDC)

• The companies will use different strategies to acquire talent:

– Rely on new university programs

– Internal and external training

January 22, 2016

http://iianalytics.com/analytics-resources/2016-predictionshttps://www.idc.com/research/viewtoc.jsp?containerId=259835

Page 8: Big Data Update

Big Data Jobs

8January 22, 2016 Proprietary and Confidential

Page 9: Big Data Update

Big Data Jobs

9January 22, 2016 Proprietary and Confidential

Source: O’Reilly Media’s 2015 Annual Report

Page 10: Big Data Update

Data Science

10January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.

IT Skills

Page 11: Big Data Update

Big Data Trends for 2016 – Real Time Streaming

• Real Time streaming ingestion of data and analytics will become a must have in 2016

– The window for turning data into action is narrowing

– The next 12 months will be about distributed, open source streaming alternatives built on open source projects like Kafka and Spark (Forrester)

– Spark is growing quickly!

January 22, 2016

http://www.informationweek.com/big-data/big-data-analytics/big-data-predictions-for-2016/d/d-id/1323671

Page 12: Big Data Update

Real Time Analytics

• Major performance improvements - Big data analytics moved well beyond their batch roots and now deliver real-time performance.

• The emergence of in-memory computing has taken hold, and a new paradigm, the Lambda Architecture, has emerged as the primary means of delivering real-time analytics to the enterprise.

12January 22, 2016 Proprietary and Confidential

Nathan Marz

Page 13: Big Data Update

Real Time Analytics

• Big data analytics has now moved beyond batch

• Some have been “retrofitted”– HIVE and Pig

• New technology has also arrived– YARN

– Tez

– Spark

– Storm

– Kafka

– Impala

• A reference architecture stack has now been defined in Hadoop version 2

13January 22, 2016 Proprietary and Confidential

Page 14: Big Data Update

Real Time Analytics

• Tez – a new architecture designed to create more

flexible execution graphs and to minimize data exchanges (and writes to disk)

• It can be retrofitted into some existing systems

14January 22, 2016 Proprietary and Confidential

Page 15: Big Data Update

Real Time Analytics• Storm is a Complex Event Processing (CEP) system designed

to handle high velocity streamed data– It was invented by Twitter but is Apache open source

– It gets its data from “spouts” and processes its in “bolts”

– Each bolt is a programmable unit which can transform, store, apply rules or machine learning in order to detect patterns in real-time.

– Storm runs on top of YARN

15January 22, 2016 Proprietary and Confidential

Page 16: Big Data Update

Real Time Analytics• Kafka is an extremely high speed, scalable message bus

– Designed for scalability and the ability to absorb massive flows of messages• Elastic linear scalability

– Developed originally at LinkedIn but open sourced in 2011• Written in Scala

• In 2014, engineers from LinkedIn formed Confluent which supports Kafka

16January 22, 2016 Proprietary and Confidential

Page 17: Big Data Update

Real Time Analytics

• Spark is a new architecture that is designed for real-time analytics– It also uses a persistent server structure

– It utilizes in-memory datasets called Resilient Distributed Datasets (RDD)

– It provides a graph database component called GraphX

– It provides a Machine Learning component called MLLib

17January 22, 2016 Proprietary and Confidential

Page 18: Big Data Update

Real Time Analytics

• RDDs are created, transformed into other RDDs, and then actions can be performed on them to yield results.

– RDDs are immutable! And transformations are “lazy!”

18January 22, 2016 Proprietary and Confidential

Page 19: Big Data Update

Real Time Analytics

• Spark uses innovative in-memory technology called a Resilient Distributed Dataset (RDD) originally designed to handle complex machine learning tasks

19January 22, 2016 Proprietary and Confidential

Page 20: Big Data Update

Real Time Analytics

• Kudu is a new data storage layer created by Cloudera designed to bridge some of the gaps that exist with Hadoop Distributed File System (HDFS).

– Unlike HDFS, it is not immutable. It allows inserts, updates, and deletes.

– It is optimized for rapidly changing schemas, data streaming, and analytic functionality

– It is also distributed, but can be replicated access data centers allowing for disaster recovery

– It is columnar and makes heavy usage of in-memory and caching

– It is not yet production ready!

20January 22, 2016 Proprietary and Confidential

Page 21: Big Data Update

Big Data Trends for 2016 – New Business Models

• Data-as-a-service business model is coming

– The window for turning data into action is narrowing

– The next 12 months will be about distributed, open source streaming alternatives built on open source projects like Kafka and Spark (Forrester)

January 22, 2016

http://www.informationweek.com/big-data/big-data-analytics/big-data-predictions-for-2016/d/d-id/1323671?image_number=8

Page 22: Big Data Update

22January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.

Page 23: Big Data Update

Innovative Use Cases - Energy

23January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.

Fracking alone has produced hundreds of petabytes of data in the form of images, seismic recordings, videos, sound recordings, and field notes. The data is vast in its volume and variety.

Page 24: Big Data Update

Innovative Use Cases – Insurance

24January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.

Page 25: Big Data Update

Pathway Benefit Hadoop Use Case

Right Living Patients can build value by taking an active role in their own treatment, including disease prevention.

Predictive Analytics: Patients enter data at home manually or with biometric device data. Algorithms analyze the data and flag patterns that indicate a high risk of readmission, alerting a physician.

Right Care Patients get the most timely, appropriate treatment available.

Real-time Monitoring: Patient vital statistics are transmitted from wireless sensors every minute. If vital signs cross certain risk thresholds, or if more complex patterns are detected, staff can be notified to attend to the patient immediately.

Right Provider Provider skill sets matched to the complexity of the assignment— for instance, nurses or physicians’ assistants performing tasks that do not require a doctor. Also the specific selection of the provider with the best outcomes.

Historical EMR Analysis: Hadoop reduces the cost to store data on clinical operations, allowing longer retention of data on staffing decisions and clinical outcomes. Analysis of this data allows administrators to promote individuals and practices that achieve the best results.

Right Value Ensure cost-effectiveness of care, such as tying provider reimbursement to patient outcomes, or eliminating fraud, waste, or abuse in the system.

Medical Device Management: For biomedical device maintenance, use geo-location and sensor data to manage its medical equipment. The biomedical team can know where all the equipment is, so they don’t waste time searching for an item. Over time, determine the usage of different devices, and use this information to make rational decisions about when to repair or replace equipment.

Right Innovation The identification of new therapies and approaches to delivering care, across all aspects of the system. Also improving the innovation engines themselves.

Research Cohort Selection: Researchers at teaching hospitals can access patient data in Hadoop for cohort discovery, then present the anonymous sample cohort to their Internal Review Board for approval, without ever having seen uniquely identifiable information.

Innovative Use Cases – Health Care

25January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.

From McKinsey & Company “The ‘Big Data’ Revolution in Healthcare

Page 26: Big Data Update

Innovative Use Cases – Health Care

• Biomedical Big Data is being driven from multiple areas and disciplines:– Genomics (genotyping, gene expression, sequencing)

• 4 TB per person, now at the rate of 1,000s of people per month!

• The computational complexity here is very high

– Molecular pathology and predictive analytics

– Survivor analytics and pattern detection

– Predictive technologies are very important• Disease propensity analysis

• Real-time population health and tele-medicine become tightly connected

26January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.

Page 27: Big Data Update

Big Data Trends for 2016 – Machine Learning Gains Momentum!

• Machine learning has become a “checklist item” for data preparation and predictive analytics (Ovum)

• It is a top strategic trend for 2016 (Gartner)

• Machine learning will replace manual data wrangling and data governance dirty work. The freeing up of time will accelerate data strategies

January 22, 2016

http://www.ovum.com/press_releases/ovum-reveals-the-reality-of-big-data-for-2016-cloud-and-appliances-will-drive-the-next-wave-of-adoption-with-spark-the-fastest-growing-workload/

http://www.gartner.com/newsroom/id/3143521

Page 28: Big Data Update

What do we mean by “Learn?”

28January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.

• Stimulus-response learning. Learning to give a precise response to a given stimulus.– Signal learning. Learning to give a precise response to a signal.

• Chaining. Learning to give a precise response to two or more stimuli.– Verbal association. A form of chaining that is special because of the tight

linkage between language and thought in human beings.

• Multiple discrimination. Learning to classify responses into categories based on multiple stimuli.

• Concept learning. Learning to give a precise response to a class of stimuli even though the individual members of that class may differ widely from each other.

• Principle learning. Learning to give a precise response to a Principle (a chain of two or more concepts.) It functions to organize higher order “behavior.”

• Problem solving. Previously acquired concept and principle learning are combined to classify unresolved or ambiguous set of events.

Robert Gagné (1965)

Page 29: Big Data Update

How do Machines “Learn?”

29January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.

• Machine learning consists of:

– Observing the world for a while (data)

– Developing a model that classifies these observations• Detecting patterns that exist in the observations

– Predicting the class of a new, unseen observation based on this model

– If correctness can be determined, modify (strengthen) the model with this information.

Robert Gagné (1965)

Page 30: Big Data Update

Machine Learning Workflow

30January 22, 2016 Proprietary and Confidential

Design

Use

Strengthenand learn

Define desired targets

Analyze available

data

Train model and

analyze results

Place model into production

Build and upload model

Upload training

data

Upload batch data set

Classify single record

Classify

Query for classification

summary

Query for classification

result(s)

Retrain model and analyze

results

Select training sets

Promote trained

model into production

Does model meet

acceptance criteria?

Does data

support desired results?

Does model meet

acceptance criteria?

NoNo

No

Results written to database

Results queried from

database

Upload trainingrecord set

Use training record set generated from classified records

Create training record set from

classifiedrecords

CRISP-DM is a very good framework much like this one.

Page 31: Big Data Update

A Little Machine Learning Theory Now

31January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.

• Machine Learning is VERY complex. It requires an understanding of:

– Linear algebra and the usage of vectors and matrices

– Statistics and probability

– Calculus and derivatives

– And the more data, the better the model Big Data!

Page 32: Big Data Update

Machine Learning Algorithms

32January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.

• Supervised Learning : – machine learning task of inferring a function from labeled training data. The training

data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).

– includes algorithms such as Linear Regression, Logistic Regression, Decision Tree, Random Forest etc.

• Unsupervised Learning :– machine learning task of inferring a function to describe hidden structure from

unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning.

– includes algorithms such as k-means, a priori etc.

• Reinforcement Learning : – machine learning inspired by behaviorist psychology, concerned with how software

agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

– includes algorithm such as Markov decision process.

http://www.analyticsvidhya.com/blog/2015/12/10-machine-learning-algorithms-explained-army-soldier/

Page 33: Big Data Update

Supervised Learning

34January 22, 2016 © 2016 Analytics Inside, LLC. All Rights Reserved.

• We use a “linear discriminant” to make predictions on new data that is given to us.

What color should thisgreen data point be?

Page 34: Big Data Update

Unsupervised Learning

● K-Means algorithm attempts to cluster things that “look alike”

January 22, 2016 35Proprietary and Confidential

Page 35: Big Data Update

Reinforcement Learning

● An HMM uses a basic understanding of probabilities of what we can observe, coupled with inclusion of features that remain unknown to us

● How a doctor might view observations about a patient

January 22, 2016 36Proprietary and Confidential

States

Transition probabilities

Sick

Observed

Predicted

Page 36: Big Data Update

Deep Learning• Geoffrey Hinton – University of Toronto, now Google

(also and Terrence J. Sejnowski)

• Research in facial recognition led to breakthrough

• Uses specialized Neural Networks called “Restricted Boltzman Machines” that are “stacked” in a feed-forward fashion

• A very basic form of a neural network that can be optimized for training

• Can learn “significant” features of a set of data

• Can “throw away” insignificant features (noise or little contribution to the model)

January 22, 2016 37Proprietary and Confidential

Page 37: Big Data Update

Mahout

• Mahout is:– the keeper, trainer, and driver of the elephant.– a Hadoop-specific set of algorithms designed to

conduct machine learning against large sets of data.– a large Java product set consisting of many

algorithms, algorithm families, and useful data processing utilities designed to simplify various machine learning tasks.

– programmable at the Java level, but very approachable through a series of command line utilities as well.

– most useful as a set of Java libraries used by one or more Java programs.

38January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.

Page 38: Big Data Update

Spark MLlib

• MLlib is a set of machine learning algorithms that uses Spark as its underlying fabric

39January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.

Page 39: Big Data Update

Python

• Python is a dynamic programming language designed to do a lot of work in a relatively small amount of code.

• It is often used as a scripting language, but it is much, much more than that.

– It has an enormous code base with many libraries.

– The Natural Language Toolkit (NLTK) is a great NLP package.

– There are many Machine Learning libraries as well.

• It is easy to create MapReduce jobs in Python using the Hadoop streaming interface.

• Python is also a core language of Spark

40January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.

Page 40: Big Data Update

R

• R is a statistical language based on the “S” language by John Chambers from Bell Labs. It is roughly based on Scheme, a LISP derivative.– Like Python, it is dynamic, and is a functional programming language.

– Its emphasis is on statistical programming, and as a result it has been a favorite of financial services and “quants.”

– Also like Python, it can be integrated into Hadoop using the streaming interface, but R is not inherently a parallel processing language!

• Now we finally have RHadoop, in which many of the R algorithms can run in parallel on a Hadoop cluster.

• And we have SparkR!

41January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.

Page 41: Big Data Update

What is next?

42January 22, 2016 Proprietary and Confidential

• New Big data computing is being done on GPUs

– Facebook open source’s its “Big Sur” AI hardware schematics

– NVidia Tesla engine

• Algorithms rule!

– Algo’s commoditized and also sold as a service.

• Data warehouse market growth has slowed and NoSQL database growth is increasing. (Gartner 2015)– Volume and variety of IoT devices will help drive this trend

– Biomed and personal devices – telemedicine and self-serve

• Over 50% of IT projects in 2016 will choose open source.

Page 42: Big Data Update

Questions and Answers

[email protected]@AnalyticsInside.us

http://www.AnalyticsInside.us

January 22, 2016 43Proprietary and Confidential