big data update
TRANSCRIPT
Big Data Update 2016
by H. Michael Covert and Victoria Loewengart
January 22, 2016
Proprietary and Confidential
Agenda
• What does Big Data mean in 2016
• Big Data Trends in 2016
• Real Time Analytics
• Mainstream Adoption Use cases
• Machine Learning in 2016
• Looking ahead
2January 22, 2016 Proprietary and Confidential
Overview
• It’s no longer acceptable to equate big data and Hadoop
• BigData means – Hadoop– Spark– Kafka– Other technologies in their echo-system– Non-Apache technologies with similar volume, velocity,
and variety
• Expect updates in machine learning, real-time data-as-a-service, algorithm markets, Spark, and more.
January 22, 2016 Proprietary and Confidential 3
Big Data Trends for 2016
• The importance of data will continue to be elevated to create incremental improvement in business performance.
• Insight will become a key competitive weapon, as firms move beyond big data and solve problems with data driven thinking.
• Chief data officers will gain power, prestige, and presence
January 22, 2016 Proprietary and Confidential 4
http://blogs.forrester.com/brian_hopkins/15-11-09-forresters_2016_predictions_turn_data_into_insight_and_action
Big Data Trends for 2016 –Tools and Features
• There will be more tools and features that expose data directly to the people who use it –business users. Such as:– Microsoft PowerApps (https://powerapps.microsoft.com/en-us/)
– CloudBI by RJMetrics (https://rjmetrics.com/product/cloudbi)
– Orange (http://orange.biolab.si/)
– Birst, Btime, Adaptive Insight….
– New “data blending” tools like Alteryx, Datawatch, Domo, and others (Note that Tableau and Qlik are also such toolsets)
January 22, 2016 Proprietary and Confidential 5
Big Data Trends for 2016 - Embedding the Analytics
– Prescriptive analytics: • IDC predicts that by 2020 half of all business analytics software will incorporate
prescriptive analytics built on cognitive computing functionality.
– Autonomous agents and things • According to Gartner – “autonomous agents and things" as one of the up-and-
coming trends, which is already marking the arrival of robots, autonomous vehicles, virtual personal assistants, and smart advisers.
– Advanced Machine Learning• According to Gartner - advanced machine learning, deep neural nets (DNNs)
move beyond classic computing and information management to create systems that can autonomously learn to perceive the world, on their own. The explosion of data sources and complexity of information makes manual classification and analysis infeasible and uneconomic. DNNs automate these tasks and make it possible to address key challenges related to the information of everything trend.
January 22, 2016 Proprietary and Confidential 6
https://www.idc.com/research/viewtoc.jsp?containerId=259835http://www.gartner.com/newsroom/id/3143521
Big Data Trends for 2016 – Jobs!
• Talent shortage is not over!
– According to Gartner, 220,000 Big Data jobs will need to be filled in 2016.
– The shortage will extend from data scientists to data architects and experts in data management (IDC)
• The companies will use different strategies to acquire talent:
– Rely on new university programs
– Internal and external training
January 22, 2016
http://iianalytics.com/analytics-resources/2016-predictionshttps://www.idc.com/research/viewtoc.jsp?containerId=259835
Big Data Jobs
8January 22, 2016 Proprietary and Confidential
Big Data Jobs
9January 22, 2016 Proprietary and Confidential
Source: O’Reilly Media’s 2015 Annual Report
Data Science
10January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.
IT Skills
Big Data Trends for 2016 – Real Time Streaming
• Real Time streaming ingestion of data and analytics will become a must have in 2016
– The window for turning data into action is narrowing
– The next 12 months will be about distributed, open source streaming alternatives built on open source projects like Kafka and Spark (Forrester)
– Spark is growing quickly!
January 22, 2016
http://www.informationweek.com/big-data/big-data-analytics/big-data-predictions-for-2016/d/d-id/1323671
Real Time Analytics
• Major performance improvements - Big data analytics moved well beyond their batch roots and now deliver real-time performance.
• The emergence of in-memory computing has taken hold, and a new paradigm, the Lambda Architecture, has emerged as the primary means of delivering real-time analytics to the enterprise.
12January 22, 2016 Proprietary and Confidential
Nathan Marz
Real Time Analytics
• Big data analytics has now moved beyond batch
• Some have been “retrofitted”– HIVE and Pig
• New technology has also arrived– YARN
– Tez
– Spark
– Storm
– Kafka
– Impala
• A reference architecture stack has now been defined in Hadoop version 2
13January 22, 2016 Proprietary and Confidential
Real Time Analytics
• Tez – a new architecture designed to create more
flexible execution graphs and to minimize data exchanges (and writes to disk)
• It can be retrofitted into some existing systems
14January 22, 2016 Proprietary and Confidential
Real Time Analytics• Storm is a Complex Event Processing (CEP) system designed
to handle high velocity streamed data– It was invented by Twitter but is Apache open source
– It gets its data from “spouts” and processes its in “bolts”
– Each bolt is a programmable unit which can transform, store, apply rules or machine learning in order to detect patterns in real-time.
– Storm runs on top of YARN
15January 22, 2016 Proprietary and Confidential
Real Time Analytics• Kafka is an extremely high speed, scalable message bus
– Designed for scalability and the ability to absorb massive flows of messages• Elastic linear scalability
– Developed originally at LinkedIn but open sourced in 2011• Written in Scala
• In 2014, engineers from LinkedIn formed Confluent which supports Kafka
16January 22, 2016 Proprietary and Confidential
Real Time Analytics
• Spark is a new architecture that is designed for real-time analytics– It also uses a persistent server structure
– It utilizes in-memory datasets called Resilient Distributed Datasets (RDD)
– It provides a graph database component called GraphX
– It provides a Machine Learning component called MLLib
17January 22, 2016 Proprietary and Confidential
Real Time Analytics
• RDDs are created, transformed into other RDDs, and then actions can be performed on them to yield results.
– RDDs are immutable! And transformations are “lazy!”
18January 22, 2016 Proprietary and Confidential
Real Time Analytics
• Spark uses innovative in-memory technology called a Resilient Distributed Dataset (RDD) originally designed to handle complex machine learning tasks
19January 22, 2016 Proprietary and Confidential
Real Time Analytics
• Kudu is a new data storage layer created by Cloudera designed to bridge some of the gaps that exist with Hadoop Distributed File System (HDFS).
– Unlike HDFS, it is not immutable. It allows inserts, updates, and deletes.
– It is optimized for rapidly changing schemas, data streaming, and analytic functionality
– It is also distributed, but can be replicated access data centers allowing for disaster recovery
– It is columnar and makes heavy usage of in-memory and caching
– It is not yet production ready!
20January 22, 2016 Proprietary and Confidential
Big Data Trends for 2016 – New Business Models
• Data-as-a-service business model is coming
– The window for turning data into action is narrowing
– The next 12 months will be about distributed, open source streaming alternatives built on open source projects like Kafka and Spark (Forrester)
January 22, 2016
http://www.informationweek.com/big-data/big-data-analytics/big-data-predictions-for-2016/d/d-id/1323671?image_number=8
22January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.
Innovative Use Cases - Energy
23January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.
Fracking alone has produced hundreds of petabytes of data in the form of images, seismic recordings, videos, sound recordings, and field notes. The data is vast in its volume and variety.
Innovative Use Cases – Insurance
24January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.
Pathway Benefit Hadoop Use Case
Right Living Patients can build value by taking an active role in their own treatment, including disease prevention.
Predictive Analytics: Patients enter data at home manually or with biometric device data. Algorithms analyze the data and flag patterns that indicate a high risk of readmission, alerting a physician.
Right Care Patients get the most timely, appropriate treatment available.
Real-time Monitoring: Patient vital statistics are transmitted from wireless sensors every minute. If vital signs cross certain risk thresholds, or if more complex patterns are detected, staff can be notified to attend to the patient immediately.
Right Provider Provider skill sets matched to the complexity of the assignment— for instance, nurses or physicians’ assistants performing tasks that do not require a doctor. Also the specific selection of the provider with the best outcomes.
Historical EMR Analysis: Hadoop reduces the cost to store data on clinical operations, allowing longer retention of data on staffing decisions and clinical outcomes. Analysis of this data allows administrators to promote individuals and practices that achieve the best results.
Right Value Ensure cost-effectiveness of care, such as tying provider reimbursement to patient outcomes, or eliminating fraud, waste, or abuse in the system.
Medical Device Management: For biomedical device maintenance, use geo-location and sensor data to manage its medical equipment. The biomedical team can know where all the equipment is, so they don’t waste time searching for an item. Over time, determine the usage of different devices, and use this information to make rational decisions about when to repair or replace equipment.
Right Innovation The identification of new therapies and approaches to delivering care, across all aspects of the system. Also improving the innovation engines themselves.
Research Cohort Selection: Researchers at teaching hospitals can access patient data in Hadoop for cohort discovery, then present the anonymous sample cohort to their Internal Review Board for approval, without ever having seen uniquely identifiable information.
Innovative Use Cases – Health Care
25January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.
From McKinsey & Company “The ‘Big Data’ Revolution in Healthcare
Innovative Use Cases – Health Care
• Biomedical Big Data is being driven from multiple areas and disciplines:– Genomics (genotyping, gene expression, sequencing)
• 4 TB per person, now at the rate of 1,000s of people per month!
• The computational complexity here is very high
– Molecular pathology and predictive analytics
– Survivor analytics and pattern detection
– Predictive technologies are very important• Disease propensity analysis
• Real-time population health and tele-medicine become tightly connected
26January 22, 2016 © 2015 Analytics Inside, LLC. All Rights Reserved.
Big Data Trends for 2016 – Machine Learning Gains Momentum!
• Machine learning has become a “checklist item” for data preparation and predictive analytics (Ovum)
• It is a top strategic trend for 2016 (Gartner)
• Machine learning will replace manual data wrangling and data governance dirty work. The freeing up of time will accelerate data strategies
January 22, 2016
http://www.ovum.com/press_releases/ovum-reveals-the-reality-of-big-data-for-2016-cloud-and-appliances-will-drive-the-next-wave-of-adoption-with-spark-the-fastest-growing-workload/
http://www.gartner.com/newsroom/id/3143521
What do we mean by “Learn?”
28January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.
• Stimulus-response learning. Learning to give a precise response to a given stimulus.– Signal learning. Learning to give a precise response to a signal.
• Chaining. Learning to give a precise response to two or more stimuli.– Verbal association. A form of chaining that is special because of the tight
linkage between language and thought in human beings.
• Multiple discrimination. Learning to classify responses into categories based on multiple stimuli.
• Concept learning. Learning to give a precise response to a class of stimuli even though the individual members of that class may differ widely from each other.
• Principle learning. Learning to give a precise response to a Principle (a chain of two or more concepts.) It functions to organize higher order “behavior.”
• Problem solving. Previously acquired concept and principle learning are combined to classify unresolved or ambiguous set of events.
Robert Gagné (1965)
How do Machines “Learn?”
29January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.
• Machine learning consists of:
– Observing the world for a while (data)
– Developing a model that classifies these observations• Detecting patterns that exist in the observations
– Predicting the class of a new, unseen observation based on this model
– If correctness can be determined, modify (strengthen) the model with this information.
Robert Gagné (1965)
Machine Learning Workflow
30January 22, 2016 Proprietary and Confidential
Design
Use
Strengthenand learn
Define desired targets
Analyze available
data
Train model and
analyze results
Place model into production
Build and upload model
Upload training
data
Upload batch data set
Classify single record
Classify
Query for classification
summary
Query for classification
result(s)
Retrain model and analyze
results
Select training sets
Promote trained
model into production
Does model meet
acceptance criteria?
Does data
support desired results?
Does model meet
acceptance criteria?
NoNo
No
Results written to database
Results queried from
database
Upload trainingrecord set
Use training record set generated from classified records
Create training record set from
classifiedrecords
CRISP-DM is a very good framework much like this one.
A Little Machine Learning Theory Now
31January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.
• Machine Learning is VERY complex. It requires an understanding of:
– Linear algebra and the usage of vectors and matrices
– Statistics and probability
– Calculus and derivatives
– And the more data, the better the model Big Data!
Machine Learning Algorithms
32January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.
• Supervised Learning : – machine learning task of inferring a function from labeled training data. The training
data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).
– includes algorithms such as Linear Regression, Logistic Regression, Decision Tree, Random Forest etc.
• Unsupervised Learning :– machine learning task of inferring a function to describe hidden structure from
unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning.
– includes algorithms such as k-means, a priori etc.
• Reinforcement Learning : – machine learning inspired by behaviorist psychology, concerned with how software
agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
– includes algorithm such as Markov decision process.
http://www.analyticsvidhya.com/blog/2015/12/10-machine-learning-algorithms-explained-army-soldier/
Supervised Learning
34January 22, 2016 © 2016 Analytics Inside, LLC. All Rights Reserved.
• We use a “linear discriminant” to make predictions on new data that is given to us.
What color should thisgreen data point be?
Unsupervised Learning
● K-Means algorithm attempts to cluster things that “look alike”
January 22, 2016 35Proprietary and Confidential
Reinforcement Learning
● An HMM uses a basic understanding of probabilities of what we can observe, coupled with inclusion of features that remain unknown to us
● How a doctor might view observations about a patient
January 22, 2016 36Proprietary and Confidential
States
Transition probabilities
Sick
Observed
Predicted
Deep Learning• Geoffrey Hinton – University of Toronto, now Google
(also and Terrence J. Sejnowski)
• Research in facial recognition led to breakthrough
• Uses specialized Neural Networks called “Restricted Boltzman Machines” that are “stacked” in a feed-forward fashion
• A very basic form of a neural network that can be optimized for training
• Can learn “significant” features of a set of data
• Can “throw away” insignificant features (noise or little contribution to the model)
January 22, 2016 37Proprietary and Confidential
Mahout
• Mahout is:– the keeper, trainer, and driver of the elephant.– a Hadoop-specific set of algorithms designed to
conduct machine learning against large sets of data.– a large Java product set consisting of many
algorithms, algorithm families, and useful data processing utilities designed to simplify various machine learning tasks.
– programmable at the Java level, but very approachable through a series of command line utilities as well.
– most useful as a set of Java libraries used by one or more Java programs.
38January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.
Spark MLlib
• MLlib is a set of machine learning algorithms that uses Spark as its underlying fabric
39January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.
Python
• Python is a dynamic programming language designed to do a lot of work in a relatively small amount of code.
• It is often used as a scripting language, but it is much, much more than that.
– It has an enormous code base with many libraries.
– The Natural Language Toolkit (NLTK) is a great NLP package.
– There are many Machine Learning libraries as well.
• It is easy to create MapReduce jobs in Python using the Hadoop streaming interface.
• Python is also a core language of Spark
40January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.
R
• R is a statistical language based on the “S” language by John Chambers from Bell Labs. It is roughly based on Scheme, a LISP derivative.– Like Python, it is dynamic, and is a functional programming language.
– Its emphasis is on statistical programming, and as a result it has been a favorite of financial services and “quants.”
– Also like Python, it can be integrated into Hadoop using the streaming interface, but R is not inherently a parallel processing language!
• Now we finally have RHadoop, in which many of the R algorithms can run in parallel on a Hadoop cluster.
• And we have SparkR!
41January 22, 2016 © 2014 Analytics Inside, LLC. All Rights Reserved.
What is next?
42January 22, 2016 Proprietary and Confidential
• New Big data computing is being done on GPUs
– Facebook open source’s its “Big Sur” AI hardware schematics
– NVidia Tesla engine
• Algorithms rule!
– Algo’s commoditized and also sold as a service.
• Data warehouse market growth has slowed and NoSQL database growth is increasing. (Gartner 2015)– Volume and variety of IoT devices will help drive this trend
– Biomed and personal devices – telemedicine and self-serve
• Over 50% of IT projects in 2016 will choose open source.
Questions and Answers
[email protected]@AnalyticsInside.us
http://www.AnalyticsInside.us
January 22, 2016 43Proprietary and Confidential