what’s big about big data? the volume? the size? the value? data benelux gse.pdf · data? the...
TRANSCRIPT
What’s Big About Big Data? The Volume? The Size? The Value?
November 2013 Andy Ward – Senior Principal Consultant
Just what is Big Data? – We hear a lot about Big Data but what is it? – The advantages and disadvantages There are a lot of benefits for those that leap into the Big Data world, but
potentially some downsides for the rest of us
What part does DB2 on z/OS have to play? – What advances are being made in the DB2 World to embrace Big Data?
This is a BIG topic, this presentation is designed to be a primer
Agenda
Big Data in Action…1
Versus
The US Census – The birth of the Big Data problem 1880 census – 8 years to complete – data obsolete
1890 census – tabulated in just one year – Herman Hollerith’s system of punch cards and tabulation machines The foundation of IBM
Google – Data volumes began to stress traditional data processing engines – MapReduce was born (circa 2004) Massive parallel processing to churn and analyse the data
– Hadoop is the open source equivalent – more on this later Hadoop was originally developed by Yahoo
The need to know rather than to sample
The Origins of Big Data
Sampling can be very effective – For simple yes/no questions, with roughly equal odds, a sample of 1100
will provide very accurate answers Remarkably, 19 times out of 20 this sample size is +/-3% of an overall
population (regardless of size)
However, what about subcategories? – If you need to know how GENERALLY people are likely to vote sampling is
likely to produce good results – What if you want to know how females between 25 and 30, with one
male child, who have spent at least 10 years living outside the country will vote? Sampling just doesn’t cut it - you need all the data
The Problem With Sampling
Where Is All This Data Coming From? Datification - Today and Tomorrow
Data Volumes are Increasing at an Alarming Rate
Terabyte (1012) – 1000 Gigabytes
Petabyte (1015) – 1000 Terabytes
Exabyte (1018) – 1000 Petabytes
Zettabyte (1021) – 1000 Exabytes
Yottabyte (1024) – 1000 Zettabytes
Getting Your Head Around Big Data Volumes
Big Data Needs Refining
Firstly, the need for transactional, traditional processing hasn’t gone away – And Big Data needs to be distilled down to manageable chunks
But… – Data volumes have increased (estimated at >1.2 zettabytes today, <2% is
analog) The internet of things
Google processes over 24 petabytes of data every day
Facebook receives over 10 million new photos every hour
– Data has become less structured Tweets and YouTube tags for example
– The increasing volume means we can’t process the data in traditional ways – There is likely to be a lot of untapped value in all this data
Will ‘Small’ Data Disappear?
We need to start asking the right questions – We can now answer questions we couldn’t before
The ‘WHAT’ not the ‘WHY?’ – Correlation versus Causality
Traditionally we strive to understand ‘WHY’ things happen – For example searching for the God particle – But theories change Consider quantum theory
Ultimately often our ‘WHYs’ are just a very educated guess
For example – If analysing huge datasets allowed us to see a strong correlation between
cancer remission and drinking in excess of 3 litres of water a day do we need to understand why?
Big Data Requires A Change in Our Thinking More Data, More Answers
Some Big Data Buzzwords
The Three Vs of Big Data
Volume
Variety
Velocity
Big Data changes the way we manage data
Today most of our business decisions are made looking at data at rest – Stored and recorded
Some data may never be stored (unless it deviates from the norm) – Think sensor data
Consider the advantages that could be gained by processing data as it moves into your line of sight
Data In-Motion Analysing streaming data – the next big thing?
Big Data provides a lot of data variety
Traditional data is very structured and data stores are generally well designed
However, even unstructured data has some sort of structure – Consider a Tweet, it has a structure of 140 characters – But within those 140 characters the data is very unstructured It could be about any topic
It may contain acronyms, slang, misspellings
It may be totally incorrect
The Big Data world is only really exploiting textual unstructured data today
Unstructured Data
In a small data world it was our instinct to eradicate inaccurate data – Sampling means small errors can have a big impact on the answer
Some data of course has to be right – £10000 is a small amount for a large bank, if it’s removed from my
account in error it’s a big deal to me
Messiness is a problem of the recording not a Big Data problem per se – For example incorrect and misspelt YouTube tags
Big Data needs to be thought of as more probabilistic than completely precise
Messy Data
Big Data in Action…2 Target
When dealing with many terabytes or even petabytes of data traditional processing just takes too long – Parallelism is the answer
Given the data volumes the data needs to be close to the processing engine – Moving datasets of this size whilst analyzing is not plausible
Un-needed data needs to be removed quickly and early – The IBM DB2 Analytics Accelerator is very good at this – more later
Processing Big Data
Big Data provides big opportunities for those that harness it – Companies using Big Data outperform their peers by 30% on average
This is becoming a boom industry
An example of Big Data analytics – How often have you seen an advert on a webpage for something you
have recently searched for? This is very likely to be Big Data in action once more
– Processing millions of web visits and finding patterns
In the good old days these were random adverts, hoping to be of interest
Personalisation of the Internet
Analytics Where the magic happens
Not having to know the $ value of data – Todays data warehouses are expensive to build and maintain and
generally contain data that is known to be valuable
Running queries on relatively cheap ‘commodity’ hardware – Away from the expensive corporate processing power
Looking at all the data (n=all) not a subset – Being able to make true data based decisions rather than decisions
based on data that fits In governmental terms ‘evidence based policy, not policy based evidence’
The Ability to Analyse More Data More Cheaply
This is a massive topic – we can only scratch the surface today
Simplifying the processing of Big Data – Removing some of the complexities of data analysis – Schema-less – ideal for unstructured data
Utilises MapReduce – Map – Splitting the request up into smaller parts and sending these to
where the data is stored to be processed A little like Query Parallelism on a massive scale (with a few differences!)
– Reduce – recombining all the results returned from the Map process
Data is accessed via Java programs – Although more DBA familiar methods are available (Hive, PIG, HBase)
Hadoop
Hadoop Architecture
No Hadoop distribution has currently been certified for the z platform
However… – Hadoop is a Linux based JAVA program So essentially it can run on z series
– Perhaps even simpler… Run Hadoop on a zBX attached to a relatively cheap z114 (or a z196 or EC12)
– Hadoop is already certified to run x86 Blades
The downside is these blades are not cheap commodity servers
We will have to wait to understand the appetite of users – IBM may well be keen to port Hadoop to z series in the future
Hadoop and the Mainframe
In the old ‘small data’ world – Data was generally very specific, collected for one purpose
In the new ‘big data’ world – Data can be re-analysed potentially for completely different purposes – Although it is likely that data quality will degrade over time Amazon purchase information, your interests of 10 years ago may no longer be
relevant
Recombining data
Captcha & ReCaptcha – ReCaptcha verifies unclear words in digitized text
Data Reuse
The valuable digital trail we leave behind us
Many companies design their systems to harvest this data – Where we click, where we type, how long we view a webpage
If a certain Google search term results in people clicking a link on the 4th page of results Google’s PageRank takes this into account – Moving the page higher up the result set
Google spell check – Google actually ratifies their analysis ‘Did you mean xxxxx?’
Data Exhaust
Took translations from all over the web – Including official documents Over a trillion words and 95 billion English sentences
Translating from English to French – the power of Google Translate – The words watch and wind have more than one English definition – ‘His watch was old, he had to wind it every day’ Sa montre était vieux, il avait pour enrouler tous les jours
– I had to watch the replay, the strong wind interrupted the satellite signal J'ai dû regarder le replay, le vent fort interrompu le signal satellite
– I didn't know which watch to watch Je ne savais pas qui montre à regarder.
Big Data in Action…4 Google Translate (project started in 2006)
Big Data can indicate our propensity to do something – Not too much of a problem if it’s buying something, a bit more of a
problem if it’s related to criminal actions
Agencies know more about you than ever before, especially web ones – People are under constant surveillance – In some respects this is very useful Airlines knowing where you like to sit for example
– In others…if I see an advert for another watch…
Correlations do not imply causation
The Downside of Big Data
The Accelerator data must originally come from DB2 (at the moment) – This intimates the data is already structured – But of course structured data can still be Big Data However, it’s only a small proportion of the data available to us
A Netezza appliance is used to process the data – More on this in a moment – Like Hadoop it solves the problem through parallelism
The optimizer makes the choice as to where the query runs – Native DB2 or the Analytics Accelerator
Data management is controlled by Stored Procedures
Where Does DB2 Fit In? IBM DB2 Analytics Accelerator
Controlled by a number of Stored Procedures that (among other things): – Add an Accelerator to a subsystem – Add or remove tables from an Accelerator – Load data from DB2 to the Accelerator
Data Management
Netezza ZoneMaps – Highlighting where the data isn’t As opposed to an index which shows you where it is
– Contain high and low values for a disk extent – Automatically managed for certain datatypes i.e. smallint, integer, date, time
Materialized Views
Cluster Based Tables
What No Indexes?
Un-named Analytics Accelerator customer
Initial Load – 5.1GB – 1m 25s (24M rows) – 400GB – 29m (570M rows)
Query acceleration – Traditional DB2 – 2hrs 39m – Accelerator – 5s (nearly 2000x faster)
CPU utilization (24m rows) – Traditional DB2 – 56.6s – Accelerator – 0.4s (99% less)
Some Very Impressive Performance
Source: IDAA: A DB2 Insider Story for Users, Maryela Weihrauch, IBM, IDUG EMEA 2012
Accelerator v3.1 – High Performance Storage Saver (HPSS) – EC12 Support – Incremental Update – High Capacity Up to 1.28 Petabytes
Accelerator v4.1 – Supports Static SQL – Incremental Update & HPSS improvements – DB2 11 Support – Improved monitoring
The Latest Advances Selected enhancements
The ever increasing pool of data – Mostly unstructured and messy
The three Vs of Big Data
The ‘what’ not the ‘why?’
Some negatives to consider
Changing our traditional view – The questions to ask, analysing more data (known $ value is no longer a
constraint), data in-motion vs data at rest.
Processing Big Data – The need for parallelism – Hadoop
Big Data in action
Big Data Summary
Structured data only at present
Things are moving forward quickly though (JSON)
And it still provides the ability to get answers to questions that were formally out of scope due to cost and processing time
Netezza – FPGA – ZoneMaps
Very impressive performance figures
Big Data and DB2 Summary