what’s big about big data? the volume? the size? the value? data benelux gse.pdf · data? the...

42
What’s Big About Big Data? The Volume? The Size? The Value? November 2013 Andy Ward – Senior Principal Consultant

Upload: lynga

Post on 04-Apr-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

What’s Big About Big Data? The Volume? The Size? The Value?

November 2013 Andy Ward – Senior Principal Consultant

Just what is Big Data? – We hear a lot about Big Data but what is it? – The advantages and disadvantages There are a lot of benefits for those that leap into the Big Data world, but

potentially some downsides for the rest of us

What part does DB2 on z/OS have to play? – What advances are being made in the DB2 World to embrace Big Data?

This is a BIG topic, this presentation is designed to be a primer

Agenda

Just What is Big Data?

Big Data in Action…1

Versus

Presenter
Presentation Notes
Tracking Flu epidemics is a big problem It is easily spread, global travel adds it’s own problems In 2009 H1N1 caused major concerns of a potential pandemic Doctors were told to report all flu cases Data hopelessly out of date – around a two week lag – people don’t go straight to the doctor and data isn’t tabulated everyday Just before H1N1 hit the media Google released a paper citing how they could track the spread of Winter Flu The system looked for correlations between search terms and the spread of the Flu in the 50M most common search terms They found 45 that had a very strong correlation Therefore they could track the spread of the Flu with a high level of accuracy in a very short span of time

The US Census – The birth of the Big Data problem 1880 census – 8 years to complete – data obsolete

1890 census – tabulated in just one year – Herman Hollerith’s system of punch cards and tabulation machines The foundation of IBM

Google – Data volumes began to stress traditional data processing engines – MapReduce was born (circa 2004) Massive parallel processing to churn and analyse the data

– Hadoop is the open source equivalent – more on this later Hadoop was originally developed by Yahoo

The need to know rather than to sample

The Origins of Big Data

Presenter
Presentation Notes
Sampling is fine when the sample set is very random and you want an answer to a wide ranging question – i.e. predicting voting patterns Sampling is not so useful when you want to understand detail – i.e. which way are women, under the age of 25, with one child who have lived in Chicago for less than 5 years are likely to vote Sampling data is like looking in detail at a small section of a photograph – it quickly becomes blurry

Sampling can be very effective – For simple yes/no questions, with roughly equal odds, a sample of 1100

will provide very accurate answers Remarkably, 19 times out of 20 this sample size is +/-3% of an overall

population (regardless of size)

However, what about subcategories? – If you need to know how GENERALLY people are likely to vote sampling is

likely to produce good results – What if you want to know how females between 25 and 30, with one

male child, who have spent at least 10 years living outside the country will vote? Sampling just doesn’t cut it - you need all the data

The Problem With Sampling

Presenter
Presentation Notes
Accurate answers because the benefit gained from each incremental data point diminishes for each new source Lots of statistical science behind this Sampling good results – look at the exit polls and GALLUP (UK) polls run prior to elections

Where Is All This Data Coming From? Datification - Today and Tomorrow

Presenter
Presentation Notes
Smart phones Apple iPhones passing information about where people are – now stopped I believe But just imaging the data transferred for that alone The Internet of Things – More and more equipment is IP enabled

Data Volumes are Increasing at an Alarming Rate

Presenter
Presentation Notes
Before the invention of the printing press by Johannes Gutenberg - around 1450 Data was scarce Cambridge University only had 122 books at the beginning of the 15th century Digitally stored data has rocketed As little as a quarter of the worlds data was held digitally in 2000 Now <2% is non-digital Global Data 2.7ZB today – ONLY 0.5% IS ANALYZED (estimated) Enterprise created data vs Enterprise managed data More and more external data will be used by organisations

Terabyte (1012) – 1000 Gigabytes

Petabyte (1015) – 1000 Terabytes

Exabyte (1018) – 1000 Petabytes

Zettabyte (1021) – 1000 Exabytes

Yottabyte (1024) – 1000 Zettabytes

Getting Your Head Around Big Data Volumes

Presenter
Presentation Notes
Terabyte In 1991 a 1GB HDD cost upwards of $2699 – today a 1TB HDD costs $80 1st consumer 1TB drive introduced in 2007 by Hitachi Petabyte In 2009 Google were processing around 24 petabytes of data a day In May 2013 MS move from Hotmail to Outlook.com migrated over 150 petabytes of data in six weeks Exabyte 16 Exabytes – the amount of addressable storage in a z/OS 64bit AS (8.6 billion times more storage than 31bit 2GB) If 2GB = 1mm, 16Eb > 8,500,000 metres Zettabyte Mark Liberman – American Linguist – calculated every word every spoken would equate to 42 zettabytes of data if digitized at 16kHz 16 bit audio Yottabyte To store a yottabyte on 1TB hard drives would require a million data centres the size of a US city block

Big Data Needs Refining

Presenter
Presentation Notes
Think of Big Data as Crude Oil it needs to be refined into something usable

Firstly, the need for transactional, traditional processing hasn’t gone away – And Big Data needs to be distilled down to manageable chunks

But… – Data volumes have increased (estimated at >1.2 zettabytes today, <2% is

analog) The internet of things

Google processes over 24 petabytes of data every day

Facebook receives over 10 million new photos every hour

– Data has become less structured Tweets and YouTube tags for example

– The increasing volume means we can’t process the data in traditional ways – There is likely to be a lot of untapped value in all this data

Will ‘Small’ Data Disappear?

Presenter
Presentation Notes
The Internet of Things Fridge doors opening, cars starting Fitbit Home control 3rd Century BC – Ptolemy II King of Egypt Great Library of Alexandria All data today – each person would have over 320 times more information than the library

We need to start asking the right questions – We can now answer questions we couldn’t before

The ‘WHAT’ not the ‘WHY?’ – Correlation versus Causality

Traditionally we strive to understand ‘WHY’ things happen – For example searching for the God particle – But theories change Consider quantum theory

Ultimately often our ‘WHYs’ are just a very educated guess

For example – If analysing huge datasets allowed us to see a strong correlation between

cancer remission and drinking in excess of 3 litres of water a day do we need to understand why?

Big Data Requires A Change in Our Thinking More Data, More Answers

Some Big Data Buzzwords

Presenter
Presentation Notes
Flume Allowing data from a source to flow into Hadoop Source – e.g TCP Port or a text file Sink – the target Decorator – Data manipulator (e.g compress, prune data) Sharding The secret of processing Big Data Think of it a little like DB2 Partitioning NoSQL Non-relational, schema free databases like HADOOP Pig/Hive Simplifying the loading and processing of data in these NoSQL databases No longer need to be able to code Map and Reduce elements

The Three Vs of Big Data

Volume

Variety

Velocity

Presenter
Presentation Notes
Volume Organisations are struggling – the amount of data available to them is increasing, whilst the proportional amount they can process is decreasing Is that data in the ‘blind spot’ useful or not… Variety Data arrives from everywhere, Social Media, a myriad of sensors, the internet, log files, email etc. etc. It is structured, semi-structured and unstructured Traditional analytic platforms don’t handle variety well – today we like to understand our data and plan for it Velocity This isn’t just about how fast data is generated – this definition sits more inside the Volume characteristic This is how fast data moves and the dual need to process it in-motion as well as at rest Seconds can make big differences, think of buying 10M shares just before they raise by a cent and then selling them before they fall Analysing data in-motion is very important

Big Data changes the way we manage data

Today most of our business decisions are made looking at data at rest – Stored and recorded

Some data may never be stored (unless it deviates from the norm) – Think sensor data

Consider the advantages that could be gained by processing data as it moves into your line of sight

Data In-Motion Analysing streaming data – the next big thing?

Presenter
Presentation Notes
Sensor data – potentially millions of readings a second arriving – if everything is fine they aren’t needed – never stored At rest data – withdrawing money, if I have £100 and want a £1000 my request will be declined. Even though a credit for £2000 may be moving through the organisation. Only at rest data is traditionally considered. Advantages – for example, share price fluctuations, a few cents when applied to millions of shares is worth leveraging

Big Data provides a lot of data variety

Traditional data is very structured and data stores are generally well designed

However, even unstructured data has some sort of structure – Consider a Tweet, it has a structure of 140 characters – But within those 140 characters the data is very unstructured It could be about any topic

It may contain acronyms, slang, misspellings

It may be totally incorrect

The Big Data world is only really exploiting textual unstructured data today

Unstructured Data

In a small data world it was our instinct to eradicate inaccurate data – Sampling means small errors can have a big impact on the answer

Some data of course has to be right – £10000 is a small amount for a large bank, if it’s removed from my

account in error it’s a big deal to me

Messiness is a problem of the recording not a Big Data problem per se – For example incorrect and misspelt YouTube tags

Big Data needs to be thought of as more probabilistic than completely precise

Messy Data

Big Data in Action…2 Target

Presenter
Presentation Notes
Target – large retailer in the US – the power of loyalty cards Analysed shopping habits of women who signed up for its baby gift register Used big data to find correlations between purchasing patterns and pregnancy – knowing a person is pregnant without them having to tell you Important because at this time shopping patterns often change along with more expensive brand loyalties Find out who these people are and send them coupons etc. for baby orientated items Caused a problem – angry father, daughter still in school – Why is Target sending information on baby product?s Turns out daughter was in fact pregnant and hadn’t told her parents

When dealing with many terabytes or even petabytes of data traditional processing just takes too long – Parallelism is the answer

Given the data volumes the data needs to be close to the processing engine – Moving datasets of this size whilst analyzing is not plausible

Un-needed data needs to be removed quickly and early – The IBM DB2 Analytics Accelerator is very good at this – more later

Processing Big Data

Big Data provides big opportunities for those that harness it – Companies using Big Data outperform their peers by 30% on average

This is becoming a boom industry

An example of Big Data analytics – How often have you seen an advert on a webpage for something you

have recently searched for? This is very likely to be Big Data in action once more

– Processing millions of web visits and finding patterns

In the good old days these were random adverts, hoping to be of interest

Personalisation of the Internet

Analytics Where the magic happens

Presenter
Presentation Notes
There are companies that process online activity and provide information to advertisements brokers to enable them to supply the right ad to the right individual in real time Story – Sara’s watch for Christmas

Not having to know the $ value of data – Todays data warehouses are expensive to build and maintain and

generally contain data that is known to be valuable

Running queries on relatively cheap ‘commodity’ hardware – Away from the expensive corporate processing power

Looking at all the data (n=all) not a subset – Being able to make true data based decisions rather than decisions

based on data that fits In governmental terms ‘evidence based policy, not policy based evidence’

The Ability to Analyse More Data More Cheaply

Presenter
Presentation Notes
The $ value of data Today retailers know that the way their stores are configured plays an important in their revenues Therefore spending time and money analysing what customers buy is time and money well spent UK retailer - Nappies and beer story Noticed that individual nappy purchases happened late evening and no loyalty card was presented (no loyalty card often associated with male shoppers) Nappies moved close to beer – beer sales increased However, retailers don’t know that all data is as important to their revenues If a customer visits on a Monday do they spend more than if they visit on a Thursday? Is it worth mining this data to find out? For example email them coupons on a Sunday

This is a massive topic – we can only scratch the surface today

Simplifying the processing of Big Data – Removing some of the complexities of data analysis – Schema-less – ideal for unstructured data

Utilises MapReduce – Map – Splitting the request up into smaller parts and sending these to

where the data is stored to be processed A little like Query Parallelism on a massive scale (with a few differences!)

– Reduce – recombining all the results returned from the Map process

Data is accessed via Java programs – Although more DBA familiar methods are available (Hive, PIG, HBase)

Hadoop

Hadoop Architecture

Presenter
Presentation Notes
Nodes – built in redundancy (a little like RAID, data is replicated across nodes) the node contains everything it needs to process the data – CPU and the data itself

No Hadoop distribution has currently been certified for the z platform

However… – Hadoop is a Linux based JAVA program So essentially it can run on z series

– Perhaps even simpler… Run Hadoop on a zBX attached to a relatively cheap z114 (or a z196 or EC12)

– Hadoop is already certified to run x86 Blades

The downside is these blades are not cheap commodity servers

We will have to wait to understand the appetite of users – IBM may well be keen to port Hadoop to z series in the future

Hadoop and the Mainframe

Big Data in Action…3

In the old ‘small data’ world – Data was generally very specific, collected for one purpose

In the new ‘big data’ world – Data can be re-analysed potentially for completely different purposes – Although it is likely that data quality will degrade over time Amazon purchase information, your interests of 10 years ago may no longer be

relevant

Recombining data

Captcha & ReCaptcha – ReCaptcha verifies unclear words in digitized text

Data Reuse

Presenter
Presentation Notes
Recombining Data Danish mobile phone cancer risk study Took datasets from mobile operators/medical records/socio-economic data Captcha (page 98 big data) Take up around 150,000 hours of time everyday Should be put to good use – hence ReCaptcha ReCaptcha Shows a known and unknown word If known word is correct, unknown word assumed to be correct If enough people get both words right the unknown word is assumed to be correct

The valuable digital trail we leave behind us

Many companies design their systems to harvest this data – Where we click, where we type, how long we view a webpage

If a certain Google search term results in people clicking a link on the 4th page of results Google’s PageRank takes this into account – Moving the page higher up the result set

Google spell check – Google actually ratifies their analysis ‘Did you mean xxxxx?’

Data Exhaust

Took translations from all over the web – Including official documents Over a trillion words and 95 billion English sentences

Translating from English to French – the power of Google Translate – The words watch and wind have more than one English definition – ‘His watch was old, he had to wind it every day’ Sa montre était vieux, il avait pour enrouler tous les jours

– I had to watch the replay, the strong wind interrupted the satellite signal J'ai dû regarder le replay, le vent fort interrompu le signal satellite

– I didn't know which watch to watch Je ne savais pas qui montre à regarder.

Big Data in Action…4 Google Translate (project started in 2006)

Presenter
Presentation Notes
Still not perfect, but probably the best – supports 60 languages (14 via voice input) Uses English as a bridge for languages that are difficult to translate directly between (Hindi and Catalan for example) Messy data, misspellings, mistranslations – but ‘More trumps better’

Some Caution Needs to be Applied

Big Data can indicate our propensity to do something – Not too much of a problem if it’s buying something, a bit more of a

problem if it’s related to criminal actions

Agencies know more about you than ever before, especially web ones – People are under constant surveillance – In some respects this is very useful Airlines knowing where you like to sit for example

– In others…if I see an advert for another watch…

Correlations do not imply causation

The Downside of Big Data

Presenter
Presentation Notes
Who has seen the Minority Report?

DB2 and Big Data

The Accelerator data must originally come from DB2 (at the moment) – This intimates the data is already structured – But of course structured data can still be Big Data However, it’s only a small proportion of the data available to us

A Netezza appliance is used to process the data – More on this in a moment – Like Hadoop it solves the problem through parallelism

The optimizer makes the choice as to where the query runs – Native DB2 or the Analytics Accelerator

Data management is controlled by Stored Procedures

Where Does DB2 Fit In? IBM DB2 Analytics Accelerator

Presenter
Presentation Notes
Used to be known as the IDAA Rumour has it that an American alcoholics group with the same initials complained Now generally known as the DB2 Accelerator

Netezza Architecture

The Magic of the Field Programmable Gate Array

Controlled by a number of Stored Procedures that (among other things): – Add an Accelerator to a subsystem – Add or remove tables from an Accelerator – Load data from DB2 to the Accelerator

Data Management

Netezza ZoneMaps – Highlighting where the data isn’t As opposed to an index which shows you where it is

– Contain high and low values for a disk extent – Automatically managed for certain datatypes i.e. smallint, integer, date, time

Materialized Views

Cluster Based Tables

What No Indexes?

Presenter
Presentation Notes
Materialized Views Cutting down the columns Cluster Based Tables Very similar to DB2 partitoning

Un-named Analytics Accelerator customer

Initial Load – 5.1GB – 1m 25s (24M rows) – 400GB – 29m (570M rows)

Query acceleration – Traditional DB2 – 2hrs 39m – Accelerator – 5s (nearly 2000x faster)

CPU utilization (24m rows) – Traditional DB2 – 56.6s – Accelerator – 0.4s (99% less)

Some Very Impressive Performance

Source: IDAA: A DB2 Insider Story for Users, Maryela Weihrauch, IBM, IDUG EMEA 2012

Accelerator v3.1 – High Performance Storage Saver (HPSS) – EC12 Support – Incremental Update – High Capacity Up to 1.28 Petabytes

Accelerator v4.1 – Supports Static SQL – Incremental Update & HPSS improvements – DB2 11 Support – Improved monitoring

The Latest Advances Selected enhancements

Presenter
Presentation Notes
HPSS Storage of data off host in the Accelerator Large disk capacity – Low cost Great for archived ‘static’ data Fantastic performance improvements when analysing All queries that touch that data run on the Accelerator

In Summary

The ever increasing pool of data – Mostly unstructured and messy

The three Vs of Big Data

The ‘what’ not the ‘why?’

Some negatives to consider

Changing our traditional view – The questions to ask, analysing more data (known $ value is no longer a

constraint), data in-motion vs data at rest.

Processing Big Data – The need for parallelism – Hadoop

Big Data in action

Big Data Summary

Structured data only at present

Things are moving forward quickly though (JSON)

And it still provides the ability to get answers to questions that were formally out of scope due to cost and processing time

Netezza – FPGA – ZoneMaps

Very impressive performance figures

Big Data and DB2 Summary

Thank You. Any Questions?

[email protected]