bdia findings
TRANSCRIPT
Findings Webcast June 25, 2014
Big Data Information Architecture
Roundtable Webcast April 9, 2014
Exploratory Webcast January 22, 2014
#BigDataArch
✓
✓
✓
Moore’s Law Cubed
u The biggest databases are NEW databases
u They grow at the cube of Moore’s Law
u Moore’s Law = 10x every 6 years u VLDB: 1000x every 6 years • 1991/2 megabytes • 1997/8 gigabytes • 2003/4 terabytes • 2009/10 petabytes • 2015/16 exabytes
Observations…
u Software architectures change: centralized, C/S, 3 tier/web, SOA, etc.
u Applications migrate according to latencies
u Wholly new applications appear because of lower latencies, e.g., VMs, CEP
u THIS CURVE IS NO LONGER VALID…
Memory is Becoming Hierarchical Store
u On chip speed v RAM • L1(32K) = 100x • L2(246K) = 30x • L3(8-20Mb) = 8.6x
u RAM v SSD • RAM = 300x
u SSD v Disk • SSD = 10x
u Disk will soon turn up its toes
Note: Vector instructions and data compression
Putting a SoC in IT
u It’s possible that the CPU/Memory split will vanish, possibly soon
u This requires the emergence of the commodity SoC
u There are already ARM SoCs that run Linux
u Grids of SoCs would replace grids of servers
Parallelism: The Imp is Out of the Bottle
u Multicore chips enabled parallelism
u It has changed the whole performance equation
u It enabled Big Data
u Big Data is really Big Processing
u Computer u On-line u PC u Internet u Mobile u Internet of things
u Batch u Centralized u Client/server u Multi-tier u Service orientation u Event driven/Big Data
Tech Revolutions
TECH REVOLUTION ARCHITECTURE
Event Types
u Instantiation Event u A State Report u A Trigger Event u A Correction Event
We also need to consider: Data Refinement | Aggregations | Homogeneous collections | Derived Data
The Evolution of Hadoop
u Hadoop is far too useful and popular to fade away
u YARN and Tez have changed the picture
u Hadoop will become the default scale out file system
u And a critical component of the DATA HUB
There MAY be some Big Data applications that are not about
data analytics.
Big Data and Analytics
If so, nobody is talking about them…
A Process, Not an Activity
u Data Analytics is a multi-disciplinary end-to-end process
u Until recently it was a walled-garden. But the walls were torn down by: • Data availability • Scalable technology • Open source tools
Data Flow (The Paradox)
Our Architectures need to cater for DATA FLOW,
not data at rest
However, DO NOT MOVE THE DATA unless you absolutely have to
The Corporate Data Flows
u There needs to be two data flows (at minimum)
u Currently we can distinguish between: • Real-time/business time applications • Analytical applications • We will build specific architectures for this
Data Flow
The role of Hadoop is as the STAGING AREA FOR REFINEMENT
And also as a A SCALE-OUT FILE SYSTEM
The CRITICAL Workload Issue
u Previously, we viewed database workloads as an i/o optimization problem
u With the BDIA the workload is a variable MIX of i/o, transformation and calculation
u No databases were built precisely for this – not even Big Data databases