intro to big data

123
Zohar Elkayam CTO, Brillix [email protected] Twitter: @realmgic Introduction to Big Data

Upload: zohar-elkayam

Post on 12-Jul-2015

671 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Intro to Big Data

Zohar Elkayam CTO, Brillix

[email protected]: @realmgic

Introduction to Big Data

Page 2: Intro to Big Data

Agenda

• What is big Data and the 3-Vs

• Introduction to Hadoop

• Who Handles Big Data and Data Science

• NoSQL

http://brillix.co.il2

Page 3: Intro to Big Data

Who am I?

• Zohar Elkayam, CTO at Brillix

• Oracle ACE Associate

• DBA, team leader, instructor and senior consultant for over 16 years

• Editor (and manager) of ilDBA – Israel Database Community

• Blogger – www.realdbamagic.com

http://brillix.co.il3

Page 4: Intro to Big Data

What is Big Data?

http://brillix.co.il4

Page 5: Intro to Big Data

http://brillix.co.il5

Page 6: Intro to Big Data
Page 7: Intro to Big Data

So, What is Big Data?

• When the data is too big or moves too fast to handle in a sensible amount of time.

• When the data doesn’t fit conventional database structure.

• What the solution becomes part of the problem.

Page 8: Intro to Big Data

Big Problems with Big Data

• Unstructured• Unprocessed• Un-aggregated• Un-filtered• Repetitive• Low quality• And generally messy• Oh, and there is a lot of it

Page 9: Intro to Big Data

http://brillix.co.il9

Page 10: Intro to Big Data

MEDIA/ENTERTAINMENT

Viewers / advertising effectiveness

COMMUNICATIONS

Location-based advertising

EDUCATION &RESEARCH

Experiment sensor analysis

CONSUMER PACKAGED GOODS

Sentiment analysis of what’s hot, problems

HEALTH CARE

Patient sensors, monitoring, EHRs

Quality of care

LIFE SCIENCES

Clinical trials

Genomics

HIGH TECHNOLOGY / INDUSTRIAL MFG.

Mfg quality

Warranty analysis

OIL & GAS

Drilling exploration sensor analysis

FINANCIALSERVICES

Risk & portfolio analysis

New products

AUTOMOTIVE

Auto sensors reporting location, problems

RETAIL

Consumer sentiment

Optimized marketing

LAW ENFORCEMENT & DEFENSE

Threat analysis -social media monitoring, photo analysis

TRAVEL &TRANSPORTATION

Sensor analysis for optimal traffic flows

Customer sentiment

UTILITIES

Smart Meter analysis for network capacity,

Sample of Big Data Use Cases Today

ON-LINE SERVICES / SOCIAL MEDIA

People & career matching

Web-site

optimization

Page 11: Intro to Big Data

Most Requested Uses of Big Data

• Log Analytics & Storage• Smart Grid / Smarter Utilities• RFID Tracking & Analytics• Fraud / Risk Management & Modeling• 360° View of the Customer• Warehouse Extension• Email / Call Center Transcript Analysis• Call Detail Record Analysis

Page 12: Intro to Big Data

The Challenge

http://brillix.co.il12

Page 13: Intro to Big Data

The Big Data Challenge (3V)

Page 14: Intro to Big Data

Big Data: Challenge to Value

BusinessValue

High Variety High Volume High Velocity

Today

Deep Analytics High Agility Massive Scalability Real Time

Tomorrow

Challenges

Page 15: Intro to Big Data

Volume

• Big data come in one size: Big. Size is measured in terabytes, petabytes and even exabytesand zeta bytes.

• The storing and handling of the data becomes an issue.

• Producing value out of the data in a reasonable time is also an issue.

Page 16: Intro to Big Data

Velocity

• The speed in which the data is being generated and collected.

• Streaming data and large volume data movement .

• High velocity of data capture – requires rapid ingestion.

• What happens on downtime (the backlog problem).

Page 17: Intro to Big Data

Variety

• Big Data extends beyond structured data: including semi-structured and unstructured information: logs, text, audio and videos.

• Wide variety of rapidly evolving data types requires highly flexible stores and handling.

Page 18: Intro to Big Data

Big Data is ANY data

Unstructured, Semi-Structure and Structured

• Some has fixed structure

• Some is “bring own structure”

• We want to find value in all of it

Page 19: Intro to Big Data

Structured & Un-Structured

Un-Structured Structured

Objects Tables

Flexible Columns and Rows

Structure Unknown Predefined Structure

Textual and Binary Mostly Textual

Page 20: Intro to Big Data

Handling Big Data

http://brillix.co.il20

Page 21: Intro to Big Data
Page 22: Intro to Big Data

Big Data in Practice

• Big data is big: technological infrastructure solutions needed.

• Big data is messy: data sources must be cleaned before use.

• Big data is complicated: need developers and system admins to manage intake of data.

Page 23: Intro to Big Data

Big Data in Practice (cont.)

• Data must be broken out of silos in order to be mined, analyzed and transformed into value.

• The organization must learn how to communicate and interpret the results of analysis.

Page 24: Intro to Big Data

Infrastructure Challenges

• Infrastructure that is built for:• Large-scale

• Distributed

• Data-intensive jobs that spread the problem across clusters of server nodes

Page 25: Intro to Big Data

Infrastructure Challenges – Cont.

• Storage:• Efficient and cost-effective enough to capture and store terabytes, if

not petabytes, of data• With intelligent capabilities to reduce your data footprint such as:

• Data compression• Automatic data tiering• Data deduplication

Page 26: Intro to Big Data

Infrastructure Challenges – Cont.

• Network infrastructure that can quickly import large data sets and then replicate it to various nodes for processing

• Security capabilities that protect highly-distributed infrastructure and data

Page 27: Intro to Big Data

Intro to Hadoop

http://brillix.co.il27

Page 28: Intro to Big Data

Apache Hadoop

• Open source project run by Apache (2006).

• Hadoop brings the ability to cheaply process large amounts of data, regardless of its structure.

• Apache Hadoop has been the driving force behind the growth of the big data Industry.

Page 29: Intro to Big Data

Hadoop Creation History

Page 30: Intro to Big Data

Key points• An open-source framework that uses a simple programming model

to enable distributed processing of large data sets on clusters of computers.

• The complete technology stack includes• common utilities• a distributed file system• analytics and data storage platforms• an application layer that manages distributed processing, parallel

computation, workflow, and configuration management• Cost-effective for handling large unstructured data sets than

conventional approaches, and it offers massive scalability and speed

Page 31: Intro to Big Data

Why use Hadoop?

Cost Flexibility

Near linear performance up

to 1000s of nodes

Leverages commodity HW & open source SW

Versatility with data, analytics &

operation

Scalability

Page 32: Intro to Big Data

Really, Why use Hadoop?

• Need to process Multi Petabyte Datasets• Expensive to build reliability in each application.• Nodes fail every day

• Failure is expected, rather than exceptional.• The number of nodes in a cluster is not constant.

• Need common infrastructure• Efficient, reliable, Open Source Apache License

• The above goals are same as Condor, but• Workloads are IO bound and not CPU bound

Page 33: Intro to Big Data

Hadoop Benefits

• Reliable solution based on unreliable hardware• Designed for large files• Load data first, structure later• Designed to maximize throughput of large scans• Designed to leverage parallelism• Designed to scale• Flexible development platform• Solution Ecosystem

Page 34: Intro to Big Data

Hadoop Limitations

• Hadoop is scalable but not fast• Some assembly required• Batteries not included• Instrumentation not included either• DIY mindset (remember Linux/MySQL?)• On the larger scale – Hadoop is not cheap (but still cheaper

than using old solutions)

Page 35: Intro to Big Data

Example Comparison: RDBMS vs. HadoopTypical Traditional RDBMS Hadoop

Data Size Gigabytes Petabytes

Access Interactive and Batch Batch – NOT Interactive

Updates Read / Write many times Write once, Read many times

Structure Static Schema Dynamic Schema

Scaling Nonlinear Linear

Query Response

Time

Can be near immediate Has latency (due to batch processing)

Page 36: Intro to Big Data

Relational DatabaseBest Used For:

Interactive OLAP Analytics (<1sec) Multistep Transactions 100% SQL Compliance

Best Used For:

Structured or Not (Flexibility) Scalability of Storage/Compute Complex Data Processing Cheaper compared to RDBMS

Best when used together

Hadoop And Relational Database

Page 37: Intro to Big Data

Hadoop Components

http://brillix.co.il37

Page 38: Intro to Big Data

Hadoop Main Components

• HDFS: Hadoop Distributed File System – distributed file system that runs in a clustered environment.

• MapReduce – programming paradigm for running processes over a clustered environments.

Page 39: Intro to Big Data

HDFS is...

• A distributed file system• Redundant storage• Designed to reliably store data using commodity hardware• Designed to expect hardware failures• Intended for large files• Designed for batch inserts• The Hadoop Distributed File System

39

Page 40: Intro to Big Data

HDFS Node TypesHDFS has three types of Nodes

• Namenode (MasterNode)• Distribute files in the cluster• Responsible for the replication between

the datanodes and for file blocks location

• Datanodes• Responsible for actual file store• Serving data from files(data) to client

• BackupNode (version 0.23 and up)• It’s a backup of the NameNode

Page 41: Intro to Big Data

Typical implementation

• Nodes are commodity PCs• 30-40 nodes per rack• Uplink from racks is 3-4 gigabit• Rack-internal is 1 gigabit

Page 42: Intro to Big Data

MapReduce is...

• A programming model for expressing distributed computations at a massive scale

• An execution framework for organizing and performing such computations

• An open-source implementation called Hadoop

42

Page 43: Intro to Big Data

MapReduce

Example: $HADOOP_HOME/bin/hadoop jar @HADOOP_HOME/hadoop-streaming.jar \

- input myInputDirs \- output myOutputDir \- mapper /bin/cat \- reducer /bin/wc

• Runs programs (jobs) across many computers• Protects against single server failure by re-run failed steps.• MR jobs can be written in Java, C, Phyton, Ruby and etc.• Users only write Map and Reduce functions

• MAP - Takes a large problem and divides into sub problems.Performs the same function on all subsystems

• REDUCE - Combine the output from all sub-problems

Page 44: Intro to Big Data

Typical large-data problem

• Iterate over a large number of records• Extract something of interest from each• Shuffle and sort intermediate results• Aggregate intermediate results• Generate final output

44

Map

Reduce

(Dean and Ghemawat, OSDI 2004)

Page 45: Intro to Big Data

MapReduce paradigm

• Implement two functions:• Map(k1, v1) -> list(k2, v2)

• Reduce(k2, list(v2)) -> list(v3)

• Framework handles everything else*• Value with same key go to same reducer

45

Page 46: Intro to Big Data

46

Divide and Conquer

Page 47: Intro to Big Data

MapReduce - word count example

function map(String name, String document):

for each word w in document:

emit(w, 1)

function reduce(String word, Iterator partialCounts):

totalCount = 0

for each count in partialCounts:

totalCount += count

emit(word, totalCount)

47

Page 48: Intro to Big Data

MapReduce Word Count Process

http://brillix.co.il48

Page 49: Intro to Big Data

MapReduce is good for...

• Embarrassingly parallel algorithms

• Summing, grouping, filtering, joining

• Off-line batch jobs on massive data sets

• Analyzing an entire large dataset

49

Page 50: Intro to Big Data

MapReduce is ok for...

• Iterative jobs (i.e., graph algorithms)

• Each iteration must read/write data to disk

• IO and latency cost of an iteration is high

50

Page 51: Intro to Big Data

MapReduce is NOT good for...

• Jobs that need shared state/coordination• Tasks are shared-nothing• Shared-state requires scalable state store

• Low-latency jobs• Jobs on small datasets• Finding individual records

51

Page 52: Intro to Big Data

Improving Hadoop

http://brillix.co.il52

Page 53: Intro to Big Data

Improving Hadoop

Core Hadoop is complicated so some tools were added to make things easier so tools were created to make things easier.

Improving programmability:• Pig: Programming language that simplifies Hadoop actions: loading,

transforming and sorting data• Hive: enables Hadoop to operate as data warehouse using SQL-like

syntax.

Page 54: Intro to Big Data

Pig

• Data flow processing

• Uses Pig Latin query language • Highly parallel in order to distribute data processing across many servers• Combining multiple data sources (Files, Hbase, Hive)

• Example:

Page 55: Intro to Big Data

Hive• Built on the MapReduce framework so it generates MR jobs behind it• Hive is a data warehouse that enables easy data summarization and ad-hoc queries via

an SQL-like interface for large datasets stored in HDFS/Hbase.• Have partitioning and partition swapping • Good for random sampling • Example: CREATE EXTERNAL TABLE vs_hdfs (

site_id string,session_id string,time_stamp bigint,visitor_id bigint,row_unit string,evts string,biz string,plne string,dims string)partitioned by (site string,day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'STORED AS SEQUENCEFILE LOCATION '/home/data/';

select session_id,get_json_object(concat(tttt, "}"), '$.BY'),get_json_object(concat(tttt, "}"), '$.TEXT') from

(select session_id,concat("{", regexp_replace(event, "\\[\\{|\\}\\]", ""), "}") tttt

from (select session_id,get_json_object(plne,

'$.PLine.evts[*]') plnfrom vs_hdfs_v1 where site='6964264'

and day='20120201' and plne!='{}' limit 10 ) tLATERAL VIEW explode(split(pln, "\\},\\{"))

adTable AS event )t2

Page 56: Intro to Big Data

HDFS

Map/Reduced

Hive PIG

Yahoopersistence

Yahooscripting

Facebook SQL Query

GoogleParallel

HADOOP Technology STACK

Page 57: Intro to Big Data

Improving Hadoop (cont.)

For improving access:

• HBase: column oriented database that runs on HDFS.

• Sqoop: a tool designed to import data from relational databases (HDFS or Hive).

Page 58: Intro to Big Data

HbaseWhat is Hbase and why should you use HBase?

• Huge volumes of randomly accessed data.• There is no restrictions on column numbers for rows it’s dynamic.• Consider HBase when you’re loading data by key, searching data by key (or range),

serving data by key, querying data by key or when storing data by row that doesn’t conform well to a schema.

Hbase dont’s?• It doesn’t talk SQL, have an optimizer, support in transactions or joins. If you don’t use

any of these in your database application then HBase could very well be the perfect fit.

Example: create ‘blogposts’, ‘post’, ‘image’ ---create tableput ‘blogposts’, ‘id1′, ‘post:title’, ‘Hello World’ ---insert valueput ‘blogposts’, ‘id1′, ‘post:body’, ‘This is a blog post’ ---insert valueput ‘blogposts’, ‘id1′, ‘image:header’, ‘image1.jpg’ ---insert valueget ‘blogposts’, ‘id1′ ---select records

Page 59: Intro to Big Data

SqoopWhat is Sqoop?

• It’s a command line tool for moving data between HDFS and relational database systems.• You can download drivers for Sqoop from Microsoft and

• Import Data/Query results from SQL Server to Hadoop.• Export Data from Hadoop to SQL Server.

• It’s like BCP

• Example: $bin/sqoop import --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' \

--table lineitem --hive-import

$bin/sqoop export --connect 'jdbc:sqlserver://10.80.181.127;username=dbuser;password=dbpasswd;database=tpch' --table lineitem --export-dir /data/lineitemData

Page 60: Intro to Big Data

Improving Hadoop (cont.)

• For improving coordination: Zookeeper

• For improving scheduling/orchestration: Oozie

• For Improving UI: Hue

• Machine learning: Mahout

Page 61: Intro to Big Data

HADOOP Technology Eco System

Page 62: Intro to Big Data

Hadoop Tools

Page 63: Intro to Big Data

63

Hadoop cluster

Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)

Page 64: Intro to Big Data

Hadoop In The Real World

http://brillix.co.il64

Page 65: Intro to Big Data

Who uses Hadoop?

Page 66: Intro to Big Data

Big Data Market Survey

• 3 major groups for rolling your own Big Data:• Integrated Hadoop providers.• Analytical database with Hadoop connectivity.• Hadoop-centered companies.

• Big Data on the Cloud.

Page 67: Intro to Big Data

Integrated Hadoop Providers

• IBM InfoSphere

DatabaseDB2

Deployment optionsSoftware (Enterprise Linux), Cloud

HadoopBundled distribution (InfoSphere BigInsights); Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Lucene

NoSQLHBase

Page 68: Intro to Big Data

Integrated Hadoop Providers

• Microsoft

DatabaseSQL Server

Deployment optionsSoftware (Windows Server), Cloud (Windows Azure Cloud)

HadoopBundled distribution (Big Data Solution); Hive, Pig

NoSQLNone

Page 69: Intro to Big Data

Integrated Hadoop Providers

• Oracle

DatabaseNone

Deployment optionsAppliance (Oracle Big Data Appliance)

HadoopBundled distribution (Cloudera’s Distribution including Apache Hadoop); Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Sqoop, Mahout, Whirr

NoSQLOracle NoSQL Database

Page 70: Intro to Big Data

Integrated Hadoop Providers

• Pivotal Greenplum

Database`

GreenPlum Database

Deployment options

Appliance (Modular Data Computing appliance), Software (Enterprise Linux), Cloud (Cloud Foundry)

Hadoop

Bundled distribution (Pivotal HD); Hive, Pig, Zookeeper, HBase

NoSQL

HBase

Page 71: Intro to Big Data

Hadoop Centered Companies

• Cloudera – longest-established of Hadoop distribution.

• Hortonworks – major contributor to the Hadoop code and core components.

• MapR.

Page 72: Intro to Big Data

Big Data and Cloud

• Some Big Data solution can be provided using IaaS: Infrastructure as a service.

• Private clouds can be constructed using Hadoop orchestration tools.

• Public clouds provided by Rockspace or Amazon EC2 can be use to start an Hadoop cluster.

Page 73: Intro to Big Data

Big Data and Cloud (cont.)

• PaaS: Platform as a Service can be used to remove the need to configure or scale things.

• The major PaaS Providers are Amazon, Google and Microsoft.

Page 74: Intro to Big Data

PaaS Services: Amazon

• Amazon:• Elastic Map Reduce (EMR): MapReduce programs submitted to a

cluster managed by Amazon. Good for EC2/S3 combinations.• DynamoDB: NoSQL database provided by Amazon to replace HBase.

Page 75: Intro to Big Data

PaaS Services: Google

• Google:• BigQuery: analytical database suitable for interactive analysis over

datasets of the order of 1TB.• Prediction API: machine learning platform for classification and

sentiment analysis be done with their tools on customers data.

Page 76: Intro to Big Data

PaaS Services: Microsoft

• Microsoft:• Windows Azure: a cloud computing platform and infrastructure that

can be used as PasS and as IaaS.

Page 77: Intro to Big Data

Who Handles Big Data… and how?

http://brillix.co.il77

Page 78: Intro to Big Data

Big Data Readiness

• The R&D Prototype Stage • Skills needed:

• Distributed data deployment (e.g. Hadoop) • Python or Java programming with MapReduce• Statistical analysis (e.g. R) • Data integration • Ability to formulate business hypotheses • Ability to convey business value of Big Data

Page 79: Intro to Big Data

Data Science

• A discipline that combines math, statistics, programming and scientific instinct with the goal of extracting meaning from data.

• Data scientists combine technical expertise curiosity, storytelling and cleverness to find and deliver the signal in the noise.

Page 80: Intro to Big Data

The Rise of the Data Scientist

• Data scientists are responsible for• modeling complex business problems• discovering business insights• identifying opportunities.

• Demand is high for people who can help make sense of the massive streams of digital information pouring into organizations

Page 81: Intro to Big Data

Big Data Scientist

• Industry Expertise

• Analytics Skills

Big Data Engineers

• Hadoop/Java

• Non-Relational DB

Agility and Focus on Value

New Roles and Skills

Page 82: Intro to Big Data
Page 83: Intro to Big Data

Predictive Analytics

• Predictive analytics looks into the future to provide insight into what will happen and includes what-if scenarios and risk assessment. It can be used for• Forecasting• hypothesis testing• risk modeling• propensity modeling

Page 84: Intro to Big Data

Prescriptive analytics

• Prescriptive analytics is focused on understanding what would happen based on different alternatives and scenarios, and then choosing best options, and optimizing what’s ahead. Use cases include• Customer cross-channel optimization• best-action-related offers• portfolio and business optimization• risk management

Page 85: Intro to Big Data

How Predictive Analytics Works

• Traditional BI tools use a deductive approach to data, which assumes some understanding of existing patterns and relationships.

• An analytics model approaches the data based on this knowledge.

• For obvious reasons, deductive methods work well with structured data

Page 86: Intro to Big Data

Inductive approach

• An inductive approach makes no presumptions of patterns or relationships and is more about data discovery. Predictive analytics applies inductive reasoning to big data using sophisticated quantitative methods such as• machine learning• neural networks• Robotics• computational mathematics• artificial intelligence

• Explore all the data and to discover interrelationships and patterns

Page 87: Intro to Big Data

Inductive approach – Cont.

• Inductive methods use algorithms to perform complex calculations specifically designed to run against highly varied or large volumes of data

• The result of applying these techniques to a real-world business problem is a predictive model

• The ability to know what algorithms and data to use to test and create the predictive model is part of the science and art of predictive analytics

Page 88: Intro to Big Data

Share nothing vs. Share everythingShare nothing Share everything

Many processing engines Many Servers

Data is spread on many nodes Data is located on a single storage

Joins are problematic Efficient Joins

Very Scalable Limited Scalability

Page 89: Intro to Big Data

Big Data and NoSQL

http://brillix.co.il89

Page 90: Intro to Big Data

The Challenge

• We want scalable, durable, high volume, high velocity, distributed data storage that can handle non-structured data and that will fit our specific need

• RDBMS is too generic and doesn’t cut it any more – it can do the job but it is not cost effective to our usages

90

Page 91: Intro to Big Data

The Solution: NoSQL

• Let’s take some parts of the standard RDBMS out to and design the solution to our specific uses

• NoSQL databases have been around for ages under different names/solutions

91

Page 92: Intro to Big Data

The NOSQL Movement

• NOSQL is not a technology – it’s a concept.• We need high performance, scale out abilities or an agile

structure.• We are now willing to sacrifice our sacred cows: consistency,

transactions.• Over 150 different brands and solutions

(http://nosql-database.org/).

Page 93: Intro to Big Data

NoSQL or NOSQL

• NoSQL is not No to SQL• NoSQL is not Never SQL• NOSQL = Not Only SQL

Page 94: Intro to Big Data

Why NoSQL?

• Some applications need very few database features, but need high scale.

• Desire to avoid data/schema pre-design altogether for simple applications.

• Need for a low-latency, low-overhead API to access data.• Simplicity -- do not need fancy indexing – just fast lookup by

primary key.

Page 95: Intro to Big Data

Why NoSQL? (cont.)

• Developer friendly, DBAs not needed (?).• Schema-less.• Agile: non-structured (or semi-structured).• In Memory.• No (or loose) Transactions.• No joins.

Page 96: Intro to Big Data
Page 97: Intro to Big Data

Is NoSQL a RDMS Replacement?

NO

97

Well... Sometimes it does…

Page 98: Intro to Big Data

RDBMS vs. NoSQL

Rationale for choosing a persistent store:

98

Relational Architecture NoSQL Architecture

High value, high density, complexData

Low value, low density, simple data

Complex data relationships Very simple relationships

Schema-centric Schema-free, unstructured or semistructured Data

Designed to scale up & out Distributed storage and processing

Lots of general purposefeatures/functionality

Stripped down, special purposedata store

High overhead ($ per operation) Low overhead ($ per operation)

Page 99: Intro to Big Data

Scalability and Consistency

http://brillix.co.il99

Page 100: Intro to Big Data

Scalability

• NoSQL is sometimes very easy to scale out

• Most have dynamic data partitioning and easy data distribution

• But distributed system always come with a price: The CAP Theorem and impact on ACID transactions

100

Page 101: Intro to Big Data

ACID Transactions

Most DBMS are built with ACID transactions in mind:• Atomicity: All or nothing, performs write operations as a single

transaction• Consistency: Any transaction will take the DB from one

consistent state to another with no broken constraints, ensures replicas are identical on different nodes

• Isolation: Other operations cannot access data that has been modified during a transaction that has not been completed yet

• Durability: Ability to recover the committed transaction updates against any kind of system failure (transaction log)

101

Page 102: Intro to Big Data

ACID Transactions (cont.)

• ACID is usually implemented by a locking mechanism/manager• Distributed systems central locking can be a bottleneck in that

system• Most NoSQL does not use/limit the ACID transactions and

replaces it with something else…

102

Page 103: Intro to Big Data

CAP Theorem

• The CAP theorem states that in a distributed/partitioned application, you can only pick two of the following three characteristics:

• Consistency.

• Availability.

• Partition Tolerance.

Page 104: Intro to Big Data

CAP in Practice

http://brillix.co.il104

Page 105: Intro to Big Data

NoSQL BASE• NoSQL usually provide BASE characteristics instead of ACID.

BASE stands for:• Basically Available• Soft State• Eventual Consistency

• It means that when an update is made in one place, the other partitions will see it over time - there might be an inconsistency window

• read and write operations complete more quickly, lowering latency

Page 106: Intro to Big Data

Eventual Consistency

Page 107: Intro to Big Data

Types of NoSQL

http://brillix.co.il107

Page 108: Intro to Big Data

NoSQL Taxonomy

Type Examples

Key-Value Store

Document Store

Column Store

Graph Store

Page 109: Intro to Big Data

SQL comfort zone

size

Complex

Typical RDBMS

Key Value

Column Store

Graph DATABASE

DocumentDatabase Performance

NoSQL Map

Page 110: Intro to Big Data

Key Value Store

• Distributed hash tables.• Very fast to get a single value.• Examples:

• Amazon DynamoDB• Berkeley DB• Redis• Riak• Cassandra

Page 111: Intro to Big Data

Document Store

• Similar to Key/Value, but value is a document.• JSON or something similar, flexible schema• Agile technology.• Examples:

• MongoDB• CouchDB• CouchBase

Page 112: Intro to Big Data

Column Store

• One key, multiple attributes.• Hybrid row/column.• Examples:

• Google BigTable• Hbase• Amazon’s SimpleDB• Cassandra

Page 113: Intro to Big Data

How Records are Organized?

• This is a logical table in RDBMS systems• Its physical organization is just like the logical one: column by

column, row by row

Row 1

Row 2

Row 3

Row 4

Col 1 Col 2 Col 3 Col 4

http://brillix.co.il113

Page 114: Intro to Big Data

Query Data

• When we query data, records are read at the order they are organized in the physical structure

• Even when we query a single column, we still need to read the entire table and extract the column

Row 1

Row 2

Row 3

Row 4

Col 1 Col 2 Col 3 Col 4

Select Col2 From MyTable

Select *From MyTable

http://brillix.co.il114

Page 115: Intro to Big Data

How Does Column Store Save Data

Organization in row store Organization in column store

http://brillix.co.il116

Page 116: Intro to Big Data

Graph Store

• Inspired by Graph Theory.• Data model: Nodes, relationships, properties on both.• Relational Database have very hard time to represent a graph

in the Database.• Example:

• Neo4j• InfiniteGraph• RDF

Page 117: Intro to Big Data

• An abstract representation of a set of objects where some pairs are connected by links.

• Object (Vertex, Node) – can have attributes like name and value

• Link (Edge, Arc, Relationship) – can have attributes like type and name or date

What is Graph

NODEEdge

Page 118: Intro to Big Data

Graph TypesUndirected Graph

Directed Graph

Pseudo Graph

Multi Graph

NODEEdge

NODE

NODEEdge

NODE

NODE

NODE NODE

Page 119: Intro to Big Data

More Graph Types

Weighted Graph

Labeled Graph

Property Graph

NODE10

NODE

NODELike

NODE

NODE NODEfriend, date 2013

Name:yosi,Age:40

Name:ami,Age:30

Page 120: Intro to Big Data

Relationships

ID:1TYPE:F

NAME:alice

ID:2TYPE:M

NAME:bob

ID:1TYPE:G

NAME:NoSQL

ID:1TYPE:F

NAME:dafna

TYPE: memberSince:2012

Page 121: Intro to Big Data
Page 122: Intro to Big Data

Conclusion

• Big Data is one of the hottest buzzwords in last few years – we should all know what it’s all about

• DBAs are often called upon big data problems – today DBAs needs to know what to ask to provide good solutions even if it’s not a database related issue

• NoSQL doesn’t have to be Big Data solutions but Big Data often use NoSQL solutions

http://brillix.co.il123

Page 123: Intro to Big Data

Thank You

Zohar ElkayamBrillix

[email protected]

http://brillix.co.il124