5 fundamental strategies for building a data-centered data

54
© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. 5 Fundamental Strategies for Building a Data-centered Data Center June 5, 2014 Ken Krupa, Chief Field Architect Pete Aven, Principal Sales Engineer

Upload: others

Post on 12-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

5 Fundamental Strategies for Building a Data-centered Data Center

June 5, 2014 Ken Krupa, Chief Field Architect Pete Aven, Principal Sales Engineer

Page 2: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 2

Last generation

OLTP

Warehouse

Data Marts Archives

“Unstructured”

“ ”

Video Audio

Signals, Logs, Streams

Social

Documents, Messages

{ } Metadata

Search 🔍

Reference Data

Page 3: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 3

Summary – The Data-centered Data Center

Elastic: flexible, shared-nothing, scale-out architecture

Cost competitive: low-cost commodity hardware, lower TCO

Converged: single data layer for operational and analytical workloads

Managing data life-cycle in real-time: prioritize your data storage

Governed, not renegade: customizable, transparent, secure

Page 4: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 4

ELASTIC

Page 5: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 5

Data organization in MarkLogic

Data inserted into stands One stand is in-memory Many other stands are on disk A collection of stands is a forest Each forest is an atomic unit and

can be managed and moved

Page 6: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 6

Servers have Multiple Forests

Page 7: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 7

Scale out

Page 8: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 8

Clustering

Page 9: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 9

Clustering

Page 10: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 10

Migration

Two forests on one node Bring a second node online Replicate a forest Disable the forest on the

original node Original forest on original

node fails over Enable the original forest as

a replica

X

Page 11: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 11

Migration in one step $ cat forest-migrate.json

{ "operation": "forest-migrate”, "forest": [”forest-in-database", ”another-forest-in-database"], "host": ”destination-host” }

$ curl --anyauth --user user:password -X PUT -d @./forest-migrate.json \ -i -H "Content-type: application/json" \ http://anyhostinthecluster:8002/manage/v2/forests

Page 12: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 12

Cluster topologies XA

RDBMS

Page 13: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 13

Knowing where you’re going - and where you’ve been

Business context

What are your SLAs? How many requests per second does the application

have to support? How will the business grow? What will drive growth - and how fast will it go?

As-Built Capacities

How does your system perform under different usage profiles (e.g., QPS tests)?

How often do you hit the cache? What is your peak storage I/O? What is end-to-end recovery objective/capability?

Page 14: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 14

Performance History

Page 15: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 15

Page 16: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 16

Performance History

To handle more requests: • Fix Configuration • Add Disk IO via Volumes or Nodes • Add Ram to decrease Disk IO • Rewrite Query

Page 17: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 17

Scaling out: questions to consider

How do you know when you need to add a node?

CPU/Memory/IO: when you get close to hardware limits, time to grow

High Performance: SLA’s may drive forest sizes; more docs, time to grow

High Capacity: running low on storage, time to grow

Easy (temporary?) fix—add RAM

Cheaper alternative

Increases cache hits for better performance

Fewer than three hosts, local forests MUST move across hosts

Use forest migrate to move forests from one host to another

Faster than backup/restore

Follow distribution pattern:

Don’t just swap masters/replicas on two: if one goes down, load is not split evenly across cluster

Adding a node - or RAM Migrating a forest

Page 18: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 18

LOWERING TCO THROUGH

COMMODITY HARDWARE

Page 19: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 19

Kryder’s Law: The density of hard drives

increases by a factor of 1,000

every 10.5 years.

(doubling every 13 months)

Page 20: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 20

Moore’s Law: The density of transistors on

integrated circuits doubles

every 18 months.

Page 21: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 21

The laws in action At the end of a 3-year life cycle, one new server can do the job of four old

servers.

At 1.5 Years, you can add 100% more capacity for 50% of original spend

For the cost of storing 1TB in 1996, we will be able to store 1PB in 2016.

Page 22: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 22

Commodity hardware will reduce costs

Page 23: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 23

Hardware/sizing recommendations

2U 25 SFF Chassis 2 Socket

8 Core/2.8Ghz

128GB – 256GB RAM

10GB Network

2 2GB RAID Cards

22 10K 900-1200GB Data Drives

Page 24: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 24

VMWare NetApp recommendations (preliminary)

1U 8SFF Chassis 2 Socket

8 Core/2.8Ghz

128GB – 256GB RAM

10GB Network

1 10GB iSCSI 12-16 Spindles per Server, 10K SAS

Page 25: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 25

Storage Economics

SAN/Scale-up

$2 - $10/Gigabyte

$1M gets: 0.5Petabytes

200,000 IOPS 1Gbyte/sec

NAS Filers

$1 - $5/Gigabyte

$1M gets: 1 Petabyte

400,000 IOPS 2Gbyte/sec

Local Compute

$0.20/Gigabyte

$1M gets: 10 Petabytes

5,000,000 IOPS 40 Gbytes/sec

SAN (Scale Up) Commodity (Scale Out)

Public cloud

$0.04/gb/month

$100K/month: 1.25 Petabytes (HA)

1,500,000 IOPS 150 GB/Sec

Cloud

Page 26: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 26

Signs of war: cloud prices have dropped recently

Google Cloud: - $0.04 GB-month for 1000GB

Amazon EBS: - $.055 GB-month (standard) - $.138 GB-month (provisioned)

Page 27: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 27

Leveraging Scale-out Economics Run on existing Infrastructure today

Leverage Scale-Out Commodity Hardware as you grow

Leverage Cloud today or tomorrow

SAME DATABASE, SAME CODE

Page 28: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 28

DATA LAYER CONVERGENCE

FEWER MOVING PARTS =

MORE AGILITY

Page 29: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 29

Page 30: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 30

Page 31: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 31

Page 32: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 32

Last generation

OLTP

Warehouse

Data Marts Archives

“Unstructured”

“ ”

Video Audio

Signals, Logs, Streams

Social

Documents, Messages

{ } Metadata

Search 🔍

Reference Data

Page 33: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 33

RDBMS: One Tool, Many Contortions OLTP

3rd normal form, updates, simple query Reporting DB

Because the OLTP app slowed down during heavy query use Enterprise Data Warehouse

Because we needed a unified view of the enterprise – Star schema enters the picture

Data Marts Because the EDW didn’t have everything – Also star schema

Federated Because it took too long to agree on a standard model

Hybrid Because Federated is too slow

Page 34: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 34

If I run analytics in my OLTP DB then.... Won’t meet my SLA’s Too expensive No common data model Cache won’t ever be right Too expensive to keep around

context necessary for analytics

If I run transactions in my Analytical DB...

Transaction locks will block aggregate reads

Too expensive Why constrain ad-hoc

query? We need to investigate

The old consensus: mixing is bad

Page 35: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 35

The new wisdom: mixing is good Operational with Analytics Risk calculations Underwriting Compliance Content Discovery Fraud

Analytics with Operations Operational BI Archival/E-Discovery Personalization Situational Awareness

SINGLE SOURCE OF TRUTH

Page 36: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 36

Mixing workloads in MarkLogic - how it works

ML as an analytic database - examples and possibilities

Range indexes: in-memory columnar Query load separation Tiered storage and real-time replication Hadoop MapReduce and HDFS Transactions and ACID help manage and

prioritize data - better performance, lower TCO

Operations and analytics in MarkLogic

COPIES, NOT ETL

Page 37: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 37

INFORMATION LIFE-CYCLE MANAGEMENT (FOR REAL)

Page 38: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 38

Understanding the life cycle The older your database,

the more data you have

The older the data, the less likely you will reuse it

Storage requirements increase, but much of what is stored will go untouched

Page 39: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 39

Data life cycle management, in three easy steps

1. Move data off active system to cheaper system.

2. Keep track of what you moved.

3. Provide facility for getting it back.

Page 40: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 40

CERN: implementation is hard in the RDBMS world DBAs / database developers cannot easily

implement these policies by themselves.

Database admins, application developers, and application owners must work together to: Reduce amount of data produced Allow for database structure that can

facilitate archiving Define data availability requirements for

online data and archive Identify how to leverage database features

Page 41: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 41

CERN: archiving RDBMS data is also difficult

The DBA removes old partitions from the production database and moves them to the archive. One option: use partition exchange to table Post-move jobs can implement compression, drop indexes

Sticking points: Set of data must be consistent Must build support in the application Have to validate access to archived data Archived data must remain readable in future

versions

Page 42: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 42

Tiered Storage With Tiered Storage, you can… Define data tiers based on a range index

Have content balanced into forests by tier

Move an entire tier to different storage

Attach a tier to a different database

Query one database on one tier…

…or the other database on the other tier…

…or both at once All with no downtime, and 100% consistency

Page 43: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 43

0

10000

20000

30000

40000

50000

60000

Tier 1 SAN Exadata ML usingDAS

Tier-1

Effective Cost/TB for Production Storage (all copies)

0100020003000400050006000700080009000

FlexPod/VCE NetApp ML usingDAS

Tier-2

Page 44: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 44

GOVERNANCE + PROVENANCE

Page 45: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 45

Data Governance Considerations

Security

Page 46: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 46

Data Governance Considerations

Security

Privacy

Page 47: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 47

Data Governance Considerations

Security

Privacy

Provenance

Page 48: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 48

Data Governance Considerations

Security

Privacy

Provenance

Retention

Page 49: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 49

Data Governance Considerations

Security

Privacy Continuity

Provenance

Retention

Page 50: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 50

Data Governance Considerations

Security

Privacy Continuity

Provenance Compliance

Retention

Page 51: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 51

Last Generation

OLTP

Warehouse

Data Marts Archives

“Unstructured”

“ ”

Video Audio

Signals, Logs, Streams

Social

Documents, Messages

{ } Metadata

Search 🔍

Reference Data

Page 52: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 52

New Generation

Application

Page 53: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. SLIDE: 54

Summary Elastic systems let you respond rapidly to changing loads - and let you keep costs

in line with usage.

Scale-out systems on commodity hardware are much less expensive and more powerful than scale-up systems.

Converging transactional and analytical workloads into single data layer is not only possible - it is often a great idea. A single data layer can increase agility.

Managing information throughout its life cycle means more than choosing the cheapest storage possible - it means being able to manage and query data in real time.

Proper data governance is simpler in an enterprise NoSQL system.

Page 54: 5 Fundamental Strategies for Building a Data-centered Data

© COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Q&A