london ceph day keynote: building tomorrow's ceph

Download London Ceph Day Keynote: Building Tomorrow's Ceph

If you can't read please download the document

Upload: ceph-community

Post on 16-Apr-2017

1.227 views

Category:

Technology


0 download

TRANSCRIPT

Building Tomorrow's Ceph

Sage Weil

Research beginnings

UCSC research grant

Petascale object storageDOE: LANL, LLNL, Sandia

Scalability

Reliability

PerformanceRaw IO bandwidth, metadata ops/sec

HPC file system workloadsThousands of clients writing to same file, directory

Distributed metadata management

Innovative designSubtree-based partitioning for locality, efficiency

Dynamically adapt to current workload

Embedded inodes

Prototype simulator in Java (2004)

First line of Ceph codeSummer internship at LLNL

High security national lab environment

Could write anything, as long as it was OSS

The rest of Ceph

RADOS distributed object storage cluster (2005)

EBOFS local object storage (2004/2006)

CRUSH hashing for the real world (2005)

Paxos monitors cluster consensus (2006)

emphasis on consistent, reliable storage scale by pushing intelligence to the edges a different but compelling architecture

Industry black hole

Many large storage vendorsProprietary solutions that don't scale well

Few open source alternatives (2006)Very limited scale, or

Limited community and architecture (Lustre)

No enterprise feature sets (snapshots, quotas)

PhD grads all built interesting systems......and then went to work for Netapp, DDN, EMC, Veritas.

They want you, not your project

A different path

Change the world with open sourceDo what Linux did to Solaris, Irix, Ultrix, etc.

What could go wrong?

LicenseGPL, BSD...

LGPL: share changes, okay to link to proprietary code

Avoid unsavory practicesDual licensing

Copyright assignment

Incubation

DreamHost!

Move back to LA, continue hacking

Hired a few developers

Pure development

No deliverables

Ambitious feature set

Native Linux kernel client (2007-)

Per-directory snapshots (2008)

Recursive accounting (2008)

Object classes (2009)

librados (2009)

radosgw (2009)

strong authentication (2009)

RBD: rados block device (2010)

The kernel client

ceph-fuse was limited, not very fast

Build native Linux kernel implementation

Began attending Linux file system developer events (LSF)Early words of encouragement from ex-Lustre devs

Engage Linux fs developer community as peer

Initial attempts merge rejected by LinusNot sufficient evidence of user demand

A few fans and would-be users chimed in...

Eventually merged for v2.6.34 (early 2010)

Part of a larger ecosystem

Ceph need not solve all problems as monolithic stack

Replaced ebofs object file system with btrfsSame design goals

Avoid reinventing the wheel

Robust, well-supported, well optimized

Kernel-level cache management

Copy-on-write, checksumming, other goodness

Contributed some early functionalityCloning files

Async snapshots

Budding community

#ceph on irc.oftc.net, [email protected]

Many interested users

A few developers

Many fans

Too unstable for any real deployments

Still mostly focused on right architecture and technical solutions

Road to product

DreamHost decides to build an S3-compatible object storage service with Ceph

StabilityFocus on core RADOS, RBD, radosgw

Paying back some technical debtBuild testing automation

Code review!

Expand engineering team

The reality

Growing incoming commercial interestEarly attempts from organizations large and small

Difficult to engage with a web hosting company

No means to support commercial deployments

Project needed a company to back itFund the engineering effort

Build and test a product

Support users

Bryan built a framework to spin out of DreamHost

Launch

Do it right

How do we build a strong open source company?

How do we build a strong open source community?

Models?RedHat, Cloudera, MySQL, Canonical,

Initial funding from DreamHost, Mark Shuttleworth

Goals

A stable Ceph release for production deploymentDreamObjects

Lay foundation for widespread adoptionPlatform support (Ubuntu, Redhat, SuSE)

Documentation

Build and test infrastructure

Build a sales and support organization

Expand engineering organization

Branding

Early decision to engage professional agencyMetaDesign

Terms likeBrand core

Design system

Project vs CompanyShared / Separate / Shared core

Inktank != Ceph

Aspirational messaging: The Future of Storage

Slick graphics

broken powerpoint template

Today: adoption

Traction

Too many production deployments to countWe don't know about most of them!

Too many customers (for me) to count

Growing partner listLots of inbound

Lots of press and buzz

Quality

Increased adoption means increased demands on robust testing

Across multiple platforms

Include platforms we don't like

UpgradesRolling upgrades

Inter-version compatibility

Expanding user community + less noise about bugs = a good sign

Developer community

Significant external contributors

First-class feature contributions from contributors

Non-Inktank participants in daily Inktank stand-ups

External access to build/test lab infrastructure

Common toolsetGithub

Email (kernel.org)

IRC (oftc.net)

Linux distros

CDS: Ceph Developer Summit

Community process for building project roadmap

100% onlineGoogle hangouts

Wikis

Etherpad

First was this Spring, second is next week

Great feedback, growing participation

Indoctrinating our own developers to an open development model

The Future

Governance

How do we strengthen the project community?

2014 is the year

Might formally acknowledge my role as BDL

Recognized project leadsRBD, RGW, RADOS, CephFS)

Formalize processes around CDS, community roadmap

External foundation?

Technical roadmap

How do we reach new use-cases and users

How do we better satisfy existing users

How do we ensure Ceph can succeed in enough markets for Inktank to thrive

Enough breadth to expand and grow the community

Enough focus to do well

Tiering

Client side caches are great, but only buy so much.

Can we separate hot and cold data onto different storage devices?Cache pools: promote hot objects from an existing pool into a fast (e.g., FusionIO) pool

Cold pools: demote cold data to a slow, archival pool (e.g., erasure coding)

How do you identify what is hot and cold?

Common in enterprise solutions; not found in open source scale-out systems

key topic at CDS next week

Erasure coding

Replication for redundancy is flexible and fast

For larger clusters, it can be expensive

Erasure coded data is hard to modify, but ideal for cold or read-only objectsCold storage tiering

Will be used directly by radosgw

Storage overheadRepair trafficMTTDL (days)

3x replication3x1x2.3 E10

RS (10, 4)1.4x10x3.3 E13

LRC (10, 6, 5)1.6x5x1.2 E15

Multi-datacenter, geo-replication

Ceph was originally designed for single DC clustersSynchronous replication

Strong consistency

Growing demandEnterprise: disaster recovery

ISPs: replication data across sites for locality

Two strategies:use-case specific: radosgw, RBD

low-level capability in RADOS

RGW: Multi-site and async replication

Multi-site, multi-clusterRegions: east coast, west coast, etc.

Zones: radosgw sub-cluster(s) within a region

Can federate across same or multiple Ceph clusters

Sync user and bucket metadata across regionsGlobal bucket/user namespace, like S3

Synchronize objects across zonesWithin the same region

Across regions

Admin control over which zones are master/slave

RBD: simple DR via snapshots

Simple backup capabilityBased on block device snapshots

Efficiently mirror changes between consecutive snapshots across clusters

Now supported/orchestrated by OpenStack

Good for coarse synchronization (e.g., hours)Not real-time

Async replication in RADOS

One implementation to capture multiple use-casesRBD, CephFS, RGW, RADOS

A harder problemScalable: 1000s OSDs 1000s of OSDs

Point-in-time consistency

Three challengesInfer a partial ordering of events in the cluster

Maintain a stable timeline to stream fromeither checkpoints or event stream

Coordinated roll-forward at destinationdo not apply any update until we know we have everything that happened before it

CephFS

This is where it all started let's get there

TodayQA coverage and bug squashing continues

NFS and CIFS now large complete and robust

NeedMulti-MDS

Directory fragmentation

Snapshots

QA investment

Amazing community effort

The larger ecosystem

Big data

When will be stop talking about MapReduce?Why is big data built on such a lame storage model?

Move computation to the data

Evangelize RADOS classes

librados case studies and proof points

Build a general purpose compute and storage platform

The enterprise

How do we pay for all our toys?

Support legacy and transitional interfacesiSCSI, NFS, pNFS, CIFS

Vmware, Hyper-v

Identify the beachhead use-casesOnly takes one use-case to get in the door

Earn others later

Single platform shared storage resource

Bottom-up: earn respect of engineers and admins

Top-down: strong brand and compelling product

Why we can beat the old guard

It is hard to compete with free and open source softwareUnbeatable value proposition

Ultimately a more efficient development model

It is hard to manufacture community

Strong foundational architecture

Native protocols, Linux kernel supportUnencumbered by legacy protocols like NFS

Move beyond traditional client/server model

Ongoing paradigm shiftSoftware defined infrastructure, data center

Thank you, and Welcome!

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text stylesSecond level

Third level

Fourth level

Fifth level

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline Level

Ninth Outline LevelClick to edit Master text styles

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level

Click to edit the title text formatClick To Edit Master Title Style

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level

Seventh Outline LevelClick to edit Master text styles