email storage with ceph

55
Email Storage with Ceph Deutsche Telekom AG Danny Al-Gaaf

Upload: danny-al-gaaf

Post on 21-Jan-2018

197 views

Category:

Presentations & Public Speaking


1 download

TRANSCRIPT

EmailStorage

with

Ceph

DeutscheTelekomAG

DannyAl-Gaaf

TelekomMailplatform

TelekomMail

DT'smailplatformforcustomersdovecotNetwork-AttachedStorage(NAS)NFS(sharded)~1.3petabytenetstorage~39millionaccounts

NFSOperations

NFSTraffic

NFSStatistics

~42%usablerawspace

NFSIOPS

max:~835,000avg:~390,000

relevantIO:

WRITE:107,700/50,000READ:65,700/30,900

EmailStatistics

6.7billionemails

1.2petabytenetcompression

1.2billionindex/cache/metadatafiles

avg:24kiBmax:~600MiB

EmailDistribution

Howareemailsstored?

Emailsarewrittenonce,readmany(WORM)

Usagedependson:

protocol(IMAPvsPOP3)userfrontend(mailervswebmailer)

usuallyseparatedmetadata,cachesandindexes

lostofmetadata/indexesiscritical

withoutattachmentseasytocompress

Wheretostoreemails?

Filesystem

maildirmailbox

Database

SQL

Objectstore

S3Swift

Ceph

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self managing, intelligent storage nodes

LIBRADOS

A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Rubyand PHP

RADOSGW

A bucket-based REST Gateway, compatible with

S3 and Swift

RBD

A reliable and fully-distributed block device,with a Linux kernel clientand a QEMU/KVM driver

CEPH FS

A POSIX-compliantdistributed file system,with a Linux kernel clientand support for FUSE

CLIENTHOST/VMAPPAPP

Motivation

Scale-outvsScale-upFastselfhealingCommodityhardwarePreventvendorlock-inOpenSourcewherefeasibleReduceTotalCostofOwnership(TCO)

CephOptions

Filesystem

CephFSNFSGatewayviaRGWAnyfilesystemonRBD

Objectstore

S3/SwiftviaRGWRADOS

WheretostoreinCeph?

CephFS

sameissuesasNFSmailstorageonPOSIXlayeraddscomplexitynooptionforemailsusableformetadata/caches/indexes

M

MRADOSCLUSTER

LINUXHOSTKERNELMODULE

0110 metadatadata

WheretostoreinCeph?

RBD

needsshardingandlargeRBDsneedsaccountmigrationneedsRBD/fsextendscenariosnosharingbetweenclientsimpracticable

M

MRADOSCLUSTER

HYPERVISORLIBRBD

VM

WheretostoreinCeph?

RadosGW

canstoreemailsasobjectsextranetworkhopspotentialbottleneckverylikelynotfastenough

M

MRADOSCLUSTER

APPLICATION

RADOSGWLIBRADOS

REST

socket

WheretostoreinCeph?

Librados

directaccesstoRADOSparallelI/Onotoptimizedforemailshowtohandlemetadata/caches/indexes?

M

MRADOSCLUSTER

APP

LIBRADOS

APP

LIBRADOS

DovecotandCeph

Dovecot

Opensourceproject(LGPL2.1,MIT)

72%marketshare(openemailsurvey.org,02/2017)

Objectstorepluginavailable(obox)

supportsonlyRESTAPIslikeS3/SwiftnotopensourcerequiresDovecotProlargeimpactonTCO

DovecotProoboxPlugin

IMAP4/POP3/LMTPprocess

StorageAPI

dovecotoboxbackend

metacacheRFC5322mails

fsAPI

fscachebackend

objectstorebackend

RFC5322objects

objectstore

index&cachebundles

synclocalindex&cachewith

objectstore

writeindex&cachetolocalstore

localstorage

mailcache

localstorage

DT'sapproach

noopensourcesolutiononthemarketclosedsourceisnooptiondevelop/sponsorasolutionopensourceitpartnerwith:WidodenHollander(42on.com)forconsultingTallenceAGfordevelopmentSUSEforCeph

CephpluginforDovecot

FirstStep:hybridapproach

Emails

StoreinRADOSCluster

Metadataandindexes

StoreinCephFS

Beasgenericaspossible

SplitoutcodeintolibrariesIntegrateintocorrespondingupstreamprojects

Mail User Agent

RADOSCLUSTER

Ceph Client

Dovecot

rbox storage plugin

librmb CephFS

librados Linux Kernel

IMAP/POP

Libradosmailbox(librmb)

Genericemailabstractionontopoflibrados

Outofscope:

Userdataandcredentialstoragetargetarehugeinstallationswhereusuallyarealreadysolutionsinplace

FulltextindexesTherearesolutionsalreadyavailableandworkingoutsideemailstorage

Libradosmailbox(librmb)

M

MRADOSCLUSTER

M

MRADOSCLUSTER

RFC5322

CACHE

LOCALSTORAGE

IMAP4/POP3/LMTPprocess

StorageAPI

dovecotrboxbackend

librmb

librados

dovecotlib-index

CephFS

LinuxkernelRFC5322mails

RFC5322objects index&metadata

librmb-MailObjectFormat

MailsareimmutableregardingtheRFC-5322content

RFC-5322contentstoredinRADOSdirectly

ImmutableattributesusedbyDovecotstoredinRADOSxattr

rboxformatversionGUIDReceivedandsavedatePOP3UIDLandPOP3orderMailboxGUIDPhysicalandvirtualsizeMailUID

writableattributesarestoredinDovecotindexfiles

DumpemaildetailsfromRADOS

$>rmb-pmail_storage-Nt1lsM=ad54230e65b49a59381100009c60b9f7

mailbox_count:1

MAILBOX:M(mailbox_guid)=ad54230e65b49a59381100009c60b9f7mail_total=2,mails_displayed=2mailbox_size=5539bytes

MAIL:U(uid)=4oid=a2d69f2868b49a596a1d00009c60b9f7R(receive_time)=TueJan1400:18:112003S(save_time)=MonAug2112:22:322017Z(phy_size)=2919V(v_size)=2919stat_size=2919M(mailbox_guid)=ad54230e65b49a59381100009c60b9f7G(mail_guid)=a3d69f2868b49a596a1d00009c60b9f7I(rbox_version):0.1[..]

RADOSDictionaryPlugin

makeuseofCephomapkey/valuestore

RADOSnamespaces

shared/<key>priv/<key>

usedbyDovecottostoremetadata,quota,...

It'sopensource!

License:LGPLv2.1

Language:C++

Location:

SupportedDovecotversions:

2.2>=2.2.212.3

github.com/ceph-dovecot/

CephRequirements

Performance

Writeperformanceforemailsiscriticalmetadata/indexread/writeperformance

Cost

ErasureCoding(EC)foremailsReplicationforCephFS

Reliability

MUSTsurvicefailureofdisk,server,rackandevenfirecompartments

WhichCephRelease?

RequiredFeatures:

Bluestoreshouldbeatleast2xfasterthanfilestore

CephFSStablereleaseMulti-MDS

Erasurecoding

SUSEProductstouse

SLES12-SP3andSES5

Hardware

Hardware

Commodityx86_64server

HPEProLiantDL380Gen9Dualsocket

Intel®Xeon®E5V42xIntel®X710-DA2Dual-port10G2xbootSSDs,SATA,HBA,noseparateRAIDcontroller

CephFS,Rados,MDSandMONnodes

StorageNodes

CephFSSSDNodes

CPU:[email protected],6Cores,turbo3.7GHzRAM:256GByte,DDR4,ECCSSD:8x1.6TBSSD,3DWPD,SAS,RR/RW125k/92kiops

RadosHDDNodes

CPU:[email protected],10Cores,turbo3.4GHzRAM:128GByte,DDR4,ECCSSD:2x400GByte,3DWPD,SAS,RR/RW108k/49kiops

forBlueStoredatabaseetc.HDD:10x4TByte,7.2K,128MBcache,SAS

ComputeNodes

MDS

CPU:[email protected],6Cores,turbo3.7GHzRAM:256GByte,DDR4,ECC

MON/SUSEadmin

CPU:[email protected],10Cores,turbo3.4GHzRAM:64GByte,DDR4,ECC

WhythisspecificHW?

Communityrecommendations?

OSD:1x64-bitAMD-64,1GBRAM/1TBofstorage,2x1GBitNICsMDS:1x64-bitAMD-64quad-core,1GBRAMminimumperMDS,2x1GBitNICs

NUMA,highclockedCPUsandlargeRAMoverkill?

VendordidnotoffersingleCPUnodesfornumberofdrivesMDSperformanceismostlyCPUclockboundandpartlysinglethreaded

HighclockedCPUsforfastsinglethreadedperformanceLargeRAM:bettercaching!

Placement

Issues

Datacenter

usuallytwoindependentfirecompartments(FCs)mayadditionalvirtualFCs

Requirements

LostofcustomerdataMUSTbepreventedAnyserver,switchorrackcanfailOneFCcanfailDatareplicationatleast3times(orequivalent)

Issues

Questions

Howtoplace3copiesintwoFCs?HowindependentandreliablearethevirtualFCs?Networkarchitecture?Networkbandwidth?

FireCompartmentA

34333231302928272625242322212019181716151413121110987654321

Switches

MON

MDS

HDDnode

HDDnode

HDDnode

HDDnode

HDDnode

HDDnode

HDDnode

SSDnode

SSDnode

SSDnode

FireCompartmentB

34333231302928272625242322212019181716151413121110987654321

Switches

MON

MDS

HDDnode

HDDnode

HDDnode

HDDnode

HDDnode

HDDnode

HDDnode

SSDnode

SSDnode

SSDnode

SUSEAdmin

FireCompartmentC(thirdroom)

34333231302928272625242322212019181716151413121110987654321

Switches

MON

MDS

SSDnode

SSDnode

SSDnode

Network

10Gnetwork

2NICs/4portspernodeSFP+DAC

Multi-chassisLinkAggregation(MC-LAG/M-LAG)

Foraggregationandfail-over

Spine-Leafarchitecture

Interconnectmustnotreflecttheoreticalrack/FCbandwidthL2:terminatedinrackL3:TOR<->spine/spine<->spineBorderGatewayProtocol(BGP)

2x 40G QSFP L3 crosslink

40G QSFPLAG, L3, BGP

N*10G SFP+MCLAG

rack , L2 terminated

Spine switch

DC-R

FC1 FC2 vFC

StatusandNextSteps

Status

DovecotCephPlugin

Opensourcedgithub

stillunderdevelopmentstillincludeslibrmb

planned:movetoCephproject

Testing

Functionaltesting

Setupsmall5-nodeclusterSLES12-SP3GMCSES5BetaRunDovecotfunctionaltestsagainstCeph

Proof-of-Concept

Hardware

9SSDnodesforCephFS12HDDnodes3MDS/3MON

2FCs+1vFC

Testing

runloadtestsrunfailurescenariosagainstCephimproveandtuneCephsetupverifyandoptimizehardware

FurtherDevelopment

Goal:PureRADOSbackend,storemetadata/indexinCephomap

M

MRADOSCLUSTER

RFC5322

CACHE

LOCALSTORAGE

IMAP4/POP3/LMTPprocess

StorageAPI

dovecotrboxbackend

librmb

librados

RFC5322mails

RFC5322objectsandindex

NextSteps

Production

verifyifallrequirementsarefulfilledintegrateinproductionmigrateusersstep-by-stepextendtofinalsize

128HDDnodes,1200OSDs,4,7PiB15SSDnodes,120OSDs,175TiB

Conclusion

Summaryandconclusions

CephcanreplaceNFS

mailsinRADOSmetadata/indexesinCephFSBlueStore,EC

librmbanddovecotrbox

OpenSource,LGPLv2.1,nolicensecostslibrmbcanbeusedinnon-dovecotsystemsstillunderdevelopment

PoCwithdovecotinprogress

Performanceoptimization

Beinvitedto:Participate!

Tryit,testit,feedbackandreportbugs!Contribute!

Thankyou.

github.com/ceph-dovecot/

Questions?