espresso - a feasibility study of a scalable, performant odbms dirk duellmann cern it/db and rd45 n...

Espresso - Espresso - a Feasibility Study of a Feasibility Study of a Scalable, Performant ODBMSa Scalable, Performant ODBMS

Dirk DuellmannDirk DuellmannCERN IT/DB and RD45CERN IT/DB and RD45

Aim of this StudyAim of this Study Architectural OverviewArchitectural Overview Espresso ComponentsEspresso Components Prototype Status & PlansPrototype Status & Plans

Espresso Overview Dirk.Duellmann@cern.ch

Why Espresso?Why Espresso?

RD45 Risk Analysis MilestoneRD45 Risk Analysis Milestone– Understand the effort needed to develop a Understand the effort needed to develop a

ODBMS suitable as fallback solution for LHC ODBMS suitable as fallback solution for LHC data storesdata stores

Testbed that allows us to test novel Testbed that allows us to test novel solutions for remaining problemssolutions for remaining problems– e.g. VLDB issues, asynchronous I/O, user e.g. VLDB issues, asynchronous I/O, user

schema & data, modern C++ binding, ...schema & data, modern C++ binding, ... NONO plans to stop Objectivity production plans to stop Objectivity production

service!service!

Could a home grown ODBMS be Could a home grown ODBMS be feasible?feasible?

Most Database kernels have been developed in “C” the Most Database kernels have been developed in “C” the late 80s and beforelate 80s and before– Today all main design choices are extensively studied in the Today all main design choices are extensively studied in the

computer science literaturecomputer science literature– C++ Language and Library provide am much better C++ Language and Library provide am much better

development platform than Cdevelopment platform than C Our specific requirements are better understoodOur specific requirements are better understood

– We know much better what we need (and not need).We know much better what we need (and not need).– We could reuse HEP developments in many areas like mass We could reuse HEP developments in many areas like mass

storage interface, securitystorage interface, security Building an ODBMS for HEP is an engineering and not a Building an ODBMS for HEP is an engineering and not a

research taskresearch task– We don’t need to spend O(150) person years which went into We don’t need to spend O(150) person years which went into

the first ODBMS!the first ODBMS!

System RequirementsSystem Requirements

ScalabilityScalability– in data volume and number of client in data volume and number of client

connectionsconnections Navigational AccessNavigational Access

– with performance close network and disk limitswith performance close network and disk limits Heterogeneous AccessHeterogeneous Access

– from multiple platforms and languagesfrom multiple platforms and languages Transactional Safety & Crash Recovery Transactional Safety & Crash Recovery

– automatic consistency after soft/hardware automatic consistency after soft/hardware failuresfailures

A Clean Sheet Approach - A Clean Sheet Approach - What should/could be done What should/could be done

differently?differently?

No need for big architectural changesNo need for big architectural changes– Objectivity/DB largely fulfils our functional requirementsObjectivity/DB largely fulfils our functional requirements– Migration would be easier if access model is similar (e.g Migration would be easier if access model is similar (e.g

ODMG-like)ODMG-like) Focus on remaining problemsFocus on remaining problems

– Improved Scalability & Concurrency of the Storage HierarchyImproved Scalability & Concurrency of the Storage Hierarchy Larger address space (VLDB)Larger address space (VLDB) Segmented and more scalable schema & catalogueSegmented and more scalable schema & catalogue

– Improved Support for HEP environmentImproved Support for HEP environment parallel development - concept of user/developer sandbox parallel development - concept of user/developer sandbox

within the store neededwithin the store needed

– Simplify Partial Distribution of the Data StoreSimplify Partial Distribution of the Data Store import export consistent subsets of the storeimport export consistent subsets of the store

Flexible Storage HierarchyFlexible Storage Hierarchy

File - Group of physically clustered objectsFile - Group of physically clustered objects– Smallest possible Espresso storeSmallest possible Espresso store– Contains data and Contains data and optionally schemaoptionally schema– Fast navigation within the file using physical OIDsFast navigation within the file using physical OIDs

Domain - Group of files with Domain - Group of files with tightly coupledtightly coupled objects objects – Contains domain catalogue, data and additional schemaContains domain catalogue, data and additional schema– Navigation between all objects within the domain using Navigation between all objects within the domain using

physical OIDsphysical OIDs

Federation - Group of Federation - Group of weakly coupledweakly coupled domains domains– Domain catalogue (very few updates!)Domain catalogue (very few updates!)– Shared schema (very few updates!)Shared schema (very few updates!)

User 1User 1DomainDomain

CatalogueCatalogue

SchemaSchemamyTrackmyTrack

User 1User 1TagsTags

User 1User 1HistosHistos

User 1User 1MyTracksMyTracks

User 1 “sandbox”User 1 “sandbox”

Period 1 Period 1 DomainDomain

CatalogueCatalogue

P 1 P 1 RAWRAW

P 1 P 1 AODAOD

P 1 P 1 RECREC

Period N Period N DomainDomain

CatalogueCatalogue

P nP nRAWRAW

P n P n AODAOD

P n P n RECREC

Production ServerProduction Server

Read-Only DomainRead-Only Domain(no locking required)(no locking required)

Calib Calib DomainDomain

CatalogueCatalogue

Calib Calib TPCTPC

Calib Calib HCALHCAL

Calib Calib ECALECAL

Calib ServerCalib Server

FDFDCatalogueCatalogue

ProductionProductionSchemaSchema

Espresso OID LayoutEspresso OID Layout

FederationFederation– set of weakly coupled domainsset of weakly coupled domains

Domain#Domain# 32bit32bit– set of tightly coupled objectsset of tightly coupled objects– e.g. a run or run period, a end-user e.g. a run or run period, a end-user

workspaceworkspace File# File# 16bit16bit

– a single file within a domaina single file within a domain Page# Page# 32bit32bit

– a single logical page in the filea single logical page in the file Object# Object# 16bit16bit

– a single data record on a page a single data record on a page e.g. a object or varraye.g. a object or varray

FederationFederation

DomainDomain

File File

PagePage

ObjectObject

Prototype ImplementationPrototype Implementation

Espresso is implemented in Espresso is implemented in standard C++standard C++

– no other dependenciesno other dependencies

– (for now we use portable network I/O from ObjectSpace)(for now we use portable network I/O from ObjectSpace)

Expect a full C++ compilerExpect a full C++ compiler

– STL containersSTL containers in fact all containers in the current implementation are STL containersin fact all containers in the current implementation are STL containers

– Exceptions Exceptions C++ binding uses exceptions to signal error conditions C++ binding uses exceptions to signal error conditions

(conforming to ODMG standard) (conforming to ODMG standard)

– NamespacesNamespaces All of the implementation is contained in namespace “espresso”All of the implementation is contained in namespace “espresso” C++ binding is in namespace "odmg”C++ binding is in namespace "odmg”

Development Platform: RedHat Linux & g++Development Platform: RedHat Linux & g++

Component ApproachComponent Approach

Espresso is split into a small set of replaceable Espresso is split into a small set of replaceable components with well defined components with well defined – tasktask– interfaceinterface– dependency on other components dependency on other components

Common Services Common Services Storage Manager Storage Manager Schema ManagerSchema Manager Catalogue ManagerCatalogue Manager Data ServerData Server Lock ServerLock Server C++ & Python Binding, (JAVA)C++ & Python Binding, (JAVA)

Toplevel ComponentsToplevel Components

User APIUser API

Tool Tool InterfaceInterface

Storage Storage Level Level

InterfaceInterface

OS & OS & Network Network

AbstractionAbstraction

DistributionDistribution

Net I/OFile I/O

StorageMgr

Page I/O

TransMgr CatalogMgr SchemaMgr

C++ Binding JAVA Binding

PageServer

Locktable

LockServer

depends ondepends on

Python Binding

Components: Physical ModelComponents: Physical Model

Each top-level component corresponds to one Each top-level component corresponds to one shared library and namespaceshared library and namespace– shared lib dependencies follow category diagramshared lib dependencies follow category diagram– components are isolated in their namespacecomponents are isolated in their namespace

from other componentsfrom other components from user classesfrom user classes

Each shared lib provides IComponent interfaceEach shared lib provides IComponent interface– Factory for main provided interfacesFactory for main provided interfaces– Version and configuration control on component Version and configuration control on component

levellevel implementation version, date and compiler version implementation version, date and compiler version boolean flags for optimised, debug, profiling boolean flags for optimised, debug, profiling

Client Side ComponentsClient Side Components

Storage ManagerStorage Manager– store and retrieve variable length opaque data objectsstore and retrieve variable length opaque data objects

maintains OIDs for data objectsmaintains OIDs for data objects implements transactional safetyimplements transactional safety language and platform independentlanguage and platform independent

– current implementation uses “shadow-paging” to implement current implementation uses “shadow-paging” to implement transactionstransactions

Schema ManagerSchema Manager– describe the layout of data typesdescribe the layout of data types

data member position, size and type, byte ordering for primitive typesdata member position, size and type, byte ordering for primitive types

– used for:used for: Platform Conversion, Generic Browsing, Schema ConsistencyPlatform Conversion, Generic Browsing, Schema Consistency

– current implementation extracts schema from the debug information current implementation extracts schema from the debug information provided directly by the compilerprovided directly by the compiler

– no schema pre-processor requiredno schema pre-processor required

Server Side Components Server Side Components

Data ServerData Server– transfer data pages from persistent storage (disk/tape) to memorytransfer data pages from persistent storage (disk/tape) to memory

file system like interfacefile system like interface

– trivial implementation for local I/Otrivial implementation for local I/O– multi-threaded server daemon for remote I/Omulti-threaded server daemon for remote I/O

Lock ServerLock Server– keep a central table of resource lockskeep a central table of resource locks

getLock (oid)getLock (oid)

– implements lock waiting and upgradingimplements lock waiting and upgrading– very similar approach to most DBMSvery similar approach to most DBMS

Hash Table of resource locks (resource specified as OID)Hash Table of resource locks (resource specified as OID) Queue of waiters per locked resourceQueue of waiters per locked resource

– moderate complexity: storage manager implements “real” moderate complexity: storage manager implements “real” transaction logictransaction logic

C++ Language BindingC++ Language Binding

Support all main language featuresSupport all main language features– Including polymorphic access and templatesIncluding polymorphic access and templates– No language extensions, NoNo language extensions, No generated codegenerated code

ODMG 2.0 compliant C++ Binding ODMG 2.0 compliant C++ Binding – Ref templates can be sub-classed to extend their behaviorRef templates can be sub-classed to extend their behavior

e.g. d_Ref could be extended to monitor object access countse.g. d_Ref could be extended to monitor object access counts– large fraction of the binding has already been implementedlarge fraction of the binding has already been implemented

smart pointers can point to transient objectssmart pointers can point to transient objects

persistent capable classes may be embedded into other persistent classespersistent capable classes may be embedded into other persistent classes

d_activate and d_deactivate are implementedd_activate and d_deactivate are implemented

– design supports multiple DB contexts per processdesign supports multiple DB contexts per process e.g. for multi-threaded applications and mutiple federationse.g. for multi-threaded applications and mutiple federations

Work in progress: Work in progress:

– B-Tree indices, bi-directional links, installable adapters for persistent objectsB-Tree indices, bi-directional links, installable adapters for persistent objects

First Scalability & Performance First Scalability & Performance TestsTests

Page ServerPage Server– up to 70 concurrent clientsup to 70 concurrent clients

Lock ServerLock Server– up toup to 150 150 concurrent clients, up to concurrent clients, up to 30003000 locks locks

Storage ManagerStorage Manager– Files up to 2 GB (ext2 file system limit under LINUX)Files up to 2 GB (ext2 file system limit under LINUX)– 100 million objects100 million objects per file per file

stress tested with “random” bit-patternsstress tested with “random” bit-patterns

– Objects up to Objects up to 10 MB size10 MB size– Write Performance: > Write Performance: > 40MB/s40MB/s at 30% CPU at 30% CPU

450MHz dual PIII with 4 stripe RAID 0 on RedHat 6.1 450MHz dual PIII with 4 stripe RAID 0 on RedHat 6.1

C++ Binding and Schema HandlingC++ Binding and Schema Handling– successfully ported several non-trivial applicationssuccessfully ported several non-trivial applications– HTL histogram examples, simple object browser using pythonHTL histogram examples, simple object browser using python– tagDb and naming examples from HepODBMStagDb and naming examples from HepODBMS

Next StepsNext Steps

Start detailed requirement discussion with Start detailed requirement discussion with experiments and other interested institutesexperiments and other interested institutes

Continue Scalability & Performance TestContinue Scalability & Performance Test– Storage Manager: Storage Manager: larger files (>100GB)larger files (>100GB)– Page Server: Page Server: connections > 500connections > 500– Lock Server: Lock Server: number of locks > 20knumber of locks > 20k– C++ Binding & Schema Manager: port Geant4 persistency C++ Binding & Schema Manager: port Geant4 persistency

examples and Conditions-DB examples and Conditions-DB By summer this year By summer this year

– Written Architectural Overview of the PrototypeWritten Architectural Overview of the Prototype– Development Plan with detailed estimate of required Development Plan with detailed estimate of required

manpowermanpower– Single user toy-systemSingle user toy-system

Summary & ConclusionsSummary & Conclusions

We identified solutions for most critical components We identified solutions for most critical components of a scalable and performant ODBMSof a scalable and performant ODBMS– Prototype implementation shows promising performance and Prototype implementation shows promising performance and

scalabilityscalability– Using a strict component approach allows to split the effort Using a strict component approach allows to split the effort

into independently developed, replaceable modules. into independently developed, replaceable modules.

The development of an Open Source ODBMS seems The development of an Open Source ODBMS seems possible within the HEP or general science communitypossible within the HEP or general science community

A collaborative effort of the order of 15 person years A collaborative effort of the order of 15 person years seems sufficient to produce such a system with seems sufficient to produce such a system with production qualityproduction quality

The EndThe End

Exploit Read-Only DataExploit Read-Only Data

Most of our data volume follows the pattern Most of our data volume follows the pattern – (private) write-once,(private) write-once,– share read-onlyshare read-only– e.g. raw data is never updated, reconstructed data is not updated e.g. raw data is never updated, reconstructed data is not updated

but replaced but replaced Current Current ODBMS ODBMS implementations implementations do not really do not really take advantage take advantage

ofof this fact this fact– read-only filesread-only files

no need to obtain any locks for this datano need to obtain any locks for this data no need to ever update cache contentno need to ever update cache content simple backup strategysimple backup strategy

Using the concept of read-only filesUsing the concept of read-only files– e.g. in the catalogue e.g. in the catalogue – should significantly reduce the locking overhead and improve the should significantly reduce the locking overhead and improve the

scalability of the system with many concurrent clientsscalability of the system with many concurrent clients

Transactions and RecoveryTransactions and Recovery

Shadow PagingShadow Paging– Physical pages on disk are accessed indirectly through a Physical pages on disk are accessed indirectly through a

translation table (page map). translation table (page map). – Copy-on-Write : page modifications are always written to Copy-on-Write : page modifications are always written to

a new, free physical pagea new, free physical page– Changed physical pages are made visible to other Changed physical pages are made visible to other

transactions by updating the page map at commit time.transactions by updating the page map at commit time.

MasterMaster PageMap 1PageMap 1 Data 2Data 2 Data 3Data 3 Data 4Data 4 Data 5Data 5 PageMap 2PageMap 2

Advantages of this ApproachAdvantages of this Approach

Single files or complete domains can be used Single files or complete domains can be used stand-alone stand-alone without modificationwithout modification – e.g. set of user files containing tags and histogramse.g. set of user files containing tags and histograms

Local OIDs could be stored in a more compact formLocal OIDs could be stored in a more compact form– transparent expansion into a full OID as they are read into memorytransparent expansion into a full OID as they are read into memory

““Attaching” or direct sharing of files or complete domains does Attaching” or direct sharing of files or complete domains does not need any special treatmentnot need any special treatment– no OID translation neededno OID translation needed– read-only files/domains can directly be shared by multiple read-only files/domains can directly be shared by multiple

federationsfederations Domains allow to segment the store into “coherent regions” of Domains allow to segment the store into “coherent regions” of

associated objects associated objects – Efficient distribution, backup and replication of subsets of the data Efficient distribution, backup and replication of subsets of the data

(e.g. a run period, a set of user tracks)(e.g. a run period, a set of user tracks)– Consistency checks can be constrained to a single domainConsistency checks can be constrained to a single domain

Common ServicesCommon Services

Services and Interfaces of global visibilityServices and Interfaces of global visibility– OID, IStorageMgr,IPageServer,ILockServer, ISchemaMgrOID, IStorageMgr,IPageServer,ILockServer, ISchemaMgr– Platform & OS abstractionPlatform & OS abstraction

fixed range types, I/O primitives, process controlfixed range types, I/O primitives, process control

– component interfacecomponent interface version & configuration controlversion & configuration control component factorycomponent factory

– extendible diagnostics extendible diagnostics named counters, timers to instrument the codenamed counters, timers to instrument the code each component may have a sub-tree diagnostic itemseach component may have a sub-tree diagnostic items

– error & debug message handler error & debug message handler syslog like: component, level, messagesyslog like: component, level, message

– exception base classexception base class

Espresso Schema ExtractionEspresso Schema Extraction

Currently implementedCurrently implemented extraction based on the “stabs” standard format for debugging extraction based on the “stabs” standard format for debugging

information (used by egcs and Sun CC)information (used by egcs and Sun CC) based on GNU “BDF” library and “objdump” source codebased on GNU “BDF” library and “objdump” source code

Prototype provides full runtime reflection for C++ dataPrototype provides full runtime reflection for C++ data describes classes and structs with their fields and inheritance describes classes and structs with their fields and inheritance supports namespaces, typedefs and enums and templatessupports namespaces, typedefs and enums and templates location and value of virtual function and virtual base class location and value of virtual function and virtual base class

pointerspointers sufficient to allow runtime field by field consistency check against sufficient to allow runtime field by field consistency check against

persistent schemapersistent schema

Starting of a modified egcs front-end as schema extractor Starting of a modified egcs front-end as schema extractor would be an alternativewould be an alternative

espresso - a feasibility study of a scalable, performant odbms dirk duellmann cern it/db and rd45 n...

user schema data

bita single data record

lhc data storestestbed

standard c

modern c binding

bita single file

storage hierarchyfile

bita single logical

Documents

espresso · espresso tbc mocha white chocolate, swirls of...

totti espresso...

[espresso - 28] espresso/attualita

pos guide - kimbo coffee...espresso italiano kimbo espresso...

espresso++ documentation+.pdf · espresso++ documentation,...

walzen - d3mm7mvnke0u7o.cloudfront.net...alle walzen im...

espresso, decaffeinato e - hiltongardeninn3.hilton.com ·...

beverages - amélie’s · espresso petit noir double...

espresso - download.p4c.philips.com · 1 2 3 espresso...

k3g190-rd45-03 operatinginstructions … · 2020. 8. 5. ·...

drinks - stadtbalkon-mainz.de · espresso 9 1,90 €...

espresso & co - neue-roesterei-luebeck.de · espresso & co...

object persistency & data handling session c - summary...

espresso 50 espresso doppio 4 espresso …

the story of espresso by espresso 1882 australia

object databases as data stores for hep dirk düllmann...

espresso - philips · espresso espresso lungo caffe crema...

getrÄnke - restaurant-brazil.de · espresso con panna...

espresso - philips · max 1/3 espresso espresso lungo aroma...

wmf espresso...4 wmf espresso the perfect espresso, handmade...