espresso - a feasibility study of a scalable, performant odbms dirk duellmann cern it/db and rd45 n...
Post on 19-Jan-2016
217 Views
Preview:
TRANSCRIPT
Espresso - Espresso - a Feasibility Study of a Feasibility Study of a Scalable, Performant ODBMSa Scalable, Performant ODBMS
Dirk DuellmannDirk DuellmannCERN IT/DB and RD45CERN IT/DB and RD45
Aim of this StudyAim of this Study Architectural OverviewArchitectural Overview Espresso ComponentsEspresso Components Prototype Status & PlansPrototype Status & Plans
Espresso Overview Dirk.Duellmann@cern.ch
Why Espresso?Why Espresso?
RD45 Risk Analysis MilestoneRD45 Risk Analysis Milestone– Understand the effort needed to develop a Understand the effort needed to develop a
ODBMS suitable as fallback solution for LHC ODBMS suitable as fallback solution for LHC data storesdata stores
Testbed that allows us to test novel Testbed that allows us to test novel solutions for remaining problemssolutions for remaining problems– e.g. VLDB issues, asynchronous I/O, user e.g. VLDB issues, asynchronous I/O, user
schema & data, modern C++ binding, ...schema & data, modern C++ binding, ... NONO plans to stop Objectivity production plans to stop Objectivity production
service!service!
Espresso Overview Dirk.Duellmann@cern.ch
Could a home grown ODBMS be Could a home grown ODBMS be feasible?feasible?
Most Database kernels have been developed in “C” the Most Database kernels have been developed in “C” the late 80s and beforelate 80s and before– Today all main design choices are extensively studied in the Today all main design choices are extensively studied in the
computer science literaturecomputer science literature– C++ Language and Library provide am much better C++ Language and Library provide am much better
development platform than Cdevelopment platform than C Our specific requirements are better understoodOur specific requirements are better understood
– We know much better what we need (and not need).We know much better what we need (and not need).– We could reuse HEP developments in many areas like mass We could reuse HEP developments in many areas like mass
storage interface, securitystorage interface, security Building an ODBMS for HEP is an engineering and not a Building an ODBMS for HEP is an engineering and not a
research taskresearch task– We don’t need to spend O(150) person years which went into We don’t need to spend O(150) person years which went into
the first ODBMS!the first ODBMS!
Espresso Overview Dirk.Duellmann@cern.ch
System RequirementsSystem Requirements
ScalabilityScalability– in data volume and number of client in data volume and number of client
connectionsconnections Navigational AccessNavigational Access
– with performance close network and disk limitswith performance close network and disk limits Heterogeneous AccessHeterogeneous Access
– from multiple platforms and languagesfrom multiple platforms and languages Transactional Safety & Crash Recovery Transactional Safety & Crash Recovery
– automatic consistency after soft/hardware automatic consistency after soft/hardware failuresfailures
Espresso Overview Dirk.Duellmann@cern.ch
A Clean Sheet Approach - A Clean Sheet Approach - What should/could be done What should/could be done
differently?differently?
No need for big architectural changesNo need for big architectural changes– Objectivity/DB largely fulfils our functional requirementsObjectivity/DB largely fulfils our functional requirements– Migration would be easier if access model is similar (e.g Migration would be easier if access model is similar (e.g
ODMG-like)ODMG-like) Focus on remaining problemsFocus on remaining problems
– Improved Scalability & Concurrency of the Storage HierarchyImproved Scalability & Concurrency of the Storage Hierarchy Larger address space (VLDB)Larger address space (VLDB) Segmented and more scalable schema & catalogueSegmented and more scalable schema & catalogue
– Improved Support for HEP environmentImproved Support for HEP environment parallel development - concept of user/developer sandbox parallel development - concept of user/developer sandbox
within the store neededwithin the store needed
– Simplify Partial Distribution of the Data StoreSimplify Partial Distribution of the Data Store import export consistent subsets of the storeimport export consistent subsets of the store
Espresso Overview Dirk.Duellmann@cern.ch
Flexible Storage HierarchyFlexible Storage Hierarchy
File - Group of physically clustered objectsFile - Group of physically clustered objects– Smallest possible Espresso storeSmallest possible Espresso store– Contains data and Contains data and optionally schemaoptionally schema– Fast navigation within the file using physical OIDsFast navigation within the file using physical OIDs
Domain - Group of files with Domain - Group of files with tightly coupledtightly coupled objects objects – Contains domain catalogue, data and additional schemaContains domain catalogue, data and additional schema– Navigation between all objects within the domain using Navigation between all objects within the domain using
physical OIDsphysical OIDs
Federation - Group of Federation - Group of weakly coupledweakly coupled domains domains– Domain catalogue (very few updates!)Domain catalogue (very few updates!)– Shared schema (very few updates!)Shared schema (very few updates!)
User 1User 1DomainDomain
CatalogueCatalogue
SchemaSchemamyTrackmyTrack
User 1User 1TagsTags
User 1User 1HistosHistos
User 1User 1MyTracksMyTracks
User 1 “sandbox”User 1 “sandbox”
Period 1 Period 1 DomainDomain
CatalogueCatalogue
P 1 P 1 RAWRAW
P 1 P 1 AODAOD
P 1 P 1 RECREC
Period N Period N DomainDomain
CatalogueCatalogue
P nP nRAWRAW
P n P n AODAOD
P n P n RECREC
Production ServerProduction Server
Read-Only DomainRead-Only Domain(no locking required)(no locking required)
Calib Calib DomainDomain
CatalogueCatalogue
Calib Calib TPCTPC
Calib Calib HCALHCAL
Calib Calib ECALECAL
Calib ServerCalib Server
FDFDCatalogueCatalogue
ProductionProductionSchemaSchema
Espresso Overview Dirk.Duellmann@cern.ch
Espresso OID LayoutEspresso OID Layout
FederationFederation– set of weakly coupled domainsset of weakly coupled domains
Domain#Domain# 32bit32bit– set of tightly coupled objectsset of tightly coupled objects– e.g. a run or run period, a end-user e.g. a run or run period, a end-user
workspaceworkspace File# File# 16bit16bit
– a single file within a domaina single file within a domain Page# Page# 32bit32bit
– a single logical page in the filea single logical page in the file Object# Object# 16bit16bit
– a single data record on a page a single data record on a page e.g. a object or varraye.g. a object or varray
FederationFederation
DomainDomain
File File
PagePage
ObjectObject
Espresso Overview Dirk.Duellmann@cern.ch
Prototype ImplementationPrototype Implementation
Espresso is implemented in Espresso is implemented in standard C++standard C++
– no other dependenciesno other dependencies
– (for now we use portable network I/O from ObjectSpace)(for now we use portable network I/O from ObjectSpace)
Expect a full C++ compilerExpect a full C++ compiler
– STL containersSTL containers in fact all containers in the current implementation are STL containersin fact all containers in the current implementation are STL containers
– Exceptions Exceptions C++ binding uses exceptions to signal error conditions C++ binding uses exceptions to signal error conditions
(conforming to ODMG standard) (conforming to ODMG standard)
– NamespacesNamespaces All of the implementation is contained in namespace “espresso”All of the implementation is contained in namespace “espresso” C++ binding is in namespace "odmg”C++ binding is in namespace "odmg”
Development Platform: RedHat Linux & g++Development Platform: RedHat Linux & g++
Espresso Overview Dirk.Duellmann@cern.ch
Component ApproachComponent Approach
Espresso is split into a small set of replaceable Espresso is split into a small set of replaceable components with well defined components with well defined – tasktask– interfaceinterface– dependency on other components dependency on other components
Common Services Common Services Storage Manager Storage Manager Schema ManagerSchema Manager Catalogue ManagerCatalogue Manager Data ServerData Server Lock ServerLock Server C++ & Python Binding, (JAVA)C++ & Python Binding, (JAVA)
Espresso Overview Dirk.Duellmann@cern.ch
Toplevel ComponentsToplevel Components
User APIUser API
Tool Tool InterfaceInterface
Storage Storage Level Level
InterfaceInterface
OS & OS & Network Network
AbstractionAbstraction
DistributionDistribution
Net I/OFile I/O
StorageMgr
Page I/O
TransMgr CatalogMgr SchemaMgr
C++ Binding JAVA Binding
PageServer
Locktable
LockServer
depends ondepends on
Python Binding
Espresso Overview Dirk.Duellmann@cern.ch
Components: Physical ModelComponents: Physical Model
Each top-level component corresponds to one Each top-level component corresponds to one shared library and namespaceshared library and namespace– shared lib dependencies follow category diagramshared lib dependencies follow category diagram– components are isolated in their namespacecomponents are isolated in their namespace
from other componentsfrom other components from user classesfrom user classes
Each shared lib provides IComponent interfaceEach shared lib provides IComponent interface– Factory for main provided interfacesFactory for main provided interfaces– Version and configuration control on component Version and configuration control on component
levellevel implementation version, date and compiler version implementation version, date and compiler version boolean flags for optimised, debug, profiling boolean flags for optimised, debug, profiling
Espresso Overview Dirk.Duellmann@cern.ch
Client Side ComponentsClient Side Components
Storage ManagerStorage Manager– store and retrieve variable length opaque data objectsstore and retrieve variable length opaque data objects
maintains OIDs for data objectsmaintains OIDs for data objects implements transactional safetyimplements transactional safety language and platform independentlanguage and platform independent
– current implementation uses “shadow-paging” to implement current implementation uses “shadow-paging” to implement transactionstransactions
Schema ManagerSchema Manager– describe the layout of data typesdescribe the layout of data types
data member position, size and type, byte ordering for primitive typesdata member position, size and type, byte ordering for primitive types
– used for:used for: Platform Conversion, Generic Browsing, Schema ConsistencyPlatform Conversion, Generic Browsing, Schema Consistency
– current implementation extracts schema from the debug information current implementation extracts schema from the debug information provided directly by the compilerprovided directly by the compiler
– no schema pre-processor requiredno schema pre-processor required
Espresso Overview Dirk.Duellmann@cern.ch
Server Side Components Server Side Components
Data ServerData Server– transfer data pages from persistent storage (disk/tape) to memorytransfer data pages from persistent storage (disk/tape) to memory
file system like interfacefile system like interface
– trivial implementation for local I/Otrivial implementation for local I/O– multi-threaded server daemon for remote I/Omulti-threaded server daemon for remote I/O
Lock ServerLock Server– keep a central table of resource lockskeep a central table of resource locks
getLock (oid)getLock (oid)
– implements lock waiting and upgradingimplements lock waiting and upgrading– very similar approach to most DBMSvery similar approach to most DBMS
Hash Table of resource locks (resource specified as OID)Hash Table of resource locks (resource specified as OID) Queue of waiters per locked resourceQueue of waiters per locked resource
– moderate complexity: storage manager implements “real” moderate complexity: storage manager implements “real” transaction logictransaction logic
Espresso Overview Dirk.Duellmann@cern.ch
C++ Language BindingC++ Language Binding
Support all main language featuresSupport all main language features– Including polymorphic access and templatesIncluding polymorphic access and templates– No language extensions, NoNo language extensions, No generated codegenerated code
ODMG 2.0 compliant C++ Binding ODMG 2.0 compliant C++ Binding – Ref templates can be sub-classed to extend their behaviorRef templates can be sub-classed to extend their behavior
e.g. d_Ref could be extended to monitor object access countse.g. d_Ref could be extended to monitor object access counts– large fraction of the binding has already been implementedlarge fraction of the binding has already been implemented
smart pointers can point to transient objectssmart pointers can point to transient objects
persistent capable classes may be embedded into other persistent classespersistent capable classes may be embedded into other persistent classes
d_activate and d_deactivate are implementedd_activate and d_deactivate are implemented
– design supports multiple DB contexts per processdesign supports multiple DB contexts per process e.g. for multi-threaded applications and mutiple federationse.g. for multi-threaded applications and mutiple federations
Work in progress: Work in progress:
– B-Tree indices, bi-directional links, installable adapters for persistent objectsB-Tree indices, bi-directional links, installable adapters for persistent objects
Espresso Overview Dirk.Duellmann@cern.ch
First Scalability & Performance First Scalability & Performance TestsTests
Page ServerPage Server– up to 70 concurrent clientsup to 70 concurrent clients
Lock ServerLock Server– up toup to 150 150 concurrent clients, up to concurrent clients, up to 30003000 locks locks
Storage ManagerStorage Manager– Files up to 2 GB (ext2 file system limit under LINUX)Files up to 2 GB (ext2 file system limit under LINUX)– 100 million objects100 million objects per file per file
stress tested with “random” bit-patternsstress tested with “random” bit-patterns
– Objects up to Objects up to 10 MB size10 MB size– Write Performance: > Write Performance: > 40MB/s40MB/s at 30% CPU at 30% CPU
450MHz dual PIII with 4 stripe RAID 0 on RedHat 6.1 450MHz dual PIII with 4 stripe RAID 0 on RedHat 6.1
C++ Binding and Schema HandlingC++ Binding and Schema Handling– successfully ported several non-trivial applicationssuccessfully ported several non-trivial applications– HTL histogram examples, simple object browser using pythonHTL histogram examples, simple object browser using python– tagDb and naming examples from HepODBMStagDb and naming examples from HepODBMS
Espresso Overview Dirk.Duellmann@cern.ch
Next StepsNext Steps
Start detailed requirement discussion with Start detailed requirement discussion with experiments and other interested institutesexperiments and other interested institutes
Continue Scalability & Performance TestContinue Scalability & Performance Test– Storage Manager: Storage Manager: larger files (>100GB)larger files (>100GB)– Page Server: Page Server: connections > 500connections > 500– Lock Server: Lock Server: number of locks > 20knumber of locks > 20k– C++ Binding & Schema Manager: port Geant4 persistency C++ Binding & Schema Manager: port Geant4 persistency
examples and Conditions-DB examples and Conditions-DB By summer this year By summer this year
– Written Architectural Overview of the PrototypeWritten Architectural Overview of the Prototype– Development Plan with detailed estimate of required Development Plan with detailed estimate of required
manpowermanpower– Single user toy-systemSingle user toy-system
Espresso Overview Dirk.Duellmann@cern.ch
Summary & ConclusionsSummary & Conclusions
We identified solutions for most critical components We identified solutions for most critical components of a scalable and performant ODBMSof a scalable and performant ODBMS– Prototype implementation shows promising performance and Prototype implementation shows promising performance and
scalabilityscalability– Using a strict component approach allows to split the effort Using a strict component approach allows to split the effort
into independently developed, replaceable modules. into independently developed, replaceable modules.
The development of an Open Source ODBMS seems The development of an Open Source ODBMS seems possible within the HEP or general science communitypossible within the HEP or general science community
A collaborative effort of the order of 15 person years A collaborative effort of the order of 15 person years seems sufficient to produce such a system with seems sufficient to produce such a system with production qualityproduction quality
The EndThe End
Espresso Overview Dirk.Duellmann@cern.ch
Exploit Read-Only DataExploit Read-Only Data
Most of our data volume follows the pattern Most of our data volume follows the pattern – (private) write-once,(private) write-once,– share read-onlyshare read-only– e.g. raw data is never updated, reconstructed data is not updated e.g. raw data is never updated, reconstructed data is not updated
but replaced but replaced Current Current ODBMS ODBMS implementations implementations do not really do not really take advantage take advantage
ofof this fact this fact– read-only filesread-only files
no need to obtain any locks for this datano need to obtain any locks for this data no need to ever update cache contentno need to ever update cache content simple backup strategysimple backup strategy
Using the concept of read-only filesUsing the concept of read-only files– e.g. in the catalogue e.g. in the catalogue – should significantly reduce the locking overhead and improve the should significantly reduce the locking overhead and improve the
scalability of the system with many concurrent clientsscalability of the system with many concurrent clients
Espresso Overview Dirk.Duellmann@cern.ch
Transactions and RecoveryTransactions and Recovery
Shadow PagingShadow Paging– Physical pages on disk are accessed indirectly through a Physical pages on disk are accessed indirectly through a
translation table (page map). translation table (page map). – Copy-on-Write : page modifications are always written to Copy-on-Write : page modifications are always written to
a new, free physical pagea new, free physical page– Changed physical pages are made visible to other Changed physical pages are made visible to other
transactions by updating the page map at commit time.transactions by updating the page map at commit time.
11
22
33
MasterMaster PageMap 1PageMap 1 Data 2Data 2 Data 3Data 3 Data 4Data 4 Data 5Data 5 PageMap 2PageMap 2
66
77
88
Espresso Overview Dirk.Duellmann@cern.ch
Advantages of this ApproachAdvantages of this Approach
Single files or complete domains can be used Single files or complete domains can be used stand-alone stand-alone without modificationwithout modification – e.g. set of user files containing tags and histogramse.g. set of user files containing tags and histograms
Local OIDs could be stored in a more compact formLocal OIDs could be stored in a more compact form– transparent expansion into a full OID as they are read into memorytransparent expansion into a full OID as they are read into memory
““Attaching” or direct sharing of files or complete domains does Attaching” or direct sharing of files or complete domains does not need any special treatmentnot need any special treatment– no OID translation neededno OID translation needed– read-only files/domains can directly be shared by multiple read-only files/domains can directly be shared by multiple
federationsfederations Domains allow to segment the store into “coherent regions” of Domains allow to segment the store into “coherent regions” of
associated objects associated objects – Efficient distribution, backup and replication of subsets of the data Efficient distribution, backup and replication of subsets of the data
(e.g. a run period, a set of user tracks)(e.g. a run period, a set of user tracks)– Consistency checks can be constrained to a single domainConsistency checks can be constrained to a single domain
Espresso Overview Dirk.Duellmann@cern.ch
Common ServicesCommon Services
Services and Interfaces of global visibilityServices and Interfaces of global visibility– OID, IStorageMgr,IPageServer,ILockServer, ISchemaMgrOID, IStorageMgr,IPageServer,ILockServer, ISchemaMgr– Platform & OS abstractionPlatform & OS abstraction
fixed range types, I/O primitives, process controlfixed range types, I/O primitives, process control
– component interfacecomponent interface version & configuration controlversion & configuration control component factorycomponent factory
– extendible diagnostics extendible diagnostics named counters, timers to instrument the codenamed counters, timers to instrument the code each component may have a sub-tree diagnostic itemseach component may have a sub-tree diagnostic items
– error & debug message handler error & debug message handler syslog like: component, level, messagesyslog like: component, level, message
– exception base classexception base class
Espresso Overview Dirk.Duellmann@cern.ch
Espresso Schema ExtractionEspresso Schema Extraction
Currently implementedCurrently implemented extraction based on the “stabs” standard format for debugging extraction based on the “stabs” standard format for debugging
information (used by egcs and Sun CC)information (used by egcs and Sun CC) based on GNU “BDF” library and “objdump” source codebased on GNU “BDF” library and “objdump” source code
Prototype provides full runtime reflection for C++ dataPrototype provides full runtime reflection for C++ data describes classes and structs with their fields and inheritance describes classes and structs with their fields and inheritance supports namespaces, typedefs and enums and templatessupports namespaces, typedefs and enums and templates location and value of virtual function and virtual base class location and value of virtual function and virtual base class
pointerspointers sufficient to allow runtime field by field consistency check against sufficient to allow runtime field by field consistency check against
persistent schemapersistent schema
Starting of a modified egcs front-end as schema extractor Starting of a modified egcs front-end as schema extractor would be an alternativewould be an alternative
top related