pond the oceanstore prototype. introduction problem: rising cost of storage management observations:...
TRANSCRIPT
![Page 1: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/1.jpg)
PondThe OceanStore Prototype
![Page 2: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/2.jpg)
Introduction
Problem: Rising cost of storage managementObservations:
Universal connectivity via Internet$100 terabyte storage within three years
Solution: OceanStore
![Page 3: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/3.jpg)
OceanStore
Internet-scaleCooperative file systemHigh durabilityUniversal availabilityTwo-tier storage system
Upper tier: powerful serversLower tier: less powerful hosts
![Page 4: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/4.jpg)
OceanStore
![Page 5: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/5.jpg)
More on OceanStore
Unit of storage: data objectApplications: email, UNIX file systemRequirements for the object interface
Information universally accessibleBalance between privacy and sharingSimple and usable consistency modelData integrity
![Page 6: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/6.jpg)
OceanStore Assumptions
Infrastructure untrusted except in aggregate
Most nodes are not faulty and malicious
Infrastructure constantly changingResources enter and exit the network without prior warningSelf-organizing, self-repairing, self-tuning
![Page 7: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/7.jpg)
OceanStore Challenges
Expressive storage interfaceHigh durability on untrusted and changing base
![Page 8: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/8.jpg)
Data Model
The view of the system that is presented to client applications
![Page 9: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/9.jpg)
Storage Organization
OceanStore data object ~= fileOrdered sequence of read-only versions
Every version of every object kept foreverCan be used as backup
An object contains metadata, data, and references to previous versions
![Page 10: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/10.jpg)
Storage Organization
A stream of objects identified by AGUID
Active globally-unique identifierCryptographically-secure hash of an application-specific name and the owner’s public keyPrevents namespace collisions
![Page 11: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/11.jpg)
Storage Organization
Each version of data object stored in a B-tree like data structure
Each block has a BGUID• Cryptographically-secure hash of the
block content
Each version has a VGUIDTwo versions may share blocks
![Page 12: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/12.jpg)
Storage Organization
![Page 13: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/13.jpg)
Application-Specific Consistency
An update is the operation of adding a new version to the head of a version stream Updates are applied atomically
Represented as an array of potential actionsEach guarded by a predicate
![Page 14: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/14.jpg)
Application-Specific Consistency
Example actionsReplacing some bytesAppending new data to an objectTruncating an object
Example predicatesCheck for the latest version numberCompare bytes
![Page 15: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/15.jpg)
Application-Specific Consistency
To implement ACID semanticCheck for readersIf none, update
Append to a mailboxNo checking
No explicit locks or leases
![Page 16: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/16.jpg)
Application-Specific Consistency
Predicate for readsExamples • Can’t read something older than 30
seconds• Only can read data from a specific time
frame
![Page 17: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/17.jpg)
System Architecture
Unit of synchronization: data objectChanges to different objects are independent
![Page 18: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/18.jpg)
Virtualization through Tapestry
Resources are virtual and not tied to particular hardwareA virtual resource has a GUID, globally unique identifierUse Tapestry, a decentralized object location and routing system
Scalable overlay network, built on TCP/IP
![Page 19: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/19.jpg)
Virtualization through Tapestry
Use GUIDs to address hosts and resourcesHosts publish the GUIDs of their resources in TapestryHosts also can unpublish GUIDs and leave the network
![Page 20: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/20.jpg)
Replication and Consistency
A data object is a sequence of read-only versions, consisting of read-only blocks, named by BGUIDsNo issues for replicationThe mapping from AGUID to the latest VGUID may changeUse primary-copy replication
![Page 21: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/21.jpg)
Replication and Consistency
The primary copy Enforces access controlSerializes concurrent updates
![Page 22: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/22.jpg)
Archival Storage
Replication: 2x storage to tolerate one failureErasure code is much better
A block is divided into m fragmentsm fragments encoded into n > m fragmentsAny m fragments can restore the original object
![Page 23: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/23.jpg)
Caching of Data Objects
Reconstructing a block from erasure code is an expensive processNeed to locate m fragments from m machinesUse whole-block caching for frequently-read objects
![Page 24: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/24.jpg)
Caching of Data Objects
To read a block, look for the block firstIf not available
Find block fragmentsDecode fragmentsPublish that the host now caches the blockAmortize the cost of erasure encoding/decoding
![Page 25: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/25.jpg)
Caching of Data Objects
Updates are pushed to secondary replicas via application-level multicast tree
![Page 26: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/26.jpg)
The Full Update Path
Serialized updates are disseminated via the multicast tree for an objectAt the same time, updates are encoded and fragmented for long-term storage
![Page 27: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/27.jpg)
The Full Update Path
![Page 28: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/28.jpg)
The Primary Replica
Primary servers run Byzantine agreement protocol
Need more than 2/3 nonfaulty participantsMessages required grow quadratic in the number of participants
![Page 29: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/29.jpg)
Public-Key Cryptography
Too expensiveUse symmetric-key message authentication codes (MACs)
Two to three orders of magnitude fasterDownside: can’t prove the authenticity of a message to the third partyUsed only for the inner ring
Public-key cryptography for outer ring
![Page 30: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/30.jpg)
Proactive Threshold Signatures
Byzantine agreement guarantees correctness if not more than 1/3 servers fail during the life of the systemNot practical for a long-lived systemNeed to reboot servers at regular intervalsKey holders are fixed
![Page 31: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/31.jpg)
Proactive Threshold Signatures
Proactive threshold signaturesMore flexibility in choosing the membership of the inner ring
A public key is paired with a number of private keysEach server uses its key to generate a signature share
![Page 32: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/32.jpg)
Proactive Threshold Signatures
Any k shares may be combined to produce a full signatureTo change membership of an inner ring
Regenerate signature sharesNo need to change the public keyTransparent to secondary hosts
![Page 33: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/33.jpg)
The Responsible Party
Who chooses the inner ring?Responsible party:
A server that publishes sets of failure-independent nodes• Through offline measurement and
analysis
![Page 34: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/34.jpg)
Software Architecture
Java atop the Staged Event Driven Architecture (SEDA)
Each subsystem is implemented as a stageWith each own state and thread poolStages communicate through events50,000 semicolons by five graduate students and many undergrad interns
![Page 35: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/35.jpg)
Software Architecture
![Page 36: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/36.jpg)
Language Choice
Java: speed of developmentStrongly typedGarbage collectedReduced debugging timeSupport for eventsEasy to port multithreaded code in Java• Ported to Windows 2000 in one week
![Page 37: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/37.jpg)
Language Choice
Problems with Java:Unpredictability introduced by garbage collectionEvery thread in the system is halted while the garbage collector runsAny on-going process stalls for ~100 millisecondsMay add several seconds to requests travel cross machines
![Page 38: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/38.jpg)
Experimental Setup
Two test bedsLocal cluster of 42 machines at Berkeley• Each with 2 1.0 GHz Pentium III • 1.5GB PC133 SDRAM• 2 36GB hard drives, RAID 0• Gigabit Ethernet adaptor• Linux 2.4.18 SMP
![Page 39: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/39.jpg)
Experimental Setup
PlanetLab, ~100 nodes across ~40 sites• 1.2 GHz Pentium III, 1GB RAM• ~1000 virtual nodes
![Page 40: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/40.jpg)
Storage Overhead
For 32 choose 16 erasure encoding2.7x for data > 8KB
For 64 choose 16 erasure encoding4.8x for data > 8KB
![Page 41: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/41.jpg)
The Latency Benchmark
A single client submits updates of various sizes to a four-node inner ringMetric: Time from before the request is signed to the signature over the result is checkedUpdate 40 MB of data over 1000 updates, with 100ms between updates
![Page 42: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/42.jpg)
The Latency Benchmark
Update Latency (ms)
Key Size
Update Size
5% Time
Median Time
95%Time
512b4kB 39 40 41
2MB 1037 1086 1348
1024b
4kB 98 99 100
2MB 1098 1150 1448
Latency Breakdown
Phase Time (ms)
Check 0.3
Serialize
6.1
Apply 1.5
Archive 4.5
Sign 77.8
![Page 43: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/43.jpg)
The Throughput Microbenchmark
A number of clients submit updates of various sizes to disjoint objects, to a four-node inner ringThe clients
Create their objectsSynchronize themselvesUpdate the object as many time as possible for 100 seconds
![Page 44: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/44.jpg)
The Throughput Microbenchmark
![Page 45: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/45.jpg)
Archive Retrieval Performance
Populate the archive by submitting updates of various sizes to a four-node inner ringDelete all copies of the data in its reconstructed formA single client submits reads
![Page 46: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/46.jpg)
Archive Retrieval Performance
Throughput: 1.19 MB/s (Planetlab)2.59 MB/s (local cluster)
Latency~30-70 milliseconds
![Page 47: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/47.jpg)
The Stream Benchmark
Ran 500 virtual nodes on PlanetLabInner Ring in SF Bay AreaReplicas clustered in 7 largest P-Lab sites
Streams updates to all replicasOne writer - content creator – repeatedly appends to data objectOthers read new versions as they arriveMeasure network resource consumption
![Page 48: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/48.jpg)
The Stream Benchmark
![Page 49: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/49.jpg)
The Tag Benchmark
Measures the latency of token passingOceanStore 2.2 times slower than TCP/IP
![Page 50: Pond The OceanStore Prototype. Introduction Problem: Rising cost of storage management Observations: Universal connectivity via Internet $100 terabyte](https://reader038.vdocuments.mx/reader038/viewer/2022110205/56649cd55503460f9499ca4a/html5/thumbnails/50.jpg)
The Andrew Benchmark
File system benchmark4.6x than NFS in read-intensive phases7.3x slower in write-intensive phases