iris: a scalable cloud file system with efficient integrity checks
DESCRIPTION
Iris: A Scalable Cloud File System with Efficient Integrity Checks. Cloud Storage. Dropbox. Enterprise. Amazon S3, EBS. Windows Azure Storage. Enterprise. SkyDrive. EMC Atmos. Mozy. iCloud. Google Storage. Can you trust the cloud?. User. Infrastructure bugs Malware - PowerPoint PPT PresentationTRANSCRIPT
Iris: A Scalable Cloud File System
with Efficient Integrity Checks
Marten van Dijk Ari Juels Alina OpreaRSA Labs RSA Labs RSA Labs
[email protected] [email protected] [email protected]
Emil StefanovUC Berkeley
Enterprise
Cloud Storage
Enterprise
User
User
User
• Infrastructure bugs• Malware• Disgruntled employees
SkyDrive
Windows Azure Storage
Amazon S3, EBSDropbox
EMC Atmos
MozyiCloud Google
StorageCan you trust
the cloud?
Iris File System
• Integrity verification (on the fly)– value read == value written (integrity)– value read == last value written (freshness)– data & metadata
• Proof of Retrievability (PoR/PDP)– Verifies: ALL of the data is on the cloud or recoverable– More on this later
• High performance (low overhead)– Hundreds of MB/s data rates– Designed for enterprises
Iris Deployment Scenario
clients
portal(s)(distributed)
cloud
lightweight(1 to 5 portals)
heavyweight(TBs to PBs
of data)
enterprise
portal
(appliances)
Overview: File System Tree
• Most file systems have file-system tree.• Contains:
– Directory structure– File names– Timestamps– Permissions– Other attributes
• Efficiently laid out on disk (e.g., using B-tree)
Overview: Merkle Trees
• Parents contain hash of children.• To verify an element (e.g., “y”) is in the tree:
nodes accessed
𝒉𝒂 :𝑯 (𝑩∨¿𝑪)
𝒉𝒃 :𝑯 (𝑫∨¿ 𝑬 ) 𝒉𝒄 :𝑯 (…)
x y
A
B C
D E…
Iris: Unified File System + Merkle Tree
• File system tree is also a Merkle tree
/u/
v/c a f
g b e
/u/
v/c a f
gb
e
Free List
v/
gb
e
Directory treeFile version treeFile blocks
• Free List: stores deleted subtrees
• Binary– Balancing nodes
• Directory Tree– Root node:
• Directory attributes– Leafs:
• Subdirectories• Files
• File Version Tree– Root node:
• File attributes– Leafs:
• File block version numbers
File Version Tree
• Each file has a version tree• Version numbers increase when blocks are modified.• Version numbers propagate upwards to version tree root
4 : 7
4 : 5 6 : 7
0 : 7
4 5 6 7v0 v0 v0
v0
0 : 3
0 : 1 4 : 7
0 1 2 3v0v0 v0 v0 v0
v0
v0
v0
v0 v0
v0
v1v1 v1 v1 v1
v1
v1
v1
v1 v1
v1
4 : 7
4 : 5 6 : 7
0 : 7
4 5 6 7v0
0 : 3
0 : 1 4 : 7
0 1 2 3v1 v1 v1 v0 v0v1v1
v1 v0v1
v1
v1
v1
v1
v2 v2v2v2
v2v2
v2
v2
v2
v2
File Version Tree
• Process repeats for every write• Unique version numbers after each write
– Helps ensure freshness
Integrity Verification: MACs
• For each file, Iris generates a MAC file.– Later used to verify integrity of data blocks.– 4 KB blocks
• MAC is computed over:– file id, block index, version number, block data
m3
4 KB
m1 m2 m4 m5
b3b1 b2 b4 b5 …
… mi = MAC(fid, i, vi, bi)
bi
20 bytes
4 KB
20 bytes
Merkle Tree Efficiency• Many FS operations access paths in the tree• Inefficient to access one path at a time
– Paths share ancestor nodes• Accessing same nodes over and over
– Unnecessary I/O– Redundant Merkle tree crypto
– Latency bound• Accessing paths in parallel?
– Naïve techniques can lead to corruption• Same ancestor node accessed in separate threads
– Need a Merkle tree cache• Very important part of our system
Merkle Tree Cache Challenges• Nodes depend on each other
– Parents contain hashes of children– Cannot evict parent before child
• Asynchronous– Inefficient: one thread per node/path
• Avoid unnecessary hashing– Nodes near the root of the tree often reused
• Efficient sequential file operations– Inefficient: access path per block log overhead– Adjacent nodes must stay “long enough” in cache.
Merkle Tree Cache
Pinnedverifying
To Verify Unpinned
Compacting
Updating Hash
Ready to Writewriting
Nodes are read into the tree in parallel.
reading
g
Reading a Path
Path: “/u/v/b”
c a f
eDirectory treeFile version treeData fileMAC File
/u/
v/
b
Merkle Tree Cache
Pinned
To Verify Unpinned
Compacting
Updating Hash
Ready to Writewriting
When both siblings arrive, they are verified.
reading
verifying Top-down verification:parent verified before
children
Verification
….
…. ….
A
B C
…. ….D E
verify
verify
Merkle Tree Cache
Pinned
To Verify Unpinned
Compacting
Updating Hash
Ready to Writewriting
Verified nodes enter “pinned” state.
reading
verifying Pinned nodes used by async file system operations.
While used by at least one
operation, nodes remain pinned.
Pinned nodes cannot be evicted.
Merkle Tree Cache
Pinned
To Verify Unpinned
Compacting
Updating Hash
Ready to Writewriting
When node no longer used, it becomes “unpinned”.
reading
verifyingUnpinned nodes are eligible for eviction.
When cache 75% full, eviction begins.
Merkle Tree Cache
Pinned
To Verify Unpinned
Compacting
Updating Hash
Ready to Writewriting
Eviction Step #1:Adjacent nodes with
identical version numbers are compacted.
reading
verifying
Compacting
0 : 7 8 : 15
0 : 3 4 : 7 8 : 11 12 : 15
0 : 15
8 : 9 10 : 11 12 : 13 14 : 15
v2 v1
v1 v2 v2 v1
v2 v2
v2 v2
v2
v1 v1
8 : 154 : 7
0 : 15
8 : 9 14 : 15
v1v2
v2• Keep:– if version ≠ parent version– for balancing
• Stripped out redundant information
Often files are written sequentially and compact
to a single node.
Merkle Tree Cache
Pinned
To Verify Unpinned
Compacting
Updating Hash
Ready to Writewriting
Eviction Step #2:Hashes are then updated
in bottom-up order.
reading
verifying
Merkle Tree Cache
Pinned
To Verify Unpinned
Compacting
Updating Hash
Ready to Writewriting
Eviction Step #3:Nodes written to
cloud storage.
reading
verifying
Merkle Tree Cache
Pinned
To Verify Unpinned
Compacting
Updating Hash
Ready to Writewriting
Note:Node can be pinned at
any time during eviction.
reading
verifying
Path to node becomes “pinned”.
Merkle Tree Cache:Crucial for Real-World Workloads
• Iris benefits from locality• Very small cache required to achieve high throughput
– Cache size: 5 MB to 10 MB
Sequential Workloads
• Results– 250 to 300 MB/s– 100+ clients
• Cache– Minimal cache size ( < 1 MB ) to achieve high throughput– Reason: Nodes get compacted– Usually network bound
Random Workloads
• Results– Bound by disk seeks
• Cache– Minimal cache size ( < 1 MB ) to achieve seek-bound throughput– Cache only used to achieve parallelism to combat latency.– Reason: Very little locality.
Other Workloads• Very workload dependent• Specifically
– Depends on number of seeks• Iris is designed to reduce Merkle tree seek
overhead via:– Compacting– Merkle tree Cache
Proofs of Retereivability
• How can we be sure our data is still there?
• Iris Continuously Verifies that the Cloud Possesses All Data
• First sublinear solution to the open problem ofDynamic Proofs of Retreivability
Proofs of Retereivability
• Iris verifies that cloud possesses 99.9% of data (with high probability).
• Remaining 0.1% can be recovered using Iris parity data structure.
• Custom designed error-correcting code (ECC) and parity data structure.– High throughput (300-550 MB/s).
ECC Challenges• Update efficiency
– Want high-throughput file system– On-the-fly– ECC should not be a bottleneck– Reed–Solomon codes are too slow.
• Hiding code structure– Adversary should not know which blocks to corrupt to make ECC
fail.– Adversarially-secure ECC
• Variable-length encoding– Handles: blocks, file attributes, Merkle tree nodes, etc
Iris Error Correcting CodeBlock on file system
Stripe
Offset
File system: ECC Parity Stripes:
Stripe
Offset
Pseudo-random Error-Correcting CodeMapping from file system position to corresponding parities:
The cloud does not know the key , so it can’t determinewhich 0.1% subset of data to corrupt to make the ECC fail.
Iris Error Correcting CodeBlock on file system
Stripe
Offset
File system: ECC Parity Stripes:
Stripe
Offset
• Memory: • Update time: • Verification time:
– Amortized cost
ECC Update Efficiency
• Very fast– 300-550 MB/s
• Not a bottleneck in Iris
Conclusion
• Presented Iris file system– Integrity– Proofs of retreivability / data possession– On the fly
• Very practical– Overall system throughput
• 250-300 MB/s per Portal– Scales to enterprises