iris: a scalable cloud file system with efficient integrity checks

Iris: A Scalable Cloud File System

with Efficient Integrity Checks

Marten van Dijk Ari Juels Alina OpreaRSA Labs RSA Labs RSA Labs

[email protected] [email protected] [email protected]

Emil StefanovUC Berkeley

[email protected]

mailto:[email protected]




Enterprise

Cloud Storage

Enterprise

User

User

User

• Infrastructure bugs• Malware• Disgruntled employees

SkyDrive

Windows Azure Storage

Amazon S3, EBSDropbox

EMC Atmos

MozyiCloud Google

StorageCan you trust

the cloud?

Iris File System

• Integrity verification (on the fly)– value read == value written (integrity)– value read == last value written (freshness)– data & metadata

• Proof of Retrievability (PoR/PDP)– Verifies: ALL of the data is on the cloud or recoverable– More on this later

• High performance (low overhead)– Hundreds of MB/s data rates– Designed for enterprises

Iris Deployment Scenario

clients

portal(s)(distributed)

cloud

lightweight(1 to 5 portals)

heavyweight(TBs to PBs

of data)

enterprise

portal

(appliances)

Overview: File System Tree

• Most file systems have file-system tree.• Contains:

– Directory structure– File names– Timestamps– Permissions– Other attributes

• Efficiently laid out on disk (e.g., using B-tree)

Overview: Merkle Trees

• Parents contain hash of children.• To verify an element (e.g., “y”) is in the tree:

nodes accessed

𝒉𝒂 :𝑯 (𝑩∨¿𝑪)

𝒉𝒃 :𝑯 (𝑫∨¿ 𝑬 ) 𝒉𝒄 :𝑯 (…)

x y

A

B C

D E…

Iris: Unified File System + Merkle Tree

• File system tree is also a Merkle tree

/u/

v/c a f

g b e

/u/

v/c a f

gb

e

Free List

v/

gb

e

Directory treeFile version treeFile blocks

• Free List: stores deleted subtrees

• Binary– Balancing nodes

• Directory Tree– Root node:

• Directory attributes– Leafs:

• Subdirectories• Files

• File Version Tree– Root node:

• File attributes– Leafs:

• File block version numbers

File Version Tree

• Each file has a version tree• Version numbers increase when blocks are modified.• Version numbers propagate upwards to version tree root

4 : 7

4 : 5 6 : 7

0 : 7

4 5 6 7v0 v0 v0

v0

0 : 3

0 : 1 4 : 7

0 1 2 3v0v0 v0 v0 v0

v0

v0

v0

v0 v0

v0

v1v1 v1 v1 v1

v1

v1

v1

v1 v1

v1

4 : 7

4 : 5 6 : 7

0 : 7

4 5 6 7v0

0 : 3

0 : 1 4 : 7

0 1 2 3v1 v1 v1 v0 v0v1v1

v1 v0v1

v1

v1

v1

v1

v2 v2v2v2

v2v2

v2

v2

v2

v2

File Version Tree

• Process repeats for every write• Unique version numbers after each write

– Helps ensure freshness

Integrity Verification: MACs

• For each file, Iris generates a MAC file.– Later used to verify integrity of data blocks.– 4 KB blocks

• MAC is computed over:– file id, block index, version number, block data

m3

4 KB

m1 m2 m4 m5

b3b1 b2 b4 b5 …

… mi = MAC(fid, i, vi, bi)

bi

20 bytes

4 KB

20 bytes

Merkle Tree Efficiency• Many FS operations access paths in the tree• Inefficient to access one path at a time

– Paths share ancestor nodes• Accessing same nodes over and over

– Unnecessary I/O– Redundant Merkle tree crypto

– Latency bound• Accessing paths in parallel?

– Naïve techniques can lead to corruption• Same ancestor node accessed in separate threads

– Need a Merkle tree cache• Very important part of our system

Merkle Tree Cache Challenges• Nodes depend on each other

– Parents contain hashes of children– Cannot evict parent before child

• Asynchronous– Inefficient: one thread per node/path

• Avoid unnecessary hashing– Nodes near the root of the tree often reused

• Efficient sequential file operations– Inefficient: access path per block log overhead– Adjacent nodes must stay “long enough” in cache.

Merkle Tree Cache

Pinnedverifying

To Verify Unpinned

Compacting

Updating Hash

Ready to Writewriting

Nodes are read into the tree in parallel.

reading

g

Reading a Path

Path: “/u/v/b”

c a f

eDirectory treeFile version treeData fileMAC File

/u/

v/

b

Merkle Tree Cache

Pinned

To Verify Unpinned

Compacting

Updating Hash


When both siblings arrive, they are verified.

reading

verifying Top-down verification:parent verified before

children

Verification

….

…. ….

A

B C

…. ….D E

verify

verify

Merkle Tree Cache

Pinned

To Verify Unpinned

Compacting

Updating Hash


Verified nodes enter “pinned” state.

reading

verifying Pinned nodes used by async file system operations.

While used by at least one

operation, nodes remain pinned.

Pinned nodes cannot be evicted.

Merkle Tree Cache

Pinned

To Verify Unpinned

Compacting

Updating Hash


When node no longer used, it becomes “unpinned”.

reading

verifyingUnpinned nodes are eligible for eviction.

When cache 75% full, eviction begins.

Merkle Tree Cache

Pinned

To Verify Unpinned

Compacting

Updating Hash


Eviction Step #1:Adjacent nodes with

identical version numbers are compacted.

reading

verifying

Compacting

0 : 7 8 : 15

0 : 3 4 : 7 8 : 11 12 : 15

0 : 15

8 : 9 10 : 11 12 : 13 14 : 15

v2 v1

v1 v2 v2 v1

v2 v2

v2 v2

v2

v1 v1

8 : 154 : 7

0 : 15

8 : 9 14 : 15

v1v2

v2• Keep:– if version ≠ parent version– for balancing

• Stripped out redundant information

Often files are written sequentially and compact

to a single node.

Merkle Tree Cache

Pinned

To Verify Unpinned

Compacting

Updating Hash


Eviction Step #2:Hashes are then updated

in bottom-up order.

reading

verifying

Merkle Tree Cache

Pinned

To Verify Unpinned

Compacting

Updating Hash


Eviction Step #3:Nodes written to

cloud storage.

reading

verifying

Merkle Tree Cache

Pinned

To Verify Unpinned

Compacting

Updating Hash


Note:Node can be pinned at

any time during eviction.

reading

verifying

Path to node becomes “pinned”.

Merkle Tree Cache:Crucial for Real-World Workloads

• Iris benefits from locality• Very small cache required to achieve high throughput

– Cache size: 5 MB to 10 MB

Sequential Workloads

• Results– 250 to 300 MB/s– 100+ clients

• Cache– Minimal cache size ( < 1 MB ) to achieve high throughput– Reason: Nodes get compacted– Usually network bound

Random Workloads

• Results– Bound by disk seeks

• Cache– Minimal cache size ( < 1 MB ) to achieve seek-bound throughput– Cache only used to achieve parallelism to combat latency.– Reason: Very little locality.

Other Workloads• Very workload dependent• Specifically

– Depends on number of seeks• Iris is designed to reduce Merkle tree seek

overhead via:– Compacting– Merkle tree Cache

Proofs of Retereivability

• How can we be sure our data is still there?

• Iris Continuously Verifies that the Cloud Possesses All Data

• First sublinear solution to the open problem ofDynamic Proofs of Retreivability

Proofs of Retereivability

• Iris verifies that cloud possesses 99.9% of data (with high probability).

• Remaining 0.1% can be recovered using Iris parity data structure.

• Custom designed error-correcting code (ECC) and parity data structure.– High throughput (300-550 MB/s).

ECC Challenges• Update efficiency

– Want high-throughput file system– On-the-fly– ECC should not be a bottleneck– Reed–Solomon codes are too slow.

• Hiding code structure– Adversary should not know which blocks to corrupt to make ECC

fail.– Adversarially-secure ECC

• Variable-length encoding– Handles: blocks, file attributes, Merkle tree nodes, etc

Iris Error Correcting CodeBlock on file system

Stripe

Offset

File system: ECC Parity Stripes:

Stripe

Offset

Pseudo-random Error-Correcting CodeMapping from file system position to corresponding parities:

The cloud does not know the key , so it can’t determinewhich 0.1% subset of data to corrupt to make the ECC fail.

Iris Error Correcting CodeBlock on file system

Stripe

Offset

File system: ECC Parity Stripes:

Stripe

Offset

• Memory: • Update time: • Verification time:

– Amortized cost

ECC Update Efficiency

• Very fast– 300-550 MB/s

• Not a bottleneck in Iris

Conclusion

• Presented Iris file system– Integrity– Proofs of retreivability / data possession– On the fly

• Very practical– Overall system throughput

• 250-300 MB/s per Portal– Scales to enterprises

iris: a scalable cloud file system with efficient integrity checks

Documents

filesystem tree

file attributesleafs

file id

mac file

version tree root4

writeunique version

version treeversion

merkle treesxyabcdeiris