rogercummings combining snia cloud v7

Upload: sean-dickerson

Post on 02-Mar-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    1/31

    PRESENTATION TITLE GOES HERECombining SNIA Cloud, Tape and ContainerFormat Technologies for the Long Term

    Retention of Big Data

    Roger Cummings, AntesignanusCo-Author: Simona Rabinovici-Cohen, IBM Research Haifa

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    2/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data 2013 Storage Networking Industry Association. All Rights Reserved.

    SNIA Legal Notice

    The material contained in this tutorial is copyrighted by the SNIA unlessotherwise noted.

    Member companies and individual members may use this material inpresentations and literature under the following conditions:Any slide or slides used must be reproduced in their entirety without modification

    The SNIA must be acknowledged as the source of any material used in the body ofany document containing material from these presentations.

    This presentation is a project of the SNIA Education Committee.

    Neither the author nor the presenter is an attorney and nothing in thispresentation is intended to be, or should be construed as legal advice or anopinion of counsel. If you need legal advice or a legal opinion pleasecontact your attorney.

    The information presented herein represents the author's personal opinion

    and current understanding of the relevant issues involved. The author, thepresenter, and the SNIA do not assume any responsibility or liability fordamages arising out of any reliance on or use of this information.

    NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.

    2

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    3/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.3

    Abstract

    Combining SNIA Cloud, Tape and Container FormatTechnologies for the Long Term Retention of Big

    DataGenerating and collecting very large data sets is becoming a necessity in many domains thatalso need to keep that data for long periods. Examples include astronomy, atmosphericscience, genomics, medical records, photographic archives, video archives, and large-scalee-commerce. While this presents significant opportunities, a key challenge is providingeconomically scalable storage systems to efficiently store and preserve the data, as well asto enable search, access, and analytics on that data in the far future.

    Both cloud and tape technologies are viable alternatives for storage of big data and SNIAsupports their standardization. The SNIA Cloud Data Management Interface (CDMI) providesa standardized interface to create, retrieve, update, and delete objects in a cloud. The SNIALinear Tape File System (LTFS) takes advantage of a new generation of tape hardware toprovide efficient access to tape using standard, familiar system tools and interfaces. Inaddition, the SNIA Self-contained Information Retention Format (SIRF) defines a storagecontainer for long term retention that will enable future applications to interpret stored dataregardless of the application that originally produced it.

    This tutorial will present advantages and challenges in long term retention of big data, as wellas initial work on how to combine SIRF with LTFS and SIRF with CDMI to address some ofthose challenges. SIRF with CDMI will also be examined in the European Union integratedresearch project ENSURE Enabling kNowledge, Sustainability, Usability and Recovery forEconomic value.

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    4/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Outline

    Introduction

    SNIA technologies Cloud Data Management Interface (CDMI) Linear Tape File System (LTFS)

    Self-contained Information Retention Format (SIRF)

    Combining SNIA technologies SIRF Serialization for CDMI SIRF Serialization for LTFS

    EU Enabling kNowledge, Sustainability, Usability and

    Recovery for Economic Value (ENSURE)Summary

    4

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    5/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Big Data..

    Really is BIG..

    2.5 quintillion (1018) bytes of new data created per day in 2012

    (source IBM)And the move to the Internet of Things is only going

    to increase this volume

    19.8 Billion connected devices by 2020 (source McKinsey)

    Only 4.2 billion smartphones and tablets, 3.4 billion PCs

    Data analytics is improving all the time

    Therefore historical information has significant value

    Apply new techniques and algorithms to gain new insights

    Need to ensure ALL necessary information is captured to extract full value

    Therefore Big Data has similarities to (long term)

    preservation

    5

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    6/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    The Need for Digital Preservation of Big Data

    Regulatory compliance and legal issues

    Sarbanes-Oxley, HIPAA, FRCP, intellectual property litigation

    Emerging web services and applicationsEmail, photo sharing, web site archives, social networks, blogs

    Many other fixed-content repositories

    Scientific data, intelligence, libraries, movies, musicDomains that have Big Data require preservation

    6

    M&E

    Film Masters, Out

    takes. Related

    artifacts (e.g.,

    games). 100 Years

    or more

    X-rays are

    often stored for

    periods of75yearsRecords of

    minors are

    needed until 20 to

    43 years of age

    Healthcare

    Scientific and

    CulturalSatellite data is

    kept for ever

    We would like tokeep digital art

    for ever

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    7/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    SNIA Survey from 2007

    What does Long-Term Mean?Retention of 20 years or more

    is required by 70% of responses.

    1.9%

    12.3%

    15.7%

    13.1%

    18.3% 38.8%

    0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0%

    >3-6 Years

    >7-10 Years

    >11-20 Years

    >21-50 Years

    >50-100 Years

    >100 Years

    Top External Factors DrivingLong-Term Retention Requirements:

    Legal Risk, Compliance Regulations,

    Business Risk, Security Risk

    Legal Risk

    Legal Risk

    Compliance

    Requirements

    Compliance

    Requirements

    Business Risk

    Business Risk Security Risk

    Security Risk

    Other

    0% 10% 20% 30% 40% 50% 60%

    Percent of Respondents

    Concern with ligitation

    protection

    Meeting regulatory

    requirements

    Meeting regulatory

    requirements

    Protection from compliance or

    legal fines

    Retaining history for

    competitiveness or protection

    Protection of business or

    intellectual assets

    Protection of customer privacy

    Preservation of business history

    Legal Risk

    Legal Risk

    Compliance

    Requirements

    Compliance

    Requirements

    Business Risk

    Business Risk Security Risk

    Security Risk

    Other

    0% 10% 20% 30% 40% 50% 60%

    Percent of Respondents

    Concern with ligitation

    protection

    Meeting regulatory

    requirements

    Meeting regulatory

    requirements

    Protection from compliance or

    legal fines

    Retaining history for

    competitiveness or protection

    Protection of business or

    intellectual assets

    Protection of customer privacy

    Preservation of business history

    Source:SNIA-100 Year Archive Requirements Survey, January 2007.

    7

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    8/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Goals of Digital Preservation

    Digital assets stored now should remain

    AccessibleUndamaged

    Usable

    For as long as desired beyond the lifetime of

    Any particular storage system

    Any particular storage technology

    And at an affordable cost

    8

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    9/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Real Life Example Problem

    To: [email protected]: [email protected]

    Subject: Something or other

    2003

    To: [email protected]: [email protected]

    Subject: Something else

    2007

    To: [email protected]: [email protected]: Something or other

    To: [email protected]: [email protected]: Something else

    Same people?? Could you PROVE it 20 years on?

    9

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    10/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Outline

    Introduction

    SNIA technologies Cloud Data Management Interface (CDMI)

    Linear Tape File System (LTFS)

    Self-contained Information Retention Format (SIRF)

    Combining SNIA technologies SIRF Serialization for CDMI

    SIRF Serialization for LTFS

    EU Enabling kNowledge, Sustainability, Usability and

    Recovery for Economic Value (ENSURE)Summary

    10

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    11/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Cloud Data Management Interface (CDMI)

    Being developed by SNIA CDMI TWG

    The CDMI standard defines an interoperable format for moving dataand associated metadata between cloud providers

    CDMI data objects can be accessed by standard browsers and

    internet tools (subject to owners access control lists)

    CDMI data objects may order data services from the cloud

    Secure Erasure, Encryption, Replication, Retention,

    Backup/Restore, Tiering, Hashing, Preservation, etc. (extensible)

    Done through Data System Metadata (key/value) on the

    Containers or ObjectsHas several implementations including OpenStack

    11

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    12/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Model for the CDMI Interface

    Resources accessed

    through RESTful interface:

    12

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    13/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Linear Tape File System (LTFS)

    A file system implemented on dual-partition linear tape:

    Index Partition and Data Partition

    Index Partition is small (2 wraps, 37.5 GB out of 1.5 TB on LTO5)

    Data Partition is remainder of the tape

    File System module that implements a set of standard file system

    interfaces

    Implemented using FUSE On Linux and Mac OS X

    Windows implementation uses FUSE-like framework

    Includes an on-tape structure used to track tape contents

    XML Index Schema

    Format becoming the standard for linear tape

    Formal standardization through SNIA LTFS TWG

    13

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    14/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Logical View of LTFS Volume

    B

    O

    T

    E

    O

    T

    Index Partition

    Data Partition

    Guard Wraps

    LTFSXMLIndex

    File File File

    File File

    14

    Check out SNIA Tutorial:

    Big Data Storage Options

    for Hadoop

    Check out SNIA Tutorial:

    Protecting Data in the "Big

    Data" World

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    15/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    SIRF: Self-contained Information Retention

    Format

    An Analogy

    Standard physical archival box

    Archivists gather together a group of relateditems and place them in a physical box container

    The box is labeled with information about itscontent e.g., name and reference number, date,contents description, destroy date

    SIRF is the digital equivalent

    Logical container for a set of (digital)

    preservation objects and a catalog

    The SIRF catalog contains metadata related tothe entire contents of the container as well as to

    the individual objects

    SIRF standardizes the information in the

    catalog

    Photo courtesy Oregon State Archives

    Being developed by SNIA Long Term Retention (LTR) TWG

    15

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    16/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    SIRF Properties

    SIRF is a logical data format of a storage containerappropriate forlong term storage of digital information

    A storage container may comprise a logical or physical storagearea considered as a unit. Examples: a file system, a tape, a block device, a stream

    device, an object store, a data bucket in a cloud storage

    16

    Required Properties

    Self-describing can be interpreted by differentsystems

    Self-contained all data needed for the

    interpretation is in the container Extensible so it can meet future needs

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    17/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    SIRF Components

    17

    A SIRF container includes:

    A magic object: identifies

    SIRF container and itsversion

    Multiple preservation

    objects that are immutable

    A catalog that is Updatable

    Contains metadata to make

    container and preservation

    objects portable into the

    future without external

    functions

    * Work-in-progress and less mature than CDMI and LTFS

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    18/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Outline

    Introduction

    SNIA technologies Cloud Data Management Interface (CDMI)

    Linear Tape File System (LTFS)

    Self-contained Information Retention Format (SIRF)

    Combining SNIA technologies SIRF Serialization for CDMI

    SIRF Serialization for LTFS

    EU Enabling kNowledge, Sustainability, Usability and

    Recovery for Economic Value (ENSURE)Summary

    18

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    19/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Goals of SIRF Serialization for CDMI/LTFS

    SIRF serialization for CDMI/LTFS specify how can a

    CDMI container or LTFS Tape become also SIRF-compliant

    A SIRF-compliant CDMI container or LTFS Tape

    enables future CDMI/LTFS client understand

    containers created by todays CDMI/LTFS clientThe properties of the future client is unknown to us today

    understand means identify the preservation objects in the

    container, the packaging format of each object, its fixities values,

    etc. (as defined in the SIRF catalog)

    19

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    20/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    SIRF Serialization for CDMI: Interface

    CDMI API can be used to access the various preservation objects andthe catalog object in a SIRF-compliant CDMI container

    ExampleAssume we have a cloud container named "PatientContainer" that is SIRF-compliant

    each encounter is a preservation object

    each image is a preservation object

    the container has a catalog object

    We can read the various preservation objects and the catalog object via

    CDMI REST API as follows:GET //encounterJan2001

    GET //chestImage

    GET / PatientContainer>/sirfCatalog

    PatientContainer

    PatientContainerPO

    PO

    PO

    cat

    20

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    21/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    PatientContainer

    SIRF magic object :specification=1111

    SIRF level = 1Catalog object=sirfCatalog

    sirfCatalog{

    "encounterJan2001":[

    "IDs": [{ ...},]

    "Fixity": [{ ...},]

    ]

    "chestImage":["IDs": [{ ...},]

    "Fixity": [{ ...},]

    ]

    }

    Simple

    POSimple

    POSimple

    POEncounter

    Jan2001

    Simple PO

    Composite PO

    cestImage

    manifest

    cestImage

    dicom1

    cestImage

    dicom1

    SIRF Serialization for CDMI

    cestImage

    manifest

    cestImage

    dicom1

    cestImage

    dicom1

    cestImage

    manifest

    cestImage

    dicom1

    cestImage

    dicom1

    chestImage

    manifest

    chestImagedicom1

    chestImagedicom1

    21

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    22/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    SIRF Serialization for CDMI: General

    A CDMI Container can be qualified also as a SIRF Container when:

    The SIRF magic object is mapped to the CDMI container metadata

    and includes, for example, specification ID and version, SIRF level,SIRF catalog object ID.

    The SIRF catalog is an object in the CDMI container formatted inJSON

    A SIRF preservation object (PO) that is a simple object (containsone element) is mapped to a CDMI data object

    The simple object can be a tar/zip

    A SIRF PO that is a composite object (contains several elements) is

    mapped to:a set of data objects (one for each element) and a manifest data objectthat its content includes the IDs and fixities of the element data objects

    22

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    23/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    SIRF Serialization for LTFS: General

    The SIRF catalog resides in the index partitionLTFS application has rules to indicate what to store in the index partition.

    This is used to indicate to store the SIRF catalog in the index partition.

    A SIRF preservation object (PO) that is a simple object (contains one element)is mapped to a LTFS file

    A SIRF PO that is a composite object (contains several elements) is mapped to:a set of LTFS files (one for each element) and a manifest file that its content includesthe IDs and fixities of the element data objects

    .LTFS

    index

    SIRF

    catalog

    File Mark

    IPLabel

    Construct

    .Preservation

    Object DP

    Label

    Construct .Preservation

    Object

    23

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    24/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    SIRF and LTFS: Label Construct

    VOL1 Label includes for example volume identifier (6 bytes),implementation identifier (13 bytes), owner identifier (14 bytes).

    LTFS Label includes for example creator, volume UUID, blocksize,compression, partitions ids.

    The SIRF Label is the magic object and includes for examplespecification ID and version, SIRF level.

    VOL1Label

    LTFSLabel

    SIRFLabel

    fixed-size

    80 bytes

    XML XML

    24

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    25/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Outline

    Introduction

    SNIA technologies Cloud Data Management Interface (CDMI)

    Linear Tape File System (LTFS)

    Self-contained Information Retention Format (SIRF)

    Combining SNIA technologies SIRF Serialization for CDMI

    SIRF Serialization for LTFS

    EU Enabling kNowledge, Sustainability, Usability and

    Recovery for Economic Value (ENSURE)Summary

    25

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    26/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    ENSURE is FP7 EU Project in the area of preservation

    Three year Integrated Project (IP) started Feb. 1, 2011

    Consortium of 13 partners (industry and academic) ENSURE has a business/industry-oriented focus

    Drivers for preservation are both regulatory and business value

    Demonstrated with three use case: Health Care, Clinical Trials and

    Finance Contributions to standards is a goal of the project

    26

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    27/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    PDS Cloud and SIRF in ENSURE

    27

    Preservation DataStores (PDS) in the Cloud provides preservation-aware storage services

    for ENSURE based on OAIS

    The SIRF Handler component will implement SIRF Serialization for CDMI

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    28/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Summary

    Need to retain not only information of interest but ALL

    other information to make it fully usable in futurePut it all in the SIRF digital box, preserve that as a unit

    No single technology will be usable over the timespans

    mandated by current digital preservation needs

    SNIA CDMI and LTFS technologies are among best currentchoices

    Are good for perhaps 5-10 years

    SIRF provides a vehicle for collecting all of the information that

    will be needed to transition to new technologies in the futureSIRF can be serialized for the future technologies as they come

    28

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    29/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Other Tutorials and Labs and Labs

    29

    Data Protection, Business Continuity, and Disaster

    Recovery - New Technologies

    Check out SNIA Tutorial:

    Deploying Public, Private,

    and Hybrid Cloud Storage

    Check out SNIA Tutorial:

    Massively Scalable File

    Storage

    Check out SNIA Tutorial:

    Object Storage Systems:

    The Underpinning of Cloud

    and Big Data Initiatives

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    30/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    Attribution & Feedback

    30

    Please send any questions or comments regarding this SNIA

    Tutorial to [email protected]

    The SNIA Education Committee thanks the following

    individuals for their contributions to this Tutorial.

    Authorship History

    Authors (Spring 2013)

    Mary Baker

    Simona Rabinovici-CohenRoger Cummings

    Sam Fineberg

    (incorporating materials from earlier tutor ials

    dating back to 2008, and with particular

    thanks to the 100 Year Archive Task Force

    (2007))

    Additional Contributors

    Mark Carlson (& the Cloud TWG)

    David Pease

    Joseph WhiteAlan Yoder

  • 7/26/2019 RogerCummings Combining SNIA Cloud v7

    31/31

    Combining SNIA Cloud, Tape and Container Format Technologies for the Long Term Retention of Big Data

    2013 Storage Networking Industry Association. All Rights Reserved.

    For further information

    SIRF use cases and requirements document is released

    for public review

    http://www.snia.org/tech_activities/publicreview

    More information on SIRF (& other SNIA LTR activities)

    is available at

    http://www.snia.org/ltr

    More information on ENSURE is available @:

    www.ensure-fp7.eu

    31