eva architectureintroduction-part1

Upload: suman-reddy-t

Post on 04-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    1/39

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    2/39

    3-Dec-12 2

    Disclosure information

    The following information is HP Confidential and

    is intended only for a limited audience within HP

    who fulfill a need to know requirement. The

    information contained is to be handled accordingly

    with HPs policy for handling this classification of

    information.

    http://legalweb.corp.hp.com/legal/files/labels.asp

    This information may NOT be shared outside HP.

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    3/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM3

    EVAExcellent Random R/W performance

    Excellent cache read hit number

    Fault tolerant, scaleable virtualization mapping scheme (Garbage collection free)

    Mirrored write cache

    Volatile read cacheMetadata in volatile memory (Policy Memory)

    Backend disks provide non volatile metadata store

    Replication features

    Snapclones, Snapshots (fully allocated, space efficient)

    CA disaster tolerant remote replication

    RAID0, RAID1, RAID5

    Active mirroring between controllers through FC Mirror Port(s)

    GL - On the fly XOR

    XLInline parity calculation

    EVA Features

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    4/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM4

    NSC (Network Storage Controller) Refers to a controller from a VCS perspective

    VCS Virtual Controller Software (firmware)

    Becomes XCS for XL family

    Physical Store Unused Drive (In the process of becoming a useable part of the system, but needs to become

    incorporated into an RSS)

    Volume A used disk drive, can accept customer data at this point

    Storage Cell EVA Controllers, Shelves and Disks that have been initialized by the firmware. Can be logically

    constructed into Disk Groups (LDADs), Logical Disks, Virtual Disks and then used for customerdata

    Disk Group (LDADLogical Disk Address Domain)

    A group of disks that function as a separate storage pool. A virtual disk is contained within a singledisk group and can not span disk groups. A disk group is made up of one or more redundant storesets. User data for a virtual disk is striped across the entire disk group.

    Architectural Discussion Objects

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    5/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM5

    Quorum

    A set of disks that contains copies of the SCS data base

    Logical Disk

    Logical representation of a virtual disk. At the CS component level the

    representation of a virtual disk

    Virtual DiskA virtual representation of a logical disk, for external use by a host

    Presented Unit

    The presentation of a virtual disk, ie. its mounted and useable by a host

    RSS (Redundant Store Set)A subset of disks within a disk group that represents a smaller fault domain

    then the disk group..

    Architectural Discussion Objects

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    6/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM6

    2C12D: One disk

    group containing

    all 64 drives

    Eight RSSs:

    RSS 1RSS 2

    RSS 3

    RSS 4

    RSS 5

    RSS 6

    RSS 7

    RSS 8

    RSS Example

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    7/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM7

    HostPort

    Cache

    Manager

    Raid

    Services

    FC

    Services

    DRM Core

    DRM Log

    DRM FC

    HPTachyons

    Device Tachyons Mirror Tachyon

    DRM Copy

    EMUENVIRONMENTAL

    MONITOR UNIT

    SCMI

    Services

    2

    HTB

    HTB XD

    HTB

    EETB

    XDXD

    XD

    SEST,ERQ,IMQ

    FED

    MFCD

    FED

    SEST,ERQ,IMQ

    SEST,IMQ ERQ

    TDCB

    DTD

    TDCB

    ALLOC DEALLOC

    EXEC

    RTOS

    CNODE

    CODE HIGHWAY

    Fault

    ManagerEIRP,TEIRP

    EIP

    OCPOPERATIONAL

    CONTROL PANEL

    ALL

    COMPONENTS

    SCSSTORAGE

    CELL

    STATE

    TDSD, ELSD, MFCD,FED

    CONFIG/STATE

    CONFIG/STATE

    SCSCB

    2

    CONFIG

    STATE

    CONFIG

    STATE

    3CONFIGSTATE

    COMMAND

    STATE

    4 4

    11

    XD 3CS

    CONTAINERSERVICES

    XD

    XD

    5

    3

    5

    CONFIG

    STATE

    ALLOC

    DEALLOC

    6

    6CSIO

    CSLD

    READY

    CSIO

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    8/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM8

    Host Port Front end FC services, decodes and sequences instructions, controller responses to host, assigns work to code

    highway, passes commands to SCMI, supports SCSI interface, handles AAA logic (V4)

    (SCMI) Storage Cell Management Interface Architected interface to allow external management agents (Command View/Bridge) to manage the EVA

    (SCS) Storage Cell State Inoperative/Operative unit handling, SCMI requests to add/remove objects from the system, return info about

    objects, unit presentation, pullover, failover, meltdowns, meltdown recovery, ILF disk management, systemdatabase, RSS management, add/remove devices, cell mastership, error reporting

    Cache Manager Read/Write cache management, full stripe writes, assigns work to RAID services, RAID5 write recovery/parity

    recovery

    (DRM) Data Replication ManagerContinuous Access Remote disaster tolerant replication

    (Container Services) Virtualization (Map management), local replication (snapclone/snapshot), sparing, leveling

    RAID Services

    Services supporting RAID0, RAID1 and RAID5

    FCS Backend FCS, Mirroring and DRM FCS support, Disk Drive handling

    FM (Fault Manager) Manage event logs, termination codes, etc.

    Architecture Component Overview

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    9/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM9

    Host Port

    (SCMI) Storage Cell Management Interface

    (SCS) Storage Cell State

    Cache Manager

    Container Services

    Data Replication Manager (DRM)Continuous Access

    FMFault Manager

    Architecture Component Overview

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    10/393-Dec-12 10

    Host Port

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    11/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM11

    EVA is made up of a controller pair 2 host ports per controller module

    One controller is the master and the other slave Actions affecting storage cell structures and database are restricted to the master controller

    Example is VDisk (LUN) creation

    EVA GL (VCS3.XXX and earlier) is an Asymmetrical Virtual RAID controller Asymmetrical LUN access

    Unit is ready read/write on one controller while it is not ready on the other controller

    Simultaneous access to LUN only supported via ports on same controller

    One queue for LUN, ordering based on command arrival

    Host Ports only support Fabric connnection 1Gb, 2Gb switches supported

    Highest available link speed is auto negotiated

    Host Port and EVA GL

    Operation

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    12/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM12

    EVA Controller Pair

    Defined as a single node

    Assigned a SCSI-3 WWID

    Two control units each containing two host ports Each host port defined by unique port WWID

    Node and Port Identifiers are 64 bit IEEE registered numbers, with a

    portion assigned by a company ID and the rest by a HP specific

    method to ensure uniqueness of the identifiers.

    EVA GL and Host Port IDs

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    13/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM13

    Handles front end FC services

    Decodes and sequences instructions

    Controller responses to Host

    Assigns work to Code Highway

    Passes along SCMI commands to SCMI module

    Supports SCSI interface

    Host Port Module

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    14/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM14

    HostPort

    Cache

    Manager

    Raid

    Services

    FC

    Services

    DRM Core

    DRM Log

    DRM FC

    HPTachyons

    Device Tachyons Mirror Tachyon

    DRM Copy

    EMUENVIRONMENTAL

    MONITOR UNIT

    SCMI

    Services

    2

    HTB

    HTB XD

    HTB

    EETB

    XDXD

    XD

    SEST,ERQ,IMQ

    FED

    MFCD

    FED

    SEST,ERQ,IMQ

    SEST,IMQ ERQ

    TDCB

    DTD

    TDCB

    ALLOC DEALLOC

    EXEC

    RTOS

    CNODE

    CODE HIGHWAY

    Fault

    ManagerEIRP,TEIRP

    EIP

    OCPOPERATIONAL

    CONTROL PANEL

    ALL

    COMPONENTS

    SCSSTORAGE

    CELL

    STATE

    TDSD, ELSD, MFCD,FED

    CONFIG/STATE

    CONFIG/STATE

    SCSCB

    2

    CONFIG

    STATE

    CONFIG

    STATE

    3CONFIGSTATE

    COMMAND

    STATE

    4 4

    11

    XD 3CS

    CONTAINERSERVICES

    XD

    XD

    5

    3

    5

    CONFIG

    STATE

    ALLOC

    DEALLOC

    6

    6CSIO

    CSLD

    READY

    CSIO

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    15/393-Dec-12 15

    SCMI

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    16/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM16

    Architected interface used by external management

    agents (Command View/Bridge) to communicate with

    the EVA

    Communication via SCSI Send Receive DiagnosticsAll SCMI commands made through LUN0

    Commands come in via SCMI command packet

    Response via SCMI response packet

    Original design limited response to a single attribute

    In order to reduce message traffic super SCMI commands

    developed which return a lot on information via a single response

    SCMI

    Storage Cell Management

    Interface

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    17/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM17

    External management agent uses SCMIApi or RealSCMI to communicate with theEVA

    SCMI Server processes the command inside of VCS

    SEND DIAGNOSTIC command - use page code 90 (vendor specific). Contains SCMI

    command packet, and command buffers(2). 64KB max buffer size.

    RECEIVE DIGNOSTIC command - returns the result in SCMI response packet and

    response buffers(2).

    Host Port layer handles matching of the send/receive pair and rejecting illegal

    combination.

    Built in security mechanism by establishing password (encrypted password is

    transmitted).

    The agent (client) must log-in using the correct password to be able to send SCMI

    commands for execution

    SCMI

    Storage Cell Management

    Interface

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    18/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM18

    Limitations

    The system processes one send/receive diagnostic at a time

    This means when the system is synchronously executing a command

    via send receive diagnostic, until that command completes the next

    management command is held upWhen a management command is held up the management agent

    loses manageability of the array for that time

    Asynchronous background delete example

    Designing commands that take along time to execute

    See SCMI Spec section 6.7, 5.2.5, 4.57.1

    SCMI

    Storage Cell Management

    Interface

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    19/393-Dec-12 19

    State (SCS)

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    20/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM20

    HostPort

    Cache

    Manager

    Raid

    Services

    FC

    Services

    DRM Core

    DRM Log

    DRM FC

    HPTachyons

    Device Tachyons Mirror Tachyon

    DRM Copy

    EMUENVIRONMENTAL

    MONITOR UNIT

    SCMI

    Services

    2

    HTB

    HTB XD

    HTB

    EETB

    XDXD

    XD

    SEST,ERQ,IMQ

    FED

    MFCD

    FED

    SEST,ERQ,IMQ

    SEST,IMQ ERQ

    TDCB

    DTD

    TDCB

    ALLOC DEALLOC

    EXEC

    RTOS

    CNODE

    CODE HIGHWAY

    Fault

    ManagerEIRP,TEIRP

    EIP

    OCPOPERATIONAL

    CONTROL PANEL

    ALL

    COMPONENTS

    SCSSTORAGE

    CELL

    STATE

    TDSD, ELSD, MFCD,FED

    CONFIG/STATE

    CONFIG/STATE

    SCSCB

    2

    CONFIG

    STATE

    CONFIG

    STATE

    3CONFIGSTATE

    COMMAND

    STATE

    4 4

    11

    XD 3CS

    CONTAINERSERVICES

    XD

    XD

    5

    3

    5

    CONFIG

    STATE

    ALLOC

    DEALLOC

    6

    6CSIO

    CSLD

    READY

    CSIO

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    21/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM21

    Storage Cell State (SCSState)

    Inoperative/Operative unit handlingSCMI requests to add/remove objects from the system

    Return info about objects

    Unit presentation

    Pullover

    FailoverMeltdowns

    Meltdown recovery

    ILF disk management

    State database (Object Store Management)

    RSS management

    add/remove devices

    Cell mastership

    Error reporting

    SCS Functionality

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    22/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM22

    Cell State Manager (CSM)

    Makes all State decisions, controls state of EVAActive only on the master controller

    Manages Quorum Disks

    Owns SCS data base

    SCMI command processing

    Cell realization

    Unit failover

    Cell Volume Manager (CVM)Volume transitions

    RSS membership

    Meltdown level

    Cell State Agent (CSA)Manipulates volatile data structures on behalf of CSM

    Device Discovery

    SCS Components

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    23/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM23

    Quorum Disks

    RSS0 is a special RSS that tracks the quorum disks It is the only RSS that has disks from multiple disk groups

    It is the only RSS that has disks that are all members of other RSSs

    At least 5 disks mirrored, max 16, 1 per disk group, 1 per shelf

    Master owns, slave cannot access quorum drives

    Read one, write allnway write

    User notified when all quorum disks are lost Special quorum disks called golden quorum, used in single controller

    configuration

    Kept in synch using an incarnation number

    In event of crash check all incarnation numbers

    SCS data base resides on quorum disks

    SCS data base keeps information about the current storage cell configuration Storage Cell, Disk Groups, VDisks, DR Groups

    Journals for Metadata Updates (Can be a performance issue)

    SCS Components

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    24/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM24

    RSS Membership

    A disk is not available for storage if it is not a member of an RSS

    When new drives are added to the system they must be added toexisting RSSs or new RSSs must be created

    When drives are removed from the system it may require that RSSsare merged

    RSS Size

    RSSs are 6 to 12 drives

    When an RSS drops below 6 drives it will merge with another RSS tocreate a larger RSS

    When an RSS grows beyond 11 drives it will be split to create 2 RSSs

    A merge can force a split

    Optimal size targeted by the system is 8 drives

    RSS Management

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    25/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM25

    RSS Goals

    Size is important Optimal size targeted by system is 8

    Must be greater then 5 and less then 12

    When an RSS goes to 5 or less it is merged with another RSS isanother RSS is available

    When an RSS grows to 12 or greater it is split into two smallerRSSs of size 6 or greater

    Every member has a mirror partner

    Talk about VA R1 geometry vs EVA R1 geometry

    Mirror partners should be on different shelves

    RSS Members should be on different shelves

    Mirror partners same size

    RSS members same size

    RSS Management

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    26/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM26

    Adding a Single Drive to an LDAD

    Add a single disk then add to RSS with smallest odd membership

    If more than 1 to choose from then select based on shelf numbers and disk

    sizes

    Adding Multiple Drives to an LDAD

    Try to mate all unpaired disks

    Try to make it so everyone has a partner on a different shelf

    If more than 5 disks try to create as many new RSSs of size 8 and a new

    smaller RSS with whats left

    Things Not Guaranteed

    Mirror partners will be on a different shelfAll RSS members will be on a different shelf

    Dont tear apart good RSSs to make RSSs with drives on different shelves

    Dont make 4 6 member RSSs into 3 8 member RSSs

    RSS Management

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    27/39

    3-Dec-12 27

    Cache and Battery

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    28/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM28

    Cache and Battery State

    Cache Policy: The battery capacity (i.e., write cache holdup time) is a major input for

    determining what is called the Cache Policy

    Cache Policy determines whether or not a unit is presented to hosts,

    which controller it is presented through, and whether it operates in

    write-back or write-through mode

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    29/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM29

    Battery Holdup and Cache Policy

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    30/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM30

    The Storage Cell and Cache Policy

    StoragecellSlave BatterySystem Bad

    StoragecellSlave BatterySystem Low

    StoragecellSlave BatterySystem Good

    StoragecellMaster BatterySystem Bad

    No unitpresentationexcept SACD

    All unitswritethrough onStoragecell Slave

    All unitswriteback onStoragecell Slave

    StoragecellMaster BatterySystem Low

    All unitswritethrough onStoragecellMaster

    All unitswritethrough onboth StoragecellMaster and Slave

    All unitswriteback onStoragecell Slave

    Storagecell

    Master BatterySystem GoodAll units

    writeback onStoragecellMaster

    All units

    writeback onStoragecellMaster

    All units

    writeback onboth StoragecellMaster and Slave

    Adapted from VCS Battery Manager Overview by Bryan Walder (Aug 29, 02).

    When one controllers battery system is no longer good,

    units move to the other controller, if its battery state isbetter

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    31/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM31

    Battery Holdup Times

    GLTwo batteries

    Low holdup time96 hours

    XL Lite (4000)

    One battery

    Low Holdup Time in Write Through is about 96 hours

    Normal Holdup Time in Write Back mode is up to 242 hours

    XL (6000, 8000)

    Two batteries

    Low Holdup Time in Write Through is about 96 hours

    Normal Holdup Time in Write Back mode is up to 244 hours

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    32/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM32

    Cache Management for Dummies

    Terminology: Dirty Data

    Write cache data that has not been flushed to disk

    Write-back caching

    Committing data when it reaches write cache and is mirrored on the

    other controller to reduce write latencies Write-through caching

    Disabling write cache and forcing a write to successfully write to

    disk before returning successful status

    Atomic Write

    Guarantee that for any write up to 128K that does not cross a 128Kboundary that a read of the data will either return all old data or all

    new data

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    33/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM33

    Cache Management for Dummies

    Terminology: Fail-over

    Process of failing over a controllers write cache to the other

    controller

    Crash-over

    The process of reconstructing local cache data structures followinga controller power cycle

    Volatile Memory

    Non battery backed memory assumed to not survive a power

    cycle

    Non-volatile Memory Battery backed memory assumed to survive a power cycle

    SACD (Storage Array Control Device)

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    34/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM34

    Cache Benefits

    Benefits of Caching: The cache acts as a holding point between front and back end

    operations for a given piece of data

    Reduced host port command latency (disk v. electronic speed):

    Read hits to already cached data

    Write-back for absorbing bursty write data at electronic speedcanachieve electronic speed for absorbing new host writes as long as

    the cache doesnt fill up, and over time, the average host write data

    rate is less than the rate at which the media can absorb the data.

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    35/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM35

    Cache Buffers

    Cache Buffers:Block = 512 bytes

    GL Buffer = 2048 bytes (populated with 1 to 4 blocks of user

    data)

    XL Buffer = 8192 bytes (populated with 1 to 16 blocks of user

    data)

    Cache Page = 128 kilo bytes

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    36/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM36

    Cache Layout GL and XL (4000)

    A Write Primary256MB Non-volatile

    B Write Mirror256MB Non-volatile

    A Read512MB Volatile

    B Write Primary256MB Non-volatile

    A Write Mirror256MB Non-volatile

    B Read512MB Volatile

    Cache-A Cache-B

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    37/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM37

    XL (6000, 8000)

    A Write Primary512MB Non-volatile

    B Write Mirror512MB Non-volatile

    A Read1024MB Volatile

    B Write Primary512MB Non-volatile

    A Write Mirror512MB Non-volatile

    B Read1024MB Volatile

    Cache-A Cache-B

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    38/39

    On LineHP Confidential

    NSS 12/3/2012 10:14 PM38

    HostPort

    Cache

    Manager

    Raid

    Services

    FC

    Services

    DRM Core

    DRM Log

    DRM FC

    HP Tachyons

    Device Tachyons Mirror Tachyon

    DRM Copy

    EMUENVIRONMENTAL

    MONITOR UNIT

    SCMI

    Services

    2

    HTB

    HTB XD

    HTB

    EETB

    XDXD

    XD

    SEST,ERQ,IMQ

    FED

    MFCD

    FED

    SEST,ERQ,IMQ

    SEST,IMQ ERQ

    TDCB

    DTD

    TDCB

    ALLOC DEALLOC

    EXEC

    RTOS

    CNODE

    CODE HIGHWAY

    Fault

    ManagerEIRP,TEIRP

    EIP

    OCPOPERATIONAL

    CONTROL PANEL

    ALL

    COMPONENTS

    SCSSTORAGE

    CELL

    STATE

    TDSD, ELSD, MFCD,FED

    CONFIG/STATE

    CONFIG/STATE

    SCSCB

    2

    CONFIG

    STATE

    CONFIG

    STATE

    3CONFIGSTATE

    COMMAND

    STATE

    4 4

    11

    XD 3CS

    CONTAINERSERVICES

    XD

    XD

    5

    3

    5

    CONFIG

    STATE

    ALLOC

    DEALLOC

    6

    6CSIO

    CSLD

    READY

    CSIO

  • 7/30/2019 EVA ArchitectureIntroduction-part1

    39/39

    Cache Manager Operations

    Host Port Reads/Writes (HP Interface) Mirroring write data to other controller

    Cooperation with DRM for order preservation

    Full stripe write aggregation for RAID5 to avoid RMW penalty

    R5 parity recovery

    World Peace