eva architectureintroduction-part1
TRANSCRIPT
-
7/30/2019 EVA ArchitectureIntroduction-part1
1/39
-
7/30/2019 EVA ArchitectureIntroduction-part1
2/39
3-Dec-12 2
Disclosure information
The following information is HP Confidential and
is intended only for a limited audience within HP
who fulfill a need to know requirement. The
information contained is to be handled accordingly
with HPs policy for handling this classification of
information.
http://legalweb.corp.hp.com/legal/files/labels.asp
This information may NOT be shared outside HP.
-
7/30/2019 EVA ArchitectureIntroduction-part1
3/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM3
EVAExcellent Random R/W performance
Excellent cache read hit number
Fault tolerant, scaleable virtualization mapping scheme (Garbage collection free)
Mirrored write cache
Volatile read cacheMetadata in volatile memory (Policy Memory)
Backend disks provide non volatile metadata store
Replication features
Snapclones, Snapshots (fully allocated, space efficient)
CA disaster tolerant remote replication
RAID0, RAID1, RAID5
Active mirroring between controllers through FC Mirror Port(s)
GL - On the fly XOR
XLInline parity calculation
EVA Features
-
7/30/2019 EVA ArchitectureIntroduction-part1
4/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM4
NSC (Network Storage Controller) Refers to a controller from a VCS perspective
VCS Virtual Controller Software (firmware)
Becomes XCS for XL family
Physical Store Unused Drive (In the process of becoming a useable part of the system, but needs to become
incorporated into an RSS)
Volume A used disk drive, can accept customer data at this point
Storage Cell EVA Controllers, Shelves and Disks that have been initialized by the firmware. Can be logically
constructed into Disk Groups (LDADs), Logical Disks, Virtual Disks and then used for customerdata
Disk Group (LDADLogical Disk Address Domain)
A group of disks that function as a separate storage pool. A virtual disk is contained within a singledisk group and can not span disk groups. A disk group is made up of one or more redundant storesets. User data for a virtual disk is striped across the entire disk group.
Architectural Discussion Objects
-
7/30/2019 EVA ArchitectureIntroduction-part1
5/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM5
Quorum
A set of disks that contains copies of the SCS data base
Logical Disk
Logical representation of a virtual disk. At the CS component level the
representation of a virtual disk
Virtual DiskA virtual representation of a logical disk, for external use by a host
Presented Unit
The presentation of a virtual disk, ie. its mounted and useable by a host
RSS (Redundant Store Set)A subset of disks within a disk group that represents a smaller fault domain
then the disk group..
Architectural Discussion Objects
-
7/30/2019 EVA ArchitectureIntroduction-part1
6/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM6
2C12D: One disk
group containing
all 64 drives
Eight RSSs:
RSS 1RSS 2
RSS 3
RSS 4
RSS 5
RSS 6
RSS 7
RSS 8
RSS Example
-
7/30/2019 EVA ArchitectureIntroduction-part1
7/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM7
HostPort
Cache
Manager
Raid
Services
FC
Services
DRM Core
DRM Log
DRM FC
HPTachyons
Device Tachyons Mirror Tachyon
DRM Copy
EMUENVIRONMENTAL
MONITOR UNIT
SCMI
Services
2
HTB
HTB XD
HTB
EETB
XDXD
XD
SEST,ERQ,IMQ
FED
MFCD
FED
SEST,ERQ,IMQ
SEST,IMQ ERQ
TDCB
DTD
TDCB
ALLOC DEALLOC
EXEC
RTOS
CNODE
CODE HIGHWAY
Fault
ManagerEIRP,TEIRP
EIP
OCPOPERATIONAL
CONTROL PANEL
ALL
COMPONENTS
SCSSTORAGE
CELL
STATE
TDSD, ELSD, MFCD,FED
CONFIG/STATE
CONFIG/STATE
SCSCB
2
CONFIG
STATE
CONFIG
STATE
3CONFIGSTATE
COMMAND
STATE
4 4
11
XD 3CS
CONTAINERSERVICES
XD
XD
5
3
5
CONFIG
STATE
ALLOC
DEALLOC
6
6CSIO
CSLD
READY
CSIO
-
7/30/2019 EVA ArchitectureIntroduction-part1
8/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM8
Host Port Front end FC services, decodes and sequences instructions, controller responses to host, assigns work to code
highway, passes commands to SCMI, supports SCSI interface, handles AAA logic (V4)
(SCMI) Storage Cell Management Interface Architected interface to allow external management agents (Command View/Bridge) to manage the EVA
(SCS) Storage Cell State Inoperative/Operative unit handling, SCMI requests to add/remove objects from the system, return info about
objects, unit presentation, pullover, failover, meltdowns, meltdown recovery, ILF disk management, systemdatabase, RSS management, add/remove devices, cell mastership, error reporting
Cache Manager Read/Write cache management, full stripe writes, assigns work to RAID services, RAID5 write recovery/parity
recovery
(DRM) Data Replication ManagerContinuous Access Remote disaster tolerant replication
(Container Services) Virtualization (Map management), local replication (snapclone/snapshot), sparing, leveling
RAID Services
Services supporting RAID0, RAID1 and RAID5
FCS Backend FCS, Mirroring and DRM FCS support, Disk Drive handling
FM (Fault Manager) Manage event logs, termination codes, etc.
Architecture Component Overview
-
7/30/2019 EVA ArchitectureIntroduction-part1
9/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM9
Host Port
(SCMI) Storage Cell Management Interface
(SCS) Storage Cell State
Cache Manager
Container Services
Data Replication Manager (DRM)Continuous Access
FMFault Manager
Architecture Component Overview
-
7/30/2019 EVA ArchitectureIntroduction-part1
10/393-Dec-12 10
Host Port
-
7/30/2019 EVA ArchitectureIntroduction-part1
11/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM11
EVA is made up of a controller pair 2 host ports per controller module
One controller is the master and the other slave Actions affecting storage cell structures and database are restricted to the master controller
Example is VDisk (LUN) creation
EVA GL (VCS3.XXX and earlier) is an Asymmetrical Virtual RAID controller Asymmetrical LUN access
Unit is ready read/write on one controller while it is not ready on the other controller
Simultaneous access to LUN only supported via ports on same controller
One queue for LUN, ordering based on command arrival
Host Ports only support Fabric connnection 1Gb, 2Gb switches supported
Highest available link speed is auto negotiated
Host Port and EVA GL
Operation
-
7/30/2019 EVA ArchitectureIntroduction-part1
12/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM12
EVA Controller Pair
Defined as a single node
Assigned a SCSI-3 WWID
Two control units each containing two host ports Each host port defined by unique port WWID
Node and Port Identifiers are 64 bit IEEE registered numbers, with a
portion assigned by a company ID and the rest by a HP specific
method to ensure uniqueness of the identifiers.
EVA GL and Host Port IDs
-
7/30/2019 EVA ArchitectureIntroduction-part1
13/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM13
Handles front end FC services
Decodes and sequences instructions
Controller responses to Host
Assigns work to Code Highway
Passes along SCMI commands to SCMI module
Supports SCSI interface
Host Port Module
-
7/30/2019 EVA ArchitectureIntroduction-part1
14/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM14
HostPort
Cache
Manager
Raid
Services
FC
Services
DRM Core
DRM Log
DRM FC
HPTachyons
Device Tachyons Mirror Tachyon
DRM Copy
EMUENVIRONMENTAL
MONITOR UNIT
SCMI
Services
2
HTB
HTB XD
HTB
EETB
XDXD
XD
SEST,ERQ,IMQ
FED
MFCD
FED
SEST,ERQ,IMQ
SEST,IMQ ERQ
TDCB
DTD
TDCB
ALLOC DEALLOC
EXEC
RTOS
CNODE
CODE HIGHWAY
Fault
ManagerEIRP,TEIRP
EIP
OCPOPERATIONAL
CONTROL PANEL
ALL
COMPONENTS
SCSSTORAGE
CELL
STATE
TDSD, ELSD, MFCD,FED
CONFIG/STATE
CONFIG/STATE
SCSCB
2
CONFIG
STATE
CONFIG
STATE
3CONFIGSTATE
COMMAND
STATE
4 4
11
XD 3CS
CONTAINERSERVICES
XD
XD
5
3
5
CONFIG
STATE
ALLOC
DEALLOC
6
6CSIO
CSLD
READY
CSIO
-
7/30/2019 EVA ArchitectureIntroduction-part1
15/393-Dec-12 15
SCMI
-
7/30/2019 EVA ArchitectureIntroduction-part1
16/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM16
Architected interface used by external management
agents (Command View/Bridge) to communicate with
the EVA
Communication via SCSI Send Receive DiagnosticsAll SCMI commands made through LUN0
Commands come in via SCMI command packet
Response via SCMI response packet
Original design limited response to a single attribute
In order to reduce message traffic super SCMI commands
developed which return a lot on information via a single response
SCMI
Storage Cell Management
Interface
-
7/30/2019 EVA ArchitectureIntroduction-part1
17/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM17
External management agent uses SCMIApi or RealSCMI to communicate with theEVA
SCMI Server processes the command inside of VCS
SEND DIAGNOSTIC command - use page code 90 (vendor specific). Contains SCMI
command packet, and command buffers(2). 64KB max buffer size.
RECEIVE DIGNOSTIC command - returns the result in SCMI response packet and
response buffers(2).
Host Port layer handles matching of the send/receive pair and rejecting illegal
combination.
Built in security mechanism by establishing password (encrypted password is
transmitted).
The agent (client) must log-in using the correct password to be able to send SCMI
commands for execution
SCMI
Storage Cell Management
Interface
-
7/30/2019 EVA ArchitectureIntroduction-part1
18/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM18
Limitations
The system processes one send/receive diagnostic at a time
This means when the system is synchronously executing a command
via send receive diagnostic, until that command completes the next
management command is held upWhen a management command is held up the management agent
loses manageability of the array for that time
Asynchronous background delete example
Designing commands that take along time to execute
See SCMI Spec section 6.7, 5.2.5, 4.57.1
SCMI
Storage Cell Management
Interface
-
7/30/2019 EVA ArchitectureIntroduction-part1
19/393-Dec-12 19
State (SCS)
-
7/30/2019 EVA ArchitectureIntroduction-part1
20/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM20
HostPort
Cache
Manager
Raid
Services
FC
Services
DRM Core
DRM Log
DRM FC
HPTachyons
Device Tachyons Mirror Tachyon
DRM Copy
EMUENVIRONMENTAL
MONITOR UNIT
SCMI
Services
2
HTB
HTB XD
HTB
EETB
XDXD
XD
SEST,ERQ,IMQ
FED
MFCD
FED
SEST,ERQ,IMQ
SEST,IMQ ERQ
TDCB
DTD
TDCB
ALLOC DEALLOC
EXEC
RTOS
CNODE
CODE HIGHWAY
Fault
ManagerEIRP,TEIRP
EIP
OCPOPERATIONAL
CONTROL PANEL
ALL
COMPONENTS
SCSSTORAGE
CELL
STATE
TDSD, ELSD, MFCD,FED
CONFIG/STATE
CONFIG/STATE
SCSCB
2
CONFIG
STATE
CONFIG
STATE
3CONFIGSTATE
COMMAND
STATE
4 4
11
XD 3CS
CONTAINERSERVICES
XD
XD
5
3
5
CONFIG
STATE
ALLOC
DEALLOC
6
6CSIO
CSLD
READY
CSIO
-
7/30/2019 EVA ArchitectureIntroduction-part1
21/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM21
Storage Cell State (SCSState)
Inoperative/Operative unit handlingSCMI requests to add/remove objects from the system
Return info about objects
Unit presentation
Pullover
FailoverMeltdowns
Meltdown recovery
ILF disk management
State database (Object Store Management)
RSS management
add/remove devices
Cell mastership
Error reporting
SCS Functionality
-
7/30/2019 EVA ArchitectureIntroduction-part1
22/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM22
Cell State Manager (CSM)
Makes all State decisions, controls state of EVAActive only on the master controller
Manages Quorum Disks
Owns SCS data base
SCMI command processing
Cell realization
Unit failover
Cell Volume Manager (CVM)Volume transitions
RSS membership
Meltdown level
Cell State Agent (CSA)Manipulates volatile data structures on behalf of CSM
Device Discovery
SCS Components
-
7/30/2019 EVA ArchitectureIntroduction-part1
23/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM23
Quorum Disks
RSS0 is a special RSS that tracks the quorum disks It is the only RSS that has disks from multiple disk groups
It is the only RSS that has disks that are all members of other RSSs
At least 5 disks mirrored, max 16, 1 per disk group, 1 per shelf
Master owns, slave cannot access quorum drives
Read one, write allnway write
User notified when all quorum disks are lost Special quorum disks called golden quorum, used in single controller
configuration
Kept in synch using an incarnation number
In event of crash check all incarnation numbers
SCS data base resides on quorum disks
SCS data base keeps information about the current storage cell configuration Storage Cell, Disk Groups, VDisks, DR Groups
Journals for Metadata Updates (Can be a performance issue)
SCS Components
-
7/30/2019 EVA ArchitectureIntroduction-part1
24/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM24
RSS Membership
A disk is not available for storage if it is not a member of an RSS
When new drives are added to the system they must be added toexisting RSSs or new RSSs must be created
When drives are removed from the system it may require that RSSsare merged
RSS Size
RSSs are 6 to 12 drives
When an RSS drops below 6 drives it will merge with another RSS tocreate a larger RSS
When an RSS grows beyond 11 drives it will be split to create 2 RSSs
A merge can force a split
Optimal size targeted by the system is 8 drives
RSS Management
-
7/30/2019 EVA ArchitectureIntroduction-part1
25/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM25
RSS Goals
Size is important Optimal size targeted by system is 8
Must be greater then 5 and less then 12
When an RSS goes to 5 or less it is merged with another RSS isanother RSS is available
When an RSS grows to 12 or greater it is split into two smallerRSSs of size 6 or greater
Every member has a mirror partner
Talk about VA R1 geometry vs EVA R1 geometry
Mirror partners should be on different shelves
RSS Members should be on different shelves
Mirror partners same size
RSS members same size
RSS Management
-
7/30/2019 EVA ArchitectureIntroduction-part1
26/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM26
Adding a Single Drive to an LDAD
Add a single disk then add to RSS with smallest odd membership
If more than 1 to choose from then select based on shelf numbers and disk
sizes
Adding Multiple Drives to an LDAD
Try to mate all unpaired disks
Try to make it so everyone has a partner on a different shelf
If more than 5 disks try to create as many new RSSs of size 8 and a new
smaller RSS with whats left
Things Not Guaranteed
Mirror partners will be on a different shelfAll RSS members will be on a different shelf
Dont tear apart good RSSs to make RSSs with drives on different shelves
Dont make 4 6 member RSSs into 3 8 member RSSs
RSS Management
-
7/30/2019 EVA ArchitectureIntroduction-part1
27/39
3-Dec-12 27
Cache and Battery
-
7/30/2019 EVA ArchitectureIntroduction-part1
28/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM28
Cache and Battery State
Cache Policy: The battery capacity (i.e., write cache holdup time) is a major input for
determining what is called the Cache Policy
Cache Policy determines whether or not a unit is presented to hosts,
which controller it is presented through, and whether it operates in
write-back or write-through mode
-
7/30/2019 EVA ArchitectureIntroduction-part1
29/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM29
Battery Holdup and Cache Policy
-
7/30/2019 EVA ArchitectureIntroduction-part1
30/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM30
The Storage Cell and Cache Policy
StoragecellSlave BatterySystem Bad
StoragecellSlave BatterySystem Low
StoragecellSlave BatterySystem Good
StoragecellMaster BatterySystem Bad
No unitpresentationexcept SACD
All unitswritethrough onStoragecell Slave
All unitswriteback onStoragecell Slave
StoragecellMaster BatterySystem Low
All unitswritethrough onStoragecellMaster
All unitswritethrough onboth StoragecellMaster and Slave
All unitswriteback onStoragecell Slave
Storagecell
Master BatterySystem GoodAll units
writeback onStoragecellMaster
All units
writeback onStoragecellMaster
All units
writeback onboth StoragecellMaster and Slave
Adapted from VCS Battery Manager Overview by Bryan Walder (Aug 29, 02).
When one controllers battery system is no longer good,
units move to the other controller, if its battery state isbetter
-
7/30/2019 EVA ArchitectureIntroduction-part1
31/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM31
Battery Holdup Times
GLTwo batteries
Low holdup time96 hours
XL Lite (4000)
One battery
Low Holdup Time in Write Through is about 96 hours
Normal Holdup Time in Write Back mode is up to 242 hours
XL (6000, 8000)
Two batteries
Low Holdup Time in Write Through is about 96 hours
Normal Holdup Time in Write Back mode is up to 244 hours
-
7/30/2019 EVA ArchitectureIntroduction-part1
32/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM32
Cache Management for Dummies
Terminology: Dirty Data
Write cache data that has not been flushed to disk
Write-back caching
Committing data when it reaches write cache and is mirrored on the
other controller to reduce write latencies Write-through caching
Disabling write cache and forcing a write to successfully write to
disk before returning successful status
Atomic Write
Guarantee that for any write up to 128K that does not cross a 128Kboundary that a read of the data will either return all old data or all
new data
-
7/30/2019 EVA ArchitectureIntroduction-part1
33/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM33
Cache Management for Dummies
Terminology: Fail-over
Process of failing over a controllers write cache to the other
controller
Crash-over
The process of reconstructing local cache data structures followinga controller power cycle
Volatile Memory
Non battery backed memory assumed to not survive a power
cycle
Non-volatile Memory Battery backed memory assumed to survive a power cycle
SACD (Storage Array Control Device)
-
7/30/2019 EVA ArchitectureIntroduction-part1
34/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM34
Cache Benefits
Benefits of Caching: The cache acts as a holding point between front and back end
operations for a given piece of data
Reduced host port command latency (disk v. electronic speed):
Read hits to already cached data
Write-back for absorbing bursty write data at electronic speedcanachieve electronic speed for absorbing new host writes as long as
the cache doesnt fill up, and over time, the average host write data
rate is less than the rate at which the media can absorb the data.
-
7/30/2019 EVA ArchitectureIntroduction-part1
35/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM35
Cache Buffers
Cache Buffers:Block = 512 bytes
GL Buffer = 2048 bytes (populated with 1 to 4 blocks of user
data)
XL Buffer = 8192 bytes (populated with 1 to 16 blocks of user
data)
Cache Page = 128 kilo bytes
-
7/30/2019 EVA ArchitectureIntroduction-part1
36/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM36
Cache Layout GL and XL (4000)
A Write Primary256MB Non-volatile
B Write Mirror256MB Non-volatile
A Read512MB Volatile
B Write Primary256MB Non-volatile
A Write Mirror256MB Non-volatile
B Read512MB Volatile
Cache-A Cache-B
-
7/30/2019 EVA ArchitectureIntroduction-part1
37/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM37
XL (6000, 8000)
A Write Primary512MB Non-volatile
B Write Mirror512MB Non-volatile
A Read1024MB Volatile
B Write Primary512MB Non-volatile
A Write Mirror512MB Non-volatile
B Read1024MB Volatile
Cache-A Cache-B
-
7/30/2019 EVA ArchitectureIntroduction-part1
38/39
On LineHP Confidential
NSS 12/3/2012 10:14 PM38
HostPort
Cache
Manager
Raid
Services
FC
Services
DRM Core
DRM Log
DRM FC
HP Tachyons
Device Tachyons Mirror Tachyon
DRM Copy
EMUENVIRONMENTAL
MONITOR UNIT
SCMI
Services
2
HTB
HTB XD
HTB
EETB
XDXD
XD
SEST,ERQ,IMQ
FED
MFCD
FED
SEST,ERQ,IMQ
SEST,IMQ ERQ
TDCB
DTD
TDCB
ALLOC DEALLOC
EXEC
RTOS
CNODE
CODE HIGHWAY
Fault
ManagerEIRP,TEIRP
EIP
OCPOPERATIONAL
CONTROL PANEL
ALL
COMPONENTS
SCSSTORAGE
CELL
STATE
TDSD, ELSD, MFCD,FED
CONFIG/STATE
CONFIG/STATE
SCSCB
2
CONFIG
STATE
CONFIG
STATE
3CONFIGSTATE
COMMAND
STATE
4 4
11
XD 3CS
CONTAINERSERVICES
XD
XD
5
3
5
CONFIG
STATE
ALLOC
DEALLOC
6
6CSIO
CSLD
READY
CSIO
-
7/30/2019 EVA ArchitectureIntroduction-part1
39/39
Cache Manager Operations
Host Port Reads/Writes (HP Interface) Mirroring write data to other controller
Cooperation with DRM for order preservation
Full stripe write aggregation for RAID5 to avoid RMW penalty
R5 parity recovery
World Peace