data processing architectures

69
Data Structure and Storage The modern world has a false sense of superiority because it relies on the mass of knowledge that it can use, but what is important is the extent to which knowledge is organized and mastered Goethe, 1810

Upload: fritz-moran

Post on 15-Mar-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Data Processing Architectures. The difficulty is in the choice George Moore, 1900. Architectures. Remote job entry. Local storage Often cheaper Maybe more secure Remote processing Useful when a personal computer is: too slow has insufficient memory software is not available - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data  Processing  Architectures

Data Structure and Storage

The modern world has a false sense of superiority because it relies on the mass of knowledge that it can use, but what is important is the extent to

which knowledge is organized and mastered

Goethe, 1810

Page 2: Data  Processing  Architectures

BytesAbbreviation

Prefix Factor

k kilo 103

M mega 106

G giga 109

T tera 1012

P peta 1015

E exa 1018

Z zetta 1021

Y yotta 1024

Page 3: Data  Processing  Architectures

Market2012

Digital universe estimated at 2.8 YBDoubling every two years

20205.2 TB per person

Page 4: Data  Processing  Architectures

Data StructuresThe goal is to minimize disk accessesDisks are relatively slow compared to main memory

Writing a letter compared to a telephone call

Disks are a bottleneckAppropriate data structures can reduce disk accesses

Page 5: Data  Processing  Architectures

Database access

Page 6: Data  Processing  Architectures

DisksData stored on tracks on a surfaceA disk drive can have multiple surfaces Rotational delay

Waiting for the physical storage location of the data to appear under the read/write headAround 4 msec for a magnetic diskSet by the manufacturer

Access arm delayMoving the read/write head to the track on which the storage location can be foundAround 9 msec for a magnetic disk

Page 7: Data  Processing  Architectures

Disks

Page 8: Data  Processing  Architectures

Minimizing data access times

Rotational delay is fixed by the manufacturerAccess arm delay can be reduced by storing files on

The same trackThe same track on each surface• A cylinder

Page 9: Data  Processing  Architectures

ClusteringRecords that are often retrieved together should be stored togetherIntra-file clustering

Records within the one file• A sequential file

Inter-file clusteringRecords in different files• A nation and its stocks

Page 10: Data  Processing  Architectures

Disk managerManages physical I/OSees the disk as a collection of pages

Has a directory of each page on a diskRetrieves, replaces, and manages free pages

Page 11: Data  Processing  Architectures

File managerManages the storage of filesSees the disk as a collection of stored files

Each file has a unique identifierEach record within a file has a unique record identifier

Page 12: Data  Processing  Architectures

File manager's tasksCreate a fileDelete a fileRetrieve a record from a fileUpdate a record in a fileAdd a new record to a fileDelete a record from a file

Page 13: Data  Processing  Architectures

Sequential retrievalConsider a file of 10,000 records each occupying 1 pageQueries that require processing all records will require 10,000 accesses

e.g., Find all items of type 'E'Many disk accesses are wasted if few records meet the condition

Page 14: Data  Processing  Architectures

IndexingAn index is a small file that has data for one field of a fileIndexes reduce disk accesses

Page 15: Data  Processing  Architectures

Querying with an indexRead the index into memorySearch the index to find records meeting the conditionAccess only those records containing required dataDisk accesses are substantially reduced when the query involves few records

Page 16: Data  Processing  Architectures

Maintaining an indexAdding a record requires at least two disk accesses

Update the fileUpdate the index

Trade-offFaster queriesSlower maintenance

Page 17: Data  Processing  Architectures

Using indexes

Sequential processing of a portion of a file

Find all items with a type code in the range 'E' to 'K'

Direct processingFind all items with a type code of 'E' or 'N'

Existence testingDetermining whether a record meeting the criteria exists without having to retrieve it

Page 18: Data  Processing  Architectures

Multiple indexesFind red items of type 'C'

Both indexes can be searched to identify records to retrieve

Page 19: Data  Processing  Architectures

Multiple indexes

Indexes are also called inverted lists

A file of record locations rather than data

Trade-offFaster retrievalSlower maintenance

Page 20: Data  Processing  Architectures

B-treeA form of inverted listFrequently used for relational systemsBasis of IBM’s VSAM underlying DB2Supports sequential and direct accessingHas two parts

Sequence setIndex set

Page 21: Data  Processing  Architectures

B-tree

Sequence set is a single level index with pointers to recordsIndex set is a tree-structured index to the sequence set

Page 22: Data  Processing  Architectures

B+ treeThe combination of index set (the B-tree) and the sequence set is called a B+ treeThe number of data values and pointers for any given node are not restrictedFree space is set aside to permit rapid expansion of a fileTradeoffs

Fast retrieval when pages are packed with data values and pointersSlow updates when pages are packed with data values and pointers

Page 23: Data  Processing  Architectures

Hash for internal memoryHash maps are available in most programing languages

Also known as lookup tablesA key-value pair

Key ValueAfghanistan 93Albania 355Algeria 213American Samoa 1684… …

International dialing codes

Page 24: Data  Processing  Architectures

Bit map indexesUses a single bit, rather than multiple bytes, to indicate the specific value of a field

Color can have only three values, so use three bits

Itemcode Color Code Disk address

Red Green Blue A N

1001 0 0 1 0 1 d1

1002 1 0 0 1 0 d2

1003 1 0 0 1 0 d3

1004 0 1 0 1 0 d4

Page 25: Data  Processing  Architectures

Bit map indexesA bit map index saves space and time compared to a standard index

Itemcode ColorCHAR(8)

CodeCHAR(1)

Disk address

1001 Blue N d11002 Red A d21003 Red A d31004 Green A d4

Page 26: Data  Processing  Architectures

Join indexes

Speed up joins by creating an index for the primary key and foreign key pairnation index stock index

natcode Disk address

natcode Disk address

UK d1 UK d101USA d2 UK d102

UK d103USA d104USA d105

join indexnationdisk address

stockdisk address

d1 d101d1 d102d1 d103d2 d104d2 d105

Page 27: Data  Processing  Architectures

Data coding standardsASCIIUNICODE

Page 28: Data  Processing  Architectures

ASCIIEach alphabetic, numeric, or special character is represented by a 7-bit code128 possible charactersASCII code usually occupies one byte

Page 29: Data  Processing  Architectures

UNICODEA unique binary code for every character, no matter what the platform, program, or languageCurrently contains 34,168 distinct characters derived from 24 supported language scriptsCovers the principal written languages

Page 30: Data  Processing  Architectures

UNICODETwo encoding forms

A default 16-bit form A 8-bit form called UTF-8 for ease of use with existing ASCII-based systems

The default encoding of HTML and XMLThe basis of global software

Page 31: Data  Processing  Architectures

Comma-separated values (CSV)

A text fileRecords separated by line breaks

Typically, all records have the same set of fields in the same sequenceFirst record can be a header

Each record consists of fields separated by some other character or string

Usually a comma or tabStrings usually enclosed in quotes

Can import into and export from MySQL

Page 32: Data  Processing  Architectures

CSV"shrcode","shrfirm","shrprice","shrqty","shrdiv","shrpe""AR","Abyssinian Ruby",31.82,22010,1.32,13"BE","Burmese Elephant",0.07,154713,0.01,3"BS","Bolivian Sheep",12.75,231678,1.78,11"CS","Canadian Sugar",52.78,4716,2.50,15"FC","Freedonia Copper",27.50,10529,1.84,16"ILZ","Indian Lead & Zinc",37.75,6390,3.00,12"NG","Nigerian Geese",35.00,12323,1.68,10"PT","Patagonian Tea",55.25,12635,2.50,10"ROF","Royal Ostrich Farms",33.75,1234923,3.00,6"SLG","Sri Lankan Gold",50.37,32868,2.68,16

Header

Data

Page 33: Data  Processing  Architectures

JavaScript object notation (JSON)

A language independent data exchange formatA collection of name/value pairsAn ordered list of valuesParsers available for most common languagesExtensions available to import to and export from MySQL

Page 34: Data  Processing  Architectures

JSON data typesNumber

Double precision floating-pointString

A sequence of zero or more Unicode characters in double quotes, with backslash escaping of special characters

ObjectArrayNull

Empty

Page 35: Data  Processing  Architectures

JSON objectAn unordered set of name/value pairs

Separated by :Enclosed in curly braces

Page 36: Data  Processing  Architectures

JSON arrayAn ordered collection of values

Enclosed in square brackets

Page 37: Data  Processing  Architectures

JSON{ "shares": [ { "shrcode": "FC", "shrdiv": 1.84, "shrfirm": "Freedonia Copper", "shrpe": 16, "shrprice": 27.5, "shrqty": 10529 }, { "shrcode": "PT", "shrdiv": 2.5, "shrfirm": "Freedonia Copper", "shrpe": 10, "shrprice": 55.25, "shrqty": 12635 } ]}

Array

Object

Value

Page 39: Data  Processing  Architectures

Data storage devicesWhat data storage device will be used for

On-line data• Access speed• Capacity

Back-up files• Security against data loss

Archival data• Long-term storage

Page 40: Data  Processing  Architectures

Key variables

Data volumeData volatilityAccess speedStorage costMedium reliabilityLegal standing of stored data

Page 41: Data  Processing  Architectures

Magnetic technologyThe major form of data storageA mature and widely used technologyStrong magnetic fields can erase dataMagnetization decays with time

Page 42: Data  Processing  Architectures

Hard disk drive (HDD)Sealed, permanently mountedHighly reliableAccess times of 4-10 msecTransfer rates as high as 1,300 Mbytes per secondCapacities in Tbytes

Page 43: Data  Processing  Architectures

Hard disk drive (HDD)HDD unit shipments and sales revenues are declining, though production (exabytes per year) is growing

Page 44: Data  Processing  Architectures

A disk storage unit

Page 45: Data  Processing  Architectures

RAIDRedundant arrays of inexpensive or independent drivesExploits economies of scale of disk manufacturing for the personal computer marketCan also give greater securityIncreases a system's fault toleranceNot a replacement for regular backup

Page 46: Data  Processing  Architectures

ParityA bit added to the end of a binary code that indicates whether the number of bits in the string with the value one is even or oddParity is used for detecting and correcting errors

Data Number of one bits

Even parity Odd parity

0001100 2 00011000 00011001

Page 47: Data  Processing  Architectures

Mirroring

Page 48: Data  Processing  Architectures

MirroringWrite

Identical copies of a file are written to each drive in an array

ReadAlternate pages are read simultaneously from each drivePages put together in memoryAccess time is reduced by approximately the number of disks in the array

Read errorRead required page from another drive

TradeoffsReduced access timeGreater securityMore disk space

Page 49: Data  Processing  Architectures

Striping

Page 50: Data  Processing  Architectures

StripingThree drive modelWrite

Half of file to first driveHalf of file to second driveParity bit to third drive

ReadPortions from each drive are put together in memory

Read errorLost bits are reconstructed from third drive’s parity data

TradeoffsIncreased data securityLess storage capacity than mirroringNot as fast as mirroring

Page 51: Data  Processing  Architectures

RAID levels

All levels, except 0, have common featuresThe operating system sees a set of physical drives as one logical driveData are distributed across physical drivesParity is used for data recovery

Page 52: Data  Processing  Architectures

RAID levelsLevel 0

Data spread across multiple drivesNo data recovery when a drive fails

Level 1MirroringCritical non-stop applications

Level 3Striping

Level 5A variation of stripingParity data is spread across drivesLess capacity than level 1Higher I/O rates than level 3

Page 53: Data  Processing  Architectures

RAID 5

Page 54: Data  Processing  Architectures

Magnetic technology

Removable magnetic diskMagnetic tapeMagnetic tape cartridgeMass storage

Page 55: Data  Processing  Architectures

Solid StateArrays of memory chips

~30 cents per GbyteMagnetic disk is ~ 5 cents per Gbyte

Prices for SSD are decreasing much faster than HDD pricesFasterLess energyMore reliableHandhelds and laptops

Page 56: Data  Processing  Architectures

Flash driveSmallRemovableSolid stateUSB connectorUp to 1 Tbytes capacityAround 25 cents per Gbyte for smaller capacity drivesAbout 50 cents per Gbyte for larger capacity drivers

Page 57: Data  Processing  Architectures

Optical technologyA more recent development than magneticUse a laser for reading and writing dataHigh storage densitiesLow costDirect accessLong storage lifeNot susceptible to head crashes

Page 58: Data  Processing  Architectures

Digital Versatile Disc (DVD)

DVD drives have transfer rates of around 2.76 M bytes/sec and access times of 150 msec Read-only versions

DVD-Video (movies)DVD-ROM (software)DVD-Audio (songs)

DVD-RRecordable (write once, read many)

DVD-RAMErasable (write many, read many)

Page 59: Data  Processing  Architectures

Blu-ray DiscCapacity of 25 to 50 Gbytes20 layer version can store 500 GbytesVersions for BD-ROM, BD-R, BD-RE

Page 60: Data  Processing  Architectures

Storage life

Page 61: Data  Processing  Architectures

Merit of data storage devices

Device Access speed

Volume Volatility Cost per megabyte

Reliability Legal standing

Solid state *** * *** * ** *

Fixed disk *** *** *** ** ** *

RAID *** *** *** ** *** *

Removable disk

** ** *** ** ** *

Tape * ** * *** ** *

Cartridge ** *** * *** ** *

Mass storage ** *** * *** ** *

SAN *** *** *** ** *** *

Optical-ROM * *** * *** *** ***

Optical-R * *** * *** *** **

Optical-RW * *** ** *** *** *

Page 62: Data  Processing  Architectures

Data compressionEncoding digital data so it requires less storage space and thus less network bandwidthLossless

File can be restored to original stateLossy

File cannot be restored to original stateUsed for graphics, video, and audio files

Page 63: Data  Processing  Architectures

Recent developmentsDeclining cost of main-memoryMulticore processorsCan get massive performance improvements for business analyticsSAP HANA is a product of these recent developments

Overcomes the disk bottleneck

Page 64: Data  Processing  Architectures

Rethinking metricsTraditional

Cost per TBNew

Cost per TB per secondMain memory is roughly 10 times more expensive but 1000 times fasterTotal cost of ownership lower as well

Page 65: Data  Processing  Architectures

Eliminating disk-based database

Faster and cheaper architectureSome firms will have a need for disk for some years because of database sizeTransition as main memory becomes cheaper

Page 66: Data  Processing  Architectures

Columnar and row-based data storage

A table can be stored as a series of rows or columnsRow-storage typically good for transactionsColumn-storage typically good for business analyticsIn-memory facilitates either approach

And so can disk

Page 67: Data  Processing  Architectures

In-memory systemsIn-memory can achieve 5-10 compression ratios

Helps reduce cost of the transitionSQL with some extensionsSAP has showcased a 250TB systems with 250 nodes

Page 69: Data  Processing  Architectures

Key pointsDisk drives are relatively slow compared to main memoryStorage devices vary on several parametersSSD gradually replacing HDDSelect a storage device based on storage and retrieval goalsIn-memory database is a recent and growing development