Download - Architecture Lecture 1 - Introduction · CISC 662 Graduate Computer Architecture Lecture 1 - Introduction ... Instruction Set Architecture Pipelining, Hazard Resolution, Superscalar,

CS252 S05 1

CISC 662 Graduate ComputerArchitecture

Lecture 1 - IntroductionMichela Taufer

http://www.cis.udel.edu/~taufer/courses

Powerpoint Lecture Notes from John Hennessy and David Patterson’s: ComputerArchitecture, 4th edition

----Additional teaching material from:

Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley)

2

Course Overview

3

CISC662: InformationInstructor: Michela Taufer

Office: 406 Smith HallOffice Hours: TR 3:00 - 4:00 or by appt.

TA: James Atlas - [email protected]: TTR 12:30 - 1:45Text: Computer Architecture: A Quantitative Approach,

Forth Edition (2006)Web page: http://www.cis.udel.edu/~taufer/courses

Lectures available in course webpage 24 hours beforeclass

Mailing List:Email: [email protected] or [email protected]: 0072

CS252 S05 2

4

CISC 662 Course FocusUnderstanding the design techniques, machine structures,technology factors, evaluation methods that will determinethe form of computers in 21st Century

Technology ProgrammingLanguages

OperatingSystems History

Applications Interface Design(ISA)

Measurement &Evaluation

Parallelism

Computer Architecture:• Instruction Set Design• Organization• Hardware/Software Boundary Compilers

5

Tentative Topics CoverageTextbook:Hennessy and Patterson,Computer Architecture: A Quantitative Approach,4th Ed., 2006

Tentative Schedule:• 2.5 weeks Fundamentals of Computer Architecture,

Instruction Set Architecture• 1.5 weeks Pipelining• 3.0 weeks Instructional Level Parallelism• 1.5 weeks Multiprocessors and Thread-Level Parallelism• 3.0 weeks Memory and Memory Hierarchy

6

Lecture Style• ~10 mins - Minute Review / Quiz• ~25 mins - Minute Lecture / Discussion• ~ 5 mins - Admin / Announcements• ~25 mins - Minutes Lecture/Review work in group• ~10 mins - Questions / Comments

Attention

Time

20 min. Break “In Conclusion, ...”

CS252 S05 3

7

Grading• Grade based on:

• Homework assignments• Midterm exam• Final exam• Reading assignments and quizzes

8

Participation• Complete reading and homework assignments on time• Print and review slides before to come to class

– Slides will be available 24 hours before the class starts

9

Getting Help• Course webpage at http://cis.udel.edu/~taufer/courses

– Copies of lectures and project assignments– Clarifications to assignments, deadlines– Syllabus and class schedule

• User: cisc662student• Password: Study4Fun!

• Discussions through mailing list– Clarifications to assignments, general discussion– Send e-mail to all with: [email protected]

• Personal help– Benefit from office hours

CS252 S05 4

10

Cheating• What is cheating?

– Sharing code: either by copying, retyping, looking at, orsupplying a copy of a file.

• What is NOT cheating?– Helping others use systems or tools.– Helping others with high-level design issues.– Helping others debug their code.

• Penalty for cheating:– Removal from course with failing grade.

11

Concepts inArchitecture (I)

12

What’s Inside a Computer?

Instruction Decoder

Memory

Cache

Main memory

Disk

ALU

Input/Output Units

CPU

Clock

I/O controller

CS252 S05 5

13

What Each Unit Does?Memory

Cache

Main memory

Disk

I/O Units

CPU

a+b=cprint c

a+b=c

a, b

print c

c

32

I/O controller

Instruction Decoder

ALU

CPU

Clock

14

What is “Computer Architecture”?Applications

Instruction Set Architecture

Compiler

OperatingSystem

Firmware

• Coordination of many levels of abstraction• Under a rapidly changing set of forces• Design, Measurement, and Evaluation

I/O systemInstr. Set Proc.

Digital DesignCircuit Design

Datapath & Control

Layout & fab

Semiconductor Materials

15

Technology constantly on the move!• All major manufacturers have

announced and/or are shippingmulti-core processor chips

• Intel talking about 80 coresin not-to-distance future

• 3-dimensional chip technology– Sandwiches of silicon– “Through-Vias” for communication

• Number of transistors/dice keepsincreasing

– Intel Core 2: 65nm, 291 Million transistors!– Intel Pentium D 900: 65nm, 376 Million

Transistors!

Intel Core Duo

CS252 S05 6

16

Dramatic Technology Advance• Prehistory: Generations

– 1st Tubes– 2nd Transistors– 3rd Integrated Circuits– 4th VLSI….– 5th Nanotubes? Optical? Quantum?

• Discrete advances in each generation– Faster, smaller, more reliable, easier to utilize

• Modern computing: Moore’s Law– Continuous advance, fairly homogeneous technology

17

Moore’s Law

• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965

• # on transistors on cost-effective integrated circuit double every 18 months

18

Computer Architecture’sChanging Definition

• 1950s to 1960s: Computer Architecture Course:Computer Arithmetic

• 1970s to mid 1980s: Computer ArchitectureCourse: Instruction Set Design, especially ISAappropriate for compilers

• 1990s: Computer Architecture Course:Design of CPU, memory system, I/O system,Multiprocessors, Networks

• 2000s: Multi-core design, on-chip networking,parallel programming paradigms, power reduction

• 2010s: Computer Architecture Course: Selfadapting systems? Self organizing structures?DNA Systems/Quantum Computing?

CS252 S05 7

19

The Instruction Set: a Critical Interface

instruction set

software

hardware

• Properties of a good abstraction– Lasts through many generations (portability)– Used in many different ways (generality)– Provides convenient functionality to higher levels– Permits an efficient implementation at lower levels

20

Instruction Set Architecture... the attributes of a [computing] system as seen bythe programmer, i.e. the conceptual structure andfunctional behavior, as distinct from the organizationof the data flows and controls the logic design, andthe physical implementation.

– Amdahl, Blaaw, and Brooks, 1964SOFTWARESOFTWARE

-- Organization of Programmable Storage

-- Data Types & Data Structures: Encodings & Representations

-- Instruction Formats

-- Instruction (or Operation Code) Set

-- Modes of Addressing and Accessing Data Items and Instructions

-- Exceptional Conditions

21

Computer Architecture is anIntegrated Approach

• What really matters is the functioning of the completesystem

– hardware, runtime system, compiler, operating system, andapplication

– In networking, this is called the “End to End argument”

• Computer architecture is not just about transistors,individual instructions, or particular implementations

– E.g., Original RISC projects replaced complex instructions with acompiler + simple instructions

• It is very important to think across allhardware/software boundaries

– New technology ⇒ New Capabilities ⇒ New Architectures ⇒New Tradeoffs

– Delicate balance between backward compatibility and efficiency

CS252 S05 8

22

Elements of an ISA• Set of machine-recognized data types

– bytes, words, integers, floating point, strings, . . .

• Operations performed on those data types– Add, sub, mul, div, xor, move, ….

• Programmable storage– regs, PC, memory

• Methods of identifying and obtaining datareferenced by instructions (addressing modes)

– Literal, reg., absolute, relative, reg + offset, …

• Format (encoding) of the instructions– Op code, operand fields, …

23

Example: MIPS R30000r0

r1°°°r31PClohi

Programmable storage2^32 x bytes31 x 32-bit GPRs (R0=0)32 x 32-bit FP regs (paired DP)HI, LO, PC

Data types ?Format ?Addressing Modes?

Arithmetic logicalAdd, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU,AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUISLL, SRL, SRA, SLLV, SRLV, SRAV

Memory AccessLB, LBU, LH, LHU, LW, LWL,LWRSB, SH, SW, SWL, SWR

ControlJ, JAL, JR, JALRBEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL

32-bit instructions on word boundary

24

ISA vs. Computer Architecture• Old definition of computer architecture

= instruction set design– Other aspects of computer design called implementation– Insinuates implementation is uninteresting or less challenging

• Our view is computer architecture >> ISA• Architect’s job much more than instruction set

design; technical hurdles today more challengingthan those in instruction set design

• Since instruction set design not where action is,some conclude computer architecture (using olddefinition) is not where action is

– We disagree on conclusion– Agree that ISA not where action is (ISA in CA:AQA 4/e appendix)

CS252 S05 9

25

Computer Architecture Topics

Instruction Set Architecture

Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, Dynamic Compilation

Addressing,Protection,Exception Handling

L1 Cache

L2 Cache

DRAM

Disks, WORM, Tape

Coherence,Bandwidth,Latency

Emerging TechnologiesInterleavingBus protocols

RAID

VLSI

Input/Output and Storage

MemoryHierarchy

Pipelining and Instruction Level Parallelism

NetworkCommunication

Oth

er P

roce

ssor

s

26

Computer Architecture Topics

M

Interconnection NetworkS

PMPMPMP ° ° °

Topologies,Routing,Bandwidth,Latency,Reliability

Network Interfaces

Shared Memory,Message Passing,Data Parallelism

Processor-Memory-Switch

MultiprocessorsNetworks and Interconnections

27

Concepts inArchitecture (II)

CS252 S05 10

28

Fundamental Execution Cycle

InstructionFetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

Obtain instructionfrom programstorage

Determine requiredactions andinstruction size

Locate and obtainoperand data

Compute result valueor status

Deposit results instorage for lateruse

Determine successorinstruction

Processor

regs

F.U.s

Memory

program

Data

von Neuman

bottleneck

29

What’s a Clock Cycle?

• Old days: 10 levels of gates• Today: determined by numerous time-of-flight

issues + gate delays– clock propagation, wire lengths, drivers

Latchor

register

combinationallogic

30

Pipelined Instruction Execution

Instr.

Order

Time (clock cycles)

Reg ALU DMemIfetch Reg




Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5

CS252 S05 11

31

Limits to pipelining

• Maintain the von Neumann “illusion” of oneinstruction at a time execution

• Hazards prevent next instruction from executingduring its designated clock cycle

– Structural hazards: attempt to use the same hardware to dotwo different things at once

– Data hazards: Instruction depends on result of priorinstruction still in the pipeline

– Control hazards: Caused by delay between the fetching ofinstructions and decisions about changes in control flow(branches and jumps).

• Power: Too many thing happening at once ⇒ Meltyour chip!

– Must disable parts of the system that are not being used– Clock Gating, Asynchronous Design, Low Voltage Swings, …

32

Progression of ILP• 1st generation RISC - pipelined

– Full 32-bit processor fit on a chip => issue almost 1 IPC» Need to access memory 1+x times per cycle

– Floating-Point unit on another chip– Cache controller a third, off-chip cache– 1 board per processor multiprocessor systems

• 2nd generation: superscalar– Processor and floating point unit on chip (and some cache)– Issuing only one instruction per cycle uses at most half– Fetch multiple instructions, issue couple

» Grows from 2 to 4 to 8 …– How to manage dependencies among all these instructions?– Where does the parallelism come from?

• VLIW (Very Long Instruction Word)– Expose some of the ILP to compiler, allow it to schedule

instructions to reduce dependences

33

Modern ILP• Dynamically scheduled, out-of-order execution

– Current microprocessor fetch 10s of instructions per cycle– Pipelines are 10s of cycles deep⇒ many 10s of instructions in execution at once

• What happens:– Grab a bunch of instructions, determine all their dependences,

eliminate dep’s wherever possible, throw them all into theexecution unit, let each one move forward as its dependencesare resolved

– Appears as if executed sequentially– On a trap or interrupt, capture the state of the machine

between instructions perfectly

• Huge complexity– Complexity of many components scales as n2 (issue width)– Power consumption big problem

CS252 S05 12

34

Have we reached the end of ILP?• Multiple processor easily fit on a chip• Every major microprocessor vendor

has gone to multithreading– Thread: loci of control, execution context– Fetch instructions from multiple threads at once,

throw them all into the execution unit– Intel: hyperthreading– Concept has existed in high performance computing

for 20 years (or is it 40? CDC6600)

• Vector processing– Each instruction processes many distinct data– Ex: MMX

• Raise the level of architecture – manyprocessors per chip

Tensilica Configurable Proc

35

The Memory Abstraction

• Association of <name, value> pairs– typically named as byte addresses– often values aligned on multiples of size

• Sequence of Reads and Writes• Write binds a value to an address• Read of addr returns most recently written

value bound to that address

address (name)command (R/W)

data (W)

data (R)

done

36

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10yrs)

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orman

ce

Time

Processor-DRAM Memory Gap (latency)

CS252 S05 13

37

Levels of the Memory Hierarchy

CPU Registers100s Bytes<< 1s ns

Cache10s-100s K Bytes~1 ns$1s/ MByte

Main MemoryM Bytes100ns- 300ns$< 1/ MByte

Disk10s G Bytes, 10 ms (10,000,000 ns)$0.001/ MByte

CapacityAccess TimeCost

Tapeinfinitesec-min$0.0014/ MByte

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

circa 1995 numbers

38

The Principle of Locality• The Principle of Locality:

– Program access a relatively small portion of the address space at anyinstant of time.

• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will tend to

be referenced again soon (e.g., loops, reuse)– Spatial Locality (Locality in Space): If an item is referenced, items whose

addresses are close by tend to be referenced soon(e.g., straightline code, array access)

• Last 30 years, HW relied on locality for speed

P MEM$

39

The Cache Design Space• Several interacting dimensions

– cache size– block size– associativity– replacement policy– write-through vs write-back

• The optimal choice is a compromise– depends on access characteristics

» workload» use (I-cache, D-cache, TLB)

– depends on technology / cost

• Simplicity often wins

Associativity

Cache Size

Block Size

Bad

Good

Less More

Factor A Factor B

CS252 S05 14

40

Memory Abstraction and Parallelism• Maintaining the illusion of sequential access to memory

across distributed system• What happens when multiple processors access the same

memory at once?– Do they see a consistent picture?

• Processing and processors embedded in the memory?

P1

$

Interconnection network

$

Pn

Mem Mem

P1

$

Interconnection network

$

Pn

Mem Mem

41

Is it all about communication?

Proc

CachesBusses

Memory

I/O Devices:

Controllers

adapters

DisksDisplaysKeyboards

Networks

Pentium IV Chipset

42

Work in group

CS252 S05 15

43

Work in Group• Team in a group of two• Select one of these disbeliefs or misconceptions

– The cost of the processor dominates the cost of the system– The rated mean time to failure of disks is 1,200,000 hours or

almost 140 hours, so disks practically never fail

• Read the explanation in the book (copies of theparagraph will be provided)

• Rephrase with you words the concept presentedin the book

• Prepare a short presentation (up to 4 minutes ) topresent to the rest of the class (a sheet will beprovided for you note; write the name of the teamand the members)

Download - Architecture Lecture 1 - Introduction · CISC 662 Graduate Computer Architecture Lecture 1 - Introduction ... Instruction Set Architecture Pipelining, Hazard Resolution, Superscalar,

Top Related