CS252 S05 1
CISC 662 Graduate ComputerArchitecture
Lecture 1 - IntroductionMichela Taufer
http://www.cis.udel.edu/~taufer/courses
Powerpoint Lecture Notes from John Hennessy and David Patterson’s: ComputerArchitecture, 4th edition
----Additional teaching material from:
Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley)
2
Course Overview
3
CISC662: InformationInstructor: Michela Taufer
Office: 406 Smith HallOffice Hours: TR 3:00 - 4:00 or by appt.
TA: James Atlas - [email protected]: TTR 12:30 - 1:45Text: Computer Architecture: A Quantitative Approach,
Forth Edition (2006)Web page: http://www.cis.udel.edu/~taufer/courses
Lectures available in course webpage 24 hours beforeclass
Mailing List:Email: [email protected] or [email protected]: 0072
CS252 S05 2
4
CISC 662 Course FocusUnderstanding the design techniques, machine structures,technology factors, evaluation methods that will determinethe form of computers in 21st Century
Technology ProgrammingLanguages
OperatingSystems History
Applications Interface Design(ISA)
Measurement &Evaluation
Parallelism
Computer Architecture:• Instruction Set Design• Organization• Hardware/Software Boundary Compilers
5
Tentative Topics CoverageTextbook:Hennessy and Patterson,Computer Architecture: A Quantitative Approach,4th Ed., 2006
Tentative Schedule:• 2.5 weeks Fundamentals of Computer Architecture,
Instruction Set Architecture• 1.5 weeks Pipelining• 3.0 weeks Instructional Level Parallelism• 1.5 weeks Multiprocessors and Thread-Level Parallelism• 3.0 weeks Memory and Memory Hierarchy
6
Lecture Style• ~10 mins - Minute Review / Quiz• ~25 mins - Minute Lecture / Discussion• ~ 5 mins - Admin / Announcements• ~25 mins - Minutes Lecture/Review work in group• ~10 mins - Questions / Comments
Attention
Time
20 min. Break “In Conclusion, ...”
CS252 S05 3
7
Grading• Grade based on:
• Homework assignments• Midterm exam• Final exam• Reading assignments and quizzes
8
Participation• Complete reading and homework assignments on time• Print and review slides before to come to class
– Slides will be available 24 hours before the class starts
9
Getting Help• Course webpage at http://cis.udel.edu/~taufer/courses
– Copies of lectures and project assignments– Clarifications to assignments, deadlines– Syllabus and class schedule
• User: cisc662student• Password: Study4Fun!
• Discussions through mailing list– Clarifications to assignments, general discussion– Send e-mail to all with: [email protected]
• Personal help– Benefit from office hours
CS252 S05 4
10
Cheating• What is cheating?
– Sharing code: either by copying, retyping, looking at, orsupplying a copy of a file.
• What is NOT cheating?– Helping others use systems or tools.– Helping others with high-level design issues.– Helping others debug their code.
• Penalty for cheating:– Removal from course with failing grade.
11
Concepts inArchitecture (I)
12
What’s Inside a Computer?
Instruction Decoder
Memory
Cache
Main memory
Disk
ALU
Input/Output Units
CPU
Clock
I/O controller
CS252 S05 5
13
What Each Unit Does?Memory
Cache
Main memory
Disk
I/O Units
CPU
a+b=cprint c
a+b=c
a, b
print c
c
32
I/O controller
Instruction Decoder
ALU
CPU
Clock
14
What is “Computer Architecture”?Applications
Instruction Set Architecture
Compiler
OperatingSystem
Firmware
• Coordination of many levels of abstraction• Under a rapidly changing set of forces• Design, Measurement, and Evaluation
I/O systemInstr. Set Proc.
Digital DesignCircuit Design
Datapath & Control
Layout & fab
Semiconductor Materials
15
Technology constantly on the move!• All major manufacturers have
announced and/or are shippingmulti-core processor chips
• Intel talking about 80 coresin not-to-distance future
• 3-dimensional chip technology– Sandwiches of silicon– “Through-Vias” for communication
• Number of transistors/dice keepsincreasing
– Intel Core 2: 65nm, 291 Million transistors!– Intel Pentium D 900: 65nm, 376 Million
Transistors!
Intel Core Duo
CS252 S05 6
16
Dramatic Technology Advance• Prehistory: Generations
– 1st Tubes– 2nd Transistors– 3rd Integrated Circuits– 4th VLSI….– 5th Nanotubes? Optical? Quantum?
• Discrete advances in each generation– Faster, smaller, more reliable, easier to utilize
• Modern computing: Moore’s Law– Continuous advance, fairly homogeneous technology
17
Moore’s Law
• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965
• # on transistors on cost-effective integrated circuit double every 18 months
18
Computer Architecture’sChanging Definition
• 1950s to 1960s: Computer Architecture Course:Computer Arithmetic
• 1970s to mid 1980s: Computer ArchitectureCourse: Instruction Set Design, especially ISAappropriate for compilers
• 1990s: Computer Architecture Course:Design of CPU, memory system, I/O system,Multiprocessors, Networks
• 2000s: Multi-core design, on-chip networking,parallel programming paradigms, power reduction
• 2010s: Computer Architecture Course: Selfadapting systems? Self organizing structures?DNA Systems/Quantum Computing?
CS252 S05 7
19
The Instruction Set: a Critical Interface
instruction set
software
hardware
• Properties of a good abstraction– Lasts through many generations (portability)– Used in many different ways (generality)– Provides convenient functionality to higher levels– Permits an efficient implementation at lower levels
20
Instruction Set Architecture... the attributes of a [computing] system as seen bythe programmer, i.e. the conceptual structure andfunctional behavior, as distinct from the organizationof the data flows and controls the logic design, andthe physical implementation.
– Amdahl, Blaaw, and Brooks, 1964SOFTWARESOFTWARE
-- Organization of Programmable Storage
-- Data Types & Data Structures: Encodings & Representations
-- Instruction Formats
-- Instruction (or Operation Code) Set
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions
21
Computer Architecture is anIntegrated Approach
• What really matters is the functioning of the completesystem
– hardware, runtime system, compiler, operating system, andapplication
– In networking, this is called the “End to End argument”
• Computer architecture is not just about transistors,individual instructions, or particular implementations
– E.g., Original RISC projects replaced complex instructions with acompiler + simple instructions
• It is very important to think across allhardware/software boundaries
– New technology ⇒ New Capabilities ⇒ New Architectures ⇒New Tradeoffs
– Delicate balance between backward compatibility and efficiency
CS252 S05 8
22
Elements of an ISA• Set of machine-recognized data types
– bytes, words, integers, floating point, strings, . . .
• Operations performed on those data types– Add, sub, mul, div, xor, move, ….
• Programmable storage– regs, PC, memory
• Methods of identifying and obtaining datareferenced by instructions (addressing modes)
– Literal, reg., absolute, relative, reg + offset, …
• Format (encoding) of the instructions– Op code, operand fields, …
23
Example: MIPS R30000r0
r1°°°r31PClohi
Programmable storage2^32 x bytes31 x 32-bit GPRs (R0=0)32 x 32-bit FP regs (paired DP)HI, LO, PC
Data types ?Format ?Addressing Modes?
Arithmetic logicalAdd, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU,AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUISLL, SRL, SRA, SLLV, SRLV, SRAV
Memory AccessLB, LBU, LH, LHU, LW, LWL,LWRSB, SH, SW, SWL, SWR
ControlJ, JAL, JR, JALRBEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL
32-bit instructions on word boundary
24
ISA vs. Computer Architecture• Old definition of computer architecture
= instruction set design– Other aspects of computer design called implementation– Insinuates implementation is uninteresting or less challenging
• Our view is computer architecture >> ISA• Architect’s job much more than instruction set
design; technical hurdles today more challengingthan those in instruction set design
• Since instruction set design not where action is,some conclude computer architecture (using olddefinition) is not where action is
– We disagree on conclusion– Agree that ISA not where action is (ISA in CA:AQA 4/e appendix)
CS252 S05 9
25
Computer Architecture Topics
Instruction Set Architecture
Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, Dynamic Compilation
Addressing,Protection,Exception Handling
L1 Cache
L2 Cache
DRAM
Disks, WORM, Tape
Coherence,Bandwidth,Latency
Emerging TechnologiesInterleavingBus protocols
RAID
VLSI
Input/Output and Storage
MemoryHierarchy
Pipelining and Instruction Level Parallelism
NetworkCommunication
Oth
er P
roce
ssor
s
26
Computer Architecture Topics
M
Interconnection NetworkS
PMPMPMP ° ° °
Topologies,Routing,Bandwidth,Latency,Reliability
Network Interfaces
Shared Memory,Message Passing,Data Parallelism
Processor-Memory-Switch
MultiprocessorsNetworks and Interconnections
27
Concepts inArchitecture (II)
CS252 S05 10
28
Fundamental Execution Cycle
InstructionFetch
InstructionDecode
OperandFetch
Execute
ResultStore
NextInstruction
Obtain instructionfrom programstorage
Determine requiredactions andinstruction size
Locate and obtainoperand data
Compute result valueor status
Deposit results instorage for lateruse
Determine successorinstruction
Processor
regs
F.U.s
Memory
program
Data
von Neuman
bottleneck
29
What’s a Clock Cycle?
• Old days: 10 levels of gates• Today: determined by numerous time-of-flight
issues + gate delays– clock propagation, wire lengths, drivers
Latchor
register
combinationallogic
30
Pipelined Instruction Execution
Instr.
Order
Time (clock cycles)
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
CS252 S05 11
31
Limits to pipelining
• Maintain the von Neumann “illusion” of oneinstruction at a time execution
• Hazards prevent next instruction from executingduring its designated clock cycle
– Structural hazards: attempt to use the same hardware to dotwo different things at once
– Data hazards: Instruction depends on result of priorinstruction still in the pipeline
– Control hazards: Caused by delay between the fetching ofinstructions and decisions about changes in control flow(branches and jumps).
• Power: Too many thing happening at once ⇒ Meltyour chip!
– Must disable parts of the system that are not being used– Clock Gating, Asynchronous Design, Low Voltage Swings, …
32
Progression of ILP• 1st generation RISC - pipelined
– Full 32-bit processor fit on a chip => issue almost 1 IPC» Need to access memory 1+x times per cycle
– Floating-Point unit on another chip– Cache controller a third, off-chip cache– 1 board per processor multiprocessor systems
• 2nd generation: superscalar– Processor and floating point unit on chip (and some cache)– Issuing only one instruction per cycle uses at most half– Fetch multiple instructions, issue couple
» Grows from 2 to 4 to 8 …– How to manage dependencies among all these instructions?– Where does the parallelism come from?
• VLIW (Very Long Instruction Word)– Expose some of the ILP to compiler, allow it to schedule
instructions to reduce dependences
33
Modern ILP• Dynamically scheduled, out-of-order execution
– Current microprocessor fetch 10s of instructions per cycle– Pipelines are 10s of cycles deep⇒ many 10s of instructions in execution at once
• What happens:– Grab a bunch of instructions, determine all their dependences,
eliminate dep’s wherever possible, throw them all into theexecution unit, let each one move forward as its dependencesare resolved
– Appears as if executed sequentially– On a trap or interrupt, capture the state of the machine
between instructions perfectly
• Huge complexity– Complexity of many components scales as n2 (issue width)– Power consumption big problem
CS252 S05 12
34
Have we reached the end of ILP?• Multiple processor easily fit on a chip• Every major microprocessor vendor
has gone to multithreading– Thread: loci of control, execution context– Fetch instructions from multiple threads at once,
throw them all into the execution unit– Intel: hyperthreading– Concept has existed in high performance computing
for 20 years (or is it 40? CDC6600)
• Vector processing– Each instruction processes many distinct data– Ex: MMX
• Raise the level of architecture – manyprocessors per chip
Tensilica Configurable Proc
35
The Memory Abstraction
• Association of <name, value> pairs– typically named as byte addresses– often values aligned on multiples of size
• Sequence of Reads and Writes• Write binds a value to an address• Read of addr returns most recently written
value bound to that address
address (name)command (R/W)
data (W)
data (R)
done
36
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10yrs)
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Perf
orman
ce
Time
Processor-DRAM Memory Gap (latency)
CS252 S05 13
37
Levels of the Memory Hierarchy
CPU Registers100s Bytes<< 1s ns
Cache10s-100s K Bytes~1 ns$1s/ MByte
Main MemoryM Bytes100ns- 300ns$< 1/ MByte
Disk10s G Bytes, 10 ms (10,000,000 ns)$0.001/ MByte
CapacityAccess TimeCost
Tapeinfinitesec-min$0.0014/ MByte
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
circa 1995 numbers
38
The Principle of Locality• The Principle of Locality:
– Program access a relatively small portion of the address space at anyinstant of time.
• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will tend to
be referenced again soon (e.g., loops, reuse)– Spatial Locality (Locality in Space): If an item is referenced, items whose
addresses are close by tend to be referenced soon(e.g., straightline code, array access)
• Last 30 years, HW relied on locality for speed
P MEM$
39
The Cache Design Space• Several interacting dimensions
– cache size– block size– associativity– replacement policy– write-through vs write-back
• The optimal choice is a compromise– depends on access characteristics
» workload» use (I-cache, D-cache, TLB)
– depends on technology / cost
• Simplicity often wins
Associativity
Cache Size
Block Size
Bad
Good
Less More
Factor A Factor B
CS252 S05 14
40
Memory Abstraction and Parallelism• Maintaining the illusion of sequential access to memory
across distributed system• What happens when multiple processors access the same
memory at once?– Do they see a consistent picture?
• Processing and processors embedded in the memory?
P1
$
Interconnection network
$
Pn
Mem Mem
P1
$
Interconnection network
$
Pn
Mem Mem
41
Is it all about communication?
Proc
CachesBusses
Memory
I/O Devices:
Controllers
adapters
DisksDisplaysKeyboards
Networks
Pentium IV Chipset
42
Work in group
CS252 S05 15
43
Work in Group• Team in a group of two• Select one of these disbeliefs or misconceptions
– The cost of the processor dominates the cost of the system– The rated mean time to failure of disks is 1,200,000 hours or
almost 140 hours, so disks practically never fail
• Read the explanation in the book (copies of theparagraph will be provided)
• Rephrase with you words the concept presentedin the book
• Prepare a short presentation (up to 4 minutes ) topresent to the rest of the class (a sheet will beprovided for you note; write the name of the teamand the members)