arsitektur evolusi komputer

27
Sistem Informasi Geografis (bahasa Inggris : Geographic Information System disingkat GIS) adalah sistem informasi khusus yang mengelola data yang memiliki informasi spasial (bereferensi keruangan). Atau dalam arti yang lebih sempit, adalah sistem komputer yang memiliki kemampuan untuk membangun, menyimpan, mengelola dan menampilkan informasi berefrensi geografis, misalnya data yang diidentifikasi menurut lokasinya, dalam sebuah database . Para praktisi juga memasukkan orang yang membangun dan mengoperasikannya dan data sebagai bagian dari sistem ini. Teknologi Sistem Informasi Geografis dapat digunakan untuk investigasi ilmiah , pengelolaan sumber daya , perencanaan pembangunan , kartografi dan perencanaan rute. Misalnya, SIG bisa membantu perencana untuk secara cepat menghitung waktu tanggap darurat saat terjadi bencana alam , atau SIG dapat digunaan untuk mencari lahan basah (wetlands) yang membutuhkan perlindungan dari polusi . (http://id.wikipedia.org/wiki/Sistem_informasi_geografis )

Upload: fathir-dcyberking

Post on 27-Mar-2015

278 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: ARSITEKTUR EVOLUSI KOMPUTER

Sistem Informasi Geografis (bahasa Inggris: Geographic Information System disingkat GIS) adalah sistem informasi khusus yang mengelola data yang memiliki informasi spasial (bereferensi keruangan). Atau dalam arti yang lebih sempit, adalah sistem komputer yang memiliki kemampuan untuk membangun, menyimpan, mengelola dan menampilkan informasi berefrensi geografis, misalnya data yang diidentifikasi menurut lokasinya, dalam sebuah database. Para praktisi juga memasukkan orang yang membangun dan mengoperasikannya dan data sebagai bagian dari sistem ini.

Teknologi Sistem Informasi Geografis dapat digunakan untuk investigasi ilmiah, pengelolaan sumber daya, perencanaan pembangunan, kartografi dan perencanaan rute. Misalnya, SIG bisa membantu perencana untuk secara cepat menghitung waktu tanggap darurat saat terjadi bencana alam, atau SIG dapat digunaan untuk mencari lahan basah (wetlands) yang membutuhkan perlindungan dari polusi.

(http://id.wikipedia.org/wiki/Sistem_informasi_geografis)

Page 2: ARSITEKTUR EVOLUSI KOMPUTER

Sistem Interface Input/Output antara Sistem Digital dan Sistem Analog

Penggunaan komputer saat ini tidak lagi terbatas pada pengolahan dan manipulasi data saja

tetapi sudah digunakan untuk mengkontrol berbagai peralatan seperti penghitung pulsa

telepon, menyalakan/mematikan lampu secara otomatis, dan lain sebagainya. Dengan

penggunaan komputer seperti yang telah disebutkan di atas maka seolah-olah komputer

berperan sebagai manusia yang dapat diprogram untuk menjalankan apa yang dikehendaki

oleh programmernya.

Antara sistem digital (sebagai pengontrol) dan sistem analog (sebagai peralatan yang

dikontrol) harus terdapat suatu jembatan yang menghubungkan kedua sistem tersebut.

Jembatan ini selanjutnya disebut sistem interface IO.

Jadi untuk sistem kontrol secara digital ini selalu terdiri dari 3 bagian yaitu : sistem digital,

sistem interface IO dan sistem analog. Sistem digital merupakan sistem yang menjadi otak

dari sistem secara keseluruhan. Sistem digital ini membaca kondisi dari sistem analog melalui

sistem interface IO dan mengkontrol sistem analog melalui sistem interface IO.

Sistem kontrol secara digital ini menggantikan sistem kontrol manual yang menggunakan

switch mekanik dan diatur secara manual pula. Selain itu dengan sistem kontrol secara digital

ini, kondisi sistem analaog yang dikontrol dapat pula dimonitor keadaannya. Sistem analog

merupakan bagian dari peralatan analog yang aktivitasnya dikontrol oleh sistem digitalnya

melalui sistem interface IO. Sistem analog dapat berupa lampu bolam 220 volt, motor AC,

bahkan sampai ke peralatan industri yang menggunakan arus besar.

Disini terlihat bahwa sistem interface IO sangat penting peranannya yaitu untuk

menginterfacekan sistem digital yang hanya mengenal kondisi ‘H’, yang ekuivalen dengan

tegangan 4.5 volt sampai 5 volt dan kondisi ‘L’ yang setara dengan tegangan dibawah 1.2

volt dengan sistem analog dengan tegangan 220 VAC dengan konsumsi arus yang paling

tidak 1A ke atas.

Dari kondisi seperti di atas maka perlulah bagian digital dan bagian analog ini dilewatkan

sistem interface yang secara elektronik terisolasi antar bagiannya. Teknik interface IO disini

ada beberapa teknik dan tiap teknik tersebut mempunyai keistimewaan pada aplikasi tertentu.

Page 3: ARSITEKTUR EVOLUSI KOMPUTER

Contoh Aplikasi

Dengan menggunakan sebuah PC diharapkan dapat mengkontrol 10 buah titik lampu yang

menyala/mati pada jam-jam tertentu. Melalui sebuah PPI card (dengan menggunakan chip

PPI 8255) dapat dikontrol 24 buah beban. Output PPI adalah TTL level sedangkan untuk

lampu yang digunakan adalah lampu TL biasa. Untuk menginterfacekan antara PPI (sistem

digital) dengan lampu (sistem analog) digunakan relay 5volt.

Contoh aplikasi ini adalah salah satu contoh penggunaan relay sebagai interafce antara sistem

digital dan sistem analog.

Sistem Interface I/O

Sistem interface I/O yang paling baik adalah sistem interface dimana sistem digital dan

sistem analognya terisolasi, terpisah. Biasanya digunakan relay atau optocoupler. Penggunaan

relay lebih mudah namun lebih sering menimbulkan masalah karena relay dapat

menghasilkan noise pada sistem digital pada saat relay berubahan keadaan. Selain itu

penggunaan relay membutuhkan daya yang lebih besar jika dibandingkan dengan

penggunaan optoisolator.

Sistem interface yang baik pada umumnya menggunakan optoisolator atau yang lebih dikenal

dengan optocoupler sepert 4N31 atau 4N35. Dengan menggunakan optocoupler arus yang

digunakan lebih sedikit paling tidak 10 mA -15 mA.

Gambar 1Blok Diagram

Penggunaan optocoupler seperti 4N35 lebih disukai daripada penggunaan relay secara

langsung.

Optoisolator

Page 4: ARSITEKTUR EVOLUSI KOMPUTER

Optoisolator merupakan komponen yang digunakan sebagai komponen kontrol I/O untuk

peralatan yang beroperasi dengan tegangan DC atau AC. Sebuah optocoupler terdiri dari

GaAs LED dan phottransistor NPN yang terbuat dari silicon. Untuk rangkaian penggunaan

optoisolator dapat dilihat pada gambar 3a dan 3b.

Pada gambar 3a. optoisolator mendapat input TTL berbentuk sinyal kotak sehingga outputnya

juga berupa sinyal kotak namun level tegangan berubah menjadi 0-+24 volt.

Gambar 2Optoisolator

Gambar 3

Penggunaan Optoisolator

Pada gambar 3b optoisolator digunakan pada input yang termodulasi dengan tegangan Vin

terisolasi dengan Vout modulasi yang tegangan puncaknya +12V.

Faktor yang paling penting pada interface I/O terutama untuk beban yang menggunakan

tegangan AC maka isolasi merupakan hal yang paing penting dan harus diperhatikan dalam

disain. Sistem digital menggunakan level tegangan +5volt sedangkan beban menggunakan

Page 5: ARSITEKTUR EVOLUSI KOMPUTER

tegangan 220VAC. Perbedaan tegangan ini sudah cukup untuk menyebabkan sistem kontrol

digital, PC misalnya, untuk rusak jika port pada komputer ini menerima tegangan imbas dari

beban 220VAC.

Gambar 4Aplikasi Optoisolator

Dengan skematik pada gambar 4, optoisolator mendapatkan tegangan 115VAC namun arusnya dilewat hanya 8mA dan arus sebesar ini sudah cukup untuk membuat phototransistor aktif dan logika yang diterima inverter menjadi ‘low’. Dengan rangkaian ini kita mendapatkan pulsa periodik dengan frekuensi yang sama dengan frekuensi tegangan PLN 50/60Hz tetapi berbentuk pulsa kotak. Dengan adanya pulsa pada Pulse Out maka dapat dipastikan bahwa masih ada tegangan pada jaringan PLN sedangkan jika sudah tidak terdapat pulsa lagi maka dapat dipastikan tegangan jaringan PLN adalah 0 VAC.

Kerugian atau keburukan dari optocoupler adalah pada kecepatan switchingnya. Hal ini

disebabkan karena efek dari area yang sensistif terhadap cahaya dan timbulnya efek

kapasitansi pada ‘junction’-nya. Jika diperlukan kecepatan switching yang cukup tinggi maka

optoisolator harus dikonfigurasikan sehingga yang digunakan adalah sebagai photodiode-nya

seperti tampak pada gambar 5.

Page 6: ARSITEKTUR EVOLUSI KOMPUTER

Gambar 5Diode-Diode Optocoupler

Cara lain untuk melakukan isolasi antara rangkaian tegangan tinggi dengan rangkaian tegangan rendah adalah menggunakan relay. Kelemahan dari relay adalah harga sebuah relay dengan kapasitas arus yang besar cukup mahal, ukuran dimensi relay besar sehingga PCB yang digunakan semakin besar pula, menimbulkan sinyal noise, dan responnya lambat. Sedangkan dengan menggunakan optocoupler, ukurannya kecil sehingga ukuran PCBnya menjadi lebih kecil dan pada akhirnya perlatan tersebut menjadi kecil pula, kecepatan responnya lebih cepat.

Penggunaan Solid State Relay (SSR)

Pada pembahasan di atas, relay tetap dapat digunakan namun untuk saat ini lebih disukai

penggunaan solid state relay karena ada dua pertimbangan yaitu efek noise yang ditimbulkan

tidak terlalu besar dan harga solid state relay relatif lebih murah dari pada sebuah relay

dengan kualitas yang sama.

Gambar 6

Rangkaian Ekuivalen Solid State Relay

Ada satu faktor lagi yang perlu diperhatikan untuk mengendalikan beban yang menggunakan

tegangan AC. Yaitu pada masalah waktu aktivasinya. Karena tegangan untuk AC selalu

berubah-ubah maka aktivasi pada solid state relay harus dilakukan pada saat tegangan AC

Page 7: ARSITEKTUR EVOLUSI KOMPUTER

pada saat mendekati nol volt. Tujuannya adalah untuk memperpanjang umur solid state itu

sendiri karena jika aktivasi SSR ini pada saat tegangan AC nya berada pada tegangan

220VAC misalnya, maka akan timbul ‘surge current’ yang dapat menimbulkan arus yang

sangat besar dan pada akhirnya menyebabkan solid state relay tersebut rusak.

Untuk mengatasi hal tersebut di atas maka untuk penggunaaan solid state relay harus pula

diserta dengan rangkaian zero crossing detector. Rangkaian zero crossing detector ini akan

mendeteksi kapan tegangan VAC ini pada nilai nol volt. Dengan adanya pemberitahuan

keadaan ini maka kapan aktivasi solid state relay dapat ditentukan dan solid state relay dapat

bekerja dengan baik.

Gambar 7

Rangkaian Zero Crossing (Isolated)

Pada gambar 7 merupakan rangkaian zero crossing detector yang menggunakan sistem yang terisolasi dengan menggunakan transformer step down. Teknik ini paling aman digunakan namun biaya pembuatannya relatif lebih mahal karena masih menggunakan transformer.

Dengan adanya rangkaian sistem interface antara tegangan tinggi dan tegangan rendah maka

diharapkan tidak terjadi rusaknya port mikrokontroller atau PC karena mendapat imbas

tegangan tinggi dari aplikasi seperti motor AC.

Page 8: ARSITEKTUR EVOLUSI KOMPUTER

ARSITEKTUR KOMPUTER

Intel i486 (sering disebut 486 atau 80486) adalah serangkaian prosesor mikro CISC skalar 32-bit Intel yang merupakan bagian dari keluarga prosesor x86 Intel. i486 merupakan penerus prosesor Intel 80386. Prosesor mikro 486 pertama kali diperkenalkan pada tahun 1989. i486 sering disebut tanpa tambahan awalan 80, karena peraturan pengadilan melarang angka-angka dijadikan mereka dagang (seperti 80486). Penamaan prosesor yang berdasarkan nomor kemudian benar-benar dihapus bersamaan dengan dipasarkannya penerus i486, yaitu prosesor Pentium.

Dari sisi penilaian perangkat lunak, instruction set dari keluarga i486 sangatlah mirip dengan pendahulunya, Intel 80386 dengan beberapa sedikit instructions tambahan.

Dari sisi penilaian perangkat keras, arsitektur dari i486 merupakan kemajuan besar. Prosesor ini memiliki instruksi dan data cache yang tergabung dalam suatu chip, suatu floating-point unit (FPU) tambahan pada chip (khusus model DX), dan bus interface unit yang ditingkatkan kemampuannya. Sebagai tambahan, pada kondisi optimal, inti prosesor dapat menjaga kecepatan eksekusi dari satu instruksi per clock cycle. Perbaikan ini secara kasar melipatgandakan kinerja dari Intel 80386 dalam clock rate yang sama. Meskipun demikian, beberapa model i486 ternyata lebih lambat daripada prosesor 386 tercepat, khususnya 'SX' i486.

Perbedaan antara 80386 dan 80486

Data/Instruction Cache -pada 8192-byte (8 kB) SRAM tertanam pada inti processor,dibuat untuk menyimpan penggunaan instruksi biasa.386 mendukung off-chip cache,tetapi ini sangatlah lambat.

Pipelining - ini mengijinkan processor untuk melakukan LocateFetchExecute setiap putaran waktu (clock cycle). Pipeline merupakan penganti informasi pelaksanaan alur instruksi yang dibutuhkan dari dua putaran waktu sebelumnya.tempatnya haruslah diberikan pada fetch berikutnya,fetch haruslah diberikan pada pelaksanaan berikutnya.386 perlu melakukan instruksi secara terpisah.

Peningkatan performance MMU Terintegrasi FPU- (hanya model DX saja) penambahan fungsi matematika.

486 mempunyai 32-bit data bus dan sebuah 32-bit address bus.ini diperlukan bagi 30-pin SIMMs atau 72-pin SIMM. Pengalamatan bus 32-bit terbatas sampai 4 GB dari RAM.

Pimpinan project untuk 80486 adalah Patrick Gelsinger.

pada Mei 2006 Intel mengumumkan bahwa produksi dari 80486 akan berhenti di ahir bulan September 2007.[1] walaupun chip ini telah lama menjadi sangat penting untuk perangkat PC, Intel akan melanjutkan produksi untuk digunakan dalam embedded systems.

Page 9: ARSITEKTUR EVOLUSI KOMPUTER

POWER PC

Sounding Off

It has been said many times before, but cannot be repeated often enough. I want you to be reading this material for the right reasons. First, I too am a die-hard 68K assembly programmer. Perhaps 60% of my personal toolbox is still in assembly. Not long ago, I saw red when I learned that neither CodeWarrior nor Think C would support inline PPC assembly, and that industry pundits routinely babbled nonsense about how we humans could never hope to do as good a job of optimization as a compiler. How can that be, I asked, when just yesterday the quality of their 68K code generation was lousy: not recognizing predecrement addressing, reloading registers far too often, not making use of implicit CCR updates,... Suddenly these guys know it all? Hah!

Well, it's not easy to adjust, but I'm working very hard to put my skepticism on hold and understand that something is fundamentally different about (reduced instruction set computer) RISC systems. We'll look in more detail at how it works in a minute, but briefly, RISC represents a partnership between hardware and compiler designers to share the job of optimization. In fact the burden has shifted dramatically toward the compiler. The machine is designed to simplify timing and dependency analysis so that instructions can be reordered, interleaved and scheduled to maximum advantage by the compiler. Now, I'm not going to forget about optimization hereafter. Rather, I'm going to yell and complain and poke sharp sticks at the responsible parties if they don't eventually get it together and make good on this promise. Nevertheless, I will give it time. The kinds of things that have to be managed to get superior performance are numerous and complex, but more amenable to analysis then they were with (complex instruction set computer) CISC systems. It looks exactly like a job to be automated - it looks exactly like what a compiler should do.

I no longer endorse assembly language programming. It is not and should not be your personal responsibility - it doesn't make sense anymore. That is the wrong way to use the material here. There are many things you can do, as always, to dramatically improve performance. Learn to profile your code to find out where you need bother in the first place. Then consider: better algorithms, more efficient data structures, better organization of information, better data base keys, moving indices instead of data, better default settings, presorting things at startup, using idle time, using locally buffered I/O, using asynchronous I/O, doing work offscreen, updating smaller areas... Retest, to make sure you've changed the right things. Also, remember that speed is in the eye of the beholder. Make use of progress bars, watch cursors and other busy box gizmos to amuse and delight. These are all better ways. They are stable, portable, maintainable and reusable.

The other reasons one formerly needed assembly were things like setting up access to global variables, and gluing things together. The people at Apple are not blind. They have done a lot to design the run-time model for PPC so that these problems are eased if not eliminated. There ought not to be a genuine reason to muck around from now on. Today we have to support multiple platforms and an explosion of new technologies just to reach acceptability, let alone competitiveness. I want you to be happy and successful. I urge you to apply your time, money and effort where it will make the biggest difference. Now more than ever, do the right thing.

Page 10: ARSITEKTUR EVOLUSI KOMPUTER

The Right Reason

Debugging remains as the real reason you must have familiarity with the machine and assembly language. Traditional places where low level debugging has proved very useful are non-application code such as INITs and code resources, spotting the consequences of nil pointers and overwriting array bounds, hunting memory leaks and many others things. These will always be with us. Even working in a high level language, bad things can happen due to inexperience with a new tool, poor documentation, or just being interrupted at some critical moment and losing one's thread. Sometimes high level language errors can themselves be quite insidious. Here is my favorite example of how it can all go wrong, admittedly stinging me more than once. Everyone knows that, theoretically, multiplying two 16-bit quantities gives a 32-bit result. Which of these actually does that?

long c;short a, b;

1) c = a * b;2) c = a, c *= b;3) c = (long)a * b;4) c = (long)a * (long)b;5) c = (long)(a * b);

Congratulations on recognizing that either (3) or (4) does the right thing by using muls.w (68K) to create a long that's stored in c. (1) and (5) give a "wrong" result. They use muls.w to create the product, but move a sign extension of only the low word into c. (2) works, but uses software emulation (%%mul in Think C) to do the requested long * long multiplication - not something you want in a loop. This is the kind of slip that can lead to days of reevaluating your whole algorithm, or career as a programmer. It's most likely to be caught at a low level where you're watching instructions as well as results. I think you know how it goes. You will have to debug something stupid you did, your coworker did, the compiler did, a third party product did, Apple did... Everyone needs debugging skills. That's life in the Big City. Now that we understand each other, let's get on with it.

The Machine

The main theme characterizing RISC computing is keeping the CPU as busy as possible so that cycles are not wasted. This is achieved in two principal ways, superscalar design and pipelining. The term superscalar refers to the CPU being a collection of semi-independent execution units operating in parallel, so that instructions can be issued to these units in parallel (and possibly out of order, as long as they are not interdependent). The figure illustrates a simplified view of the communication paths among the execution units (IU, BPU, FPU) and the supporting memory managers and cache. We'll take a quick look at the operation of each, concentrating particularly on features contributing to machine performance.

Instruction and Branch Units

Instructions march through the instruction queue (IQ) from Q7 toward Q0 as vacancies are created. New instructions are requested as soon as possible. If the cache is hit, as many as eight instructions (the whole IQ, or, a cache sector) can be prefetched in one cycle. Otherwise, further bus cycles will be needed, but this is normally simultaneous with currently

Page 11: ARSITEKTUR EVOLUSI KOMPUTER

executing instructions. Instructions can be issued from any of the lower four elements of the IQ to either the branch processor (BPU) or floating-point unit (FPU), as long as the decode stage of the target unit is vacant. The integer unit (IU) is fed only through Q0, which doubles as the IU decode stage. Instruction fetch is normally sequential, unless the BPU decides on a change of execution path.

The BPU "owns" two registers holding branch target addresses, the link (LR) and count (CTR) registers, allowing relative independence of the BPU. The LR also gets the return address following a branch, if any. The condition register (CR) provides the information necessary to resolve conditional branches. One thing a good compiler should do is schedule the instruction that updates the CR well ahead of its dependent branch instruction to allow resolution as early as possible. The BPU can examine up to one branch instruction at a time. Unconditional branches are simply removed from the instruction stream, with fetching directed along the new path. Conditional branches have a predictor bit, indicating the more likely of branch taken/not taken. If a conditional branch is encountered, instructions continue to execute along the predicted path, but not as far as the writeback stage, where registers are updated. When the condition is resolved, if the prediction was correct, writeback is enabled and execution continues as if no branch occurred. If incorrect, the instruction unit backs up by flushing everything since the branch, and fetching a new cache sector of instructions. This process of effectively removing branches from code is termed branch folding.

This buys a great deal of speed. Recall that on the 68K, branches are among the most costly instructions. One used to employ loop unrolling to cut down on branches, making source code ugly, larger and potentially confusing. Another popular trick is to recode if-blocks as follows:

/* 1 */if( condition ) x = B; x = A; recoded becomes if( condition )else x = A; x = B;

Hopefully the need for such awkward constructions will be eliminated soon. I look forward to source code being an expression of ideas, abstracted away from machine dependent trickery.

IU

The integer unit does what you expect: arithmetic, logical, and bit-field operations on integers. It contains the general-purpose register file (GPR), and the integer exception register (XER). There are thirty-two GPRs, each 32 bits wide on the 601. Each handles either data manipulation or address calculation. They are dual-ported (as are the FPRs) to allow two independent accesses at once. The XER holds result flags such as carry and overflow from arithmetic operations. The IU handles address calculation for all execution units. All load and store requests (even for floating-point operands) are processed by the IU and passed from there to the memory management unit (MMU). The IU implements feed-forwarding, simultaneously making available the result of an integer execute stage to both the register writeback bus, and the execute stage of any follow-on integer instruction waiting for that result.

FPU

Page 12: ARSITEKTUR EVOLUSI KOMPUTER

The floating-point unit contains the floating-point register file (FPR), and the status and control register (FPSCR). The thirty-two 64-bit registers handle either single or double precision operations. Only a subset of operations are handled in hardware, as with the 68040, the others must be emulated. Of interest, though, is the combined multiply-add instruction, which directly supports the vector and matrix algebra needed in common graphics transformations. It is also well suited to series expansions, speeding software emulation of transcendental functions. The FPSCR holds calculation result flags such as overflow, NaN, INF, etc., and environment controls, such as rounding direction. At this time, the FPU does not support feed-forwarding.

MMU

The memory management unit handles the translation of logical to physical addresses, access privileges, memory protection and virtual memory paging. Performance is enhanced in this unit by the incorporation of several on-chip tables of recently used addresses, so that translation can be bypassed whenever possible. Of course load and store requests look to the cache first. Misses are queued in the memory unit for servicing. The MMU can address up to 4E9 (4 Gigabytes) of physical memory, and 4E15 (4 Petabytes) of virtual memory. Where to store all that is a separate issue (4 Petabytes = 6.2 million CD ROMs).

Memory Unit

This unit buffers data transfers between the cache and memory. It contains a two entry read and three entry write queue. Each entry is actually capable of holding eight consecutive words (a cache sector). Writing to memory is primarily performed to make room in the cache for new entries. The least recently used (LRU) entry in the cache is moved to the write queue, where it waits its turn for use of the system bus. Reads are performed mainly to load the cache. If not interdependent, waiting reads and writes may be performed out of order, according to priority. However, special instructions are available to strictly enforce program order of reads and writes when needed.

Page 13: ARSITEKTUR EVOLUSI KOMPUTER

Cache

A 32-kByte (write-back) cache is provided to minimize time waiting for off-chip accesses. In the 601 it is a unified cache, holding both instructions and data. In the future, it will likely have a Harvard architecture, keeping instructions and data separate. That would simplify and speed up cache searching, and allow concurrent data and instruction accesses. An advantage of the unified design is that those nasty programs that modify their own code are given a chance of running successfully - making the PPC look even safer sometimes than a 68040. Of course, nobody you or I know writes self-modifying code, right? The cache is subdivided into 8 pages. Each page contains 64 cache lines. Each line is subdivided into 2 sectors (or blocks) of 8 32-bit words each. The block is the smallest cacheable unit. Cache sectors are read and written in a special burst mode. To further reduce processing delays when a load or fetch misses in the cache, the specifically requested words are feed-forwarded to the waiting execution unit before the remainder of the sector read completes.

Pipelining

The second major factor in keeping the CPU busy is pipelining, which refers to the scheme of dividing the processing of an instruction into several independent serial stages - similar to a factory assembly line. Each execution unit is pipelined, but has a different number of stages. More stages allow breaking a process into simpler steps. The BPU, who's task is already simple has a combined decode/execute stage. The FPU, having the most complicated operations to perform has two execute stages. The term superpipelined is often used to characterize pipelines having more than four or five stages. For simplicity, we focus our discussion on the IU, whose four stages typify basic pipeline design. These stages are:

fetch/dispatch - get the instruction from the stream into the execution unit's decoder, decode - figure out what the instruction does and initiate requests for any needed operands, execute - apply indicated operation, writeback - update target register(s) with result.

Any stage can work on only one instruction at a time. However, the IU as a whole can process four instructions at the same time - as long as its pipeline is kept full. Looked at another way, instructions complete at the rate of one per cycle with a full pipeline - that's fast. Sometimes it doesn't work out because of poorly separated dependencies. An example might be an executing instruction needing an operand, as yet unavailable from some previous instruction. The resulting inactivity is called a stall. During a stall, the stage remains occupied by the waiting instruction, but is essentially idle. When things pick up again, that stage will again be active. Now some silly nit-picking. Looking back at the history of what happened, we see that a certain stage was doing something, then stalled, then doing something again. A stall in this context is officially called a bubble (I would suggest we call poor code with too many bubbles, foamy - you heard it here first).

There is still more to the speed story. What has not yet been discussed is perhaps the single most important design feature of RISC - that instructions are a uniform size, and wherever possible, spend a uniform amount of time in each pipeline stage. I think you can easily convince yourself that if one or more stages took longer than the others, the slow ones would be bottlenecks - the pipeline could not march along in lock-step even at its theoretical best.

Page 14: ARSITEKTUR EVOLUSI KOMPUTER

The constant size of instruction words is 32 bits for all PPC implementations. This affords uniform fetch and decode times - in fact, one cycle each. Writeback, another data movement operation, is also one cycle. Remember that we are writing to GPRs and FPRs at writeback, not memory. (Writing to memory, really the cache, only occurs via explicit store instructions as we'll soon discuss).

Execute is a little trickier since there are so many different things one might like to do to the data. How can you insure that, whatever operation, takes the one cycle we seem to have established? Aha! therein lies the power of the reduced instruction set. Most integer operations do take one cycle to execute - the principal exceptions are multiplies and divides. That's because the instructions are simpler than on typical CISC machines. In fairness, the bit-field operations owe their speediness to a TurboShift unit (I'm guessing about the name). Complex operations have to be constructed from a sequence of the available simple ones, but that's O.K., because performance benefits overall. About those multiply and divide operations - avoid them if you can. They are inherently expensive (36 cycle executes for divides), but that's nature's way as far as we know. Then again, stalling is not the end of the world either - it merely means you're not operating at absolute maximum throughput. If your task requires divisions that can not be accommodated by right-shifting (properly a compiler responsibility), then you have license to divide away.

We've now seen some of the mechanisms that contribute to the goal of ever filled pipelines: fixed-time pipeline stages, parallel instruction issue to multiple pipelines, branch folding, feed-forwarding and generous use of caches, queues, buffers and large register files. I remind you, the compiler is supposed to be aware of all this, bearing the awesome responsibility of reducing dependencies by reordering instructions and making a clever mix of floating-point, integer, branch and load/store operations - while not disrupting program logic. In light of this, if your compiler is doing its job to the utmost, you should be getting headaches figuring out low level code generated with optimizers on. Conversely, you will probably want to turn off optimizations for debugging.

Note that it's important to know which execution units are involved with each instruction to do optimization effectively. This is made much more difficult considering that the 603, for example, already breaks the IU up into three new units: a smaller IU, a dedicated load/store unit and a special-purpose register access unit. As the PowerPC family matures, the design will continue to evolve. Keeping up with it will be very difficult. On the down side, it will no doubt require a certain amount of lowest-common-denominator compromising to stay compatible with as many PowerPC family members as possible.

Finally, we have seen that an individual execution unit, working at maximum throughput can process instructions at a rate of one per cycle. In conjunction with the parallelism and other performance enhancements throughout the system, the achievable rate for application code as a whole is potentially faster than that.

It's incredibly cool to Geeks (informed folks) like you and me. So why isn't everybody rushing to buy a Power Mac? I often play dummy first-time buyer at superstores to see what salespeople will try to sell me. They're not pushing Power Macs - not even to first-time buyers. At the time of this writing, there are only fifty or so native applications. Even though the emulator can run existing Mac software at speeds fully adequate for most users, magazine reviewers and computer salespeople are insisting that the user won't be happy without the native versions. Nobody knows, when asked, when these applications can be expected. In

Page 15: ARSITEKTUR EVOLUSI KOMPUTER

spite of the work Apple did on compatibility, on emulator performance and on keeping it Macintosh all the way, the message is entirely garbled and misunderstood when it gets to the marketplace. To the user, it is not a machine for today with an even greater future - it is just a machine without software. We've all got to get on the stick and produce some of that anxiously awaited software, so that 1994 won't be like 1984.

User Programming Model

The foregoing material on the inner workings of the machine is, I hope, informative and interesting, but now we will get reacquainted with the CPU from the point of view of actual programmers. What one has to know for this purpose is far simpler. Again, we are focusing on integer programming. The figure illustrates the subset of the hardware with which you interact directly - your interface to the machine.

While there are many more registers than shown, these are all that are necessary for most purposes. In particular, we acknowledge, here only, the existence of the MQ (Multiply/Quotient) register, which is used by many 601 instructions to hold intermediate or extended results. The 601 is a transitional chip - the first PowerPC implementation of IBM's POWER architecture from which PowerPC is generally derived. The MQ register, and all instructions which depend upon it were retained from POWER in the 601 to give IBM developers time to make the transition to PowerPC while maintaining a high degree of compatibility with existing POWER software. These things are not part of the PowerPC definition, and will likely be dropped from subsequent implementations. We will speak of MQ no more.

The GPRs are 32 bits. Subsequent models may have 64-bit registers. Bit labeling, in general, is just the opposite of 68K conventions. The most significant bit is number 0, the least significant is 31. Low (least) is depicted as being to the right in both worlds. Left and right come into play in connection with shift operations. As ever, right shifts divide.

The 32 general-purpose registers all function identically. Each can be used for data or address calculation. Everything interesting happens here. Interaction with memory happens only through explicit load and store operations. This is common on RISC machines. They are said to have a load/store architecture.

Page 16: ARSITEKTUR EVOLUSI KOMPUTER

We'll cover more about branching later, but for now, the link register typically holds the target address of a branch (and then an optional return address). The count register is dual purpose, holding a value to be decremented for conditional branching, similarly to the DBcc (68K) construction. It can also serve as an alternate branch target address.

The exception register (XER) is divided into several fields. Of interest for us are just the high 3 bits shown. The CA bit records carries out of bit 0 of a result during arithmetic operations. It is also the bit used for extended (multi-word) arithmetic. OV records arithmetic overflow, i.e., the result was too large to be represented with so many bits. SO records 'summary overflow' - as far as is documented, it behaves identically to OV - I see no case where only one is updated and not the other. In any case, the SO bit of the XER is the one copied to the CR to reflect overflow there.

The CR is actually a set of eight functionally identical condition records (cr0,...cr7), each 4 bits wide. Only cr0 is shown in the figure. Each can be individually targeted to hold the result of explicit compare instructions (you will rarely see any but cr0 used by your compiler). The cr0 and cr1 fields are different, however, in that when instructions other than compares are encoded for 'CR update,' cr0 is the implied target for integer operations, and cr1 is implied for floating-point (otherwise they work like the others). Bit 0 (or 4, 8, etc.) records the effect of comparing a result against zero. It is set if the result is negative (high bit is 1), similar logic applies for bits 1 and 2. Again, SO is copied from the XER. These bits are enough to characterize any signed or unsigned arithmetic or logical result.

Data Types and Alignment

The table lists the intrinsic addressable data types on the 601, and how their names differ from 68K conventions.

Size (bytes) Last 4 bits of address 601 Type Name 68K Type Name

1 xxxx byte byte

2 xxx0 half-word word

4 xx00 word long-word

8 x000 double-word NA

16 0000 quad-word NA

An extensive set of bit-field operations is available as well, acting on GPRs only, not memory.

Page 17: ARSITEKTUR EVOLUSI KOMPUTER

Proper data alignment is something to be conscious of and design for on all PPC implementations (it speeds-up 68030 and 68040 code as well). Each type has a natural address at which it should reside. This address is an integral multiple of the type's size. As the table shows, for example, words should reside at addresses divisible by 4, and so have two zeros for the last two bits of their addresses. The reason is mainly that the machine can access aligned data faster. Misaligned data may require extra work to calculate and execute a sequence of bus cycles that will map onto data crossing their natural boundaries.

Aligning your data should be done in three places: global data (together with static data, which are stored as if global), local data, and structure definitions. For global data, definition is where storage is allocated, not where data are merely declared external. Achieving alignment is easy - just a matter of discipline. You would apply the same rules in any of the three areas. As an example, let's consider structures. Start with the assumption that the top of the structure is given to reside at a 4-byte boundary (xx00), such as x000. Lay out its fields such that each starts at a natural address for the field's type, as given in the table. Add padding if necessary. Consider a structure containing 2 longs, 1 short, and 1 char.

struct Misaligned { struct Aligned { long L1; long L1; char C; charC; short S; char reserved; // padding long L2; short S;}; long L2; };

Both structure examples start at address x000. Note that the misaligned structure's S is at address x101, and L2 is at x111. All that was needed to correct this, and get all fields onto proper boundaries was a padding byte, as indicated.

You'll make things easiest for yourself by aligning each structure as an isolated entity. Make sure each structure has aligned fields, but is also a multiple of 4 bytes in total length. This strategy ensures that alignment is preserved when an array of such structures is allocated, or if the structure becomes an embedded substructure, or is defined globally or locally. You're better off designing so that you can ignore how the structure will be used by you or others. Make them good citizens.

For local data, one does the same thing, assuming that the first variable defined in any function starts at xx00.

Global data is most easily considered on a per-file basis - like the per-structure rule. Again, assume address xx00 for the top of each file.

Something has been overlooked. What about the validity of the assumption of 4-byte starting addresses? In the case of dynamically allocated memory (pointers or handles), the Mac will always hand you aligned blocks (this has been true since the 68030). For data local to a function, the stack frame is always set up so that your locals start on a 4-byte boundary (PPC run-time model specification). For globals, you have to be more careful. All of the global (and static) data for an executable module (application, INIT, etc.) are collected together (across all files) and stored as one giant block. In fact, it is loaded into memory as a heap block, so has an aligned starting address. However, to maintain alignment throughout the interior of this block, you have to be diligent about keeping the data for each file nicely

Page 18: ARSITEKTUR EVOLUSI KOMPUTER

aligned, and a multiple of four - similar to what was said for structures. Designing each file to have properly aligned globals when considered in isolation, ensures portability and reusability of files for other applications.

To wrap up for this month, I'll briefly mention the ordering of bytes in memory. The 601 is capable of operating in either big-endian or little-endian memory mode (register function is independent of these). I have debated with myself about how much to say on this complicated topic. The real reason to say anything at all is to assure those who are wondering, that Apple has chosen big-endian ordering for the run-time model's default mode. That means that, in memory, as you read a long for example, you find its most significant byte at the lowest address, and its least significant at the highest - just the same as on 68K machines. That's what you need to know, that there's nothing to worry about here.

If you're interested though, true little-endian storage is what's used on Intel machines. Let's compare the same document on floppies, as produced by a Mac version of some application and by its faster-selling DOS counterpart. The mapping difference is simple to explain. Strings look the same on either disk - they start at the same addresses, and the bytes are in the same order. However, all integers larger than 1 byte look like their bytes are reversed on the Intel disk - they have little-endian ordering. Basically, to transfer a floppy from one machine to the other, one has to byte-reverse all the numbers. Floating-point formats are just too dissimilar to worry about, so forget that. Now, the 601 can operate in a pseudo little-endian format. On disk, it looks neither like true big nor true little-endian. Why? Without going into too much detail, the 601 can make memory appear to the processor as true little-endian by playing with the addresses of load/stores, but without reversing any bytes. The result is a fast, simulated little-endian world, but it's not true little-endian in memory - numbers do not have reversed bytes, but their starting addresses are changed. It's not the case that the in-memory data are instantly ready for exchange with real PCs. However, this scheme helps make the 601 ready to speedily emulate a PC. Getting full data compatibility still requires moving fields and explicit byte-reversal during I/O - already slow, so less noticeable.

What's Next?

OK, you know all about the environment and operands of the 601. Now we'll learn its lingo. Next month we'll go into the details of reading and understanding the assembly language, and get you really prepared to do some in-depth debugging.

In the meantime, no matter how deep you plan to go, you should read the PowerPC 601 User's Manual. It's an essential resource, and it's available from APDA with their Macintosh with PowerPC Starter Kit. This package is reasonably priced at $39.95. It also includes the New Inside Macintosh volume PowerPC System Software, which explains the run-time model, the Mixed Mode Manager and the Code Fragment Manager - all crucial for high level language PowerPC development, as well as really understanding what's going on in your debugger.