map smac algorithm onto gpu
TRANSCRIPT
Mapping SMAC Algorithm onto GPU
Student: Zhengjie Lu
Supervisor: Dr. Ir. Bart Mesman
Ir. Yifan He
Prof. Dr. Ir. Richard Kleihorst
Contents 1. Background ............................................................................................................................................... 3
1.1 GPU programming .............................................................................................................................. 3
1.2 SMAC algorithm .................................................................................................................................. 3
2. Implementation ........................................................................................................................................ 6
2.1 General Structure ................................................................................................................................ 6
2.2 SMAC on CPU ...................................................................................................................................... 7
2.3 SMAC on GPU ...................................................................................................................................... 9
3. Experiment .............................................................................................................................................. 12
3.1 Experiment Environment .................................................................................................................. 12
3.2 Experiment Setup .............................................................................................................................. 13
3.3 Experiment Result ............................................................................................................................. 14
3.3.1 GPU improvement ..................................................................................................................... 14
3.3.2 Linear execution-time model ..................................................................................................... 15
4. Roofline Model Analysis .......................................................................................................................... 17
4.1 Roofline Model .................................................................................................................................. 17
4.2 Application ........................................................................................................................................ 17
5. Conclusion ............................................................................................................................................... 19
Acknowledgement ...................................................................................................................................... 20
Appendix ..................................................................................................................................................... 21
1. Background
1.1 SMAC algorithm SMAC is short for the “Simplified Method for Atmospheric Correction”, which is specially used in
computing the atmospheric correction of satellite measurements in the solar spectrum. It is popular in
the remote sensing application because it is several hundred times faster than more detailed radiative
transfer models like 5S [3]. Figure 1.1 explains a black-box model of SMAC. The input of 9 parameters
will be taken into SMAC and then a single output will be generated.
SMAC Algorithm
Float sza;
Float sva;
Float vza;
Float vva;
Float taup550;
Float uh2o;
Float uo3;
Float airPressure;
Float r_toa.
Float r_surfRecycle
Output:
Input:
Fig1.1 SMAC black box model
SMAC is computational fast in its peer, but it still takes considerable amounts of time in processing
large-size data that is common in the remote sensing applications. Figure 1.2gives the profiles into
SMAC when it is processing a data size of 231240 bytes on CPU. The file IO operation is dominant in this
case (about 75%), while the CPU computation time is also significant (about 25%). Since the file IO
performance can be improved by introducing some faster hard disks, the CPU computation will become
the bottleneck later. This motivates us to map SMAC onto a commercial GPU (i.e. NVIDIA graphic card)
and see how much computation performance improvement we can achieve.
Fig1.2 Profiles of the original SMAC program
1.2 GPU programming The GPU programming is introduced into the field of Scientific Computation after its success in
accelerating the computer graphic processing. The hardware advantage of many parallel processing
cores per GPU (normally the number is at least 32 processing cores per GPU) makes it capable to deal
with massive data. The disadvantages of GPU programming are that: (1) the programmers are required
to have the knowledge about the hardware (especially the memory accessing pattern) so that they can
manipulate GPU efficiently, and (2) the pipeline penalty is dramatically huge if the branch predictions
miss. Normally, an efficient program is organized in such a way that GPU is responsible for the massive
mathematical computation while CPU takes charge of the logical operations and controlling.
NVIDIA develops a GPU programming technology named “Compute Unified Device Architecture” (CUDA)
for its own graphic card products, and it is the most popular one in the state-of-the-art GPU
programming. The CUDA supported GPUs can be found in NVIDIA’s official website *1+ and they have
covered the latest products of NVIDIA with more than 100 cores per GPU. In our paragraph, we will refer
CUDA programming as GPU programming for the NVIDIA graphic cards.
It is essential to explain the NVIDIA GPU hardware architecture before we discuss about CUDA
programming, because CUDA programming is actually a collection of regulations for operating the
hardware in the most efficient way. Figure 1.3 shows the overview of NVIDIA GeForce 8800GT GPU.
Every 8 stream processors (SP) is organized together in a stream multi-processor (SM). 14 SMs consists
of the main body of a GPU. Inside each SM, there are 8192 registers and also a shared memory with the
size 16384 bytes. The shared memory is used for the local communication within the 8 SPs. A global
memory is connected with each SM to make the global communication. It should be pointed out that
the access to the global memory is rather slow and that to the shared memory is rather fast. This
indicates that we should play with the shared memory much more than the global memory, to achieve a
better performance.
Fig1.3 NVIDIA 8800GT architecure
The basic concept in CUDA programming is the single-instruction-multiple-threads (SIMT), which means
that all the active threads will perform an identical instruction in each execution time [2]. Each active
thread is assigned to a unique SP so that the physically parallel-threading is achieved. Also each thread
will has its own registers to keep its status and they can communicate with each other through the
shared memory.
The second concept in CUDA programming is the thread and block. Each block is consisting of multiple
threads shown in figure 1.4, and the number of threads per block is limited by the physical available
registers number and shared memory size. Each block will be assigned to a SM inside which 8 SPs are
integrated. A single block can be only assigned to a single SM, while a single SM can hold many blocks.
Fig1.4 Block and threads Fig1.5 Stream Execution
The third concept in CUDA programming is the warp, which is the basic thread scheduling unit. 32
threads in a block will be organized as a warp and then simultaneously assigned to this block’s
responding SM by the scheduler. If the thread number in a block is not the times of 32, then some
dummy threads will be appended to this block to make the thread number as the times of 32.
The forth concept in CUDA programming is the concurrent-copy-and-execution, or so-called "stream"
execution. The input data is broken down into several segmentations with the same length. Then the
data segmentations (so-called data streams) are transferred from the CPU memory to the GPU memory
one by one. A data stream can be processed by the GPU kernel can process once it has been transferred
completely, without waiting for the completion of other data stream transmissions. An explanation on
the stream execution is given in figure 1.5.
The fifth concept in CUDA programming is the memory access coalescence. This means that the access
pattern to the global memory should be viewed in terms of aligned segments of 16 and 32 words. The
addressing pattern must also be aligned to 16 or 32 words.
As a short conclusion, the program has to be mapped to the NVIDIA GPU hardware with the CUDA
regulations.
2. Implementation
2.1 General Structure SMAC algorithm is consisting of 14 steps, as shown in figure 2.1. The first step as a "data filter" is a
conditional branching, through which only the valid data can be passed to the computations later. Each
computation would just rely on the input of its previous ones, as shown in the data dependency graph in
figure 2.2. All computations in SMAC are arithmetic calculations, including trigonometric functions,
exponential functions and etc. Also several computations contain the if-else conditions and these
branching will be replaced with the equivalent logical expression when they are mapped onto GPU. The
implementation of SMAC on CPU is programmed in ASIC C++, while the one on GPU programmed in C++
and C.
Calculation: step 1
Valid vector?
Calculation: step 2
Parameters setup
Calculation: step 11
Calculation: step 3
Calculation: step 4
Calculation: step 5
Calculation: step 6
Calculation: step 7
Calculation: step 8
Calculation: step 9
Calculation: step 10
Yes
No
Start
Calculation: step 12
Fig.2.1 Overview of SMAC kernel
1
2
3
4
5
6
7 8
9
10 1112
p
p
1
Parameters
Setup
Calculation
Step 1
Fig.2.2 Data dependency Graph of SMAC
2.2 SMAC on CPU A single thread is employed as the execution model of SMAC on CPU, in which the SMAC kernel has to
read through all the input and then generate out the final results. One input vector consisting of 9 float
point number can just be used to produce one output data consisting of 1 float point number. No any
data dependencies exist between different input vectors and neither do the outputs. The completely
execution model is shown in figure 2.3.
The data flow in this case is quite simple, which is shown in figure 2.4. Both the input data and
coefficients are read from the files on the hard disk to the CPU memory. Then CPU takes them into its
registers and throws out the final results into the CPU memory.
vector[0]
vector[1]
vector[2]
…
vector[n-1]
SMAC kernel
Float sza;
Float sva;
Float vza;
Float vva;
Float taup550;
Float uh2o;
Float uo3;
Float airPressure;
Float r_toa.
Float r_surfRecycle[0];
Float r_surfRecycle[1];
Float r_surfRecycle[2];
Float r_surfRecycle[3];
…
Float r_surfRecycle[n-1].
Input Output
Fig.2.3 Execution model of SMAC on CPU
Fig.2.4 Data flow of SMAC on CPU
The original SMAC program employs 3 classes: (1) SmacAlgorithm Class, (2) Coefficients Class and (3)
CoefficientsFile Class. SmacAlgorithm Class functions the kernel in which the SMAC algorithm is fully
implemented, while the other two manage the access to the coefficient file. An additional SimData Class
is now included to manipulate the input and output data. The relations among all the classes are listed
in figure 2.5.
Because the validness of input data can be identified as soon as they are read in, the “data filter” in the
SMAC kernel can be immediately achieved in SimData Class instead. It can both save the memory usage
and reduce the processing time in the kernel. Also the GPU can get rid of the conditional branching
when mapping SMAC on it.
The flow chart in figure 2.6 explains the implementation procedures. The coefficients and the input data
are read firstly, and then passed to the SmacAlgorithm instance for computation. The computational
results will be collected by the SimData instance which is the output parameter of the SmacAlgorithm
instance.
SmacAlgorithm
classCoefficients class
CoefficientsFile
class
File OperationAlgorithm
ExecutionI/O Interface
SimData class
Satellite Sensors
Coefficients File
Satellite Data File
Earth Surface
Reflectance File
Coefficients::setC
oefficients()
SimData::readData()
SmacAlgthm::SmacAlgthm()
SmacAlgthm:
run()
ENTRY
EXIT
Fig.2.5 Program structure of SMAC on CPU Fig.2.6 Flow chart of SMAC on CPU
2.3 SMAC on GPU The execution model of SMAC on GPU is benefiting from the multiple threads. Each GPU thread is an
instance of the SMAC kernel so that multiple input vectors can be processed simultaneously, as shown in
figure 2.7. This is the main benefits of employing GPU.
Vector[0]
Vector[1]
Vector[2]
…
Vector[n-1]
SMAC kernel 0 Float sza;
Float sva;
Float vza;
Float vva;
Float taup550;
Float uh2o;
Float uo3;
Float airPressure;
Float r_toa.
Float r_surfRecycle[0];
Float r_surfRecycle[1];
Float r_surfRecycle[2];
Float r_surfRecycle[3];
…
Float r_surfRecycle[n-1].
Input Output
SMAC kernel 1
SMAC kernel 2
...
GPU
Fig.2.7 Execution model of SMAC on GPU
The data flow of SMAC on GPU is quite different from the one on CPU as shown in figure 2.8. The input
data need to be transferred from the CPU memory to the GPU memory and then the results copied back
in reverse. The constant memory in GPU is employed as the cache, for accessing the frequently used
coefficients in all SMs.
Coefficients[];
Input_data[];
1st copy of coefficients[];
1st copy of input_data[];
1st copy of output_data[].
Hard disk CPU memory
2nd
copy of input_data[];
output_data[].
GPU Global memory
2nd
copy of coefficients[]
GPU constant memory
Registers
GPU
Fig.2.8 Data flow of SMAC on GPU
The program structure of SMAC on GPU is based on the one on CPU, in which the SMAC kernel is
immediately mapped onto GPU. Two more GPU relevant modules are introduced to the program
structure as shown in figure 2.9. The module "GPU_kernel.cu" implements the SMAC kernel which will
be executed on GPU, while the one "GPU.cu" controls the GPU memory operations and kernel execution.
Fig.2.9 Program structure of SMAC on GPU
Also the flow chart of SMAC on GPU is similar to the one on CPU, except for invoking the GPU relevant
functions, as shown in figure 2.10. An obvious change is that the input data are transferred from the
CPU memory to the GPU memory before executing the SMAC kernel on GPU and then the output data
are copied back from the GPU memory to the CPU memory. Besides that, the input data has to be re-
organized in such a sequence that they can be coalesced accessed to the GPU memory.
reorganizeInput()
cudaMemcpyAsync()
cudaMemcpyAsync()
GPU_kernel<<>>()
reorganizeOutput()
Coefficients::setC
oefficients()
SimData::readData()
SmacAlgthm::SmacAlgthm()
SmacAlgthm:
run()
ENTRY
EXIT
Fig.2.10 Flow chart of SMAC on GPU
As introduced in section 1.1, we have to concern some more in GPU programming and they are: threads,
blocks and streams. The profiling tool CUDAPROF gives us the tip that 59 registers are needed per SMAC
kernel thread. With this requirement of register number, the optimization tool CUDACALCULATOR
indicates that a max number of 192 threads per block can be achieved.
Once a block with 192 threads is executing on a unique SM, no other blocks can be assigned to the same
SM before the working block finishes all its executions. Since there are 4 SMs inside the GPU in our
experiment, 4 blocks are enough to fully utilize the GPU. Employing more blocks would probably
introduce the extra overhead when the blocks are switching, while employing fewer blocks could be just
a waste of hardware resources.
The number of streams will be explored later in our experiment.
3. Experiment
3.1 Experiment Environment We apply the SMAC program on a laptop workstation which is equipped with both an Intel Dual-Core
CPU and a NVIDIA GPU. The GPU is mounted on the laptop mother boarder through the PCI express
interface. The operation system on this machine is Windows 32-bit Vista Enterprise, with CUDA 2.2
support. The workload on CPU and GPU are both low before our experiment. The detailed descriptions
of experiment environment are listed below:
HARDWARE
CPU Intel(R) Core(TM)2 Duo CPU T9300, 2.5 GHz x 2
GPU nVidia Quadro FX570M, 0.95 GHz x 32
Main Memory 4GB
Mother Board Interface PCI express 1.0 x 16
SOFTWARE
Operation system Widows Vista Enterprise 32-Bit
CUDA version CUDA 2.2
GPU maximum registers per thread 60
GPU thread number 192 x 4 (#thread per block x #block)
CPU thread number 1
Table 3.1 Experiment environment
Coefficients::setC
oefficients()
SimData::readData()
SmacAlgthm::SmacAlgthm()
SmacAlgthm:
run()
ENTRY
EXIT
Start timer
Stop timer
Fig.3.1 Time profiling method
To profile the execution time for either SMAC on CPU or SMAC on GPU, the timers are attached at two
ends of the algorithm instances as shown in figure 3.1. In our experiment, we are concern about the
SMAC kernel performance but not the application performance. That is because the later one is
dominated by the hard disk I/O speed as already given in table 1.1 and it can be overcome by employing
other hard disk with high I/O speed.
3.2 Experiment Setup As it is indicated in figure 3.1, the timing profile of the SMAC kernel can be defined as the difference
between the “start timer” and “stop timer”. For instances, they are:
CPU time CPU stop timer CPU start timer -
GPU time GPUstop timer GPU start timer -
Now we define the performance improvement as:
CPU time
ImprovementGPU time
, in which the linear execution-time model is employed:
CPU time CPU overhead Bytes CPU speed
( )
( )
( )
( )
GPU time GPU memorytime GPU run time
GPU memory overhead Bytes GPU memory speed
GPU kernel overhead Bytes GPU kernel speed
GPU memory overhead GPU kernel overhead
Bytes GPU memory speed GPU kernel speed
GPU overhe
ad Bytes GPU speed
The performance improvement can also be expressed as:
CPU overhead Bytes CPU speedImprovement
GPU overhead Bytes GPU speed
Bytes CPU speed CPU speed
Bytes GPU speed GPU speed
It should be pointed out that this equation only holds when the data size is dramatically large. From this
formula, it can be obtained that the ultimate improvement is only relevant to the CPU and GPU speed
and has nothing to do with the data size. We will apply the linear execution-time model to predict the
GPU performance later.
3.3 Experiment Result
3.3.1 GPU improvement
Table 3.2 and figure 3.2 record the performance improvement during our experiment. The improvement
cures in figure 3.2 is climbing down when the number of streams increased, in case that the data size is
quite low. That is because the data size is too small to cover the overhead gap. But if we increase the
data size to some larger one, we can find that the slopes of the curves become larger and larger. That is
due to the increasing data size which dismisses the overhead. Later all the curves would behavior
similarly or even overlapping on each other, when the data size is larger than a certain threshold. It
couldn’t help a lot even if we increase the data size or the stream number.
Table 3.2 Performance improvement: CPU time/GPU time
Fig.3.2 GPU performance improvement: CPU time/GPU time
3.3.2 Linear execution-time model
In our earlier tests, the overhead and processing speed have been obtained and they are recorded in
table 3.3. The GPU overhead is relatively large, while the CPU overhead is too tiny to be measured. The
reason for the significant GPU overhead is that the data needs be re-organized and then transferred
both before and after the GPU kernel execution. It is also obvious that the GPU speed is at least 10 times
faster than the CPU one.
CPU time (ms) = CPU overhead (ms) + data size (byte) x CPU speed (ms/byte)
CPU overhead 0
CPU speed 55.39 10
GPU time (ms) = GPU overhead (ms) + data size (byte) x GPU speed (ms/byte)
GPU overhead: 1-stream 1.67
GPU speed: 1-stream 62.41 10
GPU overhead: 8-stream 4.45
GPU speed: 8-stream 62.01 10
Table 3.3 Parameters of linear execution-time model
Both the predicted improvement and experimental improvement of 1-stream GPU performance are
plotted in figure 3.3. Some recognizable variance occurs between the two curves with the data size is
small. Then the cures are almost identical when the data size is large. Finally, the curves will likely reach
the "ceiling" in despite of the data size.
Fig.3.3. Single stream GPU performance improvement
A comparison of the 8-stream GPU performance improvement is shown in figure 3.4 and it tell tells us
the similar information. The performance improvement will ultimately stays as a constant. No better
improvement could be achieved even when we increase the amounts of data size.
Fig.3.4. 8-stream GPU performance improvement
1-stream GPU
Data size (byte)
Experimental Improvement
Predicted Improvement
116640 5,71 3,21806093
233280 8,6 5,626082853
466560 12,83 8,989388397
933120 14,79 12,82188881
1866240 19,28 16,29558697
3732480 20,54 18,84884576
8-stream GPU
Data size (byte)
Experimental Improvement
Predicted Improvement
116640 1,99 1,342894063
233280 3,49 2,557938816
466560 6,18 4,671162775
933120 9,95 7,958659158
1866240 14,75 12,27984849
3732480 19,24 16,85581952
4. Roofline Model Analysis
4.1 Roofline Model Roofline Model can give us a non-perfect insight into the performance bottleneck [4]. When the
instruction density measured in flops/byte lies below the peak performance as shown in figure 4.1, it
indicates that the bottleneck is the data transfer and we should bring in more data. Otherwise the
bottleneck might be the computational limitation and we should reconsider the computation approach.
Fig.4.1.Example of Roofline model
4.2 Application To apply the Roofline model with our case, the GPU hardware specifications and the profiles of SMAC
kernel (notes: no the SMAC application) have to be obtained before the analysis. They are all listed in
table 4.1.
Hardware: NVIDIA Quadro FX570M
PCI express bandwidth: 4 GB/sec
Peak performance: 91.2 GFlops/sec
Peak performance without FMAU: 30.4 GFlops/sec Software: SMAC kernel on GPU Data size: 59719680 Bytes
Issued instruction number: 4189335552 Flops
Execution time: 79.2 ms
Instruction density: 70.15 Flops/Byte
Instruction Throughput: 52.8 GFlops/sec
Table 4.1 Parameters of Roofline model
Now the performance of SMAC on GPU is applied on the Roofline model as the blue marker shown in
figure 4.2. It can be identified that the performance bottleneck is the computation if we only consider
the kernel execution on GPU. To be more precise, it results from the computation unbalances between
float point multiplication and addition. That is partially because that SMAC is immediately mapped on
GPU and the data dependencies inside SMAC still exist. The data dependencies limit the fully utilization
of the FMAUs (float-point multiplication-addition unit) in GPU.
The other reason is that just 192 threads are employed in a block to satisfy the register budget.
Introducing more threads would only result in that the temp operands have to be stored in the local
memory which is quite slow to be accessed. Keeping the thread number as many as 192 can dismiss the
opportunities t using the local memory, but it would also make some function units always unused.
The situation can be worse if the I/O operations are included. The navy blue sloping line stands for the
hard disk IO bandwidth bottleneck and it is right farther than the GPU memory bandwidth limitation
cure. The instruction throughput in this case is below 1 GFlops/second so that it is not plotted in figure
4.2. To be summarized, the SMAC application is limited by the hard disk I/O.
Fig.4.2 Roofline model of SMAC on GPU
5. Conclusion and Future The popular algorithm named SMAC in the remote sensing field is successfully mapped onto the
commercial programmable GPU with the help of CUDA technology. Since SMAC is applying with large
size of data stream, it can employ the trick of “stream-execution” to achieve the performance
improvement. Our experiment result shows that a performance speed up of 25 times can be achieved
by GPU, compared with CPU. Also the linear execution-model is proved in analyzing GPU’s stream-
execution.
Besides that, the Roofline model is employed to identify the bottleneck of SMAC on GPU. The SMAC
kernel on GPU is dominated by the computational bottleneck, while the completed application is limted
by the hard disk I/O bandwidth. We are only interested in the former one in this report. Two main
reasons result in the computational bottleneck of SMAC kernel on GPU and they are: (1) the unbalance
between the float-point multiplications and additions due to the data dependencies, and (2) the register
pressure caused by the register requirement per thread. The first one can possibly be released by
decoupling the data dependencies inside the SMAC algorithm kernel, while the second one can be
solved by turning to the fine-grained threads in which fewer register are required.
Fig.5.1Diagram of GPU power measurement Fig.5.2 Physical layout of GPU power measurement
The power consumption of SMAC on GPU also interested us in the future. Since no commercial power
consumption measurement PCI-E cards are available in the market, a customized approach has to be
carried out. Figure 5.1 shows the measurement principle, in which the 5V and 12V power supply lines in
the 4-pin PCI-E interface are measured separately. A 0.03 resistor with 20W capacity is connected to
the 5V power supply line and its current flow is calculated by its voltage over its register value. Then the
power supply through this 5V line can be obtained by 5V times the current through the resistor. So does
the measurement on the 12V power supply line. At last the two power supplies are summarized to
obtain the GPU power consumption. Figure 5.2 shows the physical connection of our chosen future
experiment.
Acknowledgement This assignment is my 3-month traineeship in Technische Universiteit Eindhoven (TU/e) and the topic is
from VITO-TAP, Flemish Institute for Technological Research NV. I have got lots of support from the
people surrounding. Dr. Ir. Bart Mesman, my supervisor in TU/e, has spent quit a lot of time in my
weekly reports and verifying my ideas. Ir. Ir. Yifang He who is the PhD candidate in TU/e, also my
supervisor in this traineeship, guided me with the research methodology. Prof. Dr. Ir. Richard Kleihorst,
my supervisor in VITO-TAP, kindly arranged the daily issues and working environment in VITO-TAP. Prof.
Dr. Ir. Henk Corporaal, my academic mentor in TU/e, gave me strong supports in the 3 months. I should
also thank Ir. Zhengyu Ye and Ir. Gert Jam, who gave me valuable advices on GPU programming and
Roofline Model.
Appendix [1] NVIDIA CUDA official website, http://www.nvidia.com/object/cuda_home.html, retrieved on July 20,
2009
[2] NVIDIA CUDA documentation, Chapter 4, “NVIDIA CUDA Programming Guide 2.2”, February 4, 2009
[3] H. Rahman, G. Dedieu, “SMAC: A Simplified Method for the Atmospheric Correction of Satellite”.
December 5, 1993
*4+ S. Williams, A. Waterman, D. Patterson, “Rooline: An insightful Visual Performance model for Multi-
core Architectures”, April 2009
[5] S. Williams, D. Patterson, “The Roofline Model: A pedagogical tool for program analysis and
optimization”, retrieved on September 10, 2009.
[6] Zhengyu Ye, "Design Space Exploration for GPU-Based Architecture", August 2009.
[7] Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, John A.
Stratton, and Wen-mei W. Hwu, “Program Optimization Space Pruning for a Multithreaded GPU”, ACM
2008.
[8] NVIDIA CUDA documentation, “NVIDIA_CUDA_BestPracticesGuide_2.3”, retrieved on September 12,
2009
[9] Rob Farber, “CUDA, Supercomputing for the Masses” , September 19, 2008.
*10+ X. Ma, M. Dong, L. Zhong, Z. Deng, “Statistical Power Consumption Analysis and Modeling for GPU-
based Computing”, retrieved on September, 2009
[11+ S. Collange, D. Defour, A. Tisserand, “Power Consumption of GPUs from a Software Perspective”,
Proceedings of the 9th International Conference on Computational Science, 2009
*12+ Analog Device documentation, “Measuring temperatures on computer chips with speed and
accuracy”, April, 1999
[13+ Green Grid, “The Green Data Center: Energy-Efficient Computing in the 21st Century”, retrieved on
July 16, 2009
[14+ Google, “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines”. Retrieved on July 16, 2009
[15+ Google, “The Case for Energy-Proportional Computing”, retrieved on July 16, 2009
[16] http://en.wikipedia.org/wiki/PowerNow!, , retrieved on July 16, 2009
[17] http://en.wikipedia.org/wiki/SpeedStep, retrieved on July 16, 2009
[18+ SUN, “Sun's Throughput Servers: Paradigm Shift in Processor Design Drives Improved Business
Value”, retrieved on July 16, 2009
[19] IBM, "Storage Modeling for Power Estimation", retrieved on July 16, 2009
[20] http://en.wikipedia.org/wiki/Green_computing, retrieved on July 16, 2009
[21+ Seagate, “2.5-Inch Enterprise Disc Drives: Key to Cutting Data Center Costs”, retrieved on July 16,
2009
[22] Google, "Power-Aware Micro-architecture: Design and Modeling Challenges for Next-Generation
Microprocessors", retrieved on July 16, 2009
[23] Google, "MapReduce: Simplified Data Processing on Large Clusters", retrieved on July 16, 2009
[24] Google, "Power Provisioning for a Warehouse-sized Computer", retrieved on July 16, 2009
[25+ IBM, “IBM BladeCenter HS22 Technical Introduction”, retrieved on July 16, 2009
[26+ IBM, “IBM BladeCenter Products and Technology”, retrieved on July 16, 2009
[27] http://www-03.ibm.com/systems/virtualization/, retrieved on July 16, 2009
[28+ Green Grid, “Five Ways to Reduce Data Center Server Power Consumption”, retrieved on July 16,
2009