iccs 2011 - third workshop on emerging parallel architectures · tdp 8 w 13 w 95 w clock speed 1.6...

27
Center for Information Services and High Performance Computing – TU Dresden ICCS 2011 - Third Workshop on Emerging Parallel Architectures Coarse Grained Parallelized Scientific Applications on a Cost Efficient Intel Atom Based Cluster Robin Geyer, Andy Georgi, Wolfgang E. Nagel 2nd June 2011 INF 1038 N¨othnitzer Straße 46 01189 Dresden Telefon: +49 0351 - 463 38781 E-Mail: [email protected]

Upload: dinhque

Post on 01-Apr-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Center for Information Services and High Performance Computing – TU Dresden

ICCS 2011 - Third Workshop on Emerging

Parallel Architectures

Coarse Grained Parallelized Scientific Applications on a Cost Efficient IntelAtom Based Cluster

Robin Geyer, Andy Georgi, Wolfgang E. Nagel

2nd June 2011

INF 1038Nothnitzer Straße 4601189 Dresden

Telefon: +49 0351 - 463 38781E-Mail: [email protected]

Page 2: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Content of this Presentation I

1 Intro and Idea

2 Costs and Software

3 Measurements and Conclusion

4 Sources

2/27

Page 3: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

ZIH @ TU Dresden

ZIH = Zentrum fur Informationsdienste und Hochleistungsrechnen

Central scientific unit of the Technische Universitat DresdenIn-house service provider (campus network, data-storage, backup,PC-Pools, webservers, email, ...)Research “Institute” (HPC, parallel programming, computer architecture,energy efficiency)

Professorship in computer architecture at the faculty of computerscience

Teaching, see Linux Cluster in Theory and Practice [lctp11]

3/27

Page 4: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Starting Point

We needed a very cheap cluster for a hands-on-lecture

Maximum damage by screwdriver “impact” 100$Not much space neededEnergy efficientFor coarse grained software (lecture basically about administration andfarming software)Cheap scale up (just need more Ethernet ports)

These requirements overlap with the needs of (some) people doingpractical research

Is it possible to build a Intel Atom based cluster which is suitable for realworld workload?

4/27

Page 5: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Cluster Layout

Intel Atom Core

Intel Atom Core

FSB

Intel Atom Core

Intel Atom Core

FSB

FSB

GMA 3150

Memory Controller

FSB

GMA 950

Memory Controller

ICH7MNM10 Express

Diamondville Pineview

HDD Inte

l Ato

m330 /

D510

Core 1

Core 2

eth0

HDD Inte

l Ato

m330 /

D510

Core 1

Core 2

eth0

HDD Inte

l Ato

m330 /

D510

Core 1

Core 2

eth0

Sw

itch

ed 1

GBas

eT N

etw

ork

eth0

eth1

Storage

Processors

...

Running Services:

- Parallel File System- Network Services - DNS Server - DHCP Server - Time Server- Batch System- Monitoring Server

HDD Inte

l Ato

m330 /

D510

Core 1

Core 2

eth0

Head-Node

Com

pute

-Nod

es 1

- 4

Energy Measured

Figure: Left: the cluster layout, a common design with a master node and theworkers connected via Ethernet. Right: The two different types of the Atomplatform used, except for the different north- and south-bridge chips there areidentical.

5/27

Page 6: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Technical Details of Systems under Test

Atom 330 Atom D510 Sun Fire X4170Nehalem X5560

Number of Cores 2 2 8Simult. Multithread. 2/Core 2/Core 2/Core

L1 Cache per Core 24/32 KiB 24/32 KiB 32/32 KiBL2 Cache per Core 512 KiB 512 KiB 256 KiB∗

TDP 8 W 13 W 95 WClock Speed 1.6 GHz 1.6 GHz 2.8/3.2 GHz∗Lithography 45 nm 45 nm 45 nm∗

Available Memory 2 GiB 2 GiB 12 GiBDDR2-667 DDR2-667 DDR3-1333

Nodes under Test 4 4 1

Table: Comparison of the systems under test per node/board. “∗“ denotes a specialfeature , cache with core cycle speed, Turbo Boost Mode and advanced lithography.

6/27

Page 7: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Content

2 Costs and Software

7/27

Page 8: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Costs

Part ≈ Price in ≈ Price we paid% of overall cost

Atom ITX-Board 35 $ 70250 GB HDD 20 $ 30Cabels, etc. 10 $ 102 GB 677MHz memory 20 $ 30Barebone housing 15 $ 25

Sun Fire X4170 $ 3500

Table: Rough estimation of the cost for one cluster-node for the Atom clusters. Alltogether the 4 nodes of the cluster were around $ 700 without network switch.That are $ 130 per node for a professional, large quantity purchase (regarding toour purchasing agent).

8/27

Page 9: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Software

Real applications from our users (of the ZIH systems)

MpCCD (2.2010)

Asteroid position determination — not parallel but multiple input files19 independent jobs (images) with runtimes between 20 sec and 20 minOverall runtime ≈ 40 min

RAxML (7.2.2)

Phylogenetic trees — parallel but not used for small datasets50 seeds for an Amino Acid and 50 seeds for an DNA alignementOverall runtime ≈ 25 min

POV-Ray (3.6.1)

3D rendering — not parallel but multiple frames per movie2 Tests, Cantor with 300 Frames, Menger with 100 FramesOverall runtime ≈ 40 min or 300 min

9/27

Page 10: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Content

3 Measurements and Conclusion

10/27

Page 11: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Synthetic Measurements (on 4 Nodes)

0.1 GB/s

1 GB/s

10 GB/s

100 GB/s

10 KB

100 KB

1 MB

10 MB

Ban

dw

idth

in

Byte

/sec

Accessed Memory in Bytes

AeApBxC Benchmark, Comparison Atom D510 vs. Sun X4170

X4170 with HTX4170 without HT

Atom D510 Cluster withoutAtom D510 Cluster with HT

Figure: Atom cluster is well balanced, cache to memory drop by factor of 2.5. SunFire X4170 drops by factor of 23 or 4. Measurement tool was [BenchIT]

11/27

Page 12: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Effect of Hyper-Threading on Applications (POV-Ray)

100

110

120

130

140

150

160

1700 5k 10k

15k

20k

25k

30k

Pow

er C

onsu

mpt

ion

in W

att

Runtime in sec

Atom 330

gcc / HTicc / HT

gcc / 1 on 1icc / 1 on 1

100

110

120

130

140

150

160

170

0 5k 10k

15k

20k

25k

30k

Pow

er C

onsu

mpt

ion

in W

att

Runtime in sec

Atom D510

gcc / HTicc / HT

gcc / 1 on 1icc / 1 on 1

Idle Power

Idle Power

Figure: The POV-Ray Menger test shows the power consumption when idle andunder full load. Also, the effect of Hyper-Threading with nearly 35% ofperformance gain.

12/27

Page 13: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Measurement Example: RAxML

Figure: The results for RAxML are representative for the other applications too. Byusing HT the runtime and energy consumptions drops ≈ 30% compared to asingle thread per core. While RAxML is memory bounded the Atom D510 is ≈2-3% faster in this case.

13/27

Page 14: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Sun Fire X4170 vs. Atom D510 Cluster

Sun Fire clearly outperforms the Atom cluster

At average, a factor of 5.87 longer runtime on Atom cluster (min: 2.96;max: 8.20)At average, a factor of 2.00 more energy consumption on Atom cluster(without HDD)(min: 1.81; max: 4.22)

But consider!

Sun Fire is around $ 3500, Atom cluster only $ 500 (Atoms for the sameprice should be much faster)Most of the energy is consumed by the chipsetMany nodes with HDD have more I/O performanceThe Atom cluster can deliver very much (free) batch system slotssimultaneouslyAtom is still energy efficient (see for example [FAWN])Atom has peek power consumption of 124 watt, Sun Fire 390 watt(factor 2.9)Cluster nodes in idle can be shut down

14/27

Page 15: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Conclusions

Under the assumption that cost-efficiency does not equal energy-efficiency.

We conclude that:

The atom platform is suitable for building cost-efficient clusters, if:

The applications uses the cluster as a farm of computers, not as a parallelcomputer (coarse grained parallel applications)The purchase cost has to be as minimal as possibleRuntime of single jobs is not criticalAn relatively energy efficient system is wanted

Another use-case is higher-education, where performance do not matter,but price and the node-count

See “Linux Cluster in Theory and Practice” here at ICCSLittleFe [LittleFe]

15/27

Page 16: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Thank you, Questions and Slides

Thank you for you attention!

Any questions?

Availability of this slides

Website: http://goo.gl/opAqv

or the same in long form:

http://wwwpub.zih.tu-dresden.de/~rgeyer/

16/27

Page 17: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Sources I

[lctp11] Linux Cluster in Theory and Practice: A Novel Approach inTeaching Cluster Computing Based on the Intel Atom PlatformProceedings of the International Conference on Computational Science,ICCS 2011doi:10.1016/j.procs.2011.04.209

[atom11] Coarse Grained Parallelized Scientific Applications on a CostEfficient Intel Atom Based ClusterProceedings of the International Conference on Computational Science,ICCS 2011doi:10.1016/j.procs.2011.04.216

[BenchIT] Guido Juckeland, Stefan Borner et al. — BenchIT -Performance Measurement and Comparison for Scientific ApplicationsPARCO2003 Proceedings

17/27

Page 18: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Sources II

[seamicro] Seamicro SM 10000 SystemWhitepapers on SM10000http://seamicro.com/

[FAWN] FAWN: A fast array of wimpy nodesProceedings of the ACM SIGOPS 22nd symposium on Operatingsystems principleshttp://doi.acm.org/10.1145/1629575.1629577

[LittleFe] LittleFe Computational Science on the Movehttp://littlefe.net/

18/27

Page 19: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Comparison Triad Atom 330 vs Atom D510 vs Xeon 5560

Figure: Atom 330 and Atom D510 vs. Xeon 5560 Memory Triad

19/27

Page 20: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Green 500 and MFLOP/Watt

Figure: Comparison to the Green500 List. The values for Atom are derived and notoptimized, so the may be significantly higher.

20/27

Page 21: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Synthetic Results

Benchmark Atom 330 Atom D510 PerformanceCluster Cluster Factor

CGV 1.4 GFLOP 2.1 GFLOP 1.5Total Comm. Bandwidth 1.7 GB/s 2.1 GB/s 1.24Network Bandwidth ≈ 87 MB/s ≈ 87 MB/s 1.0Network Latency 150 µs 150 µs 1.0Double Mat-Vec Multiply 4.1 GFLOP 5.2 GFLOP 1.27Integer Mat-Vec Multiply 6.2 GIOP 8.7 GIOP 1.4

Table: Comparison of generic BenchIT measurements. The average performancegain is a factor of 1.23.

21/27

Page 22: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

MpCCD Results

Figure: The results for MpCCD. Hyper-Threading does not bring any advantages inthis particular case. But the Atom D510 is a factor 1.3 faster than the 330.

22/27

Page 23: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

POV-Ray Results

Figure: The results for POV-RAY. By using HT the runtime and energyconsumptions drops ≈ 30-38% compared to a single thread per core. WhilePOV-Ray is integer bound the runtime is 12% decreased for the newer Atom D510.

23/27

Page 24: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

MpCCD

Compile Version / Runtime [s] EnergyCluster Consumption [Wh]

gcc / Atom 330 3449 (2230) 150.78gcc HT / Atom 330 4069 177.36

gcc / Atom D510 3500 (2250) 121.14gcc HT / Atom D510 3450 116.01

Table: Results of the MpCCD measurements. With manual scheduling it is possibleto reduce the runtime, with nearly the same power consumption, these results aredenoted in braces.

24/27

Page 25: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

RAxML

Compile Version / Runtime [s] EnergyCluster Cons. [Wh]

AA single / icc / Atom 330 1573 70.50AA pthreads HT / icc / Atom 330 1231 56.58DNA single / icc / Atom 330 1515 67.82DNA pthreads HT / icc / Atom 330 1046 48.35

AA single / icc / Atom D510 1720 58.20AA pthreads HT / icc / Atom D510 1192 42.12DNA single / icc / Atom D510 1412 47.85DNA pthreads HT / icc / Atom D510 971 34.01

Table: Results of the RAxML Measurements

25/27

Page 26: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

POV-Ray

Compiler Version / Runtime [s] EnergyCluster Consumption [Wh]

menger / icc / Atom 330 23665 1024.24menger HT / icc / Atom 330 15247 679.52cantor / icc / Atom 330 3425 126.32cantor HT / icc / Atom 330 2194 83.26

menger / icc / Atom D510 27145 893.11menger HT / icc / Atom D510 17290 591.36cantor / icc / Atom D510 3495 114.31cantor HT / icc / Atom D510 2175 74.07

Table: Results of the POV-Ray Measurements

26/27

Page 27: ICCS 2011 - Third Workshop on Emerging Parallel Architectures · TDP 8 W 13 W 95 W Clock Speed 1.6 GHz ... Asteroid position determination ... ICCS 2011 - Third Workshop on Emerging

Compiler Flags

ICC: -pipe -O3 -xSSE3 ATOM --minstruction=movbe

GCC: -O3 -fomit-frame-pointer -funroll-loops

-mtune=[core2|atom]GCC forPOV-Ray: -O3 -funroll-all-loops -fpeel-loops -funswitch-loops

-funit-at-a-time -ffast-math -funroll-loops

-finline-functions -mtune=[core2|atom]

Table: The compiler flags used for the building of the testing software.

27/27