並列前処理手法と領域分割，マルチコア時代の戦略

並列前処理手法と領域分割，マルチコア時代の戦略

分野横断型研究会「アルゴリズムによる計算科学の融合と発展」2009 年 4 月 22 ・ 23 日　筑波大学計算科学研究センター

中島研吾東京大学情報基盤センター海洋研究開発機構地球シミュレータセンター科学技術振興機構戦略的創造研究推進事業（ CREST ）

Algorithm09 2

• 幅広い分野• バランス• 分野間協力• 少しは他のカテゴリーについても

知らないと協力は進まない。

• 応用– 自分の問題が解ければ OK– 巷に役に立つライブラリが無い（特

に疎行列）• アルゴリズム

– 万能を目指したい

科学技術計算の真髄： SMASH

HHardwareardware

SSoftwareoftware

AAlgorithmlgorithm

MModelingodeling

SSciencecience

Algorithm09 3

• 非構造格子• 要素単位のローカルな処理⇒大規模疎行列• 悪条件問題（ ill-conditioned problems ）• 前処理付並列反復法

並列有限要素法による大規模シミュレーション

Magnetic Field of the Earth : MHD codeMagnetic Field of the Earth : MHD codeComplicated Plate Model around Japan IslandsComplicated Plate Model around Japan Islands

Simulation of Earthquake Generation CycleSimulation of Earthquake Generation Cyclein Southwestern Japanin Southwestern Japan

TSUNAMI !!TSUNAMI !!

Transportation by Groundwater Flow Transportation by Groundwater Flow through Heterogeneous Porous Mediathrough Heterogeneous Porous Media

h=5.00

h=1.25

T=100 T=200 T=300 T=400 T=500

4Algorithm09

講演の概要• 並列反復法と領域分割：SMASH

– 選択的オーバーラップ〔KN 2007〕– 階層型領域間境界分割（Hierarchical Interface Decomposition）〔Henon & Saad 2007〕

• 拡張型HID法の提案：SMASH– 悪条件向け領域分割手法

• Hybrid並列プログラミングモデル：SMASH

• HIDと並列多重格子法（時間があれば）：SMASH

HHardwareardware

SSoftwareoftware

AAlgorithmlgorithm

MModelingodeling

SSciencecience

Algorithm09 5

• 領域分割• 簡単な問題は簡単に解ける，効

率は出やすい

• 難しい問題はやっぱり難しい– Block Jacobi 型局所前処理 ⇒

領域数増加による反復回数増加• 領域外の影響を（基本的に）無視• 悪条件問題で顕著

「並列」前処理手法の技術的課題（の一部）

Algorithm09 6

• 様々な悪条件問題がある– 普通の工学的な問題は大抵「悪条件問題」– 係数行列の固有値分布，条件数

• ここでは，特に三次元固体力学における問題を扱う

– 接触条件– 不均質性– 捩れ等– Block ILU 型の前処理手法

• 各節点に 3 自由度（変位 3 成分）：効率，安定性

本講演で扱う悪条件問題

2 ux0= ux1 + ux22 uy0= uy1 + uy22 uz0= uz1 + uz2

ux0= ux1uy0= uy1uz0= uz1

0 1 2

0 1

3 nodes form 1 selective block.




0 1 2

0 1



0 1 2

0 1



Algorithm09 7

悪条件問題の例Heterogeneous Fields, Distorted Meshes

Algorithm09 8

• プレート境界における準静的応力蓄積過程• 非線形接触問題をNewton-Raphson 法によって解く• ALM法（Augmented Lagrangean, 拡大ラグランジェ法）による拘束条件：ペナルティ数• 領域分割による並列有限要素法

地震発生サイクルシミュレーションにおける接触問題

Algorithm09 9

領域間境界＝接触面収束は最悪

X

Y

1.00

1.00

0.10

X

Y

1.001.00

1.00

0.10

Algorithm09 10

「硬い」要素群上に領域境界が来ると収束は悪化する

E=100

E=103

3D Solid MechanicsE: Young’s Modulus

Algorithm09 11

不均質弾性問題： 203 要素■： 4×4×20

Algorithm09 12

不均質弾性問題： 203 要素BILU(0)-GPBiCG ，反復回数

• ■■ ：= 0.25• ■ 　： E=1.00

• 1- プロセッサ– ■ ： E=10-3， 34 回– ■ ： E=100 ， 31 回– ■ ： E=10+3, 84 回

• 8- プロセッサ（オーバーラップ無し）– ■ ： E=10-3， 53 回（ ×1.56 ）– ■ ： E=100 ， 52 回（ ×1.68 ）– ■ ： E=10+3， 158 回（ ×1.88 ）

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-dirrection @ z=Zmin

Ny-1

Nx-1

Nz-1

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-dirrection @ z=Zmin

Ny-1

Nx-1

Nz-1

Algorithm09 13

悪条件問題への対処• 悪条件問題を前処理付き反復法で解く場合，並列計算時には収束性が著しく悪化する場合があり，

安定した収束を与える前処理，領域分割の研究は重要– 領域分割：いわゆる数値計算ライブラリではカバーされていない分野

• 対処法– マルチレベル解析， Coarsegrid法– 深い領域間オーバーラップ

• 計算・通信コスト

Algorithm09 14

領域間オーバーラップの拡張

●：Internal Nodes，●：External Nodes■：Overlapped Elements●：Internal Nodes，●：External Nodes■：Overlapped Elements

5

21 22 23 24 25

1617 18 19

20

1113 14

15

67 8 9

10

PE#0PE#1

PE#2PE#3

12

32 41 5

21 22 23 24 25

1617 18 19

20

1113 14

15

67 8 9

10

PE#0PE#1

PE#2PE#3

12

32 41

1 2 3

4 5

6 7

8 9 11

10

14 13

15

12

PE#0

7 8 9 10

4 5 6 12

3111

2

PE#1

7 1 2 3

10 9 11 12

568

4

PE#2

34

8

69

10 12

1 2

5

11

7PE#3

1 2 3

4 5

6 7

8 9 11

10

14 13

15

12

PE#0

7 8 9 10

4 5 6 12

3111

2

PE#1

7 1 2 3

10 9 11 12

568

4

PE#2

34

8

69

10 12

1 2

5

11

7PE#3

Cost for computation and communication may increase

Algorithm09 15

不均質弾性問題： 203 要素ILU(0)-GPBiCG ， 8 領域，反復回数

オーバーラップ領域拡張の影響Overlap深さ

0

1

2

3

4

1領域

E=10-3

53

34

32

30

31

34

E=100

52

33

32

32

31

31

E=10+3

158

103

100

97

82

84

Algorithm09 16

接触問題：オーバーラップ拡張の効果• [KN 2005]

– BILU(0,1,2)– for “consistent” node number cases– IBM SP3 in NERSC/LBNL

Preconditioning

partitioning (overlap #)

PE#

iter’s set-up+ solve(sec.)

parallel speed-up

SB-BILU (0)

special [3] 1-layer

16 128

386 410

506.2 63.9

16.0 126.7

BILU(1)

special [3] 1-layer

16 128

225 247

563.2 95.0

16.0 94.8

BILU(1)

regular 1-layers

16 128

444 529

1033.2 191.0

16.0 86.6

BILU(1)

regular 2-layers

16 128

405 430

1063.3 204.6

16.0 83.2

SPAI regular 2-layers

16 128

891 888

626.3 105.1

16.0 95.4

17Algorithm09

• 安定で効率的な並列前処理手法，領域分割手法開発– アプリケーションの特徴を最大限使用

• 安定化，効率：ブロック化– アプリケーション⇒特殊前処理⇒一般化

• 問題に応じて，最適な前処理手法，領域分割，パラメータの組み合わせを自動的に選択するための手法の確立

• 大域的・局所的情報の利用– 大域的情報：係数行列の条件数等– 局所的情報：各要素のローカルな情報– アプリケーションの性質

• 実問題の特性を反映させたベンチマーク

最近やっていること

18Algorithm09

• 疎行列解法，前処理手法の検証– Matrix Market，実アプリマトリクス– 制約が多い⇒結局ごく限られた条件を代表– 取得が困難な場合がある

• 長時間のシミュレーション，大規模マトリクスデータ• 非線形問題：違うフェーズ⇒性質違うマトリクス

• 実問題の特性を反映させたベンチマーク– 元の問題と類似した性質の係数マトリクスを生成

• 元の問題が非線形でも，何らかの形で線形化が行われているはず– 「ベンチマーク」は線形問題でも良い

– パラメータ，形状，問題規模を自由に変えられる– 係数マトリクスが導出された過程くらいはわかっていなくてはならない

実問題の特性を反映させたベンチマーク

19Algorithm09

領域分割手法• 選択的オーバーラップ〔KN 2007〕

– 様々な条件に対応できるよう，有限要素法によるアプリケーションの特性（要素の属性，物性）を利用した前処理手法，領域分割手法

• 適応的にオーバーラップ深さを調整• 選択的フィルイン：前処理手法

• 階層型領域間境界分割（HID）〔Henon & Saad 2007〕– Hierarchical Interface Decomposition– 領域数増加による反復回数増加の効率的な抑止

• PHIDAL（Parallel Hierarchical Interface Decomposition Algorithm ）アルゴリズム

20Algorithm09

• アセンブリ構造物における接触問題• アプリケーションの特性を利用

– 選択的フィルイン：前処理– 選択的オーバーラップ：領域分割

選択的オーバーラップ，選択的フィルイン

Algorithm09 21

アセンブリ構造の例：ジェットエンジン

Algorithm09 22

接触面における節点整合，不整合接触面

整合不整合アセンブリ構造：部位ごとに別々にメッシュを作るのでこのようなことがおこりうる

地震：大すべり

Algorithm09 23

例題：接触面節点不整合問題アセンブリ構造を模擬

• 各ブロックは 1辺長さ 1.0 の立方体要素（ヤング率 =1.00 ，ポアソン比 =0.30 の弾性体）に分割。

• 各ブロックは 0.10ずつ離れており，その間が交差するトラス要素（弾性体）で結合されている。

X

Z

Y

X

Z

Y

X

Z

Y

X

Y

1.00

1.00

0.10

X

Y

1.001.00

1.00

0.10

Algorithm09 24

例題：接触面節点不整合問題アセンブリ構造を模擬

• トラス要素のヤング率をブロック部分の 103とすることによって接触面における拘束条件を模擬。

• z=0 で z 方向の変位を固定， x=0 及び y=0 で対称とし， z=zmaxの面に z 方向に一様分布荷重を与えている。

X

Z

Y

X

Z

Y

X

Z

Y

X

Y

1.00

1.00

0.10

X

Y

1.001.00

1.00

0.10

Algorithm09 25

Selective fill-ins• トラス要素に接続している節点のみに1レベル高い fill-inを適用する

• 本問題の場合： BILU(1+)– Block ILU（三次元弾性問題，1節点3自由度）– BILU(2)：トラス要素に接続する節点– BILU(1)：それ以外の節点

• 計算量はBILU(1)並みであるが，前処理性能としてはBILU(2)相当が期待される。

Algorithm09 26

Idea of “Selective fill-ins”: ILU(1+)

● 2nd order fill-in’s are considered for these nodes

● 2nd order fill-in’s are NOT considered for these nodes

● 2nd order fill-in’s are NOT considered for these nodes

Algorithm09 27

Selective Overlapping

• 「 Selective fill-ins 」の考え方を領域間オーバーラップに拡張する

• 一般の節点に関しては領域間オーバーラップの拡張を「遅らせる」• 領域間オーバーラップ拡張による計算量，通信量の増加を抑制できる。

Algorithm09 28

Internal Nodes for Partitioning ● Internal Nodes

Domain Boundary

Algorithm09 29

One-Layer Overlapping (d=0/1)

This is the general configuration of local data set for parallelFEM (one-layer of overlapping).

● Internal Nodes● External Nodes■ Overlapped Elements

Algorithm09 30

Extension of Overlapped Zones (2-layers: d=2) ● Internal Nodes

● External Nodes■ Overlapped Elements

Algorithm09 31

Extension of Overlapped Zones Extension of Overlapped Zones (d=2 and d=1+) ● Internal Nodes


Algorithm09 32

Extension of Overlapped Zones

Selective Overlapping (d=1+)“Delayed” extension for elements which do not include nodes connected to truss-type elements

Extension of Overlapped Zones (d=2 and d=1+) ● Internal Nodes


Algorithm09 33

Extension of Overlapped Zones

delayed delayed

● Internal Nodes● External Nodes■ Overlapped Elements

Selective Overlapping (d=2+)Reduced cost for computationsand communications

Extension of Overlapped Zones (d=3 and d=2+)

Algorithm09 34

BILU with selective fill-in/overlapping

• BILU (p)-(d)– p level of fill-ins (0, 1, 1+, 2, 2+ …)– d depth of overlapping (0, 1, 1+, 2, 2+ …)

Algorithm09 35

0.0E+00

5.0E+07

1.0E+08

1.5E+08

2.0E+08

2.5E+08

0 1 1+ 2 2+ 3

Off

-Dia

g. C

om

po

ne

nt

#

0

500

1000

1500

0 1 1+ 2 2+ 3

ITE

RA

TIO

NS

0

100

200

300

400

0 1 1+ 2 2+ 3

se

c.

Results: 64 cores 3,090,903 DOF, =103, =10-8

Effect of Overlapping

ITERATIONS

Elapsed Time

#-non-zero’s of [M]

● BILU(1)-(d)■ BILU(1+)-(d)▲ BILU(2)-(d)

Algorithm09 36

0

100

200

300

1+ 2 2+ 3

ITE

RA

TIO

NS

0

25

50

75

100

1+ 2 2+ 3

se

c.

Results: 64 cores 3,090,903 DOF, =103, =10-8

Effect of Overlapping

ITERATIONS Elapsed Time

● BILU(1)-(d)■ BILU(1+)-(d)▲ BILU(2)-(d)

37Algorithm09

• 階層型領域間境界分割

• 二言くらいで言うと：– 階層的な領域分割

• Nested Dissection

– 各レベルでは各領域は直接結合しない⇒レベル内並列性

• 計算コスト的には（d=0）と（d=1）の中間くらい：低コスト– d：オーバーラップ

HID ： Hierarchical Interface Decomposition 　〔 Henon & Saad 〕

Algorithm09 38

Parallel ILU for each Connector at each LEVEL

0 0 0 1 1 1

0,2 0,2 0,2 1,3 1,3 1,3

2 2 2 0,12,3 3 3 3

2 2 2 2,3 3 3 3

2 2 2 2,3 3 3 3

0 0 0 0,1 1 1 1

0 0 0 0,1 1 1 1

0,12,3

0,12,30 0 0 1 1 1

0,2 0,2 0,2 1,3 1,3 1,3

2 2 2 0,12,3 3 3 3

2 2 2 2,3 3 3 3

2 2 2 2,3 3 3 3

0 0 0 0,1 1 1 1

0 0 0 0,1 1 1 1

0,12,3

0,12,3

• レベルの若い順に番号を振りなおす• 各レベル内で不完全 LU （ 0 ）分解を並

列に実施可能

• sub-domain の番号を 0,1,2,3 とする• 数字は隣接する sub-domain の番号

0

1

2

3

0,1

1,3

2,3

0,20,1,2,3

Level-1

Level-2

Level-4

0

1

2

3

0,1

1,3

2,3

0,20,1,2,3

0

1

2

3

0,1

1,3

2,3

0,20,1,2,3

Level-1

Level-2

Level-4

Algorithm09 39

• 「高い」レベルのコネクタに属する節点群における計算

– 「低い」レベルのコネクタの計算結果を利用• 計算済

– 隣接する「低い」レベルのコネクタに属する節点群が他の領域に属している場合• 通信が必要となる

各レベルにおける通信の発生

0

1

2

3

0,1

1,3

2,3

0,20,1,2,3

Level-1

Level-2

Level-4

0

1

2

3

0,1

1,3

2,3

0,20,1,2,3

0

1

2

3

0,1

1,3

2,3

0,20,1,2,3

Level-1

Level-2

Level-4

Algorithm09 40

Forward Substitution do lev= 1, LEVELtot

do i= LEVindex(lev-1)+1, LEVindex(lev)SW1= WW(3*i-2,R); SW2= WW(3*i-1,R); SW3= WW(3*i ,R)isL= INL(i-1)+1; ieL= INL(i)do j= isL, ieL

k= IAL(j)X1= WW(3*k-2,R); X2= WW(3*k-1,R); X3= WW(3*k ,R)SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j )*X3

enddoX1= SW1; X2= SW2; X3= SW3X2= X2 - ALU(9*i-5)*X1X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2X3= ALU(9*i )* X3X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)WW(3*i-2,R)= X1; WW(3*i-1,R)= X2; WW(3*i ,R)= X3

enddo

call SOLVER_SEND_RECV_3_LEV(lev,…): Communications usingHierarchical Comm. Tables.

enddo

余計な通信が発生

Algorithm09 41

計算結果（ 64 cores ）接触問題3,090,903 DOF

0

50

100

150

200

250

300

350

BILU(1) BILU(1+) BILU(2)

se

c.

0

500

1000

1500

BILU(1) BILU(1+) BILU(2)

ITE

RA

TIO

NS

■BILU(p)-(0): Block Jacobi■BILU(p)-(1)■BILU(p)-(1+)■BILU(p)-HID GPBiCG

Algorithm09 42

• HID と選択的オーバーラップはほぼ同じ性能だが後者が若干良い

– 特に条件の悪い問題– BILU で高い order の fill-in が必要となるような問題

HID vs. Selective Overlapping

• オリジナルの HID は fill-in の order が高くなると，領域外の（同じレベルにある節点の） fill-in の影響を考慮できない

– ILU(0) としては完全だが・・・

0

1

2

3

0,1

0,2

2,3

1,30,1,2,3

Level-1

Level-2

Level-4

Algorithm09 43

• HID と選択的オーバーラップはほぼ同じ性能だが後者が若干良い

– 特に条件の悪い問題– BILU で高い order の fill-in が必要となるような問題

HID vs. Selective Overlapping

• オリジナルの HID は fill-in の order が高くなると，領域外の（同じレベルにある節点の） fill-in の影響を考慮できない

– ILU(0) としては完全だが・・・

0

1

2

3

0,1

0,2

2,3

1,30,1,2,3

Level-1

Level-2

Level-4

44Algorithm09

• 並列反復法と領域分割– 選択的オーバーラップ〔KN 2007〕– 階層型領域間境界分割（Hierarchical Interface Decomposition）〔Henon & Saad 2007〕

• 拡張型拡張型 HIDHID 法の提案法の提案– 悪条件向け領域分割手法悪条件向け領域分割手法

• Hybrid並列プログラミングモデル

• HIDと並列多重格子法（時間があれば）

Algorithm09 45

要素がねじれた問題（ 1/3 ）• 3D linear elastic problem with locally distorted elements

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-direction @ z=Zmax

(Ny-1) elementsNy nodes

(Nx-1) elementsNx nodes

(Nz-1) elementsNz nodes

x

y

z

Uz=0 @ z=Zmin

Ux=0 @ x=Xmin

Uy=0 @ y=Ymin

Uniform Distributed Force in z-direction @ z=Zmax

(Ny-1) elementsNy nodes

(Nx-1) elementsNx nodes

(Nz-1) elementsNz nodes

Algorithm09 46

要素がねじれた問題（ 2/3 ）• 3D linear elastic problem with locally distorted elements• 立方体メッシュ

– Z軸周りに回転

• 局所的な不均質性– 局所的なねじれの程度を表す– sequential Gauss algorithm [Deutsch & Journel 1988]

Algorithm09 47

要素がねじれた問題（ 3/3 ）• 3D linear elastic problem with locally distorted elements

Algorithm09 48

0

500

1000

1500

2000

(0) (1) (1+, 120) (1+, 90) (1+, 60) (2)

Depth of Overlapping

ITE

RA

TIO

NS

0

100

200

300

400

500

600

700

(0) (1) (1+, 120) (1+, 90) (1+, 60) (2)


se

c.

要素がねじれた問題（不均質分布）BILU(p,)-(d,)3,090,903 DOF ， 64 コアMAX distortion: 150-deg.

■BILU(1)-(d,) GPBiCG■BILU(1+,120°)-(d,)■BILU(1+, 60°)-(d,)■BILU(1+, 30°)-(d,)■BILU(2)-(d,)

Algorithm09 49

0

500

1000

1500

2000

(1+, 90) (1+, 45) (2) (2+, 135) (2+, 90) (3)


ITE

RA

TIO

NS

0

100

200

300

400

500

600

700

(1+, 90) (1+, 45) (2) (2+, 135) (2+, 90) (3)


se

c.

要素がねじれた問題（不均質分布）BILU(p,)-(d,)3,090,903 DOF ， 64 コアMAX distortion: 225-deg.

■BILU(1)-(d,) GPBiCG■BILU(1+,120°)-(d,)■BILU(1+, 60°)-(d,)■BILU(1+, 30°)-(d,)■BILU(2)-(d,)

50Algorithm09

• オーバーラップ領域の拡張• レベル間セパレータを「厚く」する

– Thicker Separators

HID の改良を試みるExtended Version of HID

Algorithm09 51

Original Local Data Set

2 B A 3 3

2 2 3 3

2 2 3 3

2 B A 3 3

2 2 3 3

2 2 3 3

level-1 ● level-2 ●

• Original HID – オーバーラップ深さ＝ 0 また

は 1 の場合– 同じレベルにある領域外節点からの fill-in の寄与を考慮できない• BILU(2) で節点 B の効果は A に

おいて考慮できない

Algorithm09 52

対策 1: オーバーラップ領域拡張

2 B A 3 3

2 2 3 3

2 2 3 3

2 B A 3 3

2 2 3 3

2 2 3 3

2 B A 3 3

2 2 3 3

2 2 3 3

2 B A 3 3

2 2 3 3

2 2 3 3


• オーバーラップ領域拡張 – 2層のオーバーラップ– 同じレベルにある領域外節点からの

fill-in の寄与を考慮できるようになる• BILU(2) で節点 B の効果は A において考慮可能である

– しかし，局所化，ブロック Jacobi であることには変わりない• B における値は最新では無い

Algorithm09 53

対策 2: Thicker Separator

2 B A 3 3

2 2 3 3

2 2 3 3

2 B A 3 3

2 2 3 3

2 2 3 3


• Thicker Separator – HIDnew

– 同じレベルにある領域外節点からの fill-in の寄与を考慮できるようになる• BILU(2) で節点 B の効果は A において考慮可能である

– 大域的な手法– 対策1より有効そうに見える– 負荷分散困難

– 前処理の Fill-in 深さに応じてセパレータの厚さが決まる

2 B A 3 3

2 2 3 3

2 2 3 3

2 B A 3 3

2 2 3 3

2 2 3 3

Algorithm09 54

対策 2: Thicker Separator• 関連研究

– Takeshi Iwashita and Masaaki Shimasaki; "Block Red-Black Ordering: A New Ordering Strategy for Parallelization of ICCG Method", International Journal of Parallel Programming, Vol. 31, No. 1, (2003), pp.55-75• 差分格子

• Red-Black の 2 レベルを交互に解く

Algorithm09 55

環境，プログラム概要

• T2K オープンスパコン（東大）• MPI + FORTRAN90 （日立コンパイラ）

– Flat MPI

• 要素がねじれた問題– FEM ， Tri-Linear 六面体要素

• GPBiCG [Zhang 1997]• 前処理手法

– Block ILU(2): 2nd order of fill-ins

Algorithm09 56

領域分割について• BILU (2,d)

– 局所化ブロックJacobi＋オーバーラップ拡張• BILU(2,2), BILU(2,3)

• BILU (2, HID-d)– HID＋オーバーラップ拡張

• BILU(2,HID-1), BILU(2,HID-2)– BILU(2,HID-1) : BILU(2) with original HID

• BILU (2, HIDnew-d)– HID＋オーバーラップ拡張＋Thicker Separators

• BILU(2,HIDnew-1), BILU(2,HIDnew-2)• レベル2セパレータを3層• 負荷分散の工夫は無し

Algorithm09 57

Strategies for Domain Decomposition

0 0 0 1 1 1

0,2 0,2 0,2 1,3 1,3 1,3

2 2 2 3 3 3

2 2 2 2,3 3 3 3

2 2 2 2,3 3 3 3

0 0 0 0,1 1 1 1

0 0 0 0,1 1 1 1

level-1 ●level-2 ●level-4 ○

0,2 0,2 1,3 1,3

0,2 0,2 0,2 1,3 1,3 1,3

0,2 0,2 1,3 1,3

2 2 2,3 3 3

2 2 2,3 2,3 2,3 3 3

0 0 0,1 0,1 0,1 1 1

0 0 0,1 1 1

level-1 ●level-2 ●level-3 ● ● ● ●level-4 ○

HID HIDnew

Algorithm09 58

Type-I: 64 cores, 1003 elements

0

500

1000

1500

150-deg. 200-deg. 250-deg.

MAX Distortion Angle

Ite

rati

on

s

BILU(2,2)BILU(2,3)BILU(2,HID-1)BILU(2,HID-2)BILU(2,HIDnew-1)BILU(2,HIDnew-2)

0

200

400

600

150-deg. 200-deg. 250-deg.

MAX Distortion Angle

se

c.


• HID はすでに局所化ブロック Jacobi （＋オーバーラップ拡張）より良い

• HIDnew の効果はいま一つ不明

Algorithm09 59

Type-II: Strong Scaling, 1283 elementsMAX: 200 deg.

BILU(2,HID-1) はコア数が増加すると性能劣化BILU(2,HID-2) は比較的安定

BILU(2,HIDnew-d) は安定

0

500

1000

1500

2000

32 64 128 192 384 512

core#

Ite

rati

on

s

BILU(2,2)BILU(2,3)

0

500

1000

1500

2000

32 64 128 192 256 384 512

core#

Ite

rati

on

s

BILU(2,HID-1)BILU(2,HID-2)BILU(2,HIDnew-1)BILU(2,HIDnew-2)

Algorithm09 60

Type-II: Strong Scaling, 1283 elementsMAX: 200 deg., Scalability

0

100

200

300

400

0 100 200 300 400 500 600

core#

Sp

ee

d-U

p


Algorithm09 61

Type-II: Strong Scaling, 1283 elementsMAX: 200 deg, Relative PerformanceNormalized by BILU(2,HID-1) at each number of core

0.00

1.00

2.00

3.00

32 64 128 192 384 512

core#

Re

lati

ve

Pe

rfo

rma

nc

e

BILU(2,2)

BILU(2,3)

0.00

1.00

2.00

3.00

32 64 128 192 256 384 512

core#

Re

lati

ve

Pe

rfo

rma

nc

e


Algorithm09 62

並列化の阻害要因

0

10

20

30

40

32 64 128 192 256 384 512

core#

Ad

dit

ion

al C

om

mu

nic

ati

on

(%

)


• HID/HIDnew における通信– 線形ソルバーの計算時間に対

する割合

• 512コアにおける負荷分散– 標準偏差 ()

• BILU(2,d) 85• BILU(2,HID-d) 155• BILU(2,HIDnew-d) 289

0

2000

4000

6000

BILU(2,d) BILU(2,HID-d) BILU(2,HIDnew-d)

Partitioning Method

Inte

rna

l No

de

#

min. max.

Algorithm09 63

まとめ，今後の研究• 改良型HID

– 領域間オーバーラップ拡張– Thicker Separator：　これは特に有効であった

• 領域外のFill-inの効果を取り入れるための工夫

• 問題点– 複雑形状への適用

• 特にThicker Separator• レベル2以外の節点

– 負荷分散

• Intelligent Partitioner– HIDを複雑形状へ適用する場合には必要

• 様々な問題への適用– Selective Overlapping vs.(or +) HIDnew

64Algorithm09


• 拡張型HID法の提案– 悪条件向け領域分割手法

• HybridHybrid 並列プログラミングモデル並列プログラミングモデル

• HIDと並列多重格子法（時間があれば）

65Algorithm09

• 悪条件問題の並列前処理手法– 有限要素法，疎行列

• T2K オープンスパコン（東大）• Flat MPI vs. Hybrid

– 最適化

Topics

Algorithm09 66

T2K （東大）（ 1/2 ）• T2K オープンスパコン仕様

– http://www.open-supercomputer.org/– 筑波大，東大，京大

• T2K オープンスパコン（東大）– Hitachi HA8000 クラスタシステム– 2008年6月～– 952ノード（ 15,232 コア），

141 TFLOPS peak• Quad-core Opteron (Barcelona)

– TOP500 27 位（ NOV 2008 ）• （その時点で日本では 1 位だった）

Algorithm09 67

T2K （東大）（ 2/2 ）• AMD Quad-core Opteron (Barcelona)

2.3GHz– 4 “sockets” per node– 16 cores/node

• マルチコア，マルチソケット• cc-NUMA （ cache coherent Non-

Uniform Memory Access ）– ローカルメモリ上のデータをできるだけ使用する

– 陽的なコマンドラインスイッチ– NUMA control

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

Core

L1

Core

L1

Core

L1

Core

L1L2 L2 L2 L2

L3

Memory

Core Core Core Core

L1 L1 L1 L1L2 L2 L2 L2

L3

Memory

L3

L3

L3

L3

Algorithm09 68

Flat MPI vs. Hybrid

Hybrid ： Hierarchal Structure

Flat-MPI ： Each PE -> Independent

core

core

core

core

mem

ory

core

core

core

core

mem

ory

core

core

core

core

mem

ory

core

core

core

core

mem

ory

core

core

core

corem

emo

ry

core

core

core

corem

emo

ry

mem

ory

mem

ory

mem

orycore

core

core

core

core

core

core

core

core

core

core

core

Algorithm09 69

Flat MPI vs. Hybrid• 性能は様々なパラメータの組み合わせによって決まる

• ハードウェア– コア，CPUのアーキテクチュア– ピーク性能– メモリ性能（バンド幅，レイテンシ）– 通信性能（バンド幅，レイテンシ）– それらのバランス

• アプリケーション– 特性：memory bound，communication bound– 問題サイズ

Algorithm09 70

疎行列ソルバー： FEM ， FDM• Memory-Bound（メモリに負担）

– 間接参照– Hybrid (OpenMP) は更に memory-bound

• Latency-Bound（並列計算時）– 通信は領域境界のみで発生– メッセージサイズも小さい

• エクサスケールシステム– コア数： O(108) – 108-way MPI ： MPI Latency によるオーバーヘッドは？– Hybrid への期待

• とりあえず， MPI プロセスの数を減らせる（ T2K の場合で 1/16 ）

for (i=0; i<N; i++) { for (k=Index(i-1); k<Index(i); k++{ Y[i]= Y[i] + A [k]*X[Item[k]]; } }

Algorithm09 71

Flat MPI vs. Hybrid for 疎行列ソルバー GeoFEM Benchmarks [KN 2003]

0.00

1.00

2.00

3.00

4.00

0 256 512 768 1024 1280

PE#

TF

LO

PS

Flat MPI: LargeFlat MPI: SmallHybrid: LargeHybrid: Small

• MPI Latency が効く• 「地球シミュレータ」：特に顕著

– 実効性能，通信バンド幅高い，でも latency は月並み（ 6～ 8sec. ）– ノード数が増えると一般的に Hybrid が有利– 特にノードあたり問題規模が小さい場合

Large● Flat MPI● Hybrid

Small▲ Flat MPI▲ Hybrid

Algorithm09 72

• 三次元弾性静解析（固体力学），不均質物性– Emax=103, Emin=10-3, =0.25

• 地質統計学的手法によって発生 [Deutsch & Journel, 1998]

– 1283 個の六面体要素 , 6,291,456 DOF• Strong Scaling

• 反復解法： (SGS+CG) – Symmetric Gauss-Seidel

– HID

• T2K（東大）– 512 cores (32 nodes)

• FORTARN90 (Hitachi) + MPI– Flat MPI, Hybrid (4x4, 8x2, 16x1)

対象とする問題

Algorithm09 73

• DAXPY, SMVP, Dot Products– 簡単

• 前処理： ILU系分解，前進後退代入– 大域的な依存性（ Global dependency ）– 並び替え（ Reordering ）による並列性の抽出

• Multicolor Ordering (MC), Reverse-Cuthill-Mckee (RCM)• 同じ色内の要素は独立⇒並列化可能

– 「地球シミュレータ」向け最適化　 [KN 2002,2003]• 並列及びベクトル性能

– 並列性高く安定な CM-RCM を採用

前処理付き反復法の SMP/Multicore でのOpenMP による並列化

Algorithm09 74

Ordering Methods

64 63 61 58 54 49 43 36

62 60 57 53 48 42 35 28

59 56 52 47 41 34 27 21

55 51 46 40 33 26 20 15

50 45 39 32 25 19 14 10

44 38 31 24 18 13 9 6

37 30 23 17 12 8 5 3

29 22 16 11 7 4 2 1

48 32

31 15

14 62

61 44

43 26

25 8

7 54

53 36

16 64

63 46

45 28

27 10

9 56

55 38

37 20

19 2

47 30

29 12

11 58

57 40

39 22

21 4

3 50

49 33

13 60

59 42

41 24

23 6

5 52

51 35

34 18

17 1

64 63 61 58 54 49 43 36

62 60 57 53 48 42 35 28

59 56 52 47 41 34 27 21

55 51 46 40 33 26 20 15

50 45 39 32 25 19 14 10

44 38 31 24 18 13 9 6

37 30 23 17 12 8 5 3

29 22 16 11 7 4 2 1

1 17 3 18 5 19 7 20

33 49 34 50 35 51 36 52

17 21 19 22 21 23 23 24

37 53 38 54 39 55 40 56

33 25 35 26 37 27 39 28

41 57 42 58 43 59 44 60

49 29 51 30 53 31 55 32

45 61 46 62 47 63 48 64

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

RCMReverse Cuthill-Mckee

MC (Color#=4)Multicoloring

CM-RCM (Color#=4)Cyclic MC + RCM

Algorithm09 75

Effect of Ordering Methods on Convergence

50

60

70

80

90

1 10 100 1000

color #

Ite

rati

on

s

▲　 MC●　 CM-RCM

Algorithm09 76

CM-RCM による並べ替え（ reordering ） 5 colors, 8 threads

Initial Vector

color=1 color=2 color=3 color=4 color=5Coloring(5 colors)+Ordering

color=1 color=2 color=3 color=4 color=5

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

同じ「色」に属する要素は独立，したがって並列計算可能⇒　スレッド並列化（ OpenMP等）が可能

Algorithm09 77

0 1 2 3

Flat MPI, Hybrid (4x4, 8x2, 16x1)Flat MPI

Hybrid4x4

Hybrid8x2

0 1 2 3

0 1 2 3

Hybrid16x1

0 1 2 3

Algorithm09 78

• CASE-1– Initial Case (CM-RCM)– 主として NUMA control の影響を見る

• CASE-2 (Hybrid only)– First-Touch

• CASE-3 (Hybrid only)– Further Data Reordering + First-Touch

• 各ケースにおいて NUMA Control 　（ Policy 0~5 ）を適用

実施ケース

Algorithm09 79

結果： CASE-1, 32 nodes/512cores線形ソルバーの性能

PolicyID

Command line switches

0 no command line switches

1--cpunodebind=$SOCKET --interleave=all

2--cpunodebind=$SOCKET --interleave=$SOCKET

3--cpunodebind=$SOCKET --membind=$SOCKET

4--cpunodebind=$SOCKET --localalloc

5 --localalloc Method IterationsBest Policy

CASE-1

Flat MPI 1264 2

HB 4x4 1261 2

HB 8x2 1216 2

HB 16x1 1244 2

0.00

0.50

1.00

1.50

Flat MPI HB 4x4 HB 8x2 HB 16x1

Parallel Programming ModelsR

ela

tiv

e P

erf

orm

an

ce

policy 0

best (policy 2)

Normalized byFlat MPI (Policy 0)

e.g. mpirun –np 64 –cpunodebind 0,1,2,3 a.out

Algorithm09 80

First Touch Data Placement配列のメモリ・ページ：

最初に touch したコアのローカルメモリ上に確保計算と同じ順番で初期化

do lev= 1, LEVELtot do ic= 1, COLORtot(lev)!$omp parallel do private(ip,i,j,isL,ieL,isU,ieU) do ip= 1, PEsmpTOT do i = STACKmc(ip,ic-1,lev)+1, STACKmc(ip,ic,lev) RHS(i)= 0.d0; X(i)= 0.d0; D(i)= 0.d0

isL= indexL(i-1)+1 ieL= indexL(i) do j= isL, ieL itemL(j)= 0; AL(j)= 0.d0 enddo

isU= indexU(i-1)+1 ieU= indexU(i) do j= isU, ieU itemU(j)= 0; AU(j)= 0.d0 enddo enddo enddo!$omp omp end parallel do enddo enddo

Algorithm09 81各スレッド上でメモリアクセスが連続となる

ように更なる並び替え 5 colors, 8 threads

color=1 color=2 color=3 color=4 color=5Coloring(5 colors)+Ordering

color=1 color=2 color=3 color=4 color=5

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 7 7 7 8 8 8 8 81 1 1 1 1

Initial Vector

各スレッド上で不連各スレッド上で不連続なメモリアクセス続なメモリアクセス（色の順に番号付（色の順に番号付け）け）

スレッド内で連続に番号付けスレッド内で連続に番号付け

Algorithm09 82

0.00

0.50

1.00

1.50


Parallel Programming Models

Re

lati

ve

Pe

rfo

rma

nc

e

InitialCASE-1CASE-2CASE-3

性能の改善： CASE-1 CASE-3⇒Flat MPI のベストケースで無次元化

32nodes, 512cores196,608 DOF/node

CASE-1: NUMA controlCASE-2: + F.T.CASE-3: + Further Reordering

Algorithm09 83

0.00

0.50

1.00

1.50



Re

lati

ve

Pe

rfo

rma

nc

e


0.00

0.50

1.00

1.50



Re

lati

ve

Pe

rfo

rma

nc

e


性能の改善： CASE-1 CASE-3⇒Flat MPI のベストケースで無次元化



Algorithm09 84

0.00

0.25

0.50

0.75

1.00

1.25

32 64 128 192 256 384 512

core#

Re

lati

ve

Pe

rfo

rma

nc

e

HB 4x4HB 8x2

HB 16x1

各コア数における相対性能32~512 cores

各コア数における Flat MPI のベストケースの性能で無次元化Hybrid は「コア数が多く」，「コアあたりの問題規模が小さい」

場合に有効

Algorithm09 85

• HIDによる並列前処理手法のT2K（東大）での実装– Hybrid/Flat MPI, CM-RCM reordering

• Hybid 4x4 はFlat MPIと同じ，あるいは少し良い• （並び替え＋F.T.）によるデータ局所化，メモリアクセス連続性確保によりHybrid 8x2，16x1の性能が大幅に改善• Hybrid は「コア数が多く」，「コアあたりの問題規模が小さい」場合に特に有効

– T2Kに適した並列プログラミングモデルである– ノード数がもっと増えるとHB 16x1の優位性が高まるかも知れない

• 今後の仕事– 高レベルのfill-inの考慮：BILU(p)– ノード内オーダリング手法– 性能評価モデル：Hybridについては無い

まとめ

86Algorithm09


• 拡張型HID法の提案– 悪条件向け領域分割手法

• Hybrid並列プログラミングモデル

• HIDHID と並列多重格子法（時間があれば）と並列多重格子法（時間があれば）

Algorithm09 87

• 階層型領域間境界分割（ Hierarchical Interface Decomposition, HID ）〔 Henon & Saad, 2007 〕の並列多重格子法への適用

– Smoother が Gauss-Seidel ， IC/ILU の場合でも反復回数の増大を抑制できるのではないか？

動機

Algorithm09

88

• 透水係数が空間的に分布する三次元地下水流れ– ポアソン方程式– 透水係数は地質統計学的手法によって決定〔 Deutsch & Journel, 1998 〕

• 規則正しい立方体ボクセルメッシュを使用した有限体積法

解析対象

Algorithm09

89

支配方程式：ダルシー流れ

zw

yv

xu

,,

qzzyyxxz

w

y

v

x

u

：透水係数

max@0 zz

Algorithm09

90

不均質場における地下水流れ

Homogeneous

Heterogeneous

UniformFlow Field

RandomFlow Field

91

Algorithm09

速度：背景色：位置：モード：透明度：

Algorithm09

92

• 前処理付きCG法• Gauss-Seidelに基づく多重格子法• 多重格子法

– 8個の fine meshからcoarse meshを生成– V-cycle

• 各領域のメッシュ数が「1」となるまで計算⇒最後は1プロセスで計算

計算手法

Algorithm09

93

並列 MG ：局所データ構造

Fine Level Coarse Level

Internal MeshesExternal MeshesInternal MeshesExternal Meshes

Algorithm09

94

HID for 並列 MG ：局所データ構造領域分割例

16×16 メッシュ⇒ 4×4 領域，数字は HID のレベル

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 4

1 1 1 2

1 1 1 2

1 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 21 1 1 11 1 1 11 1 1 1

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 21 1 1 11 1 1 11 1 1 1

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 21 1 1 11 1 1 11 1 1 1

1 1 1 21 1 1 21 1 1 21 1 1 2

1 1 1 21 1 1 21 1 1 21 1 1 2

1 1 1 21 1 1 21 1 1 21 1 1 2

1 1 1 11 1 1 11 1 1 11 1 1 1

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 21 1 1 11 1 1 11 1 1 1

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 21 1 1 11 1 1 11 1 1 1

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 41 1 1 21 1 1 21 1 1 2

2 2 2 21 1 1 11 1 1 11 1 1 1

1 1 1 21 1 1 21 1 1 21 1 1 2

1 1 1 21 1 1 21 1 1 21 1 1 2

1 1 1 21 1 1 21 1 1 21 1 1 2

1 1 1 11 1 1 11 1 1 11 1 1 1

Algorithm09

95

HID for 並列 MG ：局所データ構造

Fine Level Coarse Level

2 2 2 41 1 1 21 1 1 21 1 1 2

1 1 1 2

2224

2 2 2 411122 2 2 4

1 1 1 21 1 1 21 1 1 2

1 1 1 2

2224

2 2 2 41112

2 4

1 2

1 2

2 4

2

4

1

12 4

1 2

1 2

2 4

2

4

1

1

Algorithm09

96

領域間通信： Gauss-Seidel

Original

HID

do iterC= 1, ITERtotCdo icel= levMGcelINDEX(lev-1)+1,levMGcelINDEX(lev)

RF= Bmg(icel)do j= 1, INLmg(icel)

RF= RF - ALmg(j,icel)*Xmg(IALmg(j,icel))enddoXmg(icel)= RF*DDmg(icel)

enddocall SOLVER-SEND-RECV : MPI-ISEND/IRECV

enddo

do iterC= 1, ITERtotCdo levh= 1, levHIDtot

do i0= hLEVEL_index(levh-1,lev)+1, hLEVEL_index(levh,lev)

icel= MIDtoNEW(i0)RF= Bmg(icel)do j= 1, INLmg(icel)

RF= RF - ALmg(j,icel)*Xmg(IALmg(j,icel))enddoXmg(icel)= RF*DDmg(icel)

enddocall SOLVER-SEND-RECV-HID-LEV : MPI-ISEND/IRECV

enddoenddo

並替あり　 OLD　 NEW　 MID

Algorithm09

97

計算結果： IBM SP3/SR11k256コアまで， MGCG

IBM SP3

0

10

20

30

40

0.E+00 2.E+07 4.E+07 6.E+07 8.E+07

DOF

se

c.

（max/min=10+6）▲ MGCG/ORG△ MGCG/HID

（max/min=10+8）■ MGCG/ORG□ MGCG/HID

0

100

200

300

0.E+00 2.E+07 4.E+07 6.E+07 8.E+07

DOF

se

c.

Hitachi SR11000/J2

Algorithm09

98

計算結果： IBM SP3256コアまで， MGCG ， Smoother として IC(0) の方が良い

しかも， ORG と HID の差が無いGauss Seidel 4/4

（max/min=10+6）▲ MGCG/ORG△ MGCG/HID

（max/min=10+8）■ MGCG/ORG□ MGCG/HID

0

100

200

300

0.E+00 2.E+07 4.E+07 6.E+07 8.E+07

DOF

se

c.

IC(0) 2/2

0

100

200

300

0.E+00 2.E+07 4.E+07 6.E+07 8.E+07

DOF

se

c.

Algorithm09 99

• Smoother の選択– 現状では IC(0)/ILU(0) が良い– ILUT ， ILU(p)

• Fraunhofer の AMG ライブラリ

• HID の効果– GS では顕著：特に悪条件問題– IC(0) ではほとんど効果無し

• 反復回数は減っているが通信オーバーヘッドのため，計算時間はほとんど変わらない

• より悪条件の問題に使えるかも知れない

• Multigrid– Coarse Grid ：グローバルな情報をある程度考慮できる（？）

まとめ

100Algorithm09

おわりに• 問題の特性を利用した前処理，領域分割• アプリケーションの特徴を反映したベン

チマーク– マトリクスより更に SMASH寄り

• 様々な技術的課題– HID （ Hierarchical Interface

Decomposition ）• 連成した非圧縮性 NS （拘束条件付き，非正定）

– ノード内並列化手法– コア単位の最適化

HHardwareardware

SSoftwareoftware

AAlgorithmlgorithm

MModelingodeling

SSciencecience

並列前処理手法と領域分割， マルチコア時代の戦略

Documents

並列前処理手法と領域分割，マルチコア時代の戦略