parallel monte devicedownloads.hindawi.com/archive/2001/070635.pdfhpvm,specifically desiged for...

VLSI DESIGN2001, Vol. 13, Nos. 1-4, pp. 51-56Reprints available directly from the publisherPhotocopying permitted by license only

(C) 2001 OPA (Overseas Publishers Association) N.V.Published by license under

the Gordon and Breach Science Publishers imprint,member of the Taylor & Francis Group.

Cluster-based Parallel 3-D Monte CarloDevice Simulation

ASIM KEPKEP, UMBERTO RAVAIOLI* and BRIAN WINSTEAD

Beckman Institute and Dept. of Electrical and Computer Engineering, University of Illinoisat Urbana-Champaign, Urbana, IL 61801, USA

The recent improvements in the performance of commodity computer have created veryfavorable conditions for building high performance parallel machines from computerclusters. These are very attractive for 3-D device simulation, necessary to modelproperly carrier-carrier interaction and granular doping effects in deeply scaled silicondevices. We have developed a parallel 3-D Monte Carlo simulation environmentcustomized for clusters using the Message Passing Library (MPI). The code has beentested on the supercluster of NCSA at the University of Illinois. We present here testresults for an n-i-n diode structure, along with an analysis of performance for twodifferent domain decomposition schemes.

Keywords: Monte Carlo; Parallel computation; Device simulation; Semiconductors

1. INTRODUCTION

In recent years there has been a continuousincrease in the power of consumer computerproducts with a very large reduction in cost. Thistrend has eliminated most of the difference inperformance between specially designed highperformance processors and commodity micro-processors. Also, over the same period networkingtechnology has matured, making a robust andhigh-performance networks available to endusers without having to design special purposehardware. As a result, clusters of computers usingcommodity hardware has become a cost-effective

way to build very high-performance parallelcomputers. It has been demonstrated that a

computer built this way can reach performancelevels of custom built supercomputers [1]. Theother issue that helped promoting cluster-basedsupercomputers is the relatively recent standardi-zation of communication libraries and theirapplication programming interfaces with high levellanguages such as C/C+ + and Fortran. Leadingamong these standards, MPI has become virtuallyavailable on all platforms, allowing one to portsame code to other platforms with minimalchanges [2]. The pressure to integrate more

functionality on a single chip and the demand

* Address for correspondence: 3255 Beckman Institute, University of Illinois at Urbana-Champaign, 405 N. Mathews Avenue,Urbana, IL 61801, USA. Tel.: / 217 244 5765, Fax: + 217 244 4333, e-mail: [email protected]

51

52 A. KEPKEP et al.

for faster devices has caused device dimensions toshrink continuously. Therefore, device technologyhas progressed to the point where 3-D geometricaleffects, discrete dopant ion distributions, andcarrier-carrier interaction must be dealt with. Also,there is an interest in exploring 3-D geometries,where the functionality of the device can beenhanced by the availability of an additionaldegree of freedom [3, 4]. Although in many casesconventional simulation approaches (drift-diffu-sion or hydrodynamic method) are still useful,Monte Carlo transport simulation is becomingcomputationally more attractive, since the increasein dimensionality does not change the nature ofthe approach.

This paper describes the development of aMonte Carlo transport simulation environment forsemiconductor devices, targeting implementationon cluster-based supercomputers. The code hasbeen developed with a "message passing" modelwith relatively high communication costs.

2. DESIGN CONSIDERATIONS

Flexibility for future development has been themain consideration for the design of the codestructure since a range of investigations, devicestructures, and physical effects need to be sup-ported. As a consequence, any restrictive or ad hocdecisions may easily create difficulties in theapplicability of the code to other situations. Atthe heart of the simulation environment is acomplete 3-D Monte Carlo transport simulatorcode for silicon, which has evolved from a earlier2-D full band Si-MOSFET simulator [5]. The 3-Dcode has been structured to handle rectangularbrick-like domains. The inter-node boundariesare chosen to be planes, and while not sacrific-ing generality this choice helps reducing thecomputational overhead associated with detectingboundary crossing for the simulated particles. Thetransport simulator is essentially a meshless de-sign, and it does not depend on a specific choice offorce calculation scheme.

There are several opportunities for parallelismin the Monte Carlo scheme. Since the dynamics ofparticles in distant regions are relatively indepen-dent of each other, it is possible to decompose thegeometry and assign parts of the device toindividual process (or "nodes"). Each of thesesub-domains can be treated independently from a

computational point of view, with periodic recal-culations of the forces, which provide the physicaland the computational coupling between differentregions. Another inter-domain interaction occurswhen a particle crosses a boundary. As a result,one would like to find a partitioning of the devicethat minimizes inter-domain boundary crossings,to minimize not only computation but also thecommunication costs associated with node to nodeparticle data transfers. A second source ofparallelism is the possibility to spread the ensembleof the particles within a certain domain over anumber of processors. This mode of parallelism isnot being utilized in this particular implementa-tions since one can split a single domain intomultiple domains to spread the computationaloverhead. However, this source of parallelism canbe advantageous, for instance in symmetric multi-processor systems (SMP), better suited for ashared memory programming paradigm, whereparallelism across the particles can be exploited ina much more favorable manner than with a truemessage passing paradigm. As a result a mixedmode parallel code, with node internal parallelismacross the ensemble utilizing shared memory,along with message passing among nodes mightbe an optimal solution. A third source ofparallelism lies in the fact that electrons and holecan be treated independently during a time-stepand there is a two-fold parallelism available if oneuses a bipolar mode simulation.The bandwidth and latency for the Myrinet

network used in the cluster employed for this studyare good enough to handle random messageexchanges relatively easily. However, in order notto hinder the runtime performance on clusterswithout such a high performance network, keepingcommunication patterns simple and structured has

PARALLEL COMPUTATION 53

been an important design consideration. There-fore, the data structures containing the externaldevice boundaries have been replicated on allnodes. While costing more memory, this allowsone to completely evaluate the dynamics of aparticle over a given time-step, locally on the nodethe particle was located at, in the beginning of thetime-step. At the end of each time-step a setof messages are exchanged, passing messagescontaining the new state variables for particlescrossing decomposition boundaries.

3. IMPLEMENTATION

The Monte Carlo simulator has been implementedusing Massage Passing Interface (MPI) as theunderlying parallel communication library. MPI isan industry standard communications library forparallel programming. There are several imple-mentations available, some of which are systemspecific since they are optimized to yield maximumperformance on a given architecture. The advan-tage of using MPI is that while functionality of the

read and parse input files,allocate memory

Initialize MPI environment

Generate initial state and initialforces

other node instances

Evaluate carrierdynamics,scatterings,boundaryinteractions, andgather localstatistics.

particles crossing

inject or remove particles....a......t the contact boundaries

,,,,.map charge onto the grid

FORCE EVALUATION

Evaluate carrierdynamics,scatterings,boundaryinteractions, andgather localstatistics.

node boundaries

inject or remove particlesat the contact boundaries

Imap charge onto the grid

solve Poisson’s equation)

(communication)

(communication)

FIGURE Flow chart for the parallel Monte Carlo simulator.

54 A. KEPKEP et al.

code is always retained, although the performancemay not be optimal, when code is re-targeted for anew platform. Also, it is easier to deal with hetero-genous clusters, allowing computers with differentarchitectures to be used, so that portions of thecode where processor speed is indeed the bottle-neck can be run on a higher performance com-puter allowing a better overall performance.The schematic flow-chart of the program is

shown in Figure 1. After initialization of theindividual threads of execution, each branch readsin the input files, band structure and postscattering data tables followed by initialization ofthe MPI environment. Each task applies to a sub-domain with rectangular symmetry, defined byuser. This is followed by state initialization for theparticles, for instance a charge neutral equilibriumor the result of a previous simulation. In order toensure independence of random number streamsused by different tasks, the random numbergenerator associated with each task is seeded witha seed offset from a global seed according to thetask number. This saves some CPU time over a"leap-frog random generator" which would in-crease the CPU time spent in random numbergenerator linearly with the number of threads. Tocheck the randomness, tests with up to 64 randomstreams were generated, and the resulting streamswere analyzed, showing sufficient independence.

After initial validation of the approach on asmall cluster of DEC Alpha computers, the codehas then been re-targeted for the NT superclusterat the National Center for SupercomputingApplications (NCSA). This cluster-based multi-computer is built from 64 Hewlett-Packard KayakXU dual Pentium II processor computers. Thehigh performance Myrinet network allows users toaccess one directional bandwidths of about1.2Gb/s with small message latency of about20 ts. A special implementation of MPI based onHPVM, specifically desiged for Myrinet has beenused on this platform. MPI-FM is a completeimplementation of MPI 1.0 standard for devel-oped of FastMessages interface based on publicdomain MPICH code base.

The core node code is written in FORTRAN 77with higher level layers written in ANSI C toenable dynamic memory management, as well aseasy access to system management functions. Thisstrategy allows one to achieve good performanceon a variety of platforms, without resorting tointensive platform specific optimizations.

4. RESULTS AND CONCLUSIONS

An n-i-n structure has been simulated to test the 3-D Monte Carlo parallel implementation. Thestructure consists of a semiconductor slab, 0.5 tmlong with cross-section 0.2 0.2 tm. The contactslocated on both ends of the device have n-dopingof 1018 cm-3 and are 0.1 m long. The mid-sectionof the device is undoped. The biased in thisexample is 0.25V. Simulations have been con-ducted for two different decomposition schemes tosee the impact of the time spent during messageexchanges. In the first case, the device is decom-posed into slabs, which are parallel to the long-itudinal axis of the structure (Fig. 2). Thisconfiguration minimizes the message exchanges,since the particle velocity is favorably aligned. This

Vbias

Vbias

FIGURE 2 Simulated structure with: (a) sub-domains or-iented along the longitudinal axis of the n-i-n structure and (b)with sub-domains oriented in the perpendicular direction.

PARALLEL COMPUTATION 55

also results in a well-balanced load among the sub-domains. In the second case, the device has beenpartitioned into sub domains with boundariesperpendicular to the longitudinal axis of theoriginal structure. In order to keep load balance,an attempt is made to keep the number of particlesin a given node approximately the same made. Anensemble of 40,000 particles was simulated foreach case. In Figure 3, we show the speed-up ob-tained for the two test cases. The performance forthe first partitioning scheme (small boundarycrossing frequency) shows an almost linear beha-vior up to four processors and then it saturates. Ofcourse, the saturation point will correspond to amuch larger number of processors, as the problemsize is scaled up to include many more particlesand grid size. Note that here we are focusing onthe parallelism of the transport procedure, andPoisson equation is solved in scalar form on asingle processor. The related cost is only about20% and it accounts for the slight deviation fromthe linear speed-up curve. In future large-scalesimulations of complete MOSFET structures, aparallel version of the Poisson solver will be used,since the related computational cost will increaseconsiderably with respect to the transport part ofthe code. This is a very important issue whencharge-charge interaction and granular doping

effects are to be explored. The second decomposi-tion scheme shows a marked degraded perfor-mance when more than two processors are used.This is just a consequence of the large number ofboundary crossings along the conduction path,resulting in an enormous communication load. Inthis case, one should also expect a generalimprovement if the problem is scaled up, but, asthe number of sub-domains increases with thenumber of processors, the performance shouldagain start to degrade past a certain limit.The partitioning scheme for the geometry of the

structure is probably the most important factor inachieving good performance. Physical intuitionand a good knowledge of the transport conditioncan help considerably in the design of a successfulscheme. Our conclusion is that the domaindecomposition should have the following proper-ties: (i) volume to surface ratio of everysub-domain with sizable population should bemaximized; (ii) the topology of the sub-domainshould minimize boundary crossings in the direc-tion of the current flow. For complex structureswith asymmetries in the current flow, adaptivedomain decomposition schemes would be neces-sary to achieve high performance, since the currentflow patterns may not be easy to guess for thepurpose of load balance.

linear speedup1st decomposition scheme2nd decomposition scheme

10Number of processors

FIGURE 3 Speed-up as a function of number of nodes for thetwo partitioning schemes.

Acknowledgements

This work was partially supported by the Semi-conductor Research Corporation contract 99-NJ-726, by the NSF Distributed Center for AdvancedElectronics Simulation, grant ECS 98-02730 andby NCSA at the University of Illinois.

References

[1] Chien, A. et al. (1999). "Design and Evaluation of anHPVM-Based Windows NT Supercomputer", Int. J. HighPerformance Computing Applications, 13(3), 201- 219.

[2] William Gropp, Ewing Lusk and Anthony Skjellum, UsingMPI: Portable Parallel Programming with the MessagePassing Interface (MIT Press, 1994).

56 A. KEPKEP et al.

[3] Lorenzini, M. et al. (1999). There dimensional Modeling ofthe Erasing Operation in Submicron Flash-EEPROMMemory Cell, IEEE Trans. on Electron Dev., 46(5), 975-983.

[4] Tanaka, J. (1996). "Simulation of a High PerformanceMOSFET with a Quantum Wire Structure Incorporating a

Periodically Bent Si-SiO2 Interface", IEEE Trans. onElectron Dev., 43(12), 2185-2198.

[5] Duncan, A., Ravaioli, U. and Jakumeit, J. (1998). "Full-band Monte Carlo Investigation of hot carrier trends in thescaling of Metal-Oxide-Semiconductor Field-Effect Tran-sistors", IEEE Trans. on Electron Dev., 45(4), 867-876.

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2010

RoboticsJournal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of



RotatingMachinery


Hindawi Publishing Corporation http://www.hindawi.com

Journal ofEngineeringVolume 2014

Submit your manuscripts athttp://www.hindawi.com

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

SensorsJournal of


Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks


parallel monte devicedownloads.hindawi.com/archive/2001/070635.pdfhpvm,specifically desiged for...

Documents