disconnected diagrams, multi-grid, nvidia & all that y

Download Disconnected Diagrams, Multi-grid, Nvidia  &  all that y

If you can't read please download the document

Upload: hanh

Post on 08-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Disconnected Diagrams, Multi-grid, Nvidia & all that y. Richard Brower (Boston University)‏ James Brannick (Penn) Ron Babich (BU)‏ Kipton Barros (BU) Mike Clark (BU)‏ George Fleming (Yale)‏ James Osborn (Argonne)‏ Claudio Rebbi (BU)‏ QCDNA 2008 – Regensburg Sept 5, 2008. - PowerPoint PPT Presentation

TRANSCRIPT

  • Disconnected Diagrams, Multi-grid, Nvidia & all thaty Richard Brower (Boston University)James Brannick (Penn)Ron Babich (BU)Kipton Barros (BU)Mike Clark (BU)George Fleming (Yale)James Osborn (Argonne)Claudio Rebbi (BU)

    QCDNA 2008 RegensburgSept 5, 2008

    y WARNING: Much here is a FUTURE plan NOT proven results but .....

  • What is QCD?AcronymDefinition:

    QCDQualified Charitable Distribution (IRS)QCDQuality, Cost, DeliveryQCDQuantum ChromodynamicsQCDQuarkCopyDesk (file extension)QCDQuasi-Cyclic DyadicQCDQuick Change DirectoryQCDQuick Claim Deed (real estate)QCDQuintessential CD (PC media player)QCDQuit Claim Deed (real estate)QCDQuality Control Department

  • What do these mean? QCD From Wikipedia, the free encyclopedia

    Quintessential Player, formerly known as Quintessential CD

    Quality, Cost, Delivery, a three-letter acronym used in lean manufacturing

    Quad City DJ's, Southern rap group

    Quick Control Dial, a control on many DSLR cameras, like the Canon EOS 40D

    Quote-Comma-Delimited known also as Comma-separated values

    Quantum chromodynamics, the theory describing the Strong Interaction

  • Outline

    Physics: (How strange is the proton?)

    Algorithms: (Multi-grid to the rescue?)

    Hardware: (GPU propagator farm?)

  • Physics: Disconnected Diagrams Connected vs. Disconnected

    Want matrix element:

  • How strangey is the proton? Who cares?

    Violation of Standard Model:Dark Energy (Neutralino scattering): NuTev anomaly:

    Nucleon Physics (include u/d + s quares): iso-scalar Form Factors, nucleon structure function, Spin crisis for proton, matrix element etc.y see Lattice 2008: http://conferences.jlab.org/lattice2008/parallel-bytopic-struct.htmlS.Collins, G. Bali, A.Schafer Hunting for the strangeness ... nucleonTakumi Doi et al Strangeness and glue in the nucleon from lattice QCDRon Babich et al Strange quark content of the nucleon

  • Direct detection of dark matter

    In SUSY, the neutralino scatters from a nucleon via Higgs exchange:

    The strange scalar matrix element is a major uncertainty:

    Uncertainty in fTs gives up to a factor of 4 uncertainty in the cross-section! Bottino et al., hep-ph/0111229; Ellis et al., hep-ph/0502001

  • Nuclear ExperimentPate et al., arXiv:0805.2889 [hep-ex]J. Liu et al., arXiv:0706.0226 [nucl-ex]

    (see also Young et al., nucl-ex/0605010)Parity-violating electron scattering (SAMPLE, HAPPEx, PVA4, G0)PVES + BNL E734 (p scattering)

  • Algorithm Monte Carlo update (Long auto correlations times) Global Heat bath aka Stochastic Estimator: (Zero auto correlations)Find = D-1 for Gaussian or Gauge or Z2 (Zero auto correlations!)With < y x > = yx Axy

  • Improving Stochastic Estimate

    Variance reduction: Dilution vs hopping parametery (Short distance)Multi-grid vs deflation/truncationy (Long distance)Curing volume divergenceTrace versus Gauge fluctuations Better and more source (all to all?). Full multi-grid O(N long N) Trace? y S.Collins, G. Bali, A.Schafer Hunting for the strangeness ... nucleonxy

  • Trace estimationTwo sources of error: gauge noise and error in trace. In this calculation, we largely eliminate the second source by calculating a nearly exact trace on four time-slices.864 sources (x12 for color/spin). A given source is nonzero on 4 sites on each of 4 time-slices.Minimal spatial separation between sites is . Small residual contamination is gauge-variant and averages to zero.Equivalent to using a single stochastic source with extreme dilution.4 x 63 = 864

  • Preliminary MethodsConfigurations were provided by the LHPC Spectrum Collaboration anisotropic lattice with2 dynamical flavors, Wilson fermion and gauge actions863 configurations

    64 (x 12) inversions per configuration at the light quark mass, for the nucleon correlators864 (x 12) inversions per configuration at the strange mass, for the trace

  • Strange scalar form factor

  • Conventionally, one extracts the (e.g. zero-momentum) form factor from the large t behavior of the ratio

    (or from a similar expression integrated over time).Instead, we fit the numerator directly, since this allows usto avoid contamination from backward-propagating states, which are problematic due to the short temporal extent of our lattice ( ).to explicitly take into account the contribution of (forward-propagating) excited states.In the following, we always treat the system symmetrically with Ratio approach

  • Direct fitFirst, we perform a fit to the nucleon two-point function, of the formThe coefficients and masses are very well-determined, since we are required to calculate correlators from all initial times (a total of 863 x 64 = 55,232).Next, we perform a fit to the three-point function,

    Here j1 and j2 are the form factors for the proton and its first excited state, and j12 is a transition matrix element between them. In practice, we expect j2 and j12 to absorb the contribution of still higher states, and trust only j1 to be reliable.

  • Strange scalar form factorFor the renormalization-invariant quantity fTs, we estimate where we have inserted the physical nucleon mass. The second error is the uncertainty in relating this mass to the lattice scale, the first error is statistical, and no other systematics are included.Note that the matrix element in the numerator was calculated for a world with a 400 MeV pion. If we work consistently in such a world by inserting our calculated nucleon mass, the scale dependence drops out, and we find

  • Momentum dependence of GS(q2)PRELIMINARYs

  • Strange axial form factorResults have not been renormalized.Calculated value is distinct from zero at the 3-s level.PRELIMINARY

  • Error = O(L3/2) ) as L3 ) 1 For Exact Trace in a Connect correlator,

  • Most Important New Trick: Multi-grid Variance ReductionThe signal and variance of the first term is down by 1 to 2 orders of magnitude because Dc D

    The Coarse level Trace for D-1c is as cheap to calculate as the level down operator inverse.

    This can of course be done recursively giving (I think) an O(N log N)trace calculation to fixed tolerance.

  • HARDWAREGraphics hardware is well suited to highly parallel numerical tasks. Hardware vendors provide development tools to support high performance computing. NVIDIA'S CUDA offers direct access to graphics hardware through a programming language similar to C. Dirac-Wilson operator which runs at an effective 68 Gigaflops on the Tesla C870 GPU. The recently released GTX 280 GPU at 92 Gigaflops and we expect improvement pending code optimization. (Now 98 Gigaflops hope to get O(150) Gigaflops)

  • Nvidia GPU architecture

  • Two Generations Consumer vs HPC GPUsConsumer cards ) High Performance (HPC) GPUsI. 8880 GTX ) Tesla C870 (16 multi-processor with 8 cores each)II. GTX 280 ) Tesla C1060 (30 multi-processor with 8 cores each)

  • C870 code using 60% of the memory bandwidth.

  • http://www.scala-lang.org/

  • Future software PlansNeed find out why we are only saturating 60% of Memory bandwidthFurther educe memory traffic: 8 real number per SU(3) matrix (2/3 of 12 used now) shear spinors in 43 blocks (5/9 of used now)

    Generalize to clover Wilson & Domain Wall operator (slightly better flops/mem ratio).

    DMA between GPU on Quad system and network for cluster

    Start to design SciDAC API for many-core technologies.

  • Tesla 10-Series: Whats the Big Deal?

  • Consumer Chip GTX 280 ) Tesla C1060

  • 1 U Quad S1070 System $8K

  • CUDA 2.0 (Compute Unified Device Architecture)Can compile CUDA code into highly efficient SSE-based multi-threaded C code

  • Need a GPU Dirac Propagator FarmThe Clark-Kennedy RHMC Paradox: (Faster you go harder it is to keep up)

    Analysis is now the heel

    Solution: Dedicated Analysis farm.

    GPU can deliver O(10) to O(100) gain in flops/$

    Two quad Tesla ) 1 Sustained Teraflop!

    Two quad Tesla @ 25K One BG/L rack @ 2,000 K

  • Commercial Break:BOSTON POST DOC IN SEPT 2009PetaAPPS/SciDAC fellow (QCDNA in Boston Fall 2009?)