기상기후
홈 > HPC솔류션 > 기상기후

기상기후

수퍼컴퓨터를 사용하는 분야 중 국내에 많이 알려진 분야가 기상청입니다. 기상예측은 다양한 학문이 유기적으로 얽혀 있습니다. 기상학, 대기학 뿐만 아니라 유체역학, 열역학 등 다양한 분야가 연결되어 있습니다.  최근 CUDA를 이용한 기상 예측 시스템의 가속에 대해 많은 연구가 진행되고 있습니다.

 

목차

NCAR 성공사례

기상 소프트웨어 소개(WIKI)

WRF CUDA 포팅 소개

 

logo_ncar.jpg

 

당면과제


막강한 허리케인부터 지구 기후 변화까지, 날씨는 우리의 일상은 물론 때때로 생존에 가장 큰 영향을 미치는 제어불가능한 요소입니다. 그러나, 조기 날씨 예보는 날로 빠르고 정확해지고 있어서 사람들이 시간을 가지고 대비할 수 있도록 도와주고 있습니다. NCAR(National Center for Atmospheric Research: 국립 기상 연구 센터)의 과학자들은 즉시, 단기 및 장기 날씨 조건에 대한 복잡한 예보 모델을 개발했습니다.

WRF(Weather Research & Forecasting) 모델은 전세계에서 널리 사용되고 있는데 국가 날씨 서비스, 공군 날씨 담당 부서, 외국의 날씨 서비스 및 일반 날씨 예보 회사들이 주로 사용하고 있습니다. NCAR의 기후 및 날씨 모델은 테라급(1조 플롭)에서 페타급 어플리케이션으로 진화하고 있는데 이는 기존 컴퓨팅 클러스터의 한계를 초과하는 것으로 CPU를 더 추가하는 일은 더 이상 속도 개선에 도움이 되지 않습니다. 이는 특히 실시간 요소나 시간 제한을 수반하는 경우 더욱 문제가 되고 있습니다.


솔루션


전체적인 예보 속도 및 정확성을 향상시킬 수 있는 방법을 찾던 NCAR의 기술진과 콜로라도대학교 볼더캠퍼스의 연구원들은 엔비디아 GPU 컴퓨팅 솔루션을 사용하기로 결정했습니다. 엔비디아® CUDA™ 적용후 WRF의 계산 집중적인 부분인 미립자 물리학 계산 속도가 10배 향상되었습니다. 모델의 소스 코드중 미립자 물리학이 차지하는 양은 1%에 불과하지만 해당 코드를 CUDA로 전환할 경우 전체 모델의 속도는 20% 향상됩니다.

"기존 클러스터상에서 시간이 촉박한 예보를 계산하는데 더 이상 방법이 없었던 상황에서 이와 같은 결과는 매우 고무적이라고 하겠습니다." 라고 NCAR의 WRF 수석 소프트워어 개발자 존 미카레이크는 말했습니다. "엔비디아 GPU 컴퓨팅 테크놀로지를 WRF에 더 많이 적용해서 예보에 걸리는 시간을 두 배로 단축하는 것이 목표입니다. 이 기술이 날씨 및 기후 모델링에 큰 영향을 미칠 것으로 기대합니다."


효과


NCAR 및 WRF에 의존하고 있는 전세계의 많은 기관들은 실시간 날씨 현상과 그 변화에 대해 신속하게 예보를 수행할 수 있으며, 결과적으로 사람들이 피해를 입지 않도록 대처할 수 있게 도와주고 있습니다.

 

 

다음은 WIKI의 Weather Forcasting관련 내용입니다. 현재는 다양한 기상예측모델중 WRF만 CUDA로 포팅이 진행중입니다.

 

 

atmospheric model

An atmospheric model is a mathematical model constructed around the full set of primitive dynamical equations which govern atmospheric motions. It can supplement these equations with parameterizations for turbulent diffusion, radiation, moist processes (clouds and precipitation), heat exchange, soil, vegetation, surface water, the kinematic effects of terrain, and convection. Most atmospheric models are numerical, i.e. they discretize equations of motion. They can predict microscale phenomena such as tornadoes and boundary layer eddies, sub-microscale turbulent flow over buildings, as well as synoptic and global flows. The horizontal domain of a model is either global, covering the entire Earth, or regional (limited-area), covering only part of the Earth.

 

 

Global models

Some of the better known global numerical models are:

Regional models

Some of the better known regional numerical models are:

  • WRF The Weather Research and Forecasting Model was developed cooperatively by NCEP and the meteorological research community. WRF has several configurations, including:

·         WRF-NMM The WRF Nonhydrostatic Mesoscale Model is the primary short-term weather forecast model for the U.S., replacing the Eta model.

·         AR-WRF Advanced Research WRF developed primarily at the U.S. National Center for Atmospheric Research (NCAR) WRF Source Code

  • NAM The term North American Mesoscale model refers to whatever regional model NCEP operates over the North American domain. NCEP began using this designation system in January 2005. Between January 2005 and May 2006 the Eta model (began in Yugoslavia (now Serbia) during the 1970s by Zaviša Janjić and Fedor Mesinger)used this designation. Beginning in May 2006, NCEP began to use the WRF-NMM as the operational NAM.
  • RAMS the Regional Atmospheric Modeling System developed at Colorado State University for numerical simulations of atmospheric meteorology and other environmental phenomena on scales from meters to 100's of kilometers - now supported in the public domain RAMS source code available under the GNU General Public License
  • MM5 the Fifth Generation Penn State/NCAR Mesoscale Model MM5 Source Code download
  • ARPS the Advanced Region Prediction System developed at the University of Oklahoma is a comprehensive multi-scale nonhydrostatic simulation and prediction system that can be used for regional-scale weather prediction up to the tornado-scale simulation and prediction. Advanced radar data assimilation for thunderstorm prediction is a key part of the system. The source code of ARPS is freely available.
  • HIRLAM High Resolution Limited Area Model
  • GEM-LAM Global Environmental Multiscale Limited Area Model, the high resolution (2.5 km) GEM by the Meteorological Service of Canada (MSC)
  • ALADIN The high-resolution limited-area hydrostatic and non-hydrostatic model developed and operated by several European and North African countries under the leadership of Météo-France (ALADIN Community web pages)
  • COSMO The COSMO Model, formerly known as LM, aLMo or LAMI, is a limited-area non hydrostatic model developed within the framework of the Consortium for Small-Scale Modeling (Germany, Switzerland, Italy, Poland and Greece).

Climate models

Climate models use quantitative methods to simulate the interactions of the atmosphere, oceans, land surface, and ice. They are used for a variety of purposes from study of the dynamics of the climate system to projections of future climate.

All climate models take account of incoming energy as short wave electromagnetic radiation, chiefly visible and short-wave (near) infrared, as well as outgoing energy as long wave (far) infrared electromagnetic radiation from the earth. Any imbalance results in a change in the average temperature of the earth.

The most talked-about models of recent years have been those relating temperature to emissions of carbon dioxide (see greenhouse gas). These models project an upward trend in the surface temperature record, as well as a more rapid increase in temperature at higher altitudes.

Models can range from relatively simple to quite complex:

  • A simple radiant heat transfer model that treats the earth as a single point and averages outgoing energy
  • this can be expanded vertically (radiative-convective models), or horizontally
  • finally, (coupled) atmosphere–ocean–sea ice global climate models discretise and solve the full equations for mass and energy transfer and radiant exchange.

This is not a full list; for example "box models" can be written to treat flows across and within ocean basins. Furthermore, other types of modelling can be interlinked, such as land use, allowing researchers to predict the interaction between climate and ecosystems.

 

Climate

IGCM · HadCM3  · HadGEM1  · GFDL CM2.X  · CGCM  · CCSM  · ECHAM

Global weather

IFS · GEM · GFS · NOGAPS · UM · JMA · GME · ARPEGE

Regional and mesoscale atmospheric

MM5 · NAM · RUC · RAMS · WRF · RAQMS · HIRLAM  · LAPS

Regional and mesoscale oceanographic

HyCOM · ROMS

Atmospheric dispersion

NAME · MEMO  · ADMS 3  · ATSTEP  · AUSTAL2000  · CALPUFF  · DISPERSION21  · ISC3  · MERCURE  · PUFF-PLUME  · RIMPUFF  · SAFE AIR · AERMOD

Chemical transport

CLaMS  · MOZART

Land surface parameterisation

JULES  · CLASS  · ISBA

Discontinued

AVN / MRF / GSM · Eta · LFM · NGM ·

 

 

다음은 미국 기상청  http://www.mmm.ucar.edu/wrf/WG2/GPU/ 에서 제공하는 CUDA관련 연구 내용입니다. 

 

GPU Acceleration of WSM5 Microphysics

 

John Michalakes, National Center for Atmospheric Research

Manish Vachharajani, University of Colorado at Boulder

 

Introduction

This page contains most recent results updating the work published in the proceedings of the workshop on Large Scale Parallel Processing (LSPP) within the IEEE International Parallel and Distributed Processing Symposium (IPDPS), April 2008, Miami, Florida, (see References, below).

Recent results               

The figure below is an update of the figure that appears in the LSPP proceedings and the PPL paper. It shows the near doubling of performance over the results presented in the LSPP and PPL papers for the NVIDIA Quadro 5600 GPU using a code optimization, explained below, and also results for the newer NVIDIA GeForce GTX 280 GPU on one of the nodes of the University of Illinois’ GPU cluster, qp.ncsa.uiuc.edu. Performance of the WSM5 microphysics kernel on the GPU and on several conventional CPUs is shown in the figure below.  The first four bars from the left show performance of the kernel running on a single core. The next two bars show performance on all 4 cores; in other words, socket performance.  The last two bars show performance on the GPU itself, and then effective performance once CPU-GPU transfer cost is accounted for.  The Harpertown and Nehalem (workstation version) results were contributed by Roman Dubtsov of Intel Corp. in December 2008.

The GTX 280 achieves nearly 65 Gf/s on the optimized WSM5 microphysics kernel, but it is also clear that a very much larger fraction of performance is being lost to data transfer overhead.  Much of this should be amortizable as more weather model kernels are adapted to run and reuse the model state data on the GPU without moving it back and forth from the CPU.  The newest results (since December 2008) for this kernel show that conventional multi-core processors are closing the gap  with GPU performance (or already have closed it, if transfer costs are included).

Two optimizations to the WSM5 microphysics kernel provided significant performance improvements since the first results presented in April, 2008 at the LSPP workshop.  The first optimization was to utilize the –use_fast_math option to the nvcc compiler, which causes higher level mathematical intrinsic operations such as square root, log, and exponent, to be computed in hardware with modestly reduced precision in the Special Function Units (SFUs) on the GPU.  The WSM5 kernel, which uses these intrinsics heavily, ran about 1.25 times faster with –use_fast_math on the Quadro 5600 GPU results presented earlier.

The second optimization involved rewriting the kernel to eliminate temporary arrays used to store results between successive loops over the vertical dimension of the WRF domain. Since there were no loop carry dependencies over these vertical dimension loops, they could be jammed into a single large loop. The results that had been stored in the GPU’s slow DRAM memory in these temporary arrays could then be stored in registers instead. A side-effect of this restructuring was increased per-thread register usage, limiting the number of threads per block. Performance on the older Quadro 5600 improved by a factor of 1.67 while the newer GTX 280, which has twice as many registers as the older GPU, achieved a significantly greater increase from this .

GPU WSM5 Code for Standalone and in WRF model

The code and data for the CUDA implementation of the WSM5 microphysics is available in a standalone test configuration that can be downloaded from this link. There are README and SAMPLE_SESSION files included.

The GPU WSM5 microphysics that comes with the standalone configuration may also be used when compiling the full WRF model (version 3.0.1 and later) for GPU accelerated WSM5 microphysics. Copy wsm5.cu.o and wsm5_gpu.cu.o into the phys directory of WRF, then:

  1. Edit the makefile in the standalone driver source directory and set MKX to the correct number of levels in the WRF configuration you will be running.  Also set XXX and YYY for the correct number of threads per block for your GPU (see comment in makefile).
  2. Type ‘make’ in the standalone driver directory. Copy the resulting object file wsm5_gpu.cu.o into the phys directory of the WRF model.
  3. In the top-level WRF directory, run the configure script and generate a configure.wrf file for your system and configuration (must be Linux with Intel or Gnu C, C++, and Fortran and you must have CUDA 1.1 or higher installed).
  4. Add –DRUN_ON_GPU to ARCH_LOCAL in configure.wrf
  5. Add ../phys/wsm5.cu.o and ../phys/wsm5_gpu.cu.o to LIB_LOCAL in configure.wrf (define LIB_LOCAL if it does not already exist)
  6. If configuring wrf for serial or sm, you need to also include a definition for rsl_internal_microclock. See attached note.
  7. Add –L/usr/local/cuda/lib –lcuda –lcudart to LIB_LOCAL (or appropriate path for CUDA libraries on your system)
  8. compile and run wrf (see www.mmm.ucar.edu/wrf/users for other information)

To run, make sure that WSM5 physics is selected in the namelist.input file (mp_physics option #4) and then run on a GPU enabled node or nodes.  You can use the GPU WSM5 option with WRF when it is configured serial or dmpar (distributed memory parallel) The code assumes that there is one GPU device available per MPI task, if you have compiled WRF with the dmpar option.  The GPU WSM5 microphysics option is not currently supported for smpar or dm+sm (OpenMP or hybrid OpenMP & MPI) configurations of WRF.

 

GPU Acceleration of Scalar Advection

 

John Michalakes, National Center for Atmospheric Research

Manish Vachharajani, University of Colorado at Boulder

 

Introduction

Tracer advection models the transport of atmospheric constituent fields called tracers, species, or scalars (meaning they are single-value, not a vector field like wind velocity) under the forcing of the wind fields forecast by the WRF model Runge-Kutta dynamics (Skamarock et al. 2008).  Examples of tracers in WRF include concentrations of various types of moisture found in the atmosphere (vapor, cloud, rain, etc.), chemical concentrations (e.g. pollutants, reactive species in a wildfire, etc.), or other 3-dimensional fields. 

Tracer advection for this benchmark is the WRF 5th order positive-definite advection scheme (Wicker and Skamarock, 2005). Advection is called once for each Runge-Kutta iteration in the WRF solver.  It is a split scheme, advecting each dimension of the domain separately and using finite-difference approximation to compute adjustments, or “tendencies”, that are later applied as updates to each grid cell of each field.  Positive definiteness, which prevents negative concentrations, is enforced conservatively and monotonically as an adjustment  at the end of the advection scheme (Skamarock and Weisman, 2008). The tracer fields and their tendencies in WRF are organized in 4-dimensional tracer-arrays, the outer index of which is over the tracers, allowing them to be advected and diffused en-masse in a loop over this outer index. 

Cost of tracer advection scales linearly with the number of tracers. Advection of five moisture tracers in a typical WRF run accounts for about ten-percent of total run time on a single processor.  However, with atmospheric chemistry turned on, the number of tracers is an order of magnitude larger – 81 fields for the test case in this benchmark.  Because of this large number of tracers, advection dominates the WRF meteorology’s contribution to the cost of WRF-Chem simulation; however, the additional cost of the chemistry module is a factor of 6 greater than the cost the weather model that drives it (see WRF Chem GPU benchmark page).

Methodology

WRF-Chem was configured to run a large case: 27km resolution 134x110x35 grid over the Eastern U.S. that was part of a large air-quality field campaign in 2006.  The input to the RK_SCALAR_TEND routine in the WRF ARW solver was written out at each time step, generating a series of sample input files for the standalone benchmark.  The inputs include current and previous RK step values for the 81 chemical tracers in the case, wind velocities, density, and other 3- and 2-dimensional inputs.  The call tree rooted at RK_SCALAR_TEND was then extracted from the WRF code and called within a small driver, producing a standalone benchmark driver for the original Fortran, or “vanilla” version of the scalar advection module.  Once this was completed, the RK_SCALAR_TEND call tree was converted to C, then CUDA, then optimized for the GPU, and included in the benchmark distribution (below) as the “chocolate” version of the code.

Details of CUDA implementation

Tracer advection has several characteristics that make getting good performance more difficult compared decomposition with the cloud microphysics kernel:

  • Computational intensity is much lower: only 0.76 floating point operations per memory access compared with 2.2 for microphysics,
  • Unlike microphysics, there are data-dependencies between threads in a thread-per-column decomposition, because of the 5th order finite difference stencil width, and
  • Data transfer of 3-dimensional tracer fields between CPU-GPU.

One the plus-side, there are potentially many tracers to be advected and ample opportunity for overlapping CPU-GPU data transfers with computation using double buffering an asynchronous forms of CUDA calls and kernel invocations.  The final CUDA implementation of scalar advection uses textures, a mechanism from graphics that can be used to cache caching multi-dimensional normally high-latency device memory accesses in the shape of the stencil.  In CUDA version 1, these textures could be only 2-dimensional and the first versions of this scalar advection scheme had to slice and double-buffer the domain into 2-D slabs. With CUDA version 2, there is now also 3-dimensional texturing, simplifying the code and reducing the number of kernel invocations and data transfer operations by a factor of n

A final improvement in performance was made by using ‘pinned’ memory on the host side, which increases the speed of CPU-GPU transfers and improves the CUDA implementation of scalar advection by a factor of 1.25 .  Pinning can be enabled or disabled in the benchmark code by defining or not defining –DPINNING in the makefile [note: turning off pinning doesn’t work in the current benchmark code].  For more information on pinning in CUDA, see here and the CUDA manuals.

The diagram below shows the steps involved in the GPU implementation of tracer advection.  Light aquamarine blocks are executed on the CPU, the lavender blocks are kernel invocations on the GPU.  Since it is necessary to synchronize between steps, the full positive-definite tracer advection involves several kernels. 

Two sets of textures are needed for double buffering.  Curiously and somewhat inconveniently, because textures are defined at compile time in CUDA, it is not possible to switch between them at run-time. This meant that two identical sets of kernel routines (indicated with superscripts in diagram) had to be coded, differing only in name and in which set of textures, A or B, they accessed. This was handled in the benchmark code using macros to autogenerate the two routines from templates.

 

 

Results

The benchmark code was tested for the 81-tracer WRF-Chem case a workstation with the following characteristics.

    * 2.83 GHz Intel Xeon E5440 quad-core

    * Tesla C1060 GPU, 1.296 GHz

    * PCIe-2

    * Intel Fortran 10.1 (using -O3 -fpe0 optimization)

    * Cuda compilation tools, release 2.1, V0.2.1221 (Nov 25th build). 

The original “vanilla” Fortran version of the scalar advection ran in 4 seconds.  The CUDA version of the code ran in 0.6 seconds, a 6.7x improvement.

GPU Performance Tuning.  As with other kernels tested, performance of the scalar advection code appears to depend most closely on warp occupancy  (the number of thread-warps that are actively running on the GPU, which can be measured using the GPU hardware counters available under the CUDA profiling tools).  As shown in the figure below, for a particular domain, warp occupancy and ultimately performance are sensitive to GPU-specific tuning parameters such as thread-block size and the number of registers.  Optimizing benchmark codes involves sweeps over these controllable parameters.

Output Verification. The output from the two codes (both running single precision) agrees to within %0.1 percent, according to the output verification test included with the benchmark. The test is calculated as follows: for each 3-dimensional field, a mean over cell values in each horizontal (x and y) level is computed and output at the end of the run. Once both the CPU and GPU versions of the code has run, variances of the form |(meancpu-meangpu)/meancpu)| are computed for each pair of means.  The output of the test is the arithmetic mean of these variances.

Benchmark code and instructions

The downloadable version of the code contains the driver, the original and CUDA versions of the code, and a README file.  Input dataset for the benchmark is very large (>200 MB gzipped) and is downloaded separately.  

 

GPU Acceleration of a Chemistry Kinetics Solver

 

John Michalakes, National Center for Atmospheric Research

John Linford, Virginia Polytechnic Institute and State University

Manish Vachharajani, University of Colorado at Boulder

Adrian Sandu, Virginia Polytechnic Institute and State University

Introduction

This page contains results from work to accelerate an chemistry kinetics solver from WRF-Chem, the atmospheric chemistry and air quality version of the Weather Research and Forecast Model.

WRF-Chem supports a number of chemical kinetics solvers, many of which are automatically generated when WRF-Chem is compiled using the Kinetic Preprocessor (KPP) (Damian, Sandu, et al. 2002). Chemical kinetics is the computationally intensive numerical integration of the time evolution of chemical species and their reactions, which is dominated by the solution of coupled and stiff equations arising from the chemical reactions. The particular solver investigated here is a KPP-generated Rosenbrock gas-phase chemistry solver from RADM2 (see references in article cited above).  Combined with the SORGAM aerosol scheme (see references in WRF-Chem Users Guide), the RADM2SORG  chemistry kinetics option in WRF-Chem involves 61 species in a network of 156 reactions that represent more than 90 percent of the cost of running WRF-Chem over a 60km-resolution domain that is only 40 by 40 cells in the horizontal with 20 vertical layers.  The WRF-Chem time-step for this case is 240 seconds. On this very coarse resolution and very small grid, the meteorological part of the simulation (the WRF weather model itself) is only 160-million floating point operations per time step, only about 1/40th the cost of the full WRF-Chem with both chemistry-kinetics and aerosols.  Only the KPP-generated chemical kinetics solver  is subject of GPU acceleration here; the SORGAM aerosol kernel will be addressed in future work.

Methodology

WRF-Chem was configured, compiled, and run using the small test case described above. The input data to the RADM2 solver was written to files at the call site. The Fortran-90 source files for the RADM2 chemistry kinetics solver that were generated by KPP were isolated into a standalone program that reads in the input files and invokes the solver.  In addition, a number of KPP-generated tables of indices and coefficients used by the solver (described more fully below) were isolated and retained for use in the GPU version of the solver. The timings and output  from the original solver running under the standalone driver served as a baseline for developing the GPU implementation.

The eventual goal is to work with KPP developers and have KPP generate GPU code. However, since it was not yet clear what an optimal GPU implementation looked like, the objective here was to produce a GPU implementation of the RADM2 solver by hand, from the KPP generated source files. Several iterations were necessary.

The first version of the code was a direct Fortran to C to CUDA translation of the KPP generated files.  This was developed and validated against the original code under the CUDA 2.0 emulator running on an Intel Xeon with Linux; however, on attempting to compile under CUDA for the GPU, subsequent to filing a bug report tool stability issues were uncovered and fixed by NVIDIA engineering for the 2.0 versions of the nvcc compiler and ptxas assembler. With the fix, and with raising the default register-per-thread limit (-maxrregcount=85), compilation requires five minutes or less and NVIDIA engineering continues to improve the tool chain. The source of the problem was a number of routines in the solver that were generated in fully-unrolled form by KPP. For example, the KPP solver routine evaluates a sparse Jacobian matrix contains not a single loop but is rather a long succession of assignment statements; likewise code that constructs, LU-decomposes the Jacobian and the equation rates function for the RADM2SORG case, etc. for a total of more than 5000 assignments which appears to overwhelm the register allocation mechanism in the NVCC tool chain (personal communication with NVIDIA). The solution involved both rewriting the chemistry code to “re-roll” these into loops reducing these sections to about 1000 lines. 

 Details of Chemical Solver

The code is a Rosenbrock algorithm for solving stiff systems of ODEs implemented in KPP by Adrian Sandu at Virginia Tech and based on the book Solving ODEs II: Stiff and Differential-Algebraic Problems, by E. Hairer and G. Wanner, Springer –Verlag, 1996.  For chemistry, “stiff” means that the system comprises widely varying reaction rates and that it cannot be solved using explicit numerical methods.  The Rosenbrock technique uses a succession of Newton iterations called stages, each of which is itself solved implicitly.  The implementation takes advantage of sparsity as well as trading exactness for efficiency when reasonable.   An outline of the implementation generated by KPP that was then hand-converted to CUDA is shown below.

The data structures used for the computation at each cell are:

  1. Y(NVAR) – input vector of 59 active species (two less than total 61)
  2. Temporaries Ynew(NVAR) , Yerr(NVAR), and K(NVAR*3)
  3. Fcn(NVAR) – vector of 59 reaction rates
  4. RCONST(NREACT) – array of 159 reaction rates. The variable name is a misnomer; many of these are variable over time and the set of rates is computed separately for each cell.
  5. Jac0(LU_NONZERO), Ghimj(LU_NONZERO) store 659 non-zero entries of Jacobian

Analysis of the computational and memory access characteristics of the program was conducted mostly by hand.  For this case, the chemistry kinetics solution per invocation at each cell of the WRF-Chem grid is computed independently and involves approximately 610-thousand floating point operations and 1-million memory accesses. This is for about 30 iterations of the chemistry solver, and the number of iterations is somewhat dependent on time step. Regardless, although operations is very high (for comparison, the WRF meteorology is about 5-thousand operations per cell-step), computational intensity is low 0.60 operations per 8-byte word and .08 operations per byte.

None of the  small enough to fit in local shared memory without severely constraining number of threads-per-block and thus limiting latency hiding.  The above set of vectors for each cell in the domain were copied to and stored in device memory as a long 1D array. Two layouts of the data in device memory were tried:

            A[ thread-id , vector-len ]

and

            A[ vector-len , thread-id ]

where vector-len is the number of elements on each cell (NVAR, for example). The second layout is slightly more complicated to code, since it involved thread address calculations at the points of the array references; however, this was considerably faster (by a factor of 2) since it allowed neighboring threads to coalesce memory accesses. In other words, the tread index is stride-1 in the second layout.

In addition to the above, “re-rolling” the large loops in the solver introduced a number of additional indirection arrays and arrays of coefficients. 

  1. LU_IROW, LU_ICOL –indirection arrays for 659 non-zero entries of Jacobian
  2. LU_CROW, LU_DIAG – indirection arrays for 60 elements of diagonal
  3. Stoich_Left – 9516 (NSPEC * NREACT) 4-byte float coefficients

These are constant and the same for every cell. Most of these fit in GPU constant memory; the last, STOICH_LEFT is stored in device memory, but accessed as a 2-D texture.

GPU and Cell Chemistry Kinetics Standalone Codes

There are several implementations of the kernel that may be downloaded from this site.

Codes:

The original "vanilla" Fortran version with an option to run multi-core using OpenMP threading is available here:

            http://www.mmm.ucar.edu/wrf/WG2/GPU/Chem_benchmark/chem_bench_omp.tar.gz

The most recent GPU version is here:

            http://www.mmm.ucar.edu/wrf/WG2/GPU/Chem_benchmark/radm2_cuda_masked.tar.gz

Linford and Sandu’s Cell implementation of the kernel may be downloaded from:

 http://www.mmm.ucar.edu/wrf/WG2/GPU/Chem_benchmark/cell.tar.gz (0.42 MB).

Input datasets:

The dataset for the small workload is here:

            http://www.mmm.ucar.edu/wrf/WG2/GPU/Chem_benchmark/radm2sorg_in_001.gz

The large workload is here:

            http://www.mmm.ucar.edu/wrf/WG2/GPU/Chem_benchmark/radm2sorg_in_010.gz

 

Instructions:

These instructions are out of date.  Need to be updated.  To compile, edit the makefile and then type make.  To run the benchmark, cd to the run directory and type ./runit .  There is a SAMPLE_SESSION file in the top-level directory to compare against. If both the original and GPU versions of the code run successfully, the script computes RMS error (5.53492e-12) for one of the output chemical species (Ozone). Your result should be close to this (same order of magnitude or less).

Results (Updated 9/7/2009)

The GPU version of the chemistry kernel is invoked for a chunk of cells in the large array of cell states. But chunk-size and block-size (the number of threads per block) is variable and specified when the standalone chemistry code is invoked.  Parameter sweeps over chunk-size and block-size revealed that a single chunk of all 28,889 cells in the domain and a block-size of 192 threads per block was most efficient.  Block sizes larger than 192 cells fail to run, presumably because each thread uses 85 registers and larger blocks exceed 16K available.

The test system was a GPU cluster at NCSA, ac.ncsa.uiuc.edu:

  • 2.4 GHz Quad Core Opteron nodes
  • Tesla C1060 GPU, 1.296 GHz
  • PCIe-2
  • CUDA 2.2

The best time achieved on a C1060 Tesla GPU for this double precision workload was 6.9 seconds, including host-GPU data transfer time.  Running single precision, the best time was 3.3 seconds. 

The plot below shows results running the original “vanilla” Fortran version of the benchmark, modified to run multi-threaded using OpenMP, on new Intel Nehalem processors.  (The modified code is here). The blue line shows seconds to compute the workload on a Mac Pro (joshua.mmm.ucar.edu) with two quad-core 2.26 GHz Nehalem processors.  The x-axis is increasing number of OpenMP threads up to 64 threads (always only 8 physical cores).  The notch at 16-threads is for 2-threads per core.  The code was compiled using Intel Fortran 11.0 with -fast optimization.  For comparison, earlier results for 8-threads on a 3 GHz dual-quad core Xeon 5400 (John Linford at Virginia Tech) are plotted. This was with a C version of the OpenMP code. Also, for comparison, the best time achieved for the CUDA implementation of the benchmark on a Tesla C1060 GPU is indicated with the dashed green lines, double and single floating point precision as indicated.