The blog posting on Sun BestPerf by Paul Kinney reports on the results of Weather Research and Forecasting (WRF) code running on twelve Sun Blade X6275 server modules, housed in the Sun Blade 6048 chassis, using the 2.5 km CONUS benchmark dataset. According to Kinney, the Sun Blade X6275 cluster was able to achieve 373 GFLOP/s on the CONUS 2.5-KM Dataset. Further, the results demonstrate an 91% speedup efficiency, or 11x speedup, from 1 to 12 blades. The current results were run with turbo on, he adds.
Performance is expressed in terms "simulation speedup," which is the ratio of the simulated time step per iteration to the average wall clock time required to compute it. A larger number implies better performance, Kinney explains.
The hardware and software configurations for this benchmark were as follows:
- Sun Blade 6048 Modular System -
- 12 x Sun Blade X6275 Server Modules, each with 4 x 2.93 GHz Intel QC X5570 processors; 24 GB (6 x 4GB); QDR InfiniBand; HT disabled in BIOS; Turbo mode enabled in BIOS
X6275 Server Modules
- Each X6275 contains two separate compute nodes, providing a total of 24 compute nodes in 12 Sun Blade server modules. The Sun Blade 6048 with 48 Server Blades can have up to 96 compute nodes.
OS: SUSE Linux Enterprise Server 10 SP 2
- Compiler: PGI 7.2-5; MPI Library: Scali MPI v5.6.4; Benchmark: WRF 3.0.1.1; Support Library: netCDF 3.6.3
Kinney describes the WRF as a next-generation mesoscale numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. He writes that WRF is designed to be a flexible, state-of-the-art atmospheric simulation system that is portable and efficient on available parallel computing platforms. WRF features multiple dynamical cores, a 3-dimensional variational (3DVAR) data assimilation system, and a software architecture allowing for computational parallelism and system extensibility.
The dataset was a single domain, large size 2.5KM Continental US (CONUS-2.5K) with 1501x1201x35 cell volume; 6hr, 2.5km resolution dataset from June 4, 2005. The benchmark itself is the final 3 hr simulation for hrs 3-6 starting from a provided restart file; the benchmark may also be performed (but seldom reported) for the full 6hrs starting from a cold start; iterations output occurred at every 15 sec of simulation time, with the computation cost of each time step ~412 GFLOP
The key points and best practices that Kinney derives from this benchmark are as follows:
- Processes were bound to processors in round-robin fashion.
- Model simulation time is 15 seconds per iteration as defined in input job deck. An achieved speedup of 2.67X means that each model iteration of 15s of simulation time was executed in 5.6s of real wallclock time (on average).
- Computational requirements are 412 GFLOP per simulation time step as (measured empirically and) documented on the UCAR web site for this data model.
- Model was run as single MPI job.
- Benchmark was built and run as a pure-MPI variant. With larger process counts building and running WRF as a hybrid OpenMP/MPI variant may be more efficient.
- Input and output (netCDF format) datasets can be very large for some WRF data models and run times will generally benefit by using a scalable filesystem. Performance with very large datasets (>5GB) can benefit by enabling WRF quilting of I/O across designated processors/servers. The master thread (or rank-0) performs most of the I/O (unless quilting specifies otherwise), with all processes potentially generating some I/O.
Another blog on the same benchmarking exercise, this one by Rich Brueckner that is entitled Sun Blades: Outstanding Performance on WRF Weather Code, also reports the same remarkable results, noting the 11x speedup, from 1 to 12 blades.
More Information
Pathways to Petascale Computing - Updated White Paper
Sun Blade X6275 Sever Module
Sun Blade 6048 Chassis
[...read more...]