Sun has demonstrated the first large scale grid validation for the SAS Grid Computing 9.2 benchmark using large data sizes with the Sun Storage 7410 Unified Storage System and complex data processing that models the real world, according to blogger Marcus Heckel.
He summarizes the results of the benchmarking exercise as follows:
- A combination of 8x Sun Fire x2200 M2 servers, 1 configured as the grid manager, 7 as the actual grid nodes, and a Sun Storage 7410 Unified Storage System showed continued performance improvement as the node count increased from 2 to 7 nodes for the Grid Endurance Test. Each node had a 1GbE connection through a Brocade FastIron 1GbE/10GbE switch. The 7410 had a 10GbE connection to the switch and sat as the back end storage providing a common shared file system to all nodes which SAS Grid Computing requires. Fishworks Analytics combined with the reliability of ZFS proved ideal for such complex application environments as those of SAS.
- Sun Storage 7410 Unified Storage System (exporting via NFS) satisfied performance needs, throughput peaking at over 900MB/s (near 10GbE line speed) in this multi-node environment.
- Solaris 10 Containers were used to create agile and flexible deployment environments. Container deployments were trivially migrated (within minutes) as HW resources became available (Grid expanded).
- This result is the only large scale grid validation for the SAS Grid Computing 9.2, and the first and most timely qualification of OpenStorage for SAS. The test show a delivered throughput through client 1Gb connection of over 100MB/s.
The workload was a batch mixture. CPU bound workloads are numerically intensive tests, some using tables varying in row count from 9,000 to almost 200,000. The tables have up to 297 variables, and are processed with both stepwise linear regression and stepwise logistic regression. Other computational tests use GLM (General Linear Model). IO intensive jobs vary as well. One particular test reads raw data from multiple files, then generates 2 SAS data sets, one containing over 5 million records, the 2nd over 12 million. Another IO intensive job creates a 50 million record SAS data set, then subsequently does lookups against it and finally sorts it into a dimension table. Finally, other jobs are both compute and IO intensive. The SAS IO pattern for all these jobs is almost always sequential, for read, write, and mixed access.
Governing the batch is the SAS Grid Manager Scheduler, Platform LSF. It determines when to add a job to a node based on number of open job slots (user defined), and a point in time sample of how busy the node actually is. From run to run, jobs end up scheduled randomly and exactly repeatable runs are not possible. Inevitably, multiple IO intensive jobs will get scheduled on the same node, throttling the 1Gb connection, creating a bottleneck while other nodes do little to no IO.
Often this is unavoidable due to the great variety in behavior a SAS program can go through during its lifecycle. For example, a program can start out as CPU intensive and be scheduled on a node processing an IO intensive job. This is the desired behavior and the correct decision based on that point in time. However, the initially CPU intensive job can then turn IO intensive as it proceeds through its lifecycle.
While scaling is less than linear, Heckel explains that this is due to the nature of the workload. A further scaling difficulty comes from the lack of a steady state for the batch, as the above Analytics snapshot above shows, he adds.
Throughput of 763MB/s was achieved during the sample period, though that wasn't the top end of what the 7410 could provide, the blog notes. The 7 node run peaked at over 900MB/s through a single 10GbE connection. Clearly the 7410 can sustain a fairly high level of IO.
To examine the effect of running a CPU bound workload with minimal IO requirements, Heckel reports, a batch was run sans the IO intensive jobs. These jobs do require some IO, but tend to be restricted to 25MB/s or less per process and only for the purpose of initially reading a data set, or writing results.
- 3 nodes ran in 120 minutes
- 7 nodes ran in 58 minutes
With respect to tuning, the blog notes, for the achieved results, after configuring a RAID1 share on the 7410, only 1 parameter made a significant difference. During the IO intensive periods, single 1Gb client throughput was observed at 120MB/s simplex, and 180MB/s duplex - producing well over 100,000 interrupts a second. Jumbo frames were enabled on the 7410 and clients, reducing interrupts by almost 75% and reducing IO intensive job run time by an average of 12%. Many other NFS, Solaris, tcp/ip tunings were tried, with no meaningful reduction in microbenchmarks, or the actual batch.
In conclusion Heckel recommends knowing the workload since, in many cases the 7410 storage appliance can be a great fit at a relatively inexpensive price while providing the benefits described. He also notes that 10GbE client networking can be a help if your 1GbE IO pipeline is a bottleneck and there is a reasonable amount of free CPU overhead.
Sun Open Storage for SAS
What is SAS Grid Computing?
Sun Fire X2200 M2 Server
Sun Storage 7410 Unified Storage System
Read More ...