Infiniband IPoIB datalink partition and IP interface configuration is easy and painless using the same network BUI page or CLI contexts as ethernet, says Fishworks blogger Cindi in her update on Infiniband for Q3.2009. Port GUID information is available for configured partitions on the network page, she continues, which makes it easy to add Sun Storage 7000 Unified Storage System HCA ports to a partition table on a subnet manager. Once a port has been added to a partition on the subnet manager, the IPoIB device will automatically appear in the network configuration. At this point, the device may be used to configure partition datalinks and then interfaces. If desired, IP network multi-pathing (IPMP) can be employed by creating multi-pathing groups for the IPoIB interfaces.
The blogger points out that HCA and port GUID and status information may also be found on the hardware slot location by navigating to the Maintenance->Hardware->Slots for the controller and clicking on the slot information icon to get see firmware, GUID, status and speeds associated with the HCA and ports.
She goes on to report on two performance tests of Infiniband with two simple workloads on a base Sun Storage 7410 that used a total of 8 clients with up to 5 threads each performing sequential reads from a 10GB file in a slightly modified version of Fishworks Engineer Brendan Gregg's seqread.pl script. The clients are evenly assigned to each of the HCA ports.
More than 5 threads per-client did not yield any significant gain with the CPU at the max. The experiment was run twice, once for NFSv4 over IPoIB and once for NFSv4/RDMA. The result the blogger expected was the one she discovered: IPoIB yields better results with smaller block sizes but it was surprising, she writes, to see IPoIB outperform NFS/RDMA with 64K transfer block sizes and stay in the running with every size in between.
To determine the maximum limit for synchronous writes the blogger used a stepping approach to the single synchronous write experiment above. Looping through the 8 clients at one minute intervals, she added a 4K synchronous write thread every second until the number of IOPS levels. At about 10 threads per client, a max begins to appear in the number of IOPS. This time CPU utilization is below its maximum (35%) but latency turns into a lake-effect ice storm, topping out at 38961 write IOPS for the 80 client threads.
The blogger writes that she also captured the per-device network throughput with the assumption that accounting for the additional NFSv4 operations and packet overhead, 93.1MB/sec seems reasonable. She ran this experiment with NFS/RDMA and discovered a marked drop-off (30%) in IOPS when run for a long period. Until then, NFS/RDMA performed as well as IPoIB.
Cindi reports next on the performance of the Sun Storage 7410 array with two additional CPUs and twice the memory in a test of Infiniband Performance Limits: Take 1. The 7410 also now has 8 JBODs and logzillas to help with writes. The system was configured into a QDR fabric to help with overall throughput. There are two internal Sun DataCenter 3X24 Infiniband switches.
With the network and device software stacks bypassed and the number of data copies performed by the CPU reduced, the assumption is that there should be a reduction in CPU utilization and an increase in the amount of data that can be transferred between clients and NFS server.
Cindi writes that this test demonstrates the maximum read throughput achievable over NFSv3/RDMA. The test reads a 1GByte file cached entirely in DRAM from the SS7410 filer to 10 clients. Each client is running 10 threads that are each performing 128KB read accesses from the filer and dumping the data into their DRAM. This test is effectively the same test used to publish typical results for the IB transport.
Cindi found it impossible to achieve a workload confined strictly to disk reads. The problem, she found, was not with Sun Storage 7410 but rather the number of clients in the fabric. In order to obtain results for this test, she figured it would be necessary to increase the number or capabilities of the clients. In this instance, using the same 10 IB clients as in the read experiments, the blogger drove 2 streaming write threads per client. Each thread uses a 32KB block size to stream to a separate file residing on a separate share.
The results in this instance were that it was possible to break the 1 GByte/sec maximum Brendan saw with ethernet. The 1 GBytes/sec result is obtained by multiplying the NFS write IOPS by the write size, Cindi writes, adding that she was unable to sanity check this result with the network throughput in Analytics as the TCP/IP stack was being bypassed. It was possible, however, to confirm the throughput on the fabric subnet manager using the port counters exported by each HCA port. According to the port counters, roughly 1 GBytes/second receive rate resulted.
Using the port counters, she notes, is not precise as the time it takes to collect the information varies and the counters (being 32-bit in length) can wrap. But the counters do provide a way to confirm our transport throughput in the absence of Analytics for RDMA. On the subnet manager, mlx4_0 (LID/Port 5/2) is attached to switch A and mthca0 (LID/Port 3/2) is attached to switch C in the IB fabric topology.
Next up, then, given the insufficient configuration of clients and their inability to push a maximum workload, it became necessary increase the number of clients and repeat the test. The IPoIB protocol uses the TCP/IP network to transmit and receive network packets, she points out, adding that, unlike RDMA that bypassses the network stack, IPoIB suffers from some of the performance implications inherent in the traditional TCP/IP software stack.
The results?
- NFSv3 Streaming DRAM Read, 3.18 GBytes/second RDMA; 2.24 GBytes/second IPolB
- NFSv3 Streaming Disk Read (not available)
- NFSv3 Streaming Write, 1.00 Gbytes/second RDMA; 753 MBytes/second IPolB
These results dictate Cindi's next step, which was to build out and attach the 7410 to a QDR-based fabric with at least 20 clients, which should provide a client workload large enough to push the 7410 to its maximum potential.
In her report on this third instance, "Infiniband Performance Limits: Streaming Disk Rad and New Summary," NFSv3 Streaming DRAM Read remained the same at 3.18 GBytes/second RDMA while IPoB increased to 2.40 GBytes/second.
- NFSv3 Streaming Write remained at 1.00 GBytes/Second RDMA and dropped to 752 MBytes/second IPobB.
The third fabric configuration was as follows:
Filer: Sun Storage 7410, with the following config:
- 256 Gbytes DRAM
- 8 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
- 4 sockets of six-core AMD Opteron 2600 MHz CPUs (Istanbul)
- 2 Sun DDR Dual Port Infiniband HCA
- 3 HBA cards
- noatime on shares, and database size left at 128 Kbytes
Clients: 10 blades, each:
- 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
- 3 Gbytes of DRAM
- 1 Sun DDR Dual Port Infiniband HCA Express Module
- mount options: read tests: mounted forcedirectio (to skip client caching), and rsize to match the workload; write tests: default mount options
Switches: 2 internal Sun DataCenter 3x24 Infiniband switches (A and C)
Subnet manager:
- Centos 5.2
- Sun HPC Software, Linux Edition
- 2 Sun DDR Dual Port Infiniband HCA
The 10 client fabric connected to the Sun Storage 7410 array was split equally between two subnets and connected to two separate HCA ports on the 7410. Each client has a separate share mounted. For the read from disk tests, the blogger used all 10 clients each running 10 threads to read 1 MB of data from its own 2GB file. The shares are mounted with rsize=128K.
In a test of this configuration -- to see whether it would push the 7410 to its limits -- the blogger ran a step test for the 4k maximum IOPS test. Here, it is clear that the stepwise function of adding two clients at a time plus one at the end for a maximum of 9 clients scales nicely: Every two clients adds roughly 42000 IOPS per step and the last client adds another 20000. The 7410's maximum is being approached. Cindi speculates that there is room for another 5 clients for an IOP max IOP max of 400K. That's next.
More Information
Sun Storage Unified Storage System Gets Five Stars in ITPro Review
[...read more...]