It has been seen as a problem that certain applications seem not to scale easily on Non-Uniform Memory Architectures (NUMA) since the addition of CPU cores does not proportionately increase application performance. Rickey C. Weisner has written an article where he presents his success at increasing the performance on two different applications, one on a Sun SPARC E6900 and the other on an AMD Opteron X4600 machine.
In the first instance, the application is a component of a virtual telephone switch, and the business metric is expressed in calls processed per second or in busy calls per hour though the results are reported in transactions per second in the accompanying graphs. The application consists of numerous Java technology-based and C++ processes.
The issue is that CPU utilization increases nonlinearly with load, so that a six-system board configuration processed only three times the load that a one-system board configuration processed, rather than the expected six-times factor.
A dTrace analysis of system calls demonstrated that timed events were not causing the slowdown, nor were the increased spins on mutex locks. Weisner and his customer speculated that user instructions were actually taking longer to run and thus making fewer system calls per second than what we expected. The investigation used the Solaris OS cpustat tool to focus on determining whether loads and stores were taking longer due to off-board CPU migrations and L3 cache misses, in effect, using more cycles per instruction.
Analysis revealed that 41 percent of the system was being used to service L3 misses, which Weisner attempted to remedy through processor affinity and lgroups. The problem was found to lie with processor binding, which was reduced by running the bound threads in the Fixed Priority (FX) scheduling class at a high priority with a large time slice. This solution provided the final answer.
In the case of the Sun Fire X4600, the goal was to reduce execution time of a Java technology based application from the 214 minutes it took on an X4200 server with one Java Virtual Machine (JVM) and four worker threads to a quarter of that time on the X4600 with a single JVM and 16 worker threads. Execution time turned out to be only 195 minutes on the X4600, however.
Here, Weisner found that by partitioning the X4600 into four processor sets to make it resemble an X4200 (and splitting up a single large Java process into four processes as well), and then assigning each to one of the processor sets, using FX scheduling, and processor set-aware memory allocations, it became possible to complete a piece of work that had been taking 195 minutes in fewer than 80 minutes, not the stated goal but a definite improvement.
What Weisner concludes is that while integrated memory management units give modern processors significant performance advantages when accessing local memory, this advantage becomes a liability when memory accesses become disproportionately remote. Further, the typically prescribed remedies -- giving individual threads an affinity for the processor on which it most recently ran; giving a thread an affinity for the processor closest to the memory the thread needs to access Igroups in the Solaris OS; and by using large L2 and L3 caches -- are sometimes not enough to ensure efficient and scalable execution, he concludes.
And, while intelligent policy-based scheduling would help, he continues, the operating system can only do so much. Instead, he looks to an experienced competent analyst with sufficient knowledge of the workload to make the necessary difference between a smooth-running efficient system and an unscalable, erratically performing system.
To read Weisner's complete article, see Achieving Near-Linear Scalability Using Solaris OS on NUMA Architectures.
Read More ...