...
examines the issue of utilisation of UltraSPARC T1 and UltraSPARC T2 cores and attempts to determine whether performance would benefit from fewer or more virtual processors being assigned work
The principal question Gove examines has to do with whether running fewer threads on a processor's core would result in improved application latency. He also considers the case of idle threads to determine whether the core itself, under those conditions, would have sufficient instruction issue capacity to accommodate additional threads.
Anyone using superuser privileges can easily determine the instruction issue rate on a system-wide basis using the cpustat command, Gove asserts, providing code samples for collecting instruction count data once every second for 10 seconds. Knowing the instruction count helps to determine whether a core is fully saturated, Gove explains, which would allow one to know whether spare capacity exists for adding additional threads or whether to reduce the number of active threads an alternative route to improved latency.
Unlike traditional processors, Gove points out, removing stall cycles from a single thread will not necessarily result in a direct improvement in throughput with either the UltraSPARC T1 or the UltraSPARC T2 processor, though thread latency may be improved. Consequently, knowing what contributes to a stall is useful information, as is being able to calculate the product of the event count and the estimated cost per event. Gove presents instructions for estimating stall budget usage for both processors. He includes a simple script that runs an application multiple times, calculating the number of cycles that various events contribute to processor stall.
With the UltraSPARC T1 processor, Gove demonstrates that half the stall time results from loads missing the on-chip cache (though they are resident on the second level cache). When a small number of loads also miss the second level cache, resulting in the need to fetch data from memory, significant additional time results. From the standpoint of instruction count, Gove discovers that there is nearly a 100 percent improvement in performance if all the threads on a core are active. On a traditional processor, he concludes, memory stall time could be converted into performance, though on a CMT processor one encounters the upper limit imposed by sharing cycles between multiple threads.
In estimating stall budget usage for the UltraSPARC T2 processor, Gove finds that most of the time is spent in load operations of data that is resident in the second level cache, with the application getting about 16 percent of the total cycles, which is still less that the theoretical peak of 25 percent of the total cycles. As with the UltraSPARC T1 processor, so also with the UltraSPARC T2 processor in that reducing the cache misses could improve performance further. One issue that needs to be taken into consideration is that the memory pipe is shared among eight threads, so the peak performance of the application depends on there being one load for every two instructions. Otherwise the application could be limited to issuing one instruction every eight cycles.
As a result of his research, Gove concludes that, on a CMT processor, reduction in stall times leads to performance gains only up to the point at which the process consumes 25 percent of the instruction issue budget. Reductions in stall events beyond that are unlikely to lead to significant performance gains, so efforts to further reduce instruction stalls are wasted effort. Instead, he recommends the following three ways of improving the performance of CMT processors:
Use more threads (the most effective approach), with each additional thread getting a new instruction issue budget, making it possible for two threads to potentially do twice the work of a single thread.
Reduce instruction count since, with the UltraSPARC T1 and UltraSPARC T2 processors, each thread gets to issue a single instruction at a time, meaning that the instruction count corresponds directly to the length of time it takes to complete the task.
Reduce stall time (the least effective method), which might not directly improve performance because stall time on one thread is an opportunity for another thread to do work. Gove notes that, when the core is issuing its peak instruction rate, there are no possible performance gains from reducing cycles spent on stall events. He concedes that reducing stall time on one of the threads might enable that thread to get its fair share of the instruction budget, making it potentially possible to reduce the latency of one of the threads. Even so, he concludes this will not have an impact on the throughput of the system.
[...read more...]
Customized news reports about Sun Microsystems. Just the news you need, none of what you don't. 50,000+ Members. 20,000+ Articles Published since 1998.