System News
A Survey of Nine Years' Worth of Benchmarking Results
Reveals Room for Improvements
October 8, 2009,
Volume 140, Issue 1

an accurate method of conveying a file or storage system's performance is by using at least one macro-benchmark or trace, as well as several micro-benchmarks

If you have ever wondered about the validity of benchmarking findings, then you will find your skepticism confirmed by the article Notes on a Nine Year Study of File System and Storage Benchmarking by Avishay Traeger, IBM Haifa Research Lab, and Erez Zadok, Stony Brook University.

The authors, suspicious themselves, put it this way: "Our suspicions were confirmed. We found that most popular benchmarks are flawed, and many research papers used poor benchmarking practices and did not provide a clear indication of the system's true performance. We evaluated benchmarks qualitatively as well as quantitatively: we conducted a set of experiments to show how some widely used benchmarks can conceal or overemphasize overheads. Finally, we provided a set of guidelines that we hope will improve future performance evaluations."

The authors concede that benchmarking is no easy task and point out the further difficulty involved in benchmarking file and storage systems, where complex interactions between I/O devices, caches, kernel daemons, and other OS components result in behavior that is rather difficult to analyze. Moreover, as they note, systems have different features and optimizations, so no single benchmark is always suitable. Lastly, the large variety of workloads that these systems experience in the real world also adds to this difficulty.

One of the chief difficulties in benchmarking is the reporting of results, which must be clear, detailed and verifiable, the authors contend. They report having surveyed how well the conference papers (which included some of their own) performed these tasks. They established minimum criteria for the design of reliable benchmarking exercises that include length of the test run, number of times the exercise is repeated, and the use of some metric of statistical dispersion (e.g., standard deviation, confidence intervals, quartiles).

The paper reports that, "In the surveyed papers, approximately 29% of benchmarks ran for less than one minute, which is generally too short to provide accurate results at steady state. Further, about half of the papers did not specify how many times they ran a benchmark, and less than 20% ran the benchmark more than five times. [The authors] recommend at least ten data points to provide a clear picture of the results. Finally, only about 45% of the surveyed papers included any mention of statistical dispersion. In terms of accurately portraying the behavior of the system, about 38% of papers used only one or two benchmarks in their performance analysis. This is generally not adequate for providing a complete picture.

The survey included descriptions and qualitative analyzes of every macro-benchmark used in the surveyed papers, as well as other benchmarks that we deemed worthy of discussion. The authors conducted experiments to perform a more quantitative analysis on two very popular benchmarks: a compile benchmark and Postmark (a mail server workload).

The authors modified the Linux ext2 file system to slow down certain operations, calling the new file system Slowfs. A compile benchmark measures the time taken to compile a piece of software. For OpenSSH compilations, the authors slowed down Slowfs's read operations (the most time-consuming operation for this benchmark) by up to 32 times, and yet the largest elapsed time overhead observed was only 4.5%.

For the Postmark experiments, the authors used three different workload configurations that were derived from publications with both ext2 and Slowfs. This experiment produced two findings: pm-slowfs-fsl copy.jpgFirst, Postmark's configuration parameters can cause large variations even on ext2 alone: these varied from 2 to 214 seconds, with the 2-second configuration performing no I/O at all. This problem is aggravated by the fact that few papers report all parameters: in the authors' survey, only 5 out of 30 papers did so. The second lesson learned was that some configurations showed the effects of Slowfs more than others.

The paper recommends that with the current set of available benchmarks, an accurate method of conveying a file or storage system's performance is by using at least one macro-benchmark or trace, as well as several micro-benchmarks. Macro-Benchmarks and traces are intended to give an overall idea of how the system might perform under some workload, the authors write. If traces are used, then special care should be taken with regard to how they are captured and replayed, and how closely they resemble the intended real-world workload. In addition, they continue, micro-benchmarks can be used to help understand the system's performance, to test multiple operations to provide a sense of overall performance, or to highlight interesting features about the system (such as cases where the system performs particularly well or poor).

The authors maintain that the current state of performance evaluations has much room for improvement. Because computer science is still a relatively young field, they argue, the experimental evaluations need to move further in the direction of precise science. One part of the solution is that standards clearly need to be raised and defined. This will have to be done both by reviewers putting more emphasis on a system's evaluation, and by researchers conducting experiments, they note. Another part of the solution is that this information needs to be better disseminated to all. The final aspect of the solution to this problem is creating standardized benchmarks, or benchmarking suites, based on open discussion among file system and storage researchers, the paper suggests.

While this article focused on benchmark results that are published in venues such as conferences and journals, the authors think that standardized industrial benchmarks could also be subjected to scrutiny to determine how effective these benchmarks are, and how the standards shape the products that are being sold today (for better or worse).

Since this article was published in May 2008, the authors have held a workshop on storage benchmarking at UCSC, and presented a BoF session at the 2009 7th USENIX Conference on File and Storage Technologies (FAST). They have created a mailing list for information on future events, as well as discussions. More information can be found on their website,

More Information

A nine year study of file system and storage benchmarking , Avishay Traeger, Erez Zadok , N. Joukov, and C. P. Wright, ACM Transactions on Storage, 4(2), May, 2008.

File and Storage System Benchmarking Portal

Read More ... [ more...]



Other articles in the Performance section of Volume 140, Issue 1:

See all archived articles in the Performance section.

Trending in
Vol 229, Issue 3
Trending IT Articles