What performance characteristics are we expected to see from ZFS Storage appliance Dedup in the wake of the Sun Storage 7000 2010.Q1, asks Roch "Mars" Bourbonnais rhetorically in the blog Dedup Performance Considerations. By way of coming up with an answer, he reviews the basics of ZFS dedup operations, then proceeds to an examination of the implications for performance.
Among the topics he considers in brief are dedup code, block size, tar files and the backup of active databases. He suggests that, concerning databases, by using an 8K block size in the dedup target dataset instead of 128K, one could conceivably see up to a 10X better deduplication ratio.
Roch suggests that the checksum generation or validation (using the cryptographically strong SHA256 checksum ZFS employs to validate disk I/O) is not a consideration since reading or writing data to disk is expected to take only 5-10 ms.
Concerning the read code path, the author proposes that very little modification should be observed. The fact that a read happens to hit a block which is part of the dedup table is not relevant to the main code path, he notes, adding that the biggest effect will be that one must use a stronger checksum function invoked after a read I/O, adding at most an extra 1 ms to a 128K disk I/O. If a subsequent read is for a duplicate block that happens to be in the pool ARC cache, however, then instead of having to wait for a full disk I/O, only a much faster copy of duplicate block will be necessary, Roch points out.
In this case, each filesystem can then work independently on its copy of the data in the ARC cache as is the case without deduplication. Roch adds that synchronous writes are also unaffected in their interaction with the ZIL. The blocks written in the ZIL have a very short lifespan and are not subject to deduplication, so the path of synchronous writes is mostly unaffected unless the pool itself ends up not being able to absorb the sustained rate of incoming changes for 10s of seconds, he observes.
Similarly for asynchronous writes which interact with the ARC caches, dedup code has no effect unless the pool's transaction group itself becomes the limiting factor, he explains. So the effect of dedup will take place during the pool transaction group updates, which is where one takes all modifications that occurred in the last few seconds to commit a large transaction group (TXG).
While a TXG is running, applications are not directly affected except possibly for the competition for CPU cycles. They mostly continue to read from disk and do synchronous write to the zil, and asynchronous writes to memory. The biggest effect will come if the incoming flow of work exceeds the capabilities of the TXG to commit data to disk. In this case, the reads and write will eventually be held up by the necessary write (Throttling) code preventing ZFS from consuming all of memory, Roch assures his readers.
Turning to the ZFS TXG, there are two operations of interest: the creation of a new data block, and the simple removal (free) of a previously used block, the author continues. With ZFS operating under a copy on write (COW) model, any modification to an existing block actually represents both a new data block creation and a free to a previously used block (unless a snapshot was taken in which case there is no free).
For file shares, this concerns existing file rewrites; for block luns (FC and iSCSI), this concerns most writes, except the initial one (very first write to a logical block address or LBA actually allocates the initial data; subsequent writes to the same LBA are handled using COW), Roch continues. For the creation of a new application data block, ZFS will then run the checksum of the block, as it does normally and then lookup in the dedup table for a match based on that checksum and a few other bits of information.
On a dedup table hit, only a reference count needs to be increased and such changes to the dedup table will be stored on disk before the TXG completes, the author states. Many DDT entries are grouped in a disk block and compression is involved. A big win occurs when many entries in a block are subject to a write match during one TXG, however. Then a single 1 x 16K I/O can then replace 10s of larger IOPS.
As for free operations, the internals of ZFS actually hold the referencing block pointer which contains the checksum of the block being freed. Therefore there is no need to read nor recompute the checksum of the data being freed. ZFS, with checksum in hand, looks up the entry in dedup table and decrements the reference counter. If the counter is non zero then nothing more is necessary (just the dedup table sync). If the freed block ends up without any reference, then it will be freed.
The DEDUP table itself an an object managed by ZFS at the pool level. The table is considered metadata and its elements will be stored in the ARC cache. Up to 25% of memory (zfs_arc_meta_limit) can be used to store metadata. When the dedup table actually fits in memory, then enabling dedup is expected to have a rather small effect on performance, Roch suggests. But when the table is many time greater than allotted memory, then the lookups necessary to complete the TXG can cause write throttling to be invoked earlier than the same workload running without dedup.
If using an L2ARC, the DDT table represents prime objects to use the secondary cache. Note that independent of the size of the dedup table, read intensive workloads in highly duplicated environment, are expected to be serviced using fewer IOPS at lower latency than without dedup. Roch adds that whole filesystem removal or large file truncation are operations that can free up large quantities of data at once, and when the dedup table exceeds allotted memory then those operations, which are more complex with deduplication, can impact the amount of data going into every TXG and the write throttling behavior.
Finally, Roch takes up the size of the dedup table, where he notes that the command zdb -DD on a pool shows the size of DDT entries. He reports that in one of his experiments it reported about 200 Bytes of core memory for table entries. If each unique object is associated with 200 Bytes of memory then that means that 32GB of ram could reference 20TB of unique data stored in 128K records, or more than 1TB of unique data in 8K records. So if there is a need to store more unique data than these ratios provide, he advises allocating some large read optimized SSD to hold the DDT. The DDT lookups are small random IOs which are handled very well by current generation SSDs.
The first motivation to enable dedup is actually when dealing with duplicate data to begin with. If possible, procedures that generate duplication could be reconsidered. The use of ZFS Clones is actually a much better way to generate logically duplicate data for multiple users in a way that does not require a dedup hash table, Roch recommends, adding that, when the operating conditions do not allow the use of ZFS Clones and data is highly duplicated, then the ZFS deduplication capability is a great way to reduce the volume of stored data.
OpenSolaris ZFS Deduplication
Pondering the Dedup Process: Synchronous or Asynchronous
Read More ...