A number of important changes have been made in the recently-released 2009.Q3 software release that affect how these operations work, the blog reports.
By way of example, the blog uses a fairly typical cluster configuration in which there is a single pool consisting of the disks and log devices in four J4400 JBODs, a pair of network interfaces by which clients will access that storage via NFS, and a network interface private to each head for administration only, referring to the heads as simply A and B. The pool and service interfaces in the example are assigned to head A.
The resources of interest, the blog continues, must be acted upon in dependency order; for example, one must open the ZFS pool and mount and share all of its shares before safely bringing any network interfaces up that clients expect to use to access those shares. Otherwise clients could attempt to access the shares before they are available, receiving a response that would result in stale filehandle errors.
The blog explains the dynamics of takeover as it involves a time-dependent arbitration process to protect user data and the use of zone locks and the manner in which they are acquired and dropped in a defined order by any head attempting to enter the OWNER state, leading in the case when locks are not continually reacquired, to a reboot.
The blog continues with the explanation that, when head A resumes functioning, it will rejoin the cluster, meaning that the current list and state of all resources will be transferred from head B over the intracluster I/O subsystem. Head A will not, however, take control of any of the singleton resources or their symbiotes; it will import only its own private resource ak:/net/nge2 as it transitions into the STRIPPED state following rejoin. This behavior prevents ping-ponging and allows the administrator to verify that the restored head has had any hardware issues addressed before returning it to service
In this case, the user will invoke the diskset class' import function for each of the disksets, then the zfs class' import function for pool-0, then the nas class' import function for pool-0, then the net class' import function for nge0, and so on until one has attempted to import all of these resources. Note that if a failure occurs one will simply mark the resource faulted and proceed. Since the peer is down, one must make a best effort.
With head A rejoined, the blog continues, a failback can be initiated from head B. During failback, head B will walk the list of resources in reverse dependency order, invoking the resource's class's export function for each resource that is not owned by head B. If any of these functions fails, head B will generate an alert and reboot itself. This is done to ensure that the cluster is in a consistent, well-defined state: it would not be safe for head A to import a resource that is still under the control of head B, nor would it be possible for head A to enter a defined cluster state without importing all of the resources assigned to it.
Similarly, if head B attempted to re-import the resource that could not be exported, that operation or some other re-import required by it could (and likely would) fail as well, making matters worse. Therefore B's reboot will trigger a takeover by A and consistency is maintained. Assuming a successful export, head B will now perform an intracluster RPC to head A instructing it to begin importing its resources. In response, head A will walk the list of resources in dependency order, invoking each resource's class' import function for each resource assigned to it (but not any assigned to head B). If any of these functions fails, head A will generate an alert and reboot itself, again triggering takeover from head B and maintaining consistency, the blog notes.
The blog goes on to consider disksets, along with ZFS pool resourcs, NAS symbiotes, net class resources, shadow migration, and SMB.
In conclusion, the blog notes how consideration of takeover and failback time will become increasingly relevant as overall performance improves. It recommends users conduct tests using their own real-world configurations to determine the duration dictated by their respective protocols and the parameters associated with particular shares.
Customized news reports about Sun Microsystems. Just the news you need, none of what you don't. 50,000+ Members. 20,000+ Articles Published since 1998.