IN THIS ARTICLE
Outlines how to use failover and failback with Replication in Qumulo Core 2.12.0 and above
- Admin privileges required
- Source and target cluster running the same version of Qumulo Core 2.12.0 or above
If you are running a version of Qumulo Core below 2.12.0, check out the Replication: Failover and Failback with 2.11.4 and below article for details on Qumulo's legacy failover/failback UI.
Failover can come in two forms: planned and unplanned. And while everyone would prefer to only encounter the former, Replication makes recovering from either scenario a breeze.
- Planned failover: If you're performing a planned failover, set the source directory on the primary cluster to read-only and wait for a new replication job to start and finish (so that no writes are lost). To make sure that a job has finished, first verify that any blackout windows are disabled and then check the relationship's Recovery Point time to verify that it is after the time that you set the directory to read-only. This process ensures that the target directory is in a point-in-time consistent state matching the last source snapshot taken.
- For a planned failover, set the source directory on the primary cluster to read-only and wait for the new replication job to begin and finish (so that no writes are lost).
- If your relationship uses Snapshot Policy Replication mode, after you set the source directory to read-only, change the replication mode to Snapshot Policy with Continuous Replication. To ensure that any changes that occurred (since Qumulo Core took the last policy snapshot) replicate to the target cluster, this process initiates a new replication job.
- After the job finishes (or after a fail-back), you can reconfigure the relationship to Snapshot Policy Replication Mode.
- Unplanned failover: If the last Replication was incomplete before the primary cluster became unavailable, it may have left the target directory in an indeterminate state with some files truncated or otherwise inconsistent.
In either case, you will need to make the secondary target writable in order to continue making write operations.
Failover to the Secondary Cluster
Follow the instructions below to utilize failover with Replication:
- In the secondary cluster's Web UI, hover over the Cluster menu and select Replication.
- Click the menu on the Replication Relationship listing.
- Select Make Target Writable.
- In the resulting dialog, click Yes, Make Target Writable.
- Wait for the directory to be reverted to the last recovery point. Click Details to monitor progress.
- Once the Status field shows the message depicted below, the target directory is consistent and writable.
- Migrate any of the following configuration necessary to the secondary cluster if it doesn’t already exist:
- NFS exports
- SMB shares
- AD/LDAP server(s)
- Snapshot policies
- Remount all clients previously connected to the primary cluster that require access to the secondary cluster.
You've successfully performed a failover! You now have two failback options:
- Resume replication in the original direction: this option resumes replication leaving the original source and target clusters unchanged. This is a good option if you did not make any writes to the cluster during the failover interval. To continue using this path, follow the steps below in Failback: Re-Enable in the Original Direction.
- Resume replication in the reverse direction: this option resumes replication using the original target cluster as the new source. This allows you to retain any writes made to the secondary cluster during the failover. To use this path, follow the steps later in this page under Failback: Re-Enable with Reversed Sync.
Failback: Re-Enable in the Original Direction
If you didn't make any writes to the secondary during the failover (or do not wish to retain any writes that were made), Replication can be immediately re-established in the original direction and the writes can be discarded.
You may want to do this if you encountered the following:
- The primary cluster came back online before client write traffic was redirected to the secondary cluster
- The writes to the secondary cluster were done as part of a failover test or disaster recovery readiness test and can therefore be discarded
To re-enable Replication in this case, follow the steps outlined below:
- Log into the secondary cluster's Web UI.
- Click the menu on the desired relationship listing and select Reconnect Relationship... on the Replication page.
- A dialog will display the current and reconnected relationship diagram. To proceed with the reconnection, click Yes, Reconnect.
- Replication will take ownership of the directory and sync the current version of the source directory to the target, overwriting any changes on the target directory.
Note: If you are running Qumulo Core 2.12.4 or below, the initial Replication job after reconnecting the relationship will complete a full tree scan in the Replication directories on both sides to ensure the target is brought up to sync. If you are running 2.12.5 and above, the initial Replication job after reconnecting the relationship will bring both sides up to sync without doing a full tree scan. Files that have been replicated successfully and haven’t changed since the original Replication will not be re-sent.
Failback: Re-Enable with Reversed Sync
If you wish to retain any writes made to your secondary cluster during the failover, you can temporarily reverse the Replication Relationship. This action is initiated on the target cluster and will result in the previous target directory becoming the new source, and the previous source directory becoming the new target. After reversal completes, the relationship will remain disconnected, where Replication will not resume. To resume Replication after reversal, reconnect the relationship from the new target cluster. Keep in mind that any changes on the primary (new target) that were made after the last successful Replication will be overwritten.
NOTE: Prior to Qumulo Core 2.13.0, this process reset the relationship’s blackout windows and the map_local_user_ids_to_nfs_ids flag, requiring users to reconfigure them manually after failback before resuming replication. In version 2.13.0 or higher, these options are restored to their original configuration after failback, so they do not require manual user intervention.
Optional: If you wish to save the data that was written to your primary cluster but not replicated before the primary went down, please take the following steps before you re-enable with reversed sync:
- Take a manual snapshot of the affected directories. See our Snapshots: Deep Dive article for more details.
- Use qq snapshot_diff to compare the manual snapshot with the last replicated snapshot. (You can find the last replicated snapshot by using qq snapshot_list_snapshots.)
- Move the files that were altered since the last fully-replicated snapshot to a different directory (outside of the Replication process).
Once the data has been saved, follow the instructions below to utilize failback with Replication:
- Log into the secondary cluster's Web UI.
- Hover over the Cluster menu and select Replication on the target cluster.
- Click the menu in the Replication Relationship listing.
- Select Reverse Relationship....
- Enter the IP address and port number of the primary cluster in the Reverse Relationship form.
- Click the Reverse Relationship button.
- Switch to the primary cluster's Web UI. A relationship listing in the reverse direction should be present and show as disconnected.
- Still in the primary cluster's Web UI, click the menu and select Reconnect Relationship.
- Click Yes, Reconnect to begin replicating from the secondary cluster to the primary cluster.
Note: In Qumulo Core 2.12.4 or below, this initial Replication job may take some time; it can vary based on the size of your filesystem. If you are running 2.12.5 or above, the job can take some time depending on the amount of changes made on your filesystem since the last successful Replication. In either case, wait for it to complete so that Replication jobs are shorter before proceeding to the next step.
- Discontinue writes to the source directory on the secondary cluster.
Note: Click the menu and select Set Source Directory Read-only... if you are running Qumulo Core 2.12.2 and above.
- Start and complete a final Replication job. The final Replication job prior to the cutover should be as small as possible to minimize the duration that no I/O will be allowed into the directory.
Now that the data has been copied to the primary target, you can continue using this new replication relationship, or you can re-establish the original Replication direction.
To re-establish Replication:
- In the Web UI, hover over the Cluster menu and select Replication on the primary cluster.
- Click the menu and select Make Target Writable to prepare for reversing the relationship again.
- Once the relationship is disconnected, click the menu and select Reverse Relationship to make the primary cluster become the source cluster.
- Switch to the secondary cluster's Web UI and find the reversed relationship listing in the original direction.
- Re-configure your original blackout windows and Map Local User/Group IDs to associated NFS IDs configuration in the relationship (if used).
- Click the menu and select Reconnect relationship.
- Remount all clients previously connected to the secondary cluster that require access to the primary cluster.
You should now be able to successfully utilize failover and failback with Replication in Qumulo Core 2.12.0 and above
Like what you see? Share this article with your network!