Print Email PDF

Replication: Failover and Failback with 2.11.4 (and lower)

IN THIS ARTICLE 

Outlines how to use failover and failback with Continuous Replication in Qumulo Core releases 2.11.4 (and lower).

REQUIREMENTS

  • Admin privileges required
  • Source and target cluster running the same version of Qumulo Core

If you are running a version of Qumulo Core above 2.11.4, check out the Replication: Failover and Failback with 2.12.0 and above article for details on Qumulo's updated failover/failback UI.

DETAILS

This document has several sections corresponding to different releases of Qumulo Core. Refer to the section appropriate for the software version you are running.

Failover and Failback with Qumulo Core 2.11.2-2.11.4

Failover to the Secondary Cluster

To utilize the target directory on the secondary cluster for writes, you can use the Make Target Writable option in the Actions menu on the target relationship listing. Once the process is complete, the target directory will become read-write and the relationship can be deleted. If the relationship had incomplete replication data, the target directory will be synchronized with the most recent recovery point. 

Planned failover: You should disable Continuous Replication and wait for the last job to complete before deleting the relationship. This ensures that the target directory is in a point-in-time consistent state matching the last source snapshot taken.

As an extra precaution, you can make shares/exports read-only on the primary cluster to ensure that no writes are lost. You can also use the Make Target Writable action to ensure that the target is consistent with the latest replication, though there should be no difference.

Notes:

  • For a planned failover, set the source directory on the primary cluster to read-only and wait for the new replication job to begin and finish (so that no writes are lost).
  • If your relationship uses Snapshot Policy Replication mode, after you set the source directory to read-only, change the replication mode to Snapshot Policy with Continuous Replication. To ensure that any changes that occurred (since Qumulo Core took the last policy snapshot) replicate to the target cluster, this process initiates a new replication job.
  • After the job finishes (or after a fail-back), you can reconfigure the relationship to Snapshot Policy Replication Mode.

Unplanned failover: If the last replication was incomplete before the primary cluster became unavailable, it may have left the target directory in an indeterminate state with some files truncated or otherwise inconsistent.

Follow the instructions below to utilize failover with Continuous Replication:

  1. Click the ellipsis.png menu on the replication relationship listing.
  2. Select Make Target Writable.

    make_writable.png

  3. Wait for the directory to be reverted to the last recovery point. Progress can be monitored by clicking Details.
  4. Click the ellipsis.png menu and select Delete relationship once the relationship status displays Disconnected and target writable.

    disconnected_writable.png

As the target directory is being made consistent, migrate any of the following configuration necessary to the secondary cluster if it doesn’t already exist:

  • NFS exports
  • SMB shares
  • AD/LDAP server(s)
  • Snapshot policies
  • Quotas

Remount all clients previously connected to the primary cluster that require access to the secondary cluster.

Re-enable Continuous Replication after Failover

If you bring the primary cluster back online, you may want to re-enable Continuous Replication after a failover. Continuous Replication can immediately be re-established in the original direction (primary to secondary) if there is no need to keep the completed writes on the secondary cluster during the failover period.

You may want to do this if:

  • The primary cluster came back online before client write traffic was redirected to the secondary cluster
  • The writes to the secondary cluster were done as part of a test failover or DR readiness test and can therefore be discarded

To re-enable Continuous Replication in this case, you can simply re-create the original relationship. Replication will take ownership of the directory and sync the current version of the source directory to the target, overwriting any changes on the target directory.

Note that a warning dialog will display to proceed:

authorize_relationship.png

The initial replication job after re-creating the relationship will complete a full tree walk in the replication directories on both sides to ensure the target is brought up to sync. Files that have been replicated successfully and haven’t changed since the original replication will not be resent.

Failback to the Primary Cluster

To save writes on the secondary but continue using the original primary cluster, you can set up replication from the secondary to the primary cluster by recreating the relationship in that direction. Any changes on the primary that were made after the last successful replication will be overwritten. As stated above, the initial job will do a tree walk.

Follow the instructions below to utilize failback with Continuous Replication:

  1. Create a new replication relationship for each directory you would like to restore from the secondary cluster back to the primary cluster.

    create_relationship.png

  2. Begin replicating from the secondary cluster to the primary cluster.
  3. Discontinue writes to the source directory(s) on the secondary cluster.
  4. Start and complete a final replication job. The final replication job prior to the cutover should be as small as possible to minimize the duration that no I/O will be allowed into the directory.
  5. Delete the relationship for each directory failing back once the final replication job completes.
  6. Create the following data (if it doesn't already exist) on the primary cluster:
    • NFS Exports
    • SMB Shares
    • AD/LDAP Server(s)
    • Snapshot Policies
    • Quotas
  7. Remount all clients previously connected to the secondary cluster that require access to the primary cluster.
  8. Re-create all relationships (primary to secondary) to re-enable Continuous Replication.

Since the primary and secondary clusters were fully in sync when the relationship was deleted and the secondary cluster was in an inactive state, no data will be lost or overwritten on the secondary cluster.

Failover and Failback with Qumulo Core 2.11.1 and below

Failover to the Secondary Cluster

To use the secondary cluster for writes, users will need to delete the Replication relationship. Once the relationship is deleted, the target working directory is immediately in a read-write state.

Planned failover: Users should disable Continuous Replication and wait for the last job to complete before deleting the relationship. This ensures that the target directory is in a point-in-time consistent state (matching the last source snapshot taken). Users may optionally want to make shares/exports read-only on the primary cluster to ensure that no writes are lost.

Unplanned failover: If Replication was in progress before the primary cluster became unavailable, it may have left the target directory in an indeterminate state with some files truncated or otherwise corrupted. If this is unacceptable for the failover scenario, the best solution would be to manually rsync or robocopy the last replicated snapshot back to the working directory to match the last known point-in-time consistent copy of the data.

Follow the instructions below to utilize failover with Replication:

  1. Delete the Replication relationship for each directory you need to recover to put the secondary working directory in a read-write state:
    • Navigate to Cluster > Replication and click the Trash icon to delete.

      delete_replication.png

  2. Retrieve the point-in-time-consistent version of the data from the most recent snapshot.
    • From the secondary cluster, find the most recent snapshot for each relationship by viewing the Cluster > Saved Snapshots section of the UI.
      TIP!
      Use the cluster name from the target directory to search the listings and copy the Snapshot ID from the Created/Snapshot Name Column.

      snapshot_id.png

  3. Sync the data over to the secondary cluster using one of the following methods based on your permissions model:

POSIX Managed Permissions - Single threaded

sudo rsync -aH /mnt/path/to/.snapshot/SNAPSHOTID /mnt/path/to/RecoveryWorkingDir/

POSIX Managed Permissions - Multi Threaded

Navigate into the desired .snapshot/SNAPSHOTID directory to be recovered and run:

sudo find . -print0 | tr '\n' '\0' | xargs -I % -0 -n 1 -P 20 rsync -aHR % /mnt/path/to/RecoveryWorkingDir/

NoteAdd -vh --progress to rsync command for verbose progress output if needed

NTFS Managed Permissions

  1. Open a new Admin CMD prompt.
  2. Map Replication Target Qumulo to a Letter Drive as user qumulo\admin (Q: in this example):
robocopy Q:\path\to\.snapshot\ID\ Q:\path\to\RecoveryWorkingDir\ /COPYALL /SL /MT:16 /R:5 /W:2 /LOG+:C:\path\to\logfile.txt
  1. Once the sync is complete, create the following data (if it doesn't already exist) on the secondary cluster:
    • NFS Exports
    • SMB Shares
    • AD/LDAP Server(s)
    • Snapshot Policies
    • Quotas
  2. Remount all clients previously connected to the primary cluster that require access to the secondary cluster.

Re-enabling Replication after Failover

If you bring the primary cluster back online, you may want to re-enable Replication after a failover. A user can immediately re-establish Replication in the original direction (primary to secondary) if there is no need to keep the completed writes on the secondary cluster during the failover period.

A user may want to do this if:

  • The primary cluster came back online before client write traffic was redirected to the secondary cluster
  • The writes to the secondary cluster were done as part of a test failover or DR readiness test and can therefore be discarded

To re-enable Replication in this case, the user can simply re-create the original relationship. Replication will take ownership of the directory and will sync the current version of the source directory to the target, overwriting any changes on the target directory.

Note that a warning dialog will display to proceed:

authorize_relationship.png

The initial Replication job after re-creating the relationship will complete a full tree walk in the Replication directories on both sides to ensure the target is brought up to sync. Files that have been replicated successfully and haven’t changed since the original Replication will not be resent.

Failback to the Primary Cluster

If the user wants to save writes on the secondary but continue using the original primary cluster, they can set up Replication from the secondary to the primary cluster by recreating the relationship in that direction. Any changes on the primary that were made after the last successful replication will be overwritten. As stated above, the initial job will do a tree walk.

Follow the instructions below to utilize failback with Replication:

  1. Create a new Replication relationship for each directory you would like to restore from the secondary cluster back to the primary cluster.

    create_relationship.png

  2. Begin replicating from secondary cluster to the primary cluster.
    Note: The final Replication job prior to the cutover should be as small as possible to minimize the duration that no I/O will be allowed into the directory.
  3. Discontinue writes to the source directory(s) on the secondary cluster.
  4. Start and complete a final Replication job.
  5. Delete the relationship for each directory failing back once the final Replication job completes.
  6. Create the following data (if it doesn't already exist) on the primary cluster:
    • NFS Exports
    • SMB Shares
    • AD/LDAP Server(s)
    • Snapshot Policies
    • Quotas
  7. Remount all clients previously connected to the secondary cluster that require access to the primary cluster.
  8. Re-create all relationships (primary to secondary) to re-enable Continuous Replication.

Since the primary and secondary were fully in sync when the relationship was deleted and the secondary cluster was in an inactive state, no data will be lost or overwritten on the secondary cluster.

ADDITIONAL RESOURCES

Replication: Failover and Failback with 2.12.0 and above

Replication: Continuous Replication with 2.11.1 and below

Replication: Continuous Replication with 2.11.2 and above

Replication: Make Target Writable

Replication: Version Requirements and Upgrade Recommendations

QQ CLI: Replication

 

Like what you see? Share this article with your network!

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.

Have more questions?
Open a Case
Share it, if you like it.