IN THIS ARTICLE
Provides an overview of Automatic EBS Volume Replacement with Qumulo Sidecar on your AWS cloud cluster
REQUIREMENTS
- Cloud cluster with Qumulo Core 3.1.1 or above
- Cloud cluster with Qumulo Core 3.1.3 or above for SNS Topic Volume Replacement Notifications
- AWS Console access
- SSH key-pair for accessing Qumulo instance
- IAM permissions for full access to EC2
- Command line (qq CLI) tools installed via API & Tools in the Web UI
- Qumulo Sidecar configured
IAM PERMISSIONS
The table below lists the required IAM permissions for deploying the Qumulo Sidecar with Qumulo cloud clusters in AWS.
cloudformation:CreateStack | cloudformation:DeleteStack | ec2:DescribeNetworkInterfaces |
ec2:DescribeSecurityGroups |
ec2:DescribeSubnets | ec2:DescribeVpcs |
events:DeleteRule | events:DescribeRule | events:PutRule |
events:PutTargets | events:RemoveTargets | iam:AttachRolePolicy |
iam:CreateRole | iam:DeleteRole | iam:DeleteRolePolicy |
iam:DetachRolePolicy | iam:GetRole | iam:GetRolePolicy |
iam:PassRole | iam:PutRolePolicy | lambda:AddPermission |
lambda:GetFunction | lambda:CreateFunction | lambda:DeleteFunction |
lambda:DeleteFunctionEventInvokeConfig | lambda:GetFunctionConfiguration | lambda:RemovePermission |
lambda:PutFunctionEventInvokeConfig | lambda:PutFunctionConcurrency | s3:GetObject |
secretsmanager:CreateSecret | secretsmanager:DeleteSecret | secretsmanager:TagResource |
Detecting and repairing EBS volume failures with Qumulo Sidecar requires the following permissions:
ec2:AttachVolume |
ec2:CreateTags | ec2:CreateVolume |
ec2:DescribeImages |
ec2:DescribeInstances |
ec2:DescribeVolumes |
ec2:DetachVolume |
ec2:ModifyInstanceAttribute |
sns:Publish |
DETAILS
The EBS volume replacement process is useful in the rare event an EBS volume becomes unusable by the Qumulo file system. This can occur for a variety of reasons, such as a failure within AWS infrastructure or a slowdown of performance to the point that the file system cannot use it effectively. In these cases, the EBS volume will be rejected by the file system and will start reprotecting the data to other healthy volumes in the cluster. The file system is designed to support multiple volume failures at once, but it is recommended to replace the volume as quickly as possible to ensure protection from future failures.
The automatic volume replacement process will remove the unusable volume from the file system, create a new volume, and attach it to the Qumulo EC2 instance.
NOTE: The replacement process is only for the EBS volumes attached for file system data. The root device will not be queried or replaced by this process.
Automatic EBS Volume Replacement with Qumulo Sidecar
Once Qumulo Sidecar is deployed alongside your AWS Qumulo cluster, the service will perform the following to ensure the health of your cluster:
- Query the cluster for failed or missing EBS volumes.
- Tag each failed volume as being unhealthy.
- Remove each failed volume.
- Create a new volume or each failed or missing EBS volume that matches the type/size and tag it as a replacement volume.
- Attach the new replacement volume to the file system in place of the failed or missing volume.
In the case that the volume replacement process fails along the way, it will look for any replacement volumes that were created for the failed volume and use them so that unnecessary extra replacement volumes are not created when the process runs again.
The failed volume will be detached from the file system by the end of the process. However, the failed volume will not be deleted out of an abundance of caution. The administrator will need to check the health of the cluster to validate that the process is not only complete, but that the failed device is no longer needed prior to deleting.
SNS Topic Volume Replacement Notifications (3.1.3 or above)
You can configure EBS volume replacement notifications by including an SNS topic when launching or upgrading the Qumulo Sidecar service. If the service ever encounters a volume failure, the SNS topic will be notified. Additionally, information about the completed replacements will be sent via a JSON-encoded message to the SNS topic whenever the service replaces any number of volumes:
{
'Description': 'Successfully replaced disks for Qumulo cluster “<cluster name>”,
'DiskReplacements': [
{
'InstanceId': <instance id>,
'SlotId': <disk slot number>,
'NewVolumeId': <ebs volume id of the replaced disk>,
'OldVolumeId': <ebs volume id of the failed disk>,
}
],
}
Considerations
The configuration of the replacement volume is based on the configuration of the EC2 instance’s root volume. The volume being replaced may or may not exist and should not be relied on to find volume configuration information.
The following configuration is set for a replacement EBS volume by inheriting the configuration for the instance’s root volume:
- Encryption status: whether or not the EBS volume is encrypted and what encryption key it is encrypted with.
- DeleteOnTermination: whether or not the EBS volume will be deleted if the EC2 instance it is attached to is deleted.
If the attached EBS volumes have these settings changed manually to be different than the root volume of the instance, a replacement volume will not retain those manual changes.
If the process runs multiple times concurrently, it is possible that each instance of the replacement could create its own replacement volume. If there are multiple replacement volumes found for a failed or missing volume, the volume replacement process will return an error indicating that there are more replacement volumes than expected.
RESOLUTION
You should now have an overall understanding of Automatic EBS Volume Replacement with Qumulo Sidecar on your AWS cloud cluster
ADDITIONAL RESOURCES
Qumulo in AWS: Automated Instance Recovery
Like what you see? Share this article with your network!
Comments
0 comments