IN THIS ARTICLE
Outlines how to configure AWS Automated Instance Recovery for your Qumulo cloud cluster
REQUIREMENTS
- AWS account
- Account limits large enough for 20TB of EBS ST1 and 2TB of EBS GP2
- Permission to launch EC2 instances
- Four or more nodes running Qumulo in AWS version 3.1.0 or higher
IAM PERMISSIONS
The cloudwatch:PutMetricAlarm IAM permission is required to configure automated instance recovery for your Qumulo cloud cluster in AWS.
DETAILS
While deploying your virtual machines in the cloud provides many advantages, it also takes physical control out of the hands of system administrators, meaning that if an issue renders one of your machines inaccessible, you cannot physically access and troubleshoot it to bring it back online. Qumulo cloud clusters in AWS can address this with the use of the Automated Instance Recovery feature, which monitors and protects for the following failure cases:
- Loss of network connectivity
- Loss of system power
- Hardware issues on the physical host that impact network reachability
These issues can be detected by a CloudWatch alarm that monitors the instance's StatusCheckFailed_System metric and dispatches a recovery action to the AWS health service to "recover" the instance. When such an issue is detected, Automated Instance Recovery stops and restarts your instance, causing AWS to migrate it to a different physical host. When the instance has finished rebooting, it is identical to the original instance, including the instance ID, IP configuration, Elastic IPs, and metadata.
Clusters created using a CloudFormation template running Qumulo Core 3.1.0 or higher have this CloudWatch alarm enabled by default. If your cluster was created using an older template or did not use a template at all, follow the steps below to manually activate Automated Instance Recovery. An additional SNS topic ARN can also be created to send a notification to subscribers (such as email addresses and SMS phone numbers) to keep you in the loop regarding your cluster’s health status.
NOTE: If you would like to create a CloudWatch alarm that will notify you when Automated Instance Recovery is triggered, see Qumulo in AWS: Configure CloudWatch Alarms.
Enable Automated Instance Recovery via the AWS Console
- Login to the EC2 AWS Console.
- Click Instances in the left-hand navigation pane.
- Select the instance from the list and click the Status Checks tab.
- Click Create Status Check Alarm.
- Check Send a notification to: and make the desired selection if you wish to receive a notification when triggered (optional).
NOTE: See Qumulo in AWS: Configure CloudWatch Alarms for more details. - Check Take the Action and click Recover this instance.
- Use the Whenever drop-down menu to select Status Check Failed (System).
- Set For at least to 2 consecutive periods of 1 Minute.
- Provide the alarm a unique name (optional).
- Click Create Alarm.
Once created, a window will display linking to the new alarm. Dismiss the window and repeat these steps for each node to continue adding alarms to the rest of the instances.
Enable Automated Instance Recovery via the AWS CLI
Change $INSTANCE_ID to the ID for a Qumulo instance and $AWS_REGION to the current AWS region in the script below:
aws cloudwatch put-metric-alarm \
--alarm-name “auto-recover-$INSTANCE_ID” \
--alarm-description “Recover $INSTANCE_ID if it dies” \
--metric-name StatusCheckFailed_System \
--namespace AWS/EC2 \
--statistic Maximum \
--period 60 \
--threshold 1.0 \
--comparison-operator GreaterThanOrEqualToThreshold \
--dimensions “Name=InstanceId,Value=$INSTANCE_ID” \
--evaluation-periods 2 \
--datapoints-to-alarm 2 \
--alarm-actions “arn:aws:automate:$AWS_REGION:ec2:recover”
Run the script for each instance in the cluster to enable Automated Instance Recovery.
NOTE: The alarm-name and alarm-description values can be named anything you like. The alarm-actions value must have the EC2 recovery ARN, but can also include an SNS topic ARN. Add a space after the recovery ARN with the ARN of the SNS topic.
Create CloudWatch Event Rule for AWS EC2 Automated Instance Recovery
While not required, you can configure your deployment to use the SNS topic created and send a notification anytime the EC2 Automated Instance Recovery service is triggered or recovered.
- Click Events > Rules in the left navigation pane on your AWS CloudWatch Dashboard to view all current rules.
- Click Create Rule.
- Confirm that the Event Source is set to Event Pattern.
- Select Health in the Service Name field.
- Select Specific Health events in the Event Type dropdown to display additional fields for specifying your service.
- Select Specific service(s) and enter EC2 in the field below.
- Select Specific event type category(s) and enter issue.
- Select Specific event type code(s) and select both:
- AWS_EC2_INSTANCE_AUTO_RECOVERY_FAILURE
- AWS_EC2_INSTANCE_AUTO_RECOVERY_SUCCESS
- Click Add Target under Targets in the right-hand portion of the screen.
- Click the drop-down to select SNS Topic and specify the topic you created earlier.
- Click Configure Details at the bottom of the screen to continue.
- Provide a Name and Description for your new rule.
- Click Create Rule.
Your notification will automatically send an alert if the AWS EC2 Automated Instance Recovery succeeds or fails.
Considerations for Automated Instance Recovery
- No instance store volumes may be configured
- Recovery will fail if there is a temporal shortage of capacity in the AZ
- Recovery can fail if AWS is experiencing technical issues
- Recovery is limited to a maximum of 3 recovery attempts per 24 hour
RESOLUTION
You should now be able to successfully configure AWS Automated Instance Recovery for your Qumulo cloud cluster
ADDITIONAL RESOURCES
Qumulo in AWS: Configure CloudWatch Alarms
Qumulo in AWS: Build a Multi-Instance Cluster with CloudFormation
Like what you see? Share this article with your network!
Comments
0 comments