Print Email PDF

Qumulo's Behavior with Cluster Events

IN THIS ARTICLE 

Outlines the availability of Qumulo during specific cluster events

REQUIREMENTS

  • Cluster running Qumulo Core

QUMULO CLUSTER EVENTS

Below you'll find a list of cluster events with details outlining the cluster's availability, expected performance impact, severity level of the incident and Qumulo's cloud-based monitoring support for the event. 

For additional details on severity levels and response times, reference the table below.

event_table.png

Node Offline/Node Recusal

  • Cluster will continue accepting reads & writes
  • NFS clients connected to the floating IP will not be disconnected since floating IPs are rebalanced across available nodes
  • NFS clients connected to the persistent IP of the node will be disconnected
  • SMB clients connected to any IP on the node will be disconnected
  • Moderately degraded performance
  • Cluster will be at protection meaning there are no more remaining drives to be lost at this time
  • If the workload is high in incoming writes, an out of space (ENOSPC) situation is possible within hours
  • Treated as very high priority (Sev 1) by Customer Success
  • Customer Success alerted via cloud-based monitoring

Node Core

  • Cluster will continue accepting reads & writes
  • SMB clients connected to the affected node will be disconnected during the incident
  • NFS clients connected to the affected node, depending on mount options, will not be disconnected during the incident
  • Minor performance impact
  • If the node comes back online without manual intervention, Customer Success will treat it as normal priority (Sev 2) and investigate the incident
  • If the node does not come back online or continues to core, we’ll treat it as a very high priority (Sev 1) incident
  • Customer Success alerted via cloud-based monitoring

HDD/SSD Failure

  • Cluster will continue accepting reads & writes
  • Moderately degraded performance - greater the node count, the lower the impact 
  • If the workload or other hardware issues cause recurring quorum events, the chances of entering a read only state or returning an out of space (ENOSPC) error will increase
  • Drive failures are treated by Customer Success as the following priority based on disk type:
    • Normal priority (Sev 2) for HDD failures
    • High priority (Sev 1) for SSD failures
    • High priority (Sev 1) for multiple HDD or SSD failures
  • Cluster will be at protection if multiple drives fail; there are no more remaining drives to be lost at this time
  • Customer Success alerted via cloud-based monitoring

PSU Failure 

  • Cluster will continue accepting reads & writes
  • No performance impact with one PSU failure
  • Second power supply failure is treated the same as a node offline
  • PSU failures are treated by Customer Success as the following priority:
    • Normal priority (Sev 3) for single PSU failure
    • High priority (Sev 1) for two PSU failures
  • Customer Success alerted via cloud-based monitoring

Fan Failure 

  • Cluster will continue accepting reads & writes
  • No performance impact
  • Risk of overheating with multiple fan failures
  • Treated as normal priority (Sev 3) by Customer Success 
  • Customer Success alerted via cloud-based monitoring

DIMM Failure

  • Cluster will continue accepting reads & writes
  • No performance impact for correctable errors
  • Uncorrectable errors resulting in a node offline event are treated as very high priority (Sev 1) by Customer Success
  • Customer Success alerted via cloud-based monitoring in the event of a node offline from uncorrectable errors

NIC Port Failure

  • Cluster will continue accepting reads & writes
  • Minor performance impact for single port failure
  • Node is considered offline for dual port failure
  • IP Failover will occur if one port fails
  • Affected node will be offline if both ports on the NIC fail
  • NIC port failures are treated by Customer Success as the following priority:
    • Normal priority (Sev 2) for one NIC port failure
    • High priority (Sev 1) for dual NIC port failures
  • Customer Success alerted via cloud-based monitoring if both NIC ports fail resulting in a node offline

NOTE: Qumulo's all-flash platform has two NICs, one for front end and one for back end traffic, that may result in different behaviors during port failure(s) depending on switch configuration. Reference the Qumulo P-Series Networking article for additional details.

Boot Drive Failure

  • Results in a node offline
  • Clients connected to the affected node will be disconnected
    • NFS clients connected to the persistent IP of the node and SMB clients connected to any IP on the node will be disconnected
    • NFS clients connected to the floating IP will not be disconnected since floating IPs are rebalanced across available nodes
  • Cluster will continue accepting reads & writes
  • Moderately degraded performance
  • Treated as very high priority (Sev 1) by Customer Success 
  • Customer Success alerted via cloud-based monitoring

Upgrade

  • Cluster will not accept reads and writes for up to a few minutes while new version is installed on the cluster
  • Clients connected to the cluster will disconnect during upgrade and will reconnect after install is complete
  • Cluster reports version change via cloud-based monitoring

Adding a Node

  • Cluster will continue accepting reads & writes
  • Moderately degraded performance while the cluster is rebalancing data to the new node
  • Not included in cloud-based monitoring

Total Power Loss

  • Cluster will be offline until power is returned
  • Cluster does not have a battery backed journal; there is no time limit for when power must be regained
  • Nodes do not need to be powered on in a specific order following the event
  • Customer Success alerted via cloud-based monitoring of the communication loss

For information on multiple failures or incidents, reference the Qumulo Drive Failure Protection article for details on severity and data availability.

RESOLUTION 

You should now have an overall understanding of the behaviors during the cluster events outlined above

ADDITIONAL RESOURCES

Qumulo's Cloud-Based Monitoring

IP Failover with Qumulo Core

Qumulo P-Series Networking

Qumulo Drive Failure Protection

Qumulo's Remote Support

 

 

Like what you see? Share this article with your network!

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.

Have more questions?
Open a Case
Share it, if you like it.