IN THIS ARTICLE
Outlines how to use Qumulo Shift to copy files from the Qumulo file system to a folder in an Amazon S3 bucket
- A Qumulo cluster running Qumulo Core version 3.2.1 or later
- Cluster with HTTPS connectivity to s3.<region>.amazonaws.com (see AWS IP address ranges) via one of the following means:
- An existing bucket in Amazon S3
Qumulo Shift requires AWS credentials (e.g., the access key ID and secret access key) with the following permissions:
Qumulo Shift for Amazon S3 allows you to perform a one-time copy of data from any Qumulo cluster—whether on-premises or already running in the cloud—to Amazon’s Simple Storage Service cloud object store (AWS S3), making it easy to take advantage of thousands of AWS cloud services and applications. Note that this copy is one-way; Shift cannot be used to copy the same data back into the Qumulo file system.
During the creation of a Shift relationship, Qumulo verifies that the specified source directory exists on the file system and that the S3 bucket exists and is accessible via the specified credentials. Once the relationship is created successfully, a job is started using one of the nodes in the cluster—when performing multiple Shift operations, multiple nodes will be used. This job takes a temporary snapshot (named replication_to_bucket_<bucket_name>) of the source directory to ensure that the copy is point-in-time-consistent. Shift then recursively traverses the directories and files in that snapshot, copying each file to a corresponding object in S3.
NOTE: File paths in the source directory are preserved in the keys of replicated objects, i.e., the file /my-dir/my-project/file.txt will be uploaded as the object https://my-bucket.s3.us-west-2.amazonaws.com/my-folder/my-project/file.txt.
The data is not encoded or transformed in any way, but only data in a regular file's primary stream is replicated (alternate data streams and file system metadata such as ACLs are not included). Any hard links to a file within the replication source directory are also replicated to S3 as a full copy of the object, with identical contents and metadata—however, this copy is performed using a server-side S3 copy operation to avoid transferring the data across the internet. See the table below for specifics on how entities in the Qumulo file system map to entities in an Amazon S3 bucket:
|In the Qumulo File System||Becomes in Amazon S3|
|Regular file||S3 object (object key is the file system path, data is the field data)|
|Directory||Not copied (directory structure is preserved in the object key of objects created for files)|
|Symbolic link||Not copied|
|UNIX device file||Not copied|
|Hard link to a regular file||Copy of the S3 object|
|Hard link to a non-regular file||Not copied|
|Access control lists||Not copied|
|SMB extended file attributes||Not copied|
|Alternate data streams||Not copied|
|Holes in sparse files||Zeroes (Holes are expanded)|
NOTE: When copying, Shift will check to see if a file was previously replicated to S3 using Shift. If the resulting object still exists in the target S3 bucket and neither the file nor object have been modified since the last successful replication, its data will not be re-transferred to S3. Shift will never delete files in the target folder on S3, even if they have been removed from the source directory since the last replication.
Once the job has completed, the temporary snapshot is deleted and the job ends successfully. The relationship remains on the Qumulo Cluster so you can monitor the completion status of the job; it can be manually deleted when you are satisfied the job has been finished—deleting the relationship will not affect data residing on Qumulo or S3. Relationships exist as a one-time operation and cannot be set to recurring or reused; if you wish to perform a copy of the same folder, a new relationship must be created.
Copy Files to Amazon S3 using Qumulo Shift
Use the following QQ CLI command on your source cluster to create a Qumulo Shift relationship and start copying the files:
qq replication_create_object_relationship --source-directory-path <PATH> --object-store-address s3.<REGION>.amazonaws.com --object-folder <FOLDER> --bucket <BUCKET> --region <REGION> --access-key-id <KEY>
The command requires several parameters that are shown in bold:
- Path: the path of the directory to be copied on the source cluster
- Region: the AWS region of your S3 bucket
- Folder: the name of the existing folder on the target bucket
- Bucket: the name of the existing bucket on S3
- Key: the ID for the S3 access key to be used for the operation
The following example shows how to create a relationship between the directory /my-dir/ on the Qumulo file system and the bucket my-bucket and folder /my-folder/ in the us-west-2 region of Amazon S3:
qq replication_create_object_relationship --source-directory-path /my-dir/ --object-store-address s3.us-west-2.amazonaws.com --object-folder /my-folder/ --bucket my-bucket --region us-west-2 --access-key-id ABC
Enter the secret access key associated with this access key ID:
Within a few seconds, the Qumulo cluster takes a snapshot of the source directory and begins copying files to S3.
View Current Relationships
Use the following QQ CLI command on your source cluster to view all current relationships:
If you want to view details about a specific relationship, you can do so by providing its ID for the following command:
qq replication_get_object_relationship <ID>
Monitor the Status of Relationships
Use the following QQ CLI command on your source cluster to check on the status of all relationships. This example shows the status of the relationship created in the previous section:
NOTE: If you wish to view the status of a specific relationship, you can do so using the replication_get_object_relationship_status command and supplying the ID of the relationship desired.
A state of REPLICATION_RUNNING indicates the operation is in progress, and current_job shows the current progress. When the operation finishes, the state will become REPLICATION_NOT_RUNNING, current_job will be empty, and last_job will have details on the most recent job completed.
Disconnect an Active Relationship
There is no mechanism to pause and resume replication once started, nor to restart a failed or aborted replication. Interrupting active replication requires using the following command, substituting the ID for the desired relationship in the bold portion:
qq replication_abort_object_relationship --id <ID>
Delete a Completed Relationship
Use the following command to delete a relationship that is not actively running, substituting the ID for the desired relationship in the bold portion:
qq replication_delete_object_relationship --id <ID>
This removes the relationship but does not delete any data from the cluster or S3.
Any fatal errors during the replication job will cause it to fail, leaving a partially copied set of files in the Amazon S3 target. On failure, the relationship also continues to exist to allow its status and the associated failure message to be reviewed, but it cannot be restarted and can be deleted to release its associated snapshot. A new relationship can be created with the same source and target to effectively restart the replication—any successfully transferred files from the previous relationship will not be retransferred to S3.
In case of such an error, a message describing it will be returned in the status API/CLI. The error field of last_job from the output of the replication_list_object_relationship_statuses command contains a failure message in cases where the operation does not finish successfully. Additional information may also be present in the qumulo-replication.log on the Qumulo cluster.
While not required, the following is highly recommended when using Qumulo Shift for AWS S3:
- Configure a bucket lifecycle policy in S3 to abort any incomplete multipart uploads older than several days. This ensures that any storage consumed by incomplete parts of large objects left by failed or interrupted replication operations is cleaned up automatically.
- For best performance when using a Qumulo cluster in AWS, configure a VPC endpoint to S3. For on-premises Qumulo clusters, AWS Direct Connect or another high-bandwidth, low-latency connection to S3 is recommended.
- Specify a unique object folder or unique bucket in S3 for each replication relationship established from a Qumulo cluster to S3 to avoid collisions between different data sets.
- Enable object versioning in the S3 bucket to protect against unintended overwrites.
- Delete completed object relationships. While completed relationships are retained to allow their final status to be reviewed, they should be deleted when no longer needed to free up associated resources (including snapshots that may be retained after some failures).
- Use concurrent replication relationships to S3 to increase parallelism, especially across distinct datasets, but if necessary limit the number of concurrent replication relationships to S3 as well, as a large number of concurrent operations may impact client I/O to the Qumulo cluster. While there is no hard limit, no more than 100 concurrently replicating replication relationships on a cluster are recommended (including object replication relationships and Qumulo source replication relationships).
- Buckets configured with S3 Object Lock and a default retention period cannot be used as a target for Qumulo Shift. If possible, either remove the default retention period and set retention periods explicitly on objects uploaded outside Shift, or use a different destination bucket on which Object Lock is not enabled.
- The size of any individual file cannot exceed 5TiB, as this is the maximum single object size supported by Amazon S3. There is no limit on the total size of all files.
- File paths must be fewer than 1024 characters, including the configured object folder prefix but excluding the source directory path.
- Hard links are supported up to the full supported object size of Amazon S3, 5TiB.
- Whether replication ran previously or not, any object existing under the same key being replicated by a new replication relationship will be overwritten unless it contains Qumulo-specific hash metadata that matches the file. Versioning can be enabled on the bucket to ensure older versions of overwritten objects are retained.
- Qumulo Shift for AWS S3 does not support connecting through an HTTPS tunnelling proxy.
- Qumulo Shift only supports replication to Amazon S3—other S3-compatible cloud object stores and gateways have not been tested and may not function completely or at all.
- All connections are encrypted using HTTPS and verify the S3 server’s SSL certificate. HTTP is not supported.
- Anonymous access (to a public bucket) is not supported. Valid AWS credentials are required.
- All objects are stored under the default S3 standard storage class. Lifecycle policies may be configured in S3 to automatically move stored objects to other storage classes including Glacier.
- All objects are stored with the default binary/octet-stream content-type and consequently may be interpreted as binary data when downloaded from a browser. Content-type metadata may be separately attached to uploaded objects through the AWS console or other tools.
- Replication provides no throttling and may use all available bandwidth. Use Quality of Service rules on your network to throttle it if desired.
You should now understand how to use Qumulo Shift to copy files from the Qumulo file system to a folder in an Amazon S3 bucket
Like what you see? Share this article with your network!