IN THIS ARTICLE
Outlines how to use Qumulo Shift to copy files from the Qumulo file system to a folder in an Amazon S3 bucket
REQUIREMENTS
- Qumulo cluster running Qumulo Core 3.2.1 or above for QQ CLI configuration
- Qumulo cluster running Qumulo Core 3.2.5 or above for Web UI configuration
- Cluster with HTTPS connectivity to s3.<region>.amazonaws.com (see AWS IP address ranges) via one of the following means:
- Public Internet
- VPC endpoint
- AWS Direct Connect
- An existing bucket in Amazon S3
IAM PERMISSIONS
Qumulo Shift requires AWS credentials (e.g., the access key ID and secret access key) with the following permissions:
s3:ListBucket |
s3:PutObject | s3:GetObject |
DETAILS
Qumulo Shift for Amazon S3 allows you to perform a one-time copy of data from any Qumulo cluster—whether on-premises or already running in the cloud—to Amazon’s Simple Storage Service cloud object store (AWS S3), making it easy to take advantage of thousands of AWS cloud services and applications. Note that this copy is one-way; Shift cannot be used to copy the same data back into the Qumulo file system.
During the creation of a Shift relationship, Qumulo verifies that the specified source directory exists on the file system and that the S3 bucket exists and is accessible via the specified credentials. Once the relationship is created successfully, a job is started using one of the nodes in the cluster—when performing multiple Shift operations, multiple nodes will be used. This job takes a temporary snapshot (named replication_to_bucket_<bucket_name>) of the source directory to ensure that the copy is point-in-time-consistent. Shift then recursively traverses the directories and files in that snapshot, copying each file to a corresponding object in S3.
NOTE: File paths in the source directory are preserved in the keys of replicated objects, i.e., the file /my-dir/my-project/file.txt will be uploaded as the object https://my-bucket.s3.us-west-2.amazonaws.com/my-folder/my-project/file.txt.
The data is not encoded or transformed in any way, but only data in a regular file's primary stream is replicated (alternate data streams and file system metadata such as ACLs are not included). Any hard links to a file within the replication source directory are also replicated to S3 as a full copy of the object, with identical contents and metadata—however, this copy is performed using a server-side S3 copy operation to avoid transferring the data across the internet. See the table below for specifics on how entities in the Qumulo file system map to entities in an Amazon S3 bucket:
In the Qumulo File System | Becomes in Amazon S3 |
Regular file | S3 object (object key is the file system path, data is the field data) |
Directory | Not copied (directory structure is preserved in the object key of objects created for files) |
Symbolic link | Not copied |
UNIX device file | Not copied |
Hard link to a regular file | Copy of the S3 object |
Hard link to a non-regular file | Not copied |
Timestamps (mtime/ctime/atime/btime) |
Not copied |
Access control lists | Not copied |
SMB extended file attributes | Not copied |
Alternate data streams | Not copied |
Holes in sparse files | Zeroes (Holes are expanded) |
NOTE: When copying, Shift will check to see if a file was previously replicated to S3 using Shift. If the resulting object still exists in the target S3 bucket and neither the file nor object have been modified since the last successful replication, its data will not be re-transferred to S3. Shift will never delete files in the target folder on S3, even if they have been removed from the source directory since the last replication.
Once the job has completed, the temporary snapshot is deleted and the job ends successfully. The relationship remains on the Qumulo Cluster so you can monitor the completion status of the job; it can be manually deleted when you are satisfied the job has been finished—deleting the relationship will not affect data residing on Qumulo or S3. Relationships exist as a one-time operation and cannot be set to recurring or reused; if you wish to perform a copy of the same folder, a new relationship must be created.
Copy Files to Amazon S3 via the Web UI
- Login to the Qumulo Core Web UI.
- Hover over Cluster and click Copy to S3.
- Click Create Copy.
- Fill in the required fields under the Source and Target sections.
Source Directory Path: the path of the directory to be copied on the source cluster
Target Region: the AWS region for your Amazon S3 bucket
Target Folder: the name of the existing folder in the target bucket on Amazon S3
Target Bucket Name: the name of the existing Amazon S3 bucket
Target Access Key ID and Secret Access: the key ID and access key for Amazon S3
NOTE: The Source Directory Path and Target Folder fields will default to "/" if left blank. - Select Advanced to access the optional advanced S3 server configuration settings.
- Click Create Copy.
To create a Qumulo Shift copy using the QQ CLI, include the path, region, folder, bucket, and key parameters (bold text) in the qq command below:
qq replication_create_object_relationship --source-directory-path <PATH> --object-store-address s3.<REGION>.amazonaws.com --object-folder <FOLDER> --bucket <BUCKET> --region <REGION> --access-key-id <KEY>
EXAMPLE: The following example shows how to create a relationship between the directory /my-dir/ on the Qumulo file system and the bucket my-bucket and folder /my-folder/ in the us-west-2 region of Amazon S3:
qq replication_create_object_relationship --source-directory-path /my-dir/ --object-store-address s3.us-west-2.amazonaws.com --object-folder /my-folder/ --bucket my-bucket --region us-west-2 --access-key-id ABC
Enter the secret access key associated with this access key ID:
{
"access_key_id": "ABC",
"bucket": "my-bucket",
"object_store_address": "s3.us-west-2.amazonaws.com",
"id": "5c57b2ed-1c08-4f84-8e65-a7f8f0ceff95",
"object_folder": "my-folder/",
"port": 443,
"ca_certificate": null,
"region": "us-west-2",
"source_directory_id": "3",
}
Newly-created Qumulo Shift relationships will be added to the list on the Copy to S3 page. Within a few minutes, a snapshot will be captured of the source directory to begin copying files from the cluster to Amazon S3.
Review Relationship Details
As mentioned above, Qumulo Shift relationships are listed in the table featured on the Copy to S3 page so that you can review the status, start time, completed time, source, and target for each copy you've created.
For a deeper dive, select View Details from the Actions menu of any Shift relationship listing to see the throughput, run time, and data in transit stats at a granular level for that specific job.
If you are using the QQ CLI, you can see the full list of Shift relationships by running the following qq command on your source cluster:
qq replication_list_object_relationships
To view details about a specific relationship, include the copy to S3 ID in the command below:
qq replication_get_object_relationship --id <ID>
Lastly, you can run the qq command featured in the example below on the source cluster to check the status on all Qumulo Shift relationships. To view the status of a specific relationship, use the replication_get_object_relationship_status command and supply the ID of the relationship.
qq replication_list_object_relationship_statuses
[
{
"access_key_id": "ABC",
"bucket": "my-bucket",
"object_store_address": "s3.us-west-2.amazonaws.com",
"id": "5c57b2ed-1c08-4f84-8e65-a7f8f0ceff95",
"object_folder": "my-folder/",
"port": 443,
"ca_certificate": null,
"region": "us-west-2",
"source_directory_id": "3",
"source_directory_path": "/my-dir/",
"state": "REPLICATION_RUNNING",
"current_job": {
"start_time": "2020-04-06T17:56:29.659309904Z",
"estimated_end_time": "2020-04-06T21:54:33.244095593Z",
"job_progress": {
"bytes_transferred": "178388608",
"bytes_unchanged": "0",
"bytes_remaining": "21660032",
"bytes_total": "200048640",
"files_transferred": "17",
"files_unchanged": "0",
"files_remaining": "4",
"files_total": "21",
"percent_complete": 0.890368314738253,
"throughput_current": "12330689",
"throughput_overall": "12330689"
}
},
"last_job": null
}
]
A "state" of REPLICATION_RUNNING indicates that the operation is in progress, and the current_job field shows the current progress. Once files are copied to Amazon S3, details for the most recent job that completed will be available in the last_job field. Additionally, the status will reflect REPLICATION_NOT_RUNNING and the output of current_job will be empty.
Abort a Copy in Progress
Whether you are using the Web UI or the QQ CLI, there is no mechanism to pause and resume replication once started, nor to restart a copy job that has failed or stopped.
To stop a copy of files to Amazon S3 that is currently in progress via the Web UI, click the Action menu and select Abort.
Click Yes, Abort Copy to confirm.
Alternatively, you can run the following qq command to abort the copy in progress:
qq replication_abort_object_relationship --id <ID>
NOTE: Aborting a job will not clean up any files that have already been transferred to the Amazon S3 bucket.
Delete a Relationship
To delete a relationship that has completed the copy job to Amazon S3, click the Action Menu and select Delete on the relationship listing.
Click Yes, Delete Copy to remove.
To use the QQ CLI, run the following command to delete a relationship that is not actively running by substituting the ID for the desired relationship in the bold portion:
qq replication_delete_object_relationship --id <ID>
The record of the replication job will be removed, leaving objects stored in Amazon S3 unchanged.
Troubleshooting Errors
Any fatal errors during the replication job will cause it to fail, leaving a partially copied set of files in the Amazon S3 target. On failure, the relationship also continues to exist to allow its status and the associated failure message to be reviewed, but it cannot be restarted and can be deleted to release its associated snapshot. A new relationship can be created with the same source and target to effectively restart the replication—any successfully transferred files from the previous relationship will not be retransferred to S3.
In case of such an error, a message describing it will be returned in the status API/CLI. The error field of last_job from the output of the replication_list_object_relationship_statuses command contains a failure message in cases where the operation does not finish successfully. Additional information may also be present in the qumulo-replication.log on the Qumulo cluster.
Best Practices
While not required, the following is highly recommended when using Qumulo Shift for AWS S3:
- Configure a bucket lifecycle policy in S3 to abort any incomplete multipart uploads older than several days to ensure that any storage consumed by incomplete parts of large objects left by failed or interrupted replication operations is cleaned up automatically. If a Qumulo Shift replication job to Amazon S3 is interrupted by a user or an unrecoverable error, the cluster will make a best effort attempt to clean up incomplete multipart uploads whether or not a bucket lifecycle policy is in place.
- For best performance when using a Qumulo cluster in AWS, configure a VPC endpoint to S3. For on-premises Qumulo clusters, AWS Direct Connect or another high-bandwidth, low-latency connection to S3 is recommended.
- Specify a unique object folder or unique bucket in S3 for each replication relationship established from a Qumulo cluster to S3 to avoid collisions between different data sets.
- Enable object versioning in the S3 bucket to protect against unintended overwrites.
- Delete completed object relationships. While completed relationships are retained to allow their final status to be reviewed, they should be deleted when no longer needed to free up associated resources (including snapshots that may be retained after some failures).
- Use concurrent replication relationships to S3 to increase parallelism, especially across distinct datasets, but if necessary limit the number of concurrent replication relationships to S3 as well, as a large number of concurrent operations may impact client I/O to the Qumulo cluster. While there is no hard limit, no more than 100 concurrently replicating replication relationships on a cluster are recommended (including object replication relationships and Qumulo source replication relationships).
Additional Considerations
- Buckets configured with S3 Object Lock and a default retention period cannot be used as a target for Qumulo Shift. If possible, either remove the default retention period and set retention periods explicitly on objects uploaded outside Shift, or use a different destination bucket on which Object Lock is not enabled.
- The size of any individual file cannot exceed 5TiB, as this is the maximum single object size supported by Amazon S3. There is no limit on the total size of all files.
- File paths must be fewer than 1024 characters, including the configured object folder prefix but excluding the source directory path.
- Hard links are supported up to the full supported object size of Amazon S3, 5TiB (Qumulo Core 3.2.3 or higher).
- Whether replication ran previously or not, any object existing under the same key being replicated by a new replication relationship will be overwritten unless it contains Qumulo-specific hash metadata that matches the file. Versioning can be enabled on the bucket to ensure older versions of overwritten objects are retained.
- All files are replicated with Amazon S3 server-side integrity verification during upload using a SHA256 checksum stored in the replicated object's metadata.
- Qumulo Shift only supports replication to Amazon S3—other S3-compatible cloud object stores and gateways have not been tested and may not function completely or at all.
- All connections are encrypted using HTTPS and verify the S3 server’s SSL certificate. HTTP is not supported.
- Anonymous access (to a public bucket) is not supported. Valid AWS credentials are required.
- All objects are stored under the default S3 standard storage class. Lifecycle policies may be configured in S3 to automatically move stored objects to other storage classes including Glacier.
- All objects are stored with the default binary/octet-stream content-type and consequently may be interpreted as binary data when downloaded from a browser. Content-type metadata may be separately attached to uploaded objects through the AWS console or other tools.
- Replication provides no throttling and may use all available bandwidth. Use Quality of Service rules on your network to throttle it if desired.
RESOLUTION
You should now understand how to use Qumulo Shift to copy files from the Qumulo file system to a folder in an Amazon S3 bucket
ADDITIONAL RESOURCES
Replication: Continuous Replication with 2.11.2 and above
Policies and permissions in Amazon S3
Aborting incomplete multipart uploads using a bucket lifecycle policy
Like what you see? Share this article with your network!
Comments
0 comments
Please sign in to leave a comment.