Migrating and replicating data between storage clusters can be very time-consuming, especially if the data includes deeply-nested objects or large amounts of data. In addition, it can sometimes be slow for replication tools to crawl deep folder hierarchies in order to determine which files have changed, and tools such as rsync can take many hours to synchronize terabytes of data between clusters.
The ‘qsplit’ sample on GitHub uses the Qumulo REST APIs with some associated instructions to drastically reduce the amount of time needed to migrate or replicate data from one Qumulo cluster to another.You can find the sample here:
Here’s a brief description of Github Qumulo qsplit including what it does and how it can be used.
Our source cluster is 4 QC26’s — i.e. our “small model." On the source cluster we used a source path with thousands of small to medium files in a fairly deep hierarchy — about 1.5 Tb of data. Our target cluster is 4 QO626.
In addition, four client machines were used with discrete NICs for the rsync work. We could have run separate rsync processes locally or spun up VMs instead for this purpose, but for the experiment our team wanted discrete physical NICs in order to do the transfer. On each client machine we mounted the source cluster and source path and destination path from the target cluster, as well as a share on the source cluster to access the generated bucket file needed by each process.
The primary use case for qsplit is to optimize migration of data from a Qumulo cluster by using Qumulo's analytics APIs.
Optimized Migration using dir aggregates/REST API
Divide a qumulo cluster into N equal partitions(a list of paths). The partitioning is based on the block count, which is obtained from:
- Feed each partition to an rsync client. As an example, you can run the command:
./qsplit.py --host music /music/ --buckets 4
- This will create four 'bucket files' (a list of file paths using naming convention) for host 'music' and path '/music/' on the client. See example below where 'n' is # from 1..[# of buckets specified, above it is four]:
If you do not specify a '--buckets' param it will create a single bucket with all of the file paths for the specified source and path.
Once the files are created you can copy them to different machines/NICs to perform rsyncs in parallel. You could also run the rsyncs on a single machine with separate processes but you'd likely bury the machine NIC with traffic that way.
Copy the results of qsplit/ text files to somewhere client machines can resolve them
ssh to [n] different client machines with separate NICs
Mount the cluster [src] and [dest] on each machine
- On each machine run rsync in the following fashion:
rsync -av -r --files-from=qsync_[YYYYMMDDHHMM]_bucket[n].txt [src qumulo cluster mount] [target cluster mount]
NOTE that the file paths in the bucket text files are all relative to the path specified when running qsplit. So if you created file paths for '/music/' then that should be your [src cluster mount] point so that the relative file paths can resolve.
Using the above approach you should see a significant performance improvement over running rsync in the traditional way:
rsync -av -r [src] [dest]
The performance improves for two reasons:
- No file crawl needed by rsync because we're passing a filespsec in --files-from running multiple instances of rsync in parallel
- In addition by running each instance on a different client machine we avoid burying the NIC for a single machine and keep things nice and busy/active.
We saw about 10x performance improvement over ‘traditional’ rsync write throughput: 219 MBs vs 19.7 MBs (~ 50 MBs attributable to --files-from option)— which is quite a gain over traditional rsync src dest usage. Part of this is attributable IMO to using the —files-from option with rsync and discrete file paths, obviating the need for a file crawl. But most of it is due to parallelizing the transfers.
The total transfer time was less than 2 hours for 1.5TB of data. In comparison, I ran rsync the ’traditional’ way (rsync src dest) between the same clusters without any file specs or —files-from list. It ran overnight and had still not finished when I stopped it.
Using qsplit with tools such as rsync can enable parallelized migration and synchronization from Qumulo clusters to other clusters. We saw great improvement even by splitting the work up into four non-overlapping, equally-sized sets of files, and by avoiding tree crawls for things like change notification. We can also greatly improve the performance of replication scenarios thanks to the Qumulo REST APIs.