[UCI-Linux] parsync: parallelize and loadbalance rsync
harry.mangalam at uci.edu
Wed Apr 2 08:03:58 PDT 2014
We (at RCS) do lots of data transfer of large numbers and sizes of files and
I've even documented some of this here:
We've found that a single instance of rsync can be quite sub-optimal for
large-scale data movement, so I've written 'parsync', a wrapper for rsync that
allows for parallelization and loadbalancing (both for system load and
It recently rsync'ed 22M files in about 6 hrs over a 10Gb private net at a
measured transfer of up to 400MB/s (the limit of the slowest filesystem). Of
the total 27TB, ~7TB actually transferred; the rest were checksummed.
Here's the help. (output of parsync -h)
parsync version 1.0 (beta)
by Harry Mangalam <hjmangalam at gmail.com> || <harry.mangalam at uci.edu>
parsync is a Perl script that wraps Andrew Tridgells miraculous 'rsync' to
provide some load balancing and parallel operation to increase the amount
of bandwidth it can use.
The only native rsync option that parsync uses is '-a (archive). If you
need more, then it's up to you to provide them via '--rsync_opts'.
parsync checks to see if the current system load is too heavy and tries
to throttle the rsyncs during the run by monitoring and suspending
/ continuing them as needed.
It uses the very efficient (also Perl-based) kdirstat-cache-writer
from kdirstat to generate lists of files which are summed and
crudely divided into NP jobs by size.
It appropriates rsync's bandwidth throttle mechanism, using '--maxbw'
as a passthru to rsync's 'bwlimit' option, but divides it by NP so
as to keep the total bw the same as the stated limit. It monitors and
shows network bandwidth, but can't change the bw allocation mid-job.
It can only suspend rsyncs until the load decreases below the cutoff.
Unless changed by '--interface', it assumes and monitors eth0. The
transfer will use whatever interface normal routing provides. This is
normally set by the name of the target host. It can also be used for
non-host-based transfers (between mounted filesystems) but the network
bandwidth continues to be (pointlessly) shown.
It only works on dirs and files that originate from the current dir (or
specified via "--rootdir"). You cannot include dirs and files from
discontinuous or higher-level dirs.
** the ~/.parsync files **
This dir contains the cache (*.gz), the chunk files (kds*), and the time-
stamped log files. The cache files can be re-used with '--reusecache' (nb:
that will re-use ALL the cache files); the chunk files are overwritten at
each execution; the log files are datestamped and are not NOT overwritten.
** Odd characters in names **
parsync will refuse to transfer some oddly named files. Filenames with
embedded newlines, DOS EOLs, and some other odd chars will be recorded in
the logfiles in the ~/.parsync dir.
[i] = integer number
[f] = floating point number
[s] = "quoted string"
( ) = the default if any
--NP [i] (2) .......................... number of rsync processes to start
--rootdir  (`pwd`) ................. the directory it works relative to
--maxbw [i] (unlimited) .......... in KB/s max bandwidth to use (--bwlimit
passthru to rsync). maxbw is the total BW to be used, NOT per rsync.
--maxload [f] (4) ................ max system load - if sysload > maxload,
sleeps an rsync proc for 10s
--rsync_opts [s] ... options passed to rsync as a quoted string (CAREFUL!)
this opt triggers a pause before executing to verify the command.
--interface [s] ........ network interface to monitor, not use (see above)
--reusecache .......... don't re-read the dirs; re-use the existing caches
--email [s] ..................... email address to send completion message
--barefiles ..... set to allow rsync of individual files, as oppo to dirs
--nowait ................ for scripting, sleep for a few s instead of wait
--version ................................. dumps version string and exits
--help ......................................................... this help
parsync --maxload=5.5 --NP=4 --rootdir='/home/hjm' dir1 dir2 dir3 \
= "--rootdir='/home/hjm'" sets the working dir of this operation to
'/home/hjm' and dir1 dir2 dir3 are subdirs from '/home/hjm'
= the target "hjm\@remotehost:~/backups" is the same target rsync would use
= "--NP=4" forks 4 instances of rsync
= -"-maxload=5.5" will start suspending rsync instances when the 5m system
load gets to 5.5 and then unsuspending them when it goes below it.
It uses 4 instances to rsync dir1 dir2 dir3 to hjm\@remotehost:~/backups
parsync --reusecache --NP=3 --barefiles *.txt /mount/backups/txt
= "--reusecache" indicates that the filecache shouldn't be re-generated,
uses the previous filecache in ~/.parsync
= "--NP=3" for 3 copies of rsync (with no "--maxload", the default is 4)
= "--barefiles" indicates that it's OK to transfer barefiles instead of
recursing thru dirs.
= "/mount/backups/txt" is the target - a local disk mount instead of a
It uses 3 instances to rsync *.txt from the current dir to
You can get the perl script and required utilities here:
(it requires the kdirstat recurser and another util called stats)
It is currently very chatty to tell you what's going on. This will probably
be decreased as time goes on.
I'd be happy to get feedback on behavior, performance, and implementation.
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the UCI-Linux