Migrating your data safely

From Centre for Bioinformatics and Computational Biology
Jump to: navigation, search


Moving large data sets can be problematic. It cannot be done with drag-and-drop operations, or even with the ‘mv’ command due to drops in network connectivity, for example.

A much better and tried-and-tested workflow is as follows:

  • First, do a ‘cp’ with proper construction of the command line.
  • Then follow that with one or preferably two ‘rsync’ operations.
  • Only then can the source be deleted safely.

The reason this works is because ‘cp’ is usually much faster than ‘rsync’,
and it gets the bulk of the data across. But it can fail for certain files, for many reasons (unbeknownst to you),
and leave a copy operation half way done.

This is where rsync does a good job, since it checks the source and destination for file version, date, and size,
and then only copies those files that are not up to date, or do not exist. When a subsequent second ‘rsync' is run, it will report no files copied,
and this is when one knows for sure the source and targets are identical. You can then safely proceed to delete the source.

Example:

First, log onto one of our general-use servers such as Zoidberg (do not do this from our Wonko headnode please) :-

  # ssh <username>@zoidberg.bi.up.ac.za

Please note that the command line switches need to be exactly as below :-

  # tmux
  # cp -r -v --preserve=all /home/<username>/some_directory /nlustre/users/<username>
  # rsync -raH --progress /home/<username>/some_directory/* /nlustre/users/<username>/some_directory     ##### Note path specs are different

then again (just hit up-arrow) :-

  # rsync -raH --progress /home/<username>/some_directory/* /nlustre/users/<username>/some_directory

Note: Always use full explicit paths when doing this kind of operation! Also - be careful when pasting code from browsers.

You are done when rsync reports no further files copied. It won't say that explicitly, but you'll notice the total file sizes are the same, and no further file copy stats will be reported.

Any or all of these operations can take many hours, even days - hence the use of ‘tmux’ to preserve the session. If you are not familiar with tmux, a good tutorial can be found here: https://gist.github.com/MohamedAlaa/2961058

You do not need to learn all of tmux to use it - just the 3 or so basic commands will do.

Then, finally, you can remove your source directories:-

  # cd /home/<username>
  # rm -rf /home/<username>/some_directory   ##### PROCEED WITH EXTREME CAUTION! #####

Double check your command line when using 'rm -rf’ before you hit enter. You absolutely need to use full paths. Its effects are immediate, and irreversible, so mistakes will be costly.

If all went well, you’ll end up with an exact copy of your source files, with all attributes, permissions and time stamps preserved, at the new destination.