π

Copying Terabytes of Data from Apple OS X HFS+ to GNU/Linux ext4/LUKS

Show Sidebar

Update 2015-10-20: Sorry, I wrote "Gigabytes" although I meant "Terabytes". Embarrassing.

I'm currently in the process of changing my home computer from an old Mac Mini 2009 still running (heavily out-of-date and unsupported) OS X 10.5 (Leopard) to an intel NUC machine running Debian GNU/Linux (Jessie).

Since my strategy was and will be "one small, energy-efficient, 24/7 computer for OS and applications with one large external hard-disk drive for data", I have to move Terabytes of data from OS X (using HFS+) to GNU/Linux (using ext4 with LUKS). GNU/Linux does not support Apples proprietary HFS+ to a level I can trust. And OS X does not support ext4+LUKS. Therefore, I need to copy the data over the network.

The thing with Terabytes of data is, that it's still pretty time-consuming to copy them over the network. And there are different methods to do it with different layers of abstraction and security (not my concern, I own the LAN) and thus: speed.

Migrating File Labels to Filetags

OS X is using HFS+ to store metadata in resource forks. I used metadata like colored file labels on OS X. Here is a page that describes a rsync fork which deals with transferring those resource forks. In my situation, this was not appropriate. I had to convert OS X file labels to my now preferred method of filetags which works with any operating system and on any file system.

For this reason, I had to find folders that contains files with those labels using a (slow and in efficient) shell script. A second shell script converts labels of files within a folder to filetags. Both scripts (and a recursive version) are hosted on GitHub.

My process was:

  1. invoke vkfindOSXfilelabelsinefficiently.sh and log to a text file
    • this took several days(!) due to a very inefficient method which was fine to me because it ran in background and I did not want to invest time in optimization
  2. visit every listed folder containing important file labels and invoke vkconvertOSXfilelabelstofiletags.sh or vkconvertOSXfilelabelstofiletagsrecursive.sh manually

Researching Copy Benchmarks

In this GitHub Gist, Kartik Talwar describes an rsync approach which got him 40MB/s.

My most important source for inspiration was this page where Maxym Kharchenko compares five different approaches for transferring data over the net:

Method Time Capacity Used CPU Rate
scp 4min 50s ~55 MB/s ~5% ~55 MB/s
bbcp 2min 27s ~108 MB/s ~2% ~108 MB/s
ncp/gzip 2min 1s ~10 MB/s ~15% ~132 MB/s
ncp/pigz 30s ~20 MB/s ~50% ~533 MB/s
ncp/pigz (parallel degree lt) 1min 15s ~15 MB/s ~20% ~214 MB/s

Benchmarking Copy Methods

With Maxyms benchmark results from above, I tried to reproduce a series of benchmarks with my setup as well.

Quick note before beginning to benchmark network speed: please do make sure that you do not spend time trying to get your damn Linux box to enable Gigabit Ethernet without checking your network cable first. My old one was not specified for Cat 6 (or similar) and thus supported only 100 MBit/s connections. Switching to a decent cable enabled Gigabit Ethernet right away.

Unfortunately, I had no access to a cross-over cable so that I could connect both hosts directly. However, my new Gigabit switch should not be the bottle-neck for the following operations anyhow.

After analyzing my data, I created a two Gigabyte dataset containing a similar mix of video files, mp3 files, JPEG pictures and PDF documents. Those file formats are my dominant file types.

This dataset was now copied using different methods from OS X blanche to GNU/Linux sherri:

  1. rsync-normal: time nohup rsync -avz blanche:/Users/vk/testdatafolder ./rsync-normal
  2. rsync-inplace: time nohup rsync -avz --inplace --partial --progress blanche:/Users/vk/testdatafolder ./
  3. scp-normal: time nohup scp -r blanche:/Users/vk/testdatafolder ./scp-normal
  4. scp-c: time nohup scp -r -C -c arcfour256 blanche:/Users/vk/testdatafolder ./scp-c
  5. tar, pigz, nc:
    1. blanche: tar -cf - ./testdatafolder | pigz | nc -l 8888
    2. sherri: time nc blanche 8888 | pigz -d | tar xf - -C .
  6. tar, nc
    1. blanche: tar --recursion -cf - ./testdatafolder | nc -l 8888
    2. sherri time nc blanche 8888 | tar xf - -C .
Method [min] MB/s
rsync-normal 3.2 9.0
rsync-inplace 3.2 9.0
scp-normal 1.1 26.1
scp-c 3.4 8.4
tar/pigz/nc 2.0 14.4
tar/nc 0.5 57.6

CPU usage was almost negligible on my new intel NUC (GNU/Linux). With the fastest method (tar/nc), I had approximately once core occupied on my Mac Mini. I guess this is the main bottle-neck here.

So my results are a bit different to the benchmark results of Maxym above. I did not want to invest time in this subject to find reasons or optimization. Almost 60 MB/s seemed OK to me since I had no pressure to migrate as fast as possible. I just wanted to get the transfer time roughly within 24 hours.

Testing Copy Method

In order to avoid bad surprises after the data is copied, I checked some basic stuff first. Symbolic links, hard links, special characters in file names, empty folders (git, I am looking at you), and some other minor checks.

Symbolic links were transferred fine.

Hard links were not transferred, there were multiple copies of the same file. However, I am using hard links more or less for saving space within my archived browser data. And this "find duplicate files and replace with hard links"-process I can run on the destination disk after the copying.

Special characters and empty folders were transferred fine.

I had no reason to distrust the method so far.

Actually Copying Data

I pressed the return key bravely and started the processes on both machines. The network indicator went max and showed an average usage of 42-45 MB/s. nc was using approximately 15 percent on my GNU/Linux machine. On OS X, nc was using 65-75 percent CPU, tar 11 percent, and kernel_task 25 percent.

After roughly 23 hours, the nc-process on OS X finished but the process on GNU/Linux continued to run. No network usage comparable to the >40 MB/s could be seen any more. And the GNU/Linux box was using the disk heavily due to recollindex which is the working horse of my desktop search Recoll I am testing for now.

Approximately half an hour later, the GNU/Linux task also ended successfully:

vk@sherri ~/data % time nc blanche 8888 | tar xf - -C . ; date                                                                                                                  :(
nc blanche 8888  478.22s user 12688.88s system 15% cpu 22:56:19.01 total
tar xf - -C .  617.53s user 8765.91s system 11% cpu 23:15:03.23 total
Sat Oct  3 14:32:13 CEST 2015
vk@sherri ~/data %	  

Verifying the Data

After having Terabytes of data copied from one host to another, I suggest to verify the result. I was using rsync for this.

After two attempts with broken pipe, I deactivated sleep mode on OS X and added nohup to the command:

time nohup rsync -avz --dry-run --itemize-changes blanche:/Volumes/moe/data /home/vk | tee verify-data.log

This went fine and an analysis of the log file showed, that there were no files rsync would have copied: egrep '^[^.]' verify-data.log (show me all lines not starting with a dot).

So the data is successfully copied from my old host to the new one and I'm good to go for further migration steps.


Related articles that link to this one:

Comment via email (persistent) or via Disqus (ephemeral) comments below: