3

I've 2 HDDs where I save the backups. Sometimes, what happened I backed up in one and missed backing it up on the other one. As a result, there are some data in one which is not in another and some other data in the another which is not in the other. There are many files which are on both. Now I want to bring both on sync to each other so that both have all the data and are a twin copy of each other.

Also how to make sure that one backup HDD doesn't contain duplicate files resulting in unnecessary consumption of space and time (on reviewing the backup)?

I've worked on rsync earlier but not extensively covering this part of my Q. I like the tool and feel it can do the work. Can one mention on this tool along with any other if that is better?

Ravi
  • 3,823
  • rsync is the tool you want, don't bother with anything else. You must select one version of the data that is the master copy, once you have determined that, you use the rsync option to delete all files that do not match the master copy, then you rsync for example from the source disk to the two backup disks, then your data will be synced. rysync is worth learning, it's one of the best tools in the unix ecosystem in my opinion. – Lizardx Mar 08 '17 at 19:03

3 Answers3

3

Unidirectional tools like rsync work great when you want to make B look like A but are less useful when you want to make A and B to be the same, but not the same as either A or B. When I need to sync directory trees, I like Unison. It has a nice graphical interface that lets you see the differences between the trees and makes suggestions based on time stamps as to which is newer (which isn't always what you want to keep). It also has options to backup both copies of any file that is different so that nothing gets lost.

When syncing with rsync, you can tell it to keep the newest version of files and then sync SRC to DEST and DEST to SRC. The problem is rsync has no way of detecting conflicts where the file has changed in both SRC and DEST and you will simply get the newest version. Unison keeps track of what has changed. If the file has only changed in one place, you get the newest version, but if the file has changed in both places you get a warning about the conflict and then get a chance to manually deal with it.

In terms of "duplicate" files fslint is a nice utility for identifying files that are identical apart from the name and permissions. The graphical index makes deciding which duplicates you really want and which ones you do not.

StrongBad
  • 5,261
0

Kdiff3 is a good visual directory comparison program too, that can compare 2 or 3 different directory trees. It should be available separately from all the KDE packages too (kdiff3-qt in Debian). It appears to be updated every year or so too, so still "relatively active."

enter image description here

Besides FSlint, there are several other "find duplicate files" programs, just search for "linux find duplicate files" for results like:

Xen2050
  • 2,354
-1

This is an example, which would sync a backup of your /home/user/data directory to a mounted backup disk /media/backups with a backup directory data. Note that the destination does NOT end with /. This would delete all files in the backup destination that do not exist in the master source data.

rsync -av --delete --delete-excluded /home/user/data/ /media/backups/data

Use this option first to make sure it's doing what you want. Always use --dry-run when debugging your backup arguments the first time to make sure it's doing what you wanted/expected!! Otherwise you could for example delete all the data in your source if you get the sequence wrong. -v makes it verbose, which shows what is going to where.

--dry-run

rysync is very complicated to understand but very powerful but once you have the backup scripted so you don't forget it, it's a one time thing, then you just run the stuff.

I use rbxi to automate my rsync backups, but that's overkill for most applications.

rsync is one of the best unix type tools I have ever seen, the author is a genius (the creator of samba if I remember correctly), the chance of there being something technically better I'd put at fairly close to zero.

Note that if you have both backup drives mounted, you can simply rsync the main data to the first one, then rsync the first one to the second one, and you have perfectly matched data. Trying to unravel stuff with gui tools... well I wouldn't rely on anything like that for my backups is all I can tell you, if they are good, they are probably using rsync as their engine in the first place, if they are not good, they aren't, and I wouldn't trust them.

once you have your main data stuff done with rsync, it's usually a matter of only a few minutes to sync up to date, since it's done by changed chunks only. As an example, my main backup backs up about 1 million files, 400 gigabytes, give or take, and it takes only about 20 minutes or so to run through all that with rsync. Time spent learning this tool is time VERY well spent, as I indicated, I can think of few unix type tools that are better designed and implemented than rysnc, and it's learning time you'll never regret spending.

Lizardx
  • 3,058
  • 17
  • 18
  • Lizardx, I will go through the process you mentioned and later update. – Ravi Mar 08 '17 at 19:23
  • Rsync has no mechanism for detecting, nevermind handling, conflicts where the file has changed on both drives. Also if rsync crashes there is a chance you can lose data on the next sync. – StrongBad Mar 09 '17 at 03:37
  • There's no conflict in the user question. He wants the data on his two backup drives to be the same. Being backups that means that the data on the two should be the same as the source. I've done exactly this job for clients, have you? – Lizardx Mar 09 '17 at 05:17