How to merge two folders so as to remove identical files from one, while keeping checksummed differentials?

Question

I have a scenario where I have a recovery from a drive from possibly a different point in time that is not only potentially different in terms of files being present, but also there was some corruption where a number of files are obviously corrupt.

We'll call the left folder "A" and the right folder "B".

It falls to me to merge these two images such that:

any files that are present in B that are not in A should be moved to A, and
any files that are present in both locations and are the same should be deleted from B, and finally
any files that have differences in checksum should remain in B such that these different files can be manually compared between A and B, but nothing else should be left in B except for these files that have different checksums (ie, actual contents).

NOTE: Dates are pretty much immaterial at this point, although it would be nice to preserve the older date in the metadata.

How can this be done cleanly? Unfortunately I have to do this for several tens of TB of data and thus it will be a very lengthy process if I do not figure out how to automate this. It appears that 90-95% of the contents ARE the same, so a "set it and forget it" method should work out for preparation to do the manual comparison.

There is no way to give a worthwhile answer without seeing sample data and from what it looks like, there are going to be many patterns for it. If this data is really important, then you're better off hiring someone to do it as they'd be able to be more thorough and comprehensive and make sure that you are getting what you need. — Nasir Riley, Jan 11 '21 at 20:44
@alecxs Even in the cases where it seems that it won't be an issue, it also states that some of the files are corrupt. That alone is going to throw it off because corrupt can also mean that they can't be read which would render any methods useless. — Nasir Riley, Jan 11 '21 at 20:54
@alecxs Even then, there's no way to be certain in what should and shouldn't be recovered. If that's not important and he'll worry about that later, then there's enough. Otherwise, if he was able to have the data recovered, it would make more sense to go the other route like he did to recover it in the first place. Something like this is best done by someone who is there instead of being answered over the internet. — Nasir Riley, Jan 11 '21 at 21:00
@alecxs I've considered rsync, but I'm afraid it might end up overwriting A files in this case without making sure that A and B are truly checksum diff'd. I also don't think rsync has the ability to rm B if it then syncs it to A... — ylluminate, Jan 11 '21 at 21:12
@NasirRiley if these are disk images then corrupt files just means their content is not what it should be. If the read iteself was going to fail it would have done so creating the image. As alexces says, file-content here is really irrelevant. — Philip Couling, Jan 11 '21 at 21:53
i have recently seen question where rsync even matches identical files moved to different dir (based on inodes) so i am pretty sure someone will find 1-pass solution without scripting, rsync seems a powerful tool — alecxs, Jan 11 '21 at 23:15
@alecxs that's a bold statement if you don't know how to do it yourself. This combination of requirements looks way outside of the intended use cases for rsync. — Philip Couling, Jan 11 '21 at 23:51
@PhilipCouling right i was wrong remembering one of these comments https://unix.stackexchange.com/q/628201 https://unix.stackexchange.com/q/330426 — alecxs, Jan 12 '21 at 11:56
Do you really mean "different checksums", or do you really care about different content, even in the unlikely event of checksum collision? — Toby Speight, Jan 12 '21 at 13:14

score 4 · Answer 1 · answered Jan 11 '21 at 21:16

4

Step 2 and 3 seem the most difficult, so let's start with those.

There is a tool called rdfind which finds duplicate files. You can decide what to do when a duplicate is detected: in your case you would want to delete it from B: rdfind -deleteduplicates true A B. If identical files exist in A and in B, the one from A is preserved. Other options are to replace the copy with a hard link or a softlink, or just to report the findings.

Then the files that remain in B are either unique to B, or are different in B than in A. To move the unique ones from B to A: mv -i B/* A/ and answer no each time when asked if you want to overwrite. You can automate this with yes no | mv -i B/* A/. If you're using GNU mv, you can use mv --no-clobber B/* A/.

Of course, you want to practice this first before doing it on your real data. You can easily make a tree of hard links to your files in A and B: mkdir training; cp -lr A training; cp -lr B training, and do your practice there.

answered Jan 11 '21 at 21:16

Rhialto supports Monica

245

This is interesting, thank you, however I have a concern: I tried this with a "small" subset in dryrun mode and received a concerning message (that clearly had not used the md5sum checking yet since it was quite fast): Removed 3865 files due to unique sizes from list.423279 files left. (here's the full output: GitHub Gist). How do I interpret this? – ylluminate Jan 12 '21 at 00:05
I assume that means 3865 files have a unique size and therefore cannot be duplicates so have been removed from an internal list of files that will be compared against each other. For example, if only one file is 32 bytes, it can't have a duplicate so can be ignored, but if two are 33 bytes it's possible they are duplicates and will be compared. – StuBez Jan 12 '21 at 12:19
2

There's a risk with rdfind if there are intentional duplicates within A; we'll end up removing something which should be kept. – Toby Speight Jan 12 '21 at 13:16
1

As I was looking at and working on this, I found a very interesting tool called rmlint and thought perhaps we should regroup in consideration of it instead of rdfind? Please see: https://unix.stackexchange.com/questions/628916/how-to-use-rmlint-to-merge-two-large-folders – ylluminate Jan 13 '21 at 06:28

score 2 · Answer 2 · answered Jan 11 '21 at 21:27

2

Here's a method that's simple, but inefficient if a lot of files are missing in A. Just do each step in turn. I assume that there are only directories and regular files (comparing metadata for special files is doable with a bit more work). Warning: untested code.

First, copy files that are present in B but not in A to A. Preserve metadata (timestamps, permissions) as much as possible.

rsync -a --ignore-existing B A

Second, remove duplicates from B. Note that at this point, files that were originally not present in A are now identical.

cd B
find . -type f -exec sh '
  for x; do
    if cmp -s "$x" "$0/$x"; then rm "$x"; fi
  done
' /path/to/A {} +

Optionally, remove empty directories from B.

find B -depth -type d -exec rmdir {} + 2>/dev/null

This is inefficient because at step 2, every file that was already missing from A is now duplicated and will be compared then removed from B. If a lot of files are missing in A, it would be more efficient to make a single pass on B to both move files to A and remove duplicates. This is especially true if A and B are on the same filesystem, which allows moving files cheaply rather than by copying and then deleting the source.

answered Jan 11 '21 at 21:27

Gilles 'SO- stop being evil'

829,060

this does not appear to be doing what is needed in his requirement #1 (move new files in B, to A)? Should be trivial to add? I'm never comfortable with using xargs with shell snippets so I will leave it to you :) – Jan 12 '21 at 04:38
2

@sitaram How does rsync -a --ignore-existing B A not fulfill “move new files in B, to A”? – Gilles 'SO- stop being evil' Jan 12 '21 at 09:46
can't you combine with --remove-source-files – alecxs Jan 12 '21 at 11:43
@Gilles'SO-stopbeingevil' -- apologies; I somehow missed that first step while concentrating on the more meaty stuff. Not sure if protocol is for me to delete the comment or not; leaving it there. – Jan 13 '21 at 09:40
If you insert mv --no-clobber "$x" "$0/$x" before your if statement and add [ -f "$x" ] && before cmp, wouldn't that handle requirement #1 allowing you to skip the original call to rsync? – jrw32982 Jan 14 '21 at 17:10

Philip Couling · Answer 3 · 2021-01-11T21:49:44.737

I would start by challenging your requirements. You are trying to do everything in one step. You'd be much better off knowing what you want the restored system to look like before you start restoring files.

It's actually easier to get the diff first than you might think.

Step 1

Get the hash of every file on disk. You're going to have to do this anyway. So might as well get it over and done with. The command below works well if there aren't too many hard links. Assume directories are named /media/A and /media/B.

cd /media/A
find . -type f -exec sha256sum {} + > ~/hashes.txt

This will create a hash of every regular file on the disk. If a file is hard linked it will appear under every name (and have been scanned once for each name).

Step 2

Identify changes with

cd /media/B
sha256sum -c ~/hashes.txt > ~/check.txt

check.txt will now contain three types of line:

good/file: OK
missing/file: FAILED open or read
changed/file: FAILED

Step 3

As a shortcut, you can copy all files that are missing using:

rsync -a --ignore-existing /media/A/ /media/B/

Step 4

Then you only need to worry about changed files:

grep 'FAILED$' ~/check.txt | while read file ; do
    echo "${file%: FAILED}"
done > ~/changed.txt

This gives you changed.txt which will have one file name per line. Every one is a file on both systems that has changed.

It's now up to you to sort through changed.txt and identify which files to keep and which to overwrite to A from B.

@alecxs not really no. diff is a tool for performing line-by-line comparisons of files. That's really over kill and likely to perform very badly on large files, so I have no doubt you've seen rsync perform better. Steps 1/2 create a list of files in B with a note to whether or not they match on A. It makes no attempt to check why they don't match, just if they don't match. — Philip Couling, Jan 11 '21 at 23:42

score 1 · Answer 4 · answered Jan 12 '21 at 04:53

1

assuming you don't have any "newlines" in filenames, this should work:

cd B
find . -type f -print | while read f
do
    [[ -f "A/$f" ]] || { echo mv "$f" "A/$f" ; continue; }
    cmp "$f" "A/$f" && echo rm "$f"
done

Run it, and if it looks good, remove the "echo" words to get the actual commands to run.

answered Jan 12 '21 at 04:53

1
on GNU find while read loops can be NUL separated 2) instead of f use ! -type d 3) mv isn't faster than cp between different file systems so cp -a will work too covering all files and symlinks (without traversing) 4) use --parents flag (to preserve tree)

alecxs

Jan 12 '21 at 11:31

How to merge two folders so as to remove identical files from one, while keeping checksummed differentials?

4 Answers4

Linked