1

I have a very large file (200GB). Apparently when I transfer it over it did not copy correctly. The sha1 hash on both are different. Is there a way I can divide the file up to blocks (like 1MB or 64MB) and output a hash for each block? Then compare/fix?

I might just write a quick app to do it.

3 Answers3

2

That "quick app" already exists, and is relatively common: rsync. Of course, rsync will do a whole lot more than that, but what you want is fairly simple:

rsync -cvP --inplace user@source:src-path-to-file dest-path-to-file   # from the destination
rsync -cvP --inplace src-path-to-file user@dest:dest-path-to-file     # from the source

That will by default use ssh (or maybe rsh, on a really old version) to make the connection and transfer the data. Other methods are possible, too.

Options I passed are:

  • -c — skip based on checksums, not file size/mtime. By default rsync optimizes and skips transfers where the size & mtime match. -c forces it to compute the checksum (which is an expensive operation, in terms of I/O). Note this is a block-based checksum (unless you tell it to do whole files only), and it'll only transfer the corrupted blocks. The block size is automatically chosen, but can be overridden with -B (I doubt there is any reason to).
  • -v — verbose, will give some details (which file it's working on)
  • -P — turns on both partial files (so if it gets halfway through, it won't throw out the work) and a progress bar.
  • --inplace — Update the existing file, not a temporary file (which would then replace the original file). Saves you from having a 200GB temporary file. Also implies partial files, so that -P is partially redundant.

BTW: I'm not sure how you did the original transfer, but if it was sftp/scp, then something is very wrong—those fully protect from any corruption on the network. You really ought to track down the cause. Defective RAM is a relatively common one.

derobert
  • 109,670
  • So I'll have to get rsync for windows and linux. After looking at -c in the manual it says it will use checksum instead of time/length. However that's not what I want. I want to repair the file and use a checksum to check which parts is incorrect. The option seems like it will check and send the whole thing. I'm not sure how it got corrupted. I shut the drive down improperly a few times. But I imagined that the ext4 filesystem would write the data first then update the length, I'm pretty confused. I transferred by using wget on linux and nginx as the server on windows. –  May 19 '17 at 07:52
  • 1
    @acidzombie24 The rsync checksum works on a block basis. It'll only re-transfer the blocks that are corrupted. BTW: I'm going to add --inplace to those options just to be sure it does an in-place update. – derobert May 19 '17 at 07:55
  • Ah. Ok. So I should set it up on linix+windows then... rsync -cvP --inplace file root@linux:/path/file. I'll try that. I'll assume rsync uses a good block size but something like 16MB blocks would be great –  May 19 '17 at 07:58
  • @acidzombie24 Yes, as long as the good copy is "file", not the one at root@linux. Good copy (source) comes first. – derobert May 19 '17 at 07:59
  • @acidzombie24 BTW: ext4 is somewhat weird in how it handles recovering from unclean shutdowns, it can leave a block full of 0s. But that'll only happen if you ripped the cord out shortly after writing the file—by default, it should finish flushing it to disk within a minute. Once its flushed to disk, it should be pretty safe. One thing you can try is running md5sum (etc.) on the same file a few times—if you ever get a different result, you have a broken computer. – derobert May 19 '17 at 08:03
  • Why on earth would it do 0's? Is it only when it's overwriting part of the file? Many times the network cut out or I pulled the power during transfer and it was going several megabytes per second. I'm sure that's what happens if it does append files with 0s –  May 19 '17 at 10:12
  • One last question. How do I get rsync to talk to eachother at both ends? The CPU usage is VERY low using the commands above. I grabbed deltacopy for windows because it had a rsync binary for windows. Their ssh didn't work so I used the bin from git for windows. I synced a testfile without problem. Now for my 200GB file. I expected my main computer to use little CPU and my linux box to use 100% or at least >50%. It uses about 2% (as it does when idle). dstat -ds had many lines saying 0k disk being read and written. IDK what it's doing but it doesn't seem to be doing anything. Any idea? –  May 19 '17 at 11:03
  • @acidzombie24 I suspect that your Windows box is computing the hashes first (which should mainly be an I/O load, unless you have fast disks such as SSD), then it'll send those hashes to the Linux box, which will compare them (and then it'll transfer any corrupt blocks). BTW: you can add more -vs to increase the verbosity. – derobert May 19 '17 at 16:14
1

If you want to re-transfer the file to another machine via a network connection, use rsync.

If you want to get an idea of where the differences are, the easiest way would be to have the two versions on the same machine. If you don't want to do that because bandwidth is too expensive, here are ways you can checkum chunks of files.

This method relies on head -c leaving the file position where it left off, and pre-computes the size to know where to end the loop.

n=$(($(wc -c <very_large_file) / (64*1024*1024) + 1))
i=0
while [ $i -gt $n ]; do
    head -c 64m | sha256sum
    i=$((i+1))
done <very_large_file

This method relies on head -c leaving the file position where it left off, and uses cksum to find the size of each chunk (a short chunk indicates the end of the file).

while true; do
    output=$(head -c 64m | cksum)
    size=${output#* }; size=${output%% *}
    if [ $size -eq 0 ]; then break; fi
    echo "$output"
done <very_large_file

This method calls dd to skip to the desired start position for each chunk.

n=$(($(wc -c <very_large_file) / (64*1024*1024) + 1))
i=0
while [ $i -gt $n ]; do
    dd if=very_large_file ibs=64m skip=$i count=1 | sha256sum
    i=$((i+1))
done <very_large_file
0

You should probably look at split:

Here is the man page with examples:

https://ss64.com/bash/split.html

VaTo
  • 3,101