0

When I connect the external hard drives to my computer(with FreeBSD or other Unix systems) and copy files from the first external hard drive to the second hard drive, are the files on the second hard drive the same as the files from the source (first external hard drive)?

I know there is a hash (checksum). I read somewhere that copying from different volumes will result in different files(since they are 2 different volumes).

Only when I copy a file to the same volume, I can guarantee it is the same file.

What is the recommendation for copying, and will my files stay the same?

johnf
  • 1
  • 2
    After the copy you have two different files, but the contents should be the same, apart from I/O errors. You can run a checksum on both to verify. – Gerard H. Pille May 21 '21 at 13:18
  • 4
    If I buy two copies of "War and Peace", are they the same book? Yes -- they have identical words, typeface, paper, bindings, etc. No -- I can burn one and keep the other. I can annotate one and never even open the covers of the other. "The same" is too broad a condition: ask a pair of twins for an answer. – Paul_Pedant May 21 '21 at 13:18
  • 1
    What do you mean by "the same"? If we are talking about the streams of bits they must be. Otherwise they are not the same. – Artem S. Tashkinov May 21 '21 at 13:20
  • What do you mean by "the same"? A perfect clone there is no difference.

    – johnf May 21 '21 at 13:21
  • A perfect clone exists only in mathematics and logic. IRL even two electrons are not the same. There can be logical constructs IRL which are sort of the same, but they are not physical entities, e.g. the right angle. – Artem S. Tashkinov May 21 '21 at 13:23
  • @Gerard H. Pille I understand that even if you say verify, your answer seems that checksum algorithms cannot guarantee that it is the same file. Your answer seems that from hardware level there are clear differences. Can you say more? – johnf May 21 '21 at 13:24
  • @ArtemS.Tashkinov Ok thanks for the help, in this case does that mean that my files on second hard drive can be found out by certain methods in forensics, that it is not the same file, which method is it? – johnf May 21 '21 at 13:38
  • Bitstreams must be the same - that's all that matters. You're not working with any physical properties, you're only interested in data. – Artem S. Tashkinov May 21 '21 at 13:48
  • @johnf, I can say more, but nothing that would be useful in regards to your question. – Gerard H. Pille May 21 '21 at 13:49
  • 2
    If they are not the same data content (even considering disk-to-tape and disk-to-optical copies, and download from repositories), then 50 years of nightly backups were all wasted, and every distro is unreliable. Of course there are observable differences: inode numbers, block numbers, even sparse blocks, but the logical contents are independent of the physical mechanisms. Can you identify your source which asserted "copying from different volumes will result in different files" ? – Paul_Pedant May 21 '21 at 13:58
  • @Paul_Pedant A developer told me once and I wanted to know more about it, but he stopped answering. The only thing he said that what I see as copy is not the same. That's all the unknown developer said. You can ask him a question yourself maybe he will write you back more, his site is rixstep.com. @cas you read his answer here and he said that it is technically possible with md5sum to have 2 different files, despite having the same checksum. The principle should then also apply to other hash algorithms. – johnf May 21 '21 at 14:15
  • johnf, I think an equally interesting question would be, "why don't you think a copy would be the same as the original?" – Chris Davies May 21 '21 at 14:17
  • @johnf Yes, it is theoretically possible for even a 256-bit checksum to be the same for two or more files. If it were not so, then every possible file of any size could be uniquely compressed into 256 bits. But there are 10 to the 77th such values, so the chances of a clash are vanishingly small. Also, the algorithm is designed to make it very hard to construct a file with a given checksum in order to fake the result. Your "developer" probably stopped answering because he had dug himself into a deep enough hole by then. – Paul_Pedant May 21 '21 at 14:59
  • 1
    Because you have the "forensics" tag, I'll point out that you're copying the data, not the metadata. cp has some options to help, but the best forensic move is to start with a bit-for-bit copy of the whole volume, with dd or similar. – waltinator May 21 '21 at 16:36
  • 1
    @Paul_Pedant, in Flemish we have a saying: "one fool can ask more than ten wise man can answer". The "developer" may have gotten tired also. – Gerard H. Pille May 22 '21 at 08:43

2 Answers2

2

When a copy of a file is made, the copy is exactly the same as the original (assuming that there were no errors during the copy). This is true whether the file is copied to another location on the same device or to a different device.

The copy might have a different filename, or maybe a different timestamp or different permissions, but the contents are identical.

This can be verified by running some sort of checksum or hashing algorithm (e.g. md5sum) on the original and the copy.

For example:

$ cp original /tmp/thecopy
$ md5sum original /tmp/thecopy 
93d9d61139ff5f1287764f1c1994cbe3  original
93d9d61139ff5f1287764f1c1994cbe3  /tmp/thecopy

Both files have the exact same md5sum. original was stored on an NVME. /tmp/ is a ramdisk.

It is technically possible for two different files to have the same md5sum. The probability of that happening is extremely low. md5sum is mostly "good enough" for many simple purposes, but most people use & recommend stronger hashing methods these days to reduce the probability even further. Here's what sha512sum has to say about the same-ness of the files.

$ sha512sum original /tmp/thecopy 
5ba61d6f2a883c3afebc949b0f0d0a1c020498a1052771de98e6e1bbb42d438a0a53f49f381a2e1311c1bdf82a0cea9de646fc03c529fcb6fca0ab6476badf35  original
5ba61d6f2a883c3afebc949b0f0d0a1c020498a1052771de98e6e1bbb42d438a0a53f49f381a2e1311c1bdf82a0cea9de646fc03c529fcb6fca0ab6476badf35  /tmp/thecopy

again, identical.

cas
  • 78,579
  • 2
    You might want cp -p to maintain permissions, ownerships, and timestamps. – Chris Davies May 21 '21 at 13:39
  • @roaima i'm already giving the copy a different filename, differences in other metadata don't affect the fact that the original and the copy are identical. – cas May 21 '21 at 13:42
  • btw, original was just a copy of /bin/true to begin with :-) – cas May 21 '21 at 13:43
  • So it is a perfect clone after all? Why do other users here see the other comments that it is not the same? So not a perfect clone and it only exists in logic and mathematics. Can you clarify or add a comment what is your opinion to the other comments of the other users. – johnf May 21 '21 at 13:48
  • @roaima This cp -p does not keep all timestamps, you can check for example with stat and you will see that ctime and sometimes btime is not kept from the source. – johnf May 21 '21 at 13:51
  • 1
    @johnf yes, it is a perfect clone. You misunderstood Gerard H. Pille's comment. The others seem to be waxing philosophical on the nature of identity. – cas May 21 '21 at 13:54
  • @cas You said yourself that it is technically possible with md5sum to have same checksum with 2 different files. Your recommendation is that the user should take another algorithm, but in principle only the same result, because it is possible to crack the algorithm. It does not have to be with the current hardware.

    You know a lot, do you know exactly what Gerard H. Pille means? Unfortunately, he does not want to say more.

    – johnf May 21 '21 at 14:07
  • 1, In theory, any hashing algorithm can have collisions. In practice, the chance is non-zero but is vanishingly small, especially with modern algorithms. 2. you seem to have mis-read what GHP wrote, as if he was saying the opposite of what he actually wrote. try reading it again. – cas May 21 '21 at 14:21
  • BTW, i don't "know a lot" about cryptography or related topics. I'm moderately well-read on the topic and functionally competent in using related tools. I'm not a mathematician, I could probably force myself to read and maybe even understand the math behind but it would not be easy or enjoyable for me. I trust the broad consensus of scientific research. – cas May 21 '21 at 14:29
  • @cas I know what he said. But his answer did not seem convincing to me, it does not guarantee. There are slight fluctuations. I also read your answer and you say several times that it is possible that algorithms have some collisions or something. It is vanishingly small, yes, but that does not automatically mean that it is not possible to crack it. A perfect clone is something else. That's all I wanted to know. What if in the future let's say 3000 years later there is hardware or technology that can crack everything and algorithms are not really the solution to this problem. It is okey. – johnf May 21 '21 at 14:30
  • It is okey. Thanks anyway for the great clarification. – johnf May 21 '21 at 14:31
  • 1
    The algorithms are NOT the original file or the copy. They are tools for verifying that the files are identical. The tools are 99.99999999+ % accurate, not 100%. There is always a tiny chance that they get it wrong. You can use cmp to do a comparison of every single byte in the files - if it detects any difference, it will say so. There's still a tiny chance of a HW fault, or a cosmic ray hitting the wrong bit of non-ECC memory at the wrong time causing a false result. Nothing's 100% reliable or perfect. For all practical purposes, cmp or one of the SHA* algorithms is good enough. – cas May 21 '21 at 14:37
  • @cas In summary, there is still a lack of technologies that guarantee this. That is, technologies that make it possible to be all-encompassing and cannot be influenced, even by HW fault, cosmic rays and whatever. Do you know if there are projects researching this? – johnf May 21 '21 at 14:50
  • No, that is not an accurate summary. I can see why your developer friend gave up. I'm following in his footsteps. – cas May 21 '21 at 15:38
  • @cas, with "waxing philosophical" you hit the nail on the head. – Gerard H. Pille May 22 '21 at 08:45
  • @john you can use par2: https://en.wikipedia.org/wiki/Parchive – Artem S. Tashkinov May 22 '21 at 13:21
0

What is the recommendation for copying, and will my files stay the same?

cp -a will copy the data but it will not copy the creation time, and you may lose the last access time. If your source system uses SeLinux context, the cp the other system which doesn't know about SeLinux will not copy it.

As for the crtime please refer to Copying or restoring crtime for files/directories on ext4fs filesystem

Lastly, running md5sum right after copying especially for small files is not advisable as files at the time could not have physically been dumped to the storage and the kernel will instead work with dirty buffers which are stored in RAM.

In order to be sure you've copied everything precisely you need to:

  • Copy all the files
  • Create their checksums (if your CPU is modern enough sha256sum is more reliable and faster than md5sum)
  • Drop caches echo 3 | sudo tee /proc/sys/vm/drop_caches
  • Create checksums for newly copied files and compare them to the old ones.

Alternatively for ext2/ext3/ext4 filesystems you could use e2image which copies everything (all the time stamps, including crtime and last access times) except free space. I normally use it for copying partitions.


For NTFS I've used ntfsclone on multiple occasions - it's fast and reliable.


I presume there are similar utilities for other filesystems but I only use ext4 and NTFS on my devices.