0

I came across a problem and have no idea how to find the OPTIMAL solution. I have a list of files that looks like this:

file1\0file2\0...fileX\0\0file(x+1)\0

Every name of file is separated by a \0 and each group of files are separated with an extra \0. Every file from each group has the same hash codes (I used md5sum to calculate them). I need to find which files from each group are the same and print them.

For example, I have a group of 6 files (let's call them f1,f2,f3,f4,f5,f6). I used diff and found out that f1,f2,f3 are the same and f4,f5 are also the same (but different than f1,f2,f3). So I want to print files f1,f2,f3 and f4,f5 but not f6(because I didn't find f6's duplicate).

I use | while read -r -d $'\0' file to read data. Can you guys help me with finding optimal way to do it ?

Edit :To simplify my problem. I have a array that has n fields. I'm looking for easy to implement in bash and not the slowest algorithm that will find values that are the same and at the end of those values add some number, which will help to sort it later. Referring to my example , after 'sorting' I want to print 'f1 1','f2 1','f3 1','f4 2', 'f5 2', 'f6 3' then using awk I will modify it into table.

ogarogar
  • 191
  • Every file from each group has the same hash codes (I used md5sum to calculate them)

    So basically, you have multiple MD5 collisions across the input files !? Is this expected, i.e. is that some kind of MD5 collision experiment data ?

    – zeppelin Nov 07 '16 at 22:25
  • Also, have you tried to apply sha256 (or even sha512) to your files ? (most likely you will get unique hash codes, to use for duplicate detection, even if there are some MD5 collisions among them). – zeppelin Nov 07 '16 at 22:27
  • No no, every files has the same hash code. I have to check (using diff) if those files are the same. Becuause I cannot assume they are without checking with diff and then display results. I have to use MD5 (it's school assigment ). – ogarogar Nov 07 '16 at 22:35
  • 2
    If you're worried about algorithm speed you are 100% using the wrong language. Bash is not what you use when you care about speed, nor is it even truly a language. Nor can you implement a specific algorithm in Bash without absurd overhead. Read more here. And show this to your teacher. – Wildcard Nov 08 '16 at 00:17

1 Answers1

0

Taking that you must use MD5, and your input files are said to have hash collisions (equal MD5 sums for binary-different files), one trick you can employ is to use a random SALT (a short random string) when hashing them, to make the colliding checksums vary again.

E.g.

#!/bin/bash
SALT=$(dd if=/dev/urandom status=none bs=1c count=128)

FILES=("f1" "f2" "f3" "f4" "f5");
for file in "${FILES[@]}"
do
     echo $(echo $SALT | cat - "$file" | md5sum --binary | cut -d' ' -f1) $file
done

This will compute an MD5 hash for an each file in FILES, adding a randomly generated 128 byte SALT in the mix, resulting in an output like that:

741eefc6c14d80ee38164a0961cfd200 f1
741eefc6c14d80ee38164a0961cfd200 f2
741eefc6c14d80ee38164a0961cfd200 f3
68441eb38393a75dee94ae089d528633 f4
68441eb38393a75dee94ae089d528633 f5

If you run this again, you will get the different checksums (as SALT will be different), but nevertheless they will match for the duplicate files:

bc2fdca1b765989b62e507711749c5b4 f1
bc2fdca1b765989b62e507711749c5b4 f2
bc2fdca1b765989b62e507711749c5b4 f3
a31019a6ace1f51b18920bb33d781c97 f4
a31019a6ace1f51b18920bb33d781c97 f5

Now you can just process this "MD5SUM file" list, to get the list of duplicates.

And you will also have to adapt this to handle the input data in your format - multiple \0\0 separated groups of, \0 separated file names. (Not going to take all the fun out of your assignment).

zeppelin
  • 3,822
  • 10
  • 21
  • So after adding the same SALT to every file and using md5sum collision will disappear for 100% and only the same files will now have the same md5 hash code? It's really important that it has to be 100% true. – ogarogar Nov 08 '16 at 08:51
  • Well, the thing to understand here is that any hash-functions work in terms of probability and even the most cryptographically strong and modern of them (e.g. SHA-3 or BLAKE2 e.t.c.) won't give you a 100% guarantee, just an guarantee a very low probability of collision.

    But that's ok, as if you are doing things right, this probability is so low that it can be ignored for any imaginable practical tasks.

    – zeppelin Nov 08 '16 at 09:41
  • If that was not the case, a number of key technologies, cryptographic schemes and tools would fail to work (e.g. rsync which uses MD5 for checksumming, or Git which uses SHA1 for commits e.t.c.)

    You can read more of MD5 collision probabilities here: http://stackoverflow.com/questions/201705/how-many-random-elements-before-md5-produces-collisions

    (cite) To have a 50% chance of any hash colliding with any other hash you need 2^64 hashes. This means that to get a collision, on average, you'll need to hash 6 billion files per second for 100 years.

    – zeppelin Nov 08 '16 at 09:42
  • Another thing to note is that while MD5 is known to be outdated and vunerable, and artificial collisions can be constructed by an attacker, it takes a special effort to do so. If you just attach a randomly choosen prefix to your exiting (and possibly colliding) messages, a probability of you creating a new collision accidentaly is absolutely negligible. – zeppelin Nov 08 '16 at 09:58
  • So at the end I have to check files using diff if the are the same... Because at the end of script printed groups of files have to be the same. – ogarogar Nov 08 '16 at 10:01
  • so at the end I have to check files using diff if the are the same

    If you use diff, you will have to re-check all the possible combinations, for all the files with the equal checksum like that: {f1,f2} {f1,f3} {f2,f3}, {f4,f5}

    Which is not really optimal, and does not have any practical value, unless you have trillions of the input files (in which case it will be prohibitedly slow)

    – zeppelin Nov 08 '16 at 10:18
  • I'm looking for a duplicate files in given directory. I know that chances I will get hash conflict are very very low but I have to list files that are duplicates for 100% and sadly like you said every hashing method cannot provide this 100% and sadly I don't know any different way to be 100% sure. – ogarogar Nov 08 '16 at 11:11