I came across a problem and have no idea how to find the OPTIMAL solution. I have a list of files that looks like this:
file1\0file2\0...fileX\0\0file(x+1)\0
Every name of file is separated by a \0
and each group of files are separated with an extra \0
. Every file from each group has the same hash codes (I used md5sum to calculate them). I need to find which files from each group are the same and print them.
For example, I have a group of 6 files (let's call them f1,f2,f3,f4,f5,f6). I used diff and found out that f1,f2,f3 are the same and f4,f5 are also the same (but different than f1,f2,f3). So I want to print files f1,f2,f3 and f4,f5 but not f6(because I didn't find f6's duplicate).
I use | while read -r -d $'\0' file
to read data. Can you guys help me with finding optimal way to do it ?
Edit :To simplify my problem. I have a array that has n fields. I'm looking for easy to implement in bash and not the slowest algorithm that will find values that are the same and at the end of those values add some number, which will help to sort it later. Referring to my example , after 'sorting' I want to print 'f1 1','f2 1','f3 1','f4 2', 'f5 2', 'f6 3' then using awk I will modify it into table.
Every file from each group has the same hash codes (I used md5sum to calculate them)
So basically, you have multiple MD5 collisions across the input files !? Is this expected, i.e. is that some kind of MD5 collision experiment data ?
– zeppelin Nov 07 '16 at 22:25