2

I'm using this command to rename files with random characters from sha1sum and move all files from subdirectories to the current directory:

for fname in `find . -type f`; do mv "$fname" $(echo "$fname" | sha1sum | cut -f1 -d' ').html; done
  1. But the question is: Does it create unique filenames? I'm worried the generated name from sha1sum may not be unique (generated twice or more).
  2. If I run the above command, and then run another one in another directory, will it generate a unique file name for each file?
psmears
  • 465
  • 3
  • 8

3 Answers3

2

sha1sum outputs will be unique as long as inputs are unique. (Unless you are very extremely unlucky and you found some sha1sum collision).

As for your use case: It's a good habit to use printf '%s' "$fname" instead of echo "$fname", the former will work when $fname is -n, or -e,… See also enzotib remark, I missed that at first glance.

Also, I'm not sure exactly what are your motivations, but you may consider feeding sha1sum with file contents instead of filenames. This way, you would obtain a unique filename for each unique content.

  • Note that using the sha1 of file contents would give more unique results but would also take far more time (depending on the size of files of course). – rozcietrzewiacz Aug 19 '11 at 12:06
  • If you hash the content, you'll end up with the same filename if the contents are the same. That may or may not be desirable (it isn't if you later want to modify one of the files). – Gilles 'SO- stop being evil' Aug 19 '11 at 13:19
2

First, a few shell matters:

  • Don't use for fname in `find …` as this will mangle file names and will fail (because the command line is too long) if there are too many files with too long names. Use find -exec instead. Since you need shell expansion in the command executed by find, invoke a shell.
  • You need double quotes around command substitutions as well as variable substitutions ("$fname", "$(echo …)").
  • echo mangles backslashes on a few shells (it also mangles a few arguments beginning with -, but that's not an issue here since all arguments will begin with ./). A way to print any string literally is printf "%s\n" "$fname", or printf "%s" "$fname" to avoid a final newline. Here I see no reason to take the hash of the filename plus a final newline as opposed to the hash of the filename.

So we get this command:

find . -type f -exec sh -c 'mv "$0" "$(printf "%s" "$0" | sha1sum | cut -f1 -d" ").html' {} \;

It will be slightly faster to invoke a shell for a whole batch of names at once.

find . -type f -exec sh -c 'for fname; do mv "$fname" "$(printf "%s" "$fname" | sha1sum | cut -f1 -d" ").html; done' _ {} +

A problem with this method is that if mv starts to act before find has finished traversing the directory, files that have been moved may be picked up by mv. This is not an issue with your command because it waits for find to finish before starting moving files. So put the renamed files in a different directory hierarchy. This will solve another problem which your proposed command also has, which is that mv may overwrite an existing file that happens to be called <sha1sum>.html.

mkdir ../staging
find . -type f -exec sh -c 'for fname; do mv "$fname" ../staging/"$(printf "%s" "$fname" | sha1sum | cut -f1 -d" ").html; done' _ {} +
find . -depth \! -name "." -type d -exec rmdir {} +
mv ../staging/* .

Now on to your main question: two files with different paths will map to two different SHA-1 hashes. Mathematically speaking, there exist distinct strings with identical SHA-1 hashes (that's obvious since there are infinitely many strings but only finitely many hashes). However, practically speaking, no one knows how to find them: there is no known collision for SHA-1. It is possible that one day in the future SHA-1 will be broken, in which case your procedure will be safe only against accidental collisions, not against malicious attackers. If that happens (not any time soon), you should upgrade to whatever is considered a secure hash algorithm at the time.

As for your second question: the hash is fully determined by the string you hash. So if you have two files called tweedledum/staple and tweedledee/staple and you run that renaming procedure from each directory tweedledee and tweedledum in turn, then both directories will end up with a file called 1c0ee9c1eed005a476403c7651b739ae5bc7cf2a.html. If you want to have different names, you need to put some distinguishing content in the hashed text, such as the name of the directory.

1

First of all I suggest to substitute

for fname in `find . -type f`; do

with

find . -type f | while read -r fname; do

Next, regarding sha1sum, it should be "virtually" unique, meaning that the probability to have to different files with the same checksum if considerably low, so that you can safely assume it is unique.

enzotib
  • 51,661