2

I wrote a bash script today during my lunch break that finds extensionless files in a directory and appends a file extension to those files.

The script is relatively long because I added a bunch of flags and stuff like directory selection and whether to copy or overwrite the file, but the meat and potatoes of its functionality can be replicated simply with this:

#recursively find files in current directory that have no extension
for i in $(find . -type f ! -name "*.*"); do
    #guess that extension using file
    extfile=$(file --extension --brief $i)
    #select the first extension in the event file spits something weird (e.g. jpeg/jpe/jfif) 
    extawk=$(echo $extfile | awk -F/ '{print $1}')
    #copy the file to a file appended with the extension guessed from the former commands
    cp -av $i $i.$extawk
done

It's a bit tidier in my actual script—I just wanted to split commands up on here so I could comment why I was doing things.

My question: Using find in combination with file in the manner I have chosen is likely not the most fool-proof way to go about doing this—what is the best way to recursively guess and append extensions for a bulk group of diverse filetypes among several directories?

2 Answers2

2

for x in $(find …) fails with file names containing whitespace (common) or wildcard characters (somewhat uncommon). Never parse the output of find. Use -exec.

Zsh's zmv is convenient for mass renaming.

Let's construct a zmv command that does what you want. First, let's build the search pattern:

autoload zmv
zmv -C -o -a -n -Q '(*/)#^*.*(.)' …
  • -C causes files to be copied instead of moved.
  • -o -a passes -a to cp.
  • -n means don't act, just print what would be done. Remove it once you're happy. Replace it by -v if you want to act but also print what is being done.
  • -Q enables glob qualifiers in the pattern.
  • (*/)# matches zero or more directories. It uses the # glob operator (extended_glob is always enabled in zmv).
  • ^*.* uses the ^ glob operator to match files without a . in their name.
  • (.) is a glob qualifier that restricts the matches to regular files.
  • will be replaced by the replacement text. This can use $f to refer to the original name.

zmv calculates all the replacement names before performing any replacement and will complain if any replacement name already exists or if there are clashes. Files for which the replacement name is identical to the original are skipped.

Now let's build the replacement text. We'll use a lot of parameter expansion features.

  1. Ask file for the extension: $(file --extension --brief -- $f)
  2. Prepend a ., in preparation for the replacement: $(echo -n .; file --extension --brief -- $f)
    (This could alternatively be done with parameter expansion: ${:-.$(…)}.)
  3. If there are multiple suggested extensions (separated by slashes), keep only the first one: ${$(echo -n .; file --extension --brief -- $f)%%/*}
  4. If the suggested extension is empty or ???, give up (replace . or .??? by an empty string): ${${$(echo -n .; file --extension --brief -- $f)%%/*}:#.(|\?\?\?)}
  5. Append the added extension to $f (the original name). If what we're appending is empty, the file will be left untouched.

The resulting command:

zmv -C -o -a -n -Q '(*/)#^*.*(.)' '$f${${$(echo -n .; file --extension --brief -- $f)%%/*}:#.(|\?\?\?)}'

This is a bit cryptic and you may prefer to put the code to generate the replacement in a function and use zmv … '$(add_extension $f)'.

1

I think the mos effective way is to compare the mime-types of the file with the database located at /usr/share/mime/globs.

  • globs in Linux are file extension. Example given, output from globs file
application/x-mswinurl:*.url
text/x-mrml:*.mrl
text/x-erlang:*.erl
audio/x-pn-audibleaudio:*.aa
application/x-bzip-compressed-tar:*.tbz2
application/x-netshow-channel:*.nsc
application/x-hdf:*.h4
application/pgp-keys:*.key
text/x-idl:*.idl
text/x-chdr:*.h
application/vnd.ms-powerpoint.presentation.macroEnabled.12:*.pptm
application/vnd.ms-powerpoint.presentation.macroEnabled.12:*.pptm
application/vnd.visio:*.vsd
application/x-hdf:*.h5
video/vnd.mpegurl:*.m4u
  • after describing the type example --> text/x-erlang, it tells Linux to identify all files*. as Erlang with extension .erl[glob], that's why --> *.erl
  • you can add your own extensions to be reckoned in /etc/magic file

so running the command:

mimetype -bM file
  • b argument for just show you type-app/extension(brief)

  • M argument means Magic is the way Linux checks out the file in byte code, hexadecimal, binary to verify that the files are really what they claim to be.

  • mimetype doesn't return /jpg/png/webp only returns one type, and it's shorter than file --mime-type file

Returns:

image/webp

final thoughts

mimetype works best with binary files like PDFs, images, videos. This is because it can check the binary, instead, text plain is just that, and you need to identify with something, and this is more complicated, that's why text editors can recon different programming languages, it needs the help from user and a server language for each programming language.

for recursion, I think tree is fine:

tree -FIi '*.*' | grep -v /$
  • argument F is to add /[slash] to directories, example, folderfolder/
  • argument I is to select the opposite of the pattern *.*[this means select all files with extension], so the opposite is not extension
  • argument i is to remove spaces from tree output
  • grep -v is to select reverse, that's why you add the -F/ argument to tree command at the beginning, so you can remove directories and get only files, with /$.

Check more here mime types

AlexPixel
  • 290