4

I'm writing a quick tool to inspect the contents of a node.js node_modules folder or python virtualenv for native dependencies. As a quick first approximation to this I wrote the following command.

find . | xargs file | awk '/C source/ {print $1} /ELF/ {print $1}'

I'm okay with false positives but not false negatives (e.g. files literally containing the string ELF or C source can be marked suspicious.), but this script also potentially breaks on long file names (because xargs will split them) and file names containing spaces (because awk will split on whitespace) and file names containing newlines (because find uses newlines to separate paths).

Is there a way to filter the paths generated by find by seeing if the output of file {} (possibly with some additional options to remove the path entirely from the output of file) matches a particular regular expression?

Greg Nisbet
  • 3,076
  • 1
    Use -exec cmd {} + instead of the nonstandard -print0. – schily Apr 06 '16 at 19:12
  • Your updated command looks better, but it is still susceptible to filenames with spaces or newlines. If you can get the list of files from find and iterate F over that list, then you can echo "$F" whenever the output of file $F | grep ... returns success. – Prem Apr 06 '16 at 19:22
  • 3
    OP - get rid of the pipe and do it find only; use file with -b so as to use grep with anchor e.g. find . -type f -exec sh -c 'file -b "$0" | grep -q "^ELF\|^C source" && printf %s\\n "$0"' {} \; – don_crissti Apr 06 '16 at 19:25
  • @don_crissti , excellent !! It solves all problems given by OP, and should be the answer here. Key point is that -exec file is not enough, but -exec sh ... allows a whole small script to be executed for each file, with find itself doing the iteration !! – Prem Apr 06 '16 at 19:32
  • @don_crissti Why does the {} argument to the inner script become $0 instead of $1 in the body of the script? sh -c "echo $0" hi, for instance, prints /bin/bash on my machine. – Greg Nisbet Apr 06 '16 at 19:39
  • 1
    Visit this page; also follow the links there... – don_crissti Apr 06 '16 at 19:44
  • @don_crissti Done. Perhaps this is more appropriate as a meta question, but a) you came up with the solution and b) the find -exec file {} + ... solution is still better than what I originally posted, but doesn't directly address the whitespace problem. What's the right thing to do in terms of packaging that as an answer? – Greg Nisbet Apr 06 '16 at 20:03
  • 1
  • @don_crissti, if the only desired output is a human-readable list of filenames, there will never be a fully unambiguous way to distinguish them later, although the fact that every path reported by find will contain at least one slash will definitely help. But see my answer below; you can easily modify the -print flag in the first command to another -exec operator and thereby handle any special character filenames safely for whatever you need to do. – Wildcard Apr 06 '16 at 23:43
  • Also relevant: Bash script: check if a file is a text file. Suggestion: read everything Stéphane Chazelas wrote. – G-Man Says 'Reinstate Monica' Apr 07 '16 at 04:16

2 Answers2

3

It is easiest to execute a small script for every file that checks the brief-mode output of file and prints the path if the output of file matches ELF or C source, the path is passed in as $0.

find . -type f -exec sh -c \
    'file -b "$0" | grep -q "^ELF\|^C source" && printf %s\\n "$0"' {} \;

This solution has the following advantages over the original

-type f filters out directories immediately instead of relying on the output of file

Passing in the argument as {} avoids issues related to whitespace or newlines in the file name.

Greg Nisbet
  • 3,076
3

The key factor in reaching find enlightenment ;) is:

find's business is evaluating expressions -- not locating files. Yes, find certainly locates files; but that's really just a side effect.

--Unix Power Tools

There is an alternate approach to this question that it's worth knowing about (as also described in Unix Power Tools, in the section "Using -exec to Create Custom Tests"):

find . -type f -exec sh -c 'file -b "$1" | grep -iqE "^ELF|^C source"' sh {} \; -print

It's worth knowing about this filtering method since it can be used for many more things than simply printing the name of the file; just change the -print operator to any other operator you like (including another -exec operator) and do what you like with it.


There is a performance drawback to this command (which is also present in the other answer), which is that since we are using \; and not +, we are spawning a shell for every single file. Using + to pass multiple files at once to the sh command and processing them with a for loop gives a noticeable performance advantage:

find . -exec sh -c 'for f do file -b "$f" | grep -qE "^ELF|^C source" && printf %s\\n "$f"; done' sh {} +

You can see the comparison for yourself by running both of the following commands and comparing the output of time:

time find . -exec sh -c 'for f do file -b "$f" | grep -qE "^ELF|^C source" && printf %s\\n "$f"; done' sh {} +
time find . -exec sh -c 'file -b "$1" | grep -qE "^ELF|^C source" && printf %s\\n "$1"' sh {} \;

The real point, though, is:

Never run a shell for loop on a list of files that is output from find. Instead, either run the action you need to do on each file directly within find by using the -exec operator, or embed a shell for loop within a find command and do it that way.

Some additional reasons:

Wildcard
  • 36,499
  • Does + in place of \; affect the character used to separate lines of output from find or whether a subprocess is spawned for each invocation? – Greg Nisbet Apr 06 '16 at 23:24
  • @GregoryNisbet, the latter—see POSIX specs for find (or just look at man find). + affects the behavior of the -exec operator and makes find run whatever command is specified after -exec with multiple files as arguments, rather than just one. Actually, I just realized your answer has no performance benefit as written; I will update my answer. – Wildcard Apr 06 '16 at 23:28
  • @StéphaneChazelas, I'm trying to find how for f; do is nonstandard and from the line about for_clause on this page it seems it should be standard...sequential_sep can be a ;, right? I'm not sure I've got the hang of the grammar depiction format, though.... – Wildcard Nov 08 '16 at 00:00
  • 1
    See http://austingroupbugs.net/view.php?id=581 – Stéphane Chazelas Nov 08 '16 at 11:59
  • So, with the 2016 edition now released, that's no longer true, I'll remove my comment. (you still need for f do for the Bourne shell though). – Stéphane Chazelas Nov 08 '16 at 12:38
  • Excuse the simplistic question. How do I then take this list of files and use them as input to a binary in the most efficient way? I tried a pipe | at the end but that didn't work. – Frak Oct 20 '21 at 13:51
  • @frakman1 see examples in https://unix.stackexchange.com/a/321753/135943 that have "somecommand" or "somecommandwith" in them. If that's not enough, please post a new question. – Wildcard Oct 20 '21 at 20:00