Find only the first few matched files using find

Question

Say there are hundreds of *.txt files in a directory. I only want to find the first three *.txt files and then exit the searching process.

How can I achieve this using the find utility? I had a quick look through its man pages but saw no option for this.

You can use find . -name '*.txt' -print -quit only show the first match and let find exit after the first match. I do not know if it is possible to adapt to the case "exit after finding n matches". — N.N., Mar 19 '13 at 09:49

score 31 · Accepted Answer · edited Aug 01 '15 at 23:12

31

You could pipe the output of find through head:

find . -name '*.txt' | head -n 3

edited Aug 01 '15 at 23:12

don_crissti

82,805

answered Mar 19 '13 at 08:27

Chris Card

2,292

5

I knew this, I want exit the searching process after finding out the first three matched files. There can be huge amount of matched files I don't care. – mitnk Mar 19 '13 at 08:31
3

I think the find command does get terminated once head has printed the first 3 files – Chris Card Mar 19 '13 at 08:48
2

Yes, it's strange, but you are right. – mitnk Mar 19 '13 at 09:22
25

It's not at all strange - it's how pipes work in UNIX. head starts up and waits for input from the lefthand side of the pipe. Then find starts up and searches for files that match the criteria specified, sending its output through the pipe. When head has received and printed the number of lines requested, it terminates, closing the pipe. find notices the closed pipe and it also terminates. Simple, elegant and efficient. – D_Bye Mar 19 '13 at 09:37
Isn't that head -n3, though? I always use it that way, and the man page for GNU coreutils-8.5 head(1) doesn't seem to indicate that just using -3 is a supported syntax. – user Mar 19 '13 at 14:01
@MichaelKjörling: From the coreutils man page: "For compatibility head also supports an obsolete option syntax -countoptions, which is recognized only if it is specified first. count is a decimal number optionally followed by a size letter (‘b’, ‘k’, ‘m’) as in -c, or ‘l’ to mean count by lines, or other option letters (‘cqv’)." GNU calling this "obsolete" seems overly optimistic to me; as far as I know, this style is far more common. – Plutor Mar 19 '13 at 14:04
@Plutor I don't see that in the Debian Squeeze coreutils-8.5 man page for head. Though if it works, then that's well enough. I do think it could be added to the answer, though, particularly since the man page does state with the leading '-', print all but the last K lines of each file (for -n). – user Mar 19 '13 at 14:06
@MichaelKjörling That was quoted from the coreutils 8.21 manual on gnu.org. Not sure about Debian's manual. http://www.gnu.org/software/coreutils/manual/coreutils.html#head-invocation – Plutor Mar 19 '13 at 14:39
3

To summarize, -n 3 is POSIX compatible, and therefore likely to be more portable. – l0b0 Mar 19 '13 at 16:11
+1 to @D_Bye I feel I just learned a ton of Unix after reading It's not at all strange - it's how pipes...'s comment. – j-- Dec 16 '14 at 10:07
@D_Bye Your explanation isn't perfectly accurate. See my answer here. – Kamil Maciorowski Jun 24 '17 at 09:46

Kamil Maciorowski · Answer 2 · 2024-02-18T14:14:08.217

This other answer is somewhat flawed. The command is

find . -name '*.txt' | head -n 3

Then there's an explanation in one of the comments [emphasis mine]:

head starts up and waits for input from the lefthand side of the pipe. Then find starts up and searches for files that match the criteria specified, sending its output through the pipe. When head has received and printed the number of lines requested, it terminates, closing the pipe. find notices the closed pipe and it also terminates. Simple, elegant and efficient.

This is almost true.

The problem is find notices the closed pipe only when it tries to write to it – in this case it's when the 4th match is found. But if there's no 4th match then find will continue. Your shell will wait! If it happens in a script, the script will wait, despite the fact we already know the pipe output is final and nothing can be added to it. Not so efficient.

The effect is negligible if this particular find finishes fast by itself but with complex search in a large file tree the command may unnecessarily delay whatever you want to do next.

The not-so-perfect solution is to run

( find … & ) | head -n 3

This way when head exits, the shell continues immediately. Background find process may be ignored then (it will exit sooner or later) or targeted with pkill or something.

To prove the concept you may search for /. We expect one match only, yet find looks for it everywhere and it may take a lot of time.

find / -name / 2>/dev/null | head -n 1

Terminate it with Ctrl+C as soon as you see the issue. Now compare:

pidof find ; ( find / -name / 2>/dev/null & ) | head -n 1 ; pidof find

A better solution may be:

yes | head -n 2 \
| find … -print -exec sh -c '
   read dummy || kill -s PIPE "$PPID"
' find-sh \;

Notes:

Here we want 3 matched files, but we use head -n 2 (not head -n 3). After the third matching file, read finds no input on its stdin and then kill terminates find. If we used head -n 3, then kill would be triggered after the fourth file.
The signal is SIGPIPE. kill -s INT … should work as well. I deliberately chose SIGPIPE because it's the signal that terminates find in the simplest solution (find … | head -n 3).
Running one sh per matching file will be negligible if you want 3 files. Remember our goal is to avoid this find (from what I called "not-so-perfect solution") running in the background in vain; for the overall performance of the OS, few short-living shells instead of "abandoned" find that traverses the filesystem are better for sure. But if you want (at most) 1000 files and the chances are find may run out of files even earlier (so there may be no problem we want to avoid), then these shells are a burden.

The following code spawns reduced number of sh processes, but I think it's flawed:
```
# flawed, DO NOT USE
yes | head -n 999 \
| find … -exec sh -c '
   for pathname do
      printf "%s\\n" "$pathname"
      read dummy || { kill -s PIPE "$PPID"; exit 0; }
   done
' find-sh {} +
```
I had to replace -print (from the outside of the shell code) with printf … (inside the shell code). The reason is -print before -exec sh … {} + could (and probably would) print too many pathnames.

A potential problem arises: if each printf created a separate process, then it would make this "optimization" pointless. Fortunately in almost(?) every sh printf is a builtin.

But the real flaw is the fact exec sh … {} + waits for as many pathnames as possible before handing them over to sh. On one hand this is exactly what reduces the number of sh processes. On the other hand it's almost certain that when the 1000th match is enqueued, find will keep searching for 1001st; and when the 1001st is found, probably for even more. Note in this case the 1001st match is the one that would terminate find … | head -n 1000; so the flawed solution is even worse than the simplest solution, do not use it.
The simplest solution (find … | head -n 3) will miscount if there's a newline character in one of the printed pathnames. If you want null-terminated strings then the simplest solution will become like find … -print0 | head -z -n 3, i.e. you will need head that supports this non-portable option -z. In our optimized solution you will need neither head -z nor find -print0; printf "%s\\0" "$pathname" in the shell code will be enough.
Counting is done inside sh by consuming lines from the stdin inherited from find. Usually you don't pipe anything to find, but in general you may want to do this for some purpose other than our counting. The other purpose and our counting method will be incompatible then.
yes is not portable. For our purpose while :; do echo; done is a portable replacement.
find-sh is explained here: What is the second sh in sh -c 'some shell code' sh?

A fellow user asked for a shell function that implements the solution. Here it is:

findn () (
  n="$1"
  shift
  case "$n" in
    '' | *[!0123456789]*) echo >&2 not a valid number; 
  exit 1;;
  esac
  [ "$n" -eq 0 ] && exit 0
  n="$((n-1))"
  while :; do echo; done | head -n "$n" \
  | find "$@" -exec sh -c '
     read dummy || kill -s PIPE "$PPID"
  ' find-sh \;
)

The first argument is the maximum number of matches you want, the rest will be given to find. Notes:

The case is because of this: Security Implications of using unsanitized data in Shell Arithmetic evaluation.
While running find, the function appends -exec …, so there will never be an implicit -print. If you want results to be printed, specify -print explicitly.

Example usage:

findn 2 / -name bin -print 2>/dev/null

Could you make your final snippet into a general-purpose script, or bash function at least? i.e. replace ... and 999 etc. with variables to be provided via the cmdline? — einpoklum, Feb 18 '24 at 13:14
@einpoklum Now I think the final snippet (the one with -exec … {} +) is badly flawed. I will improve the answer. — Kamil Maciorowski, Feb 18 '24 at 13:40

score 1 · Answer 3 · answered Oct 31 '21 at 16:30

1

A solution without find that might work for many is to use fd instead, a find-like tool written in Rust. (fd is a simple, fast and user-friendly alternative to find)

fd --glob '*.txt' /path/to/search --max-results $n

answered Oct 31 '21 at 16:30

Atemu

689

Note that the command is called fdfind (from fd-find package, since 2018) on Debian-based systems, fd being a file manager in the fdclone package (since 1997). – Stéphane Chazelas Oct 31 '21 at 17:27
1

You'd need to add -uu to get a closer equivalent to the find command. – Stéphane Chazelas Oct 31 '21 at 17:27

Stéphane Chazelas · Answer 4 · 2021-10-31T18:30:54.843

With bash 4.4+ and GNU tools, to exit as early as possible as soon as the 3rd file has been found, you can do:

n=3
readarray -td '' first_3_files < <(
  (
    echo "$BASHPID"
    LC_ALL=C exec stdbuf -o0 find . -name '*.txt' -type f -print0
  ) | {
    IFS= read -r pid
    head -zn "$n"
    kill -s PIPE "$pid"
  }
)
echo "The first $n files are:"
printf ' - %s\n' "${first_3_files[@]}"

stdbuf -o0 stops find buffering its output, and we send the SIGPIPE signal to find as soon as head -zn 3 returns, rather than letting find carry on searching and only receive the SIGPIPE when it finds and prints the 4th file path.

Or another GNU specific approach using GNU find's -quit predicate:

n=3
readarray -td '' first_3_files < <(
  seq "$((n - 1))" | LC_ALL=C find . -name '*.txt' -type f -print0 \
   ! -exec read iteration ';' -quit)

(if your system doesn't have a standalone read utility, use -exec sh -c 'read iteration' ';'; systems with a standalone read utility probably has it implemented as a shell script wrapper around the builtin read anyway).

With zsh, you can just do:

first_3_files=( **/*.txt(ND.Y3) )

Find only the first few matched files using find

4 Answers4

Linked