14

I have some failed experiment result files, and their contents are exactly a single \n (newline).

I would like to list them all (perhaps with something like find or grep), to know what are the files and later delete them.

FxMySz
  • 144

6 Answers6

33

Create a reference file outside of the search path (it will be . in the example):

echo >/tmp/reference

Now we have a known file identical to what you're looking for. Then compare all regular files under the search path (. here) to the reference file:

find . -type f -size 1c -exec cmp -s -- /tmp/reference {} \; -print

-size 1c is not necessary, it can be omitted; it's only to improve the performance. It's a quick preliminary test that rejects files of wrong sizes, without spawning additional processes. Relatively costly cmp … processes will be created only for files of the right size.

-s makes cmp itself silent. We don't need its output, just the exit status.

-- is explained here: What does "--" (double-dash) mean? It's really not needed in our example case, i.e. if the reference file is specified as /tmp/reference and the search path is .. I used -- in case someone carelessly chooses path(s) that would otherwise make cmp misbehave or fail; with -- it should just work.

-exec is used as a test, it will succeed if and only if cmp returns exit status zero; and for a tested file this will happen if the file is identical to /tmp/reference. This way, find will give you the pathnames of files that are identical to the reference file.

The method can be used to find files with any fixed content; you just need a reference file with the exact content (and don't forget to adjust -size … if you use it; -size "$(</tmp/reference wc -c)c" will be handy). In our specific case a simple echo was used to create the file because it prints one newline character, which is exactly the content you want to find.

To make find attempt to delete each matching file, use -delete (xor -exec rm -- {} +) after -print.

  • 7
    Adding a -size 1c before the exec would probably make the whole thing faster as it would not need to spawn cmp for every file (though of course it would need to be adapted if the target file has a different size). – jcaron Jan 05 '24 at 18:09
  • @jcaron: If a script wants to make this reusable for other files, you could wrap it up in a shell function that takes a path to a temp file, and a path to a directory. stat or a size=$(find "$temp" -printf ...) an arg for -size. Or if you take the contents as a string arg, then size="${#1}c" or something, and printf "%s" "$1" > "$tmp" from mktemp or something. – Peter Cordes Jan 06 '24 at 06:08
  • 1
    @jcaron Good idea. Answer improved. Thanks! – Kamil Maciorowski Jan 06 '24 at 08:37
  • I wonder whether -size 1c followed by a checksum would be faster? There would be slightly more computation involved, but it would roughly halve the number of syscalls. – Mark Morgan Lloyd Jan 07 '24 at 13:09
  • 1
    @MarkMorganLloyd: The number of one byte files on the system should be pretty trivial. Eliminating the huge number of files that aren't one byte is 99.9% of the work, further optimization isn't going to matter very much. Syscalls cost something, but so does loading cold file, and the reference file should be pretty hot, so optimizing away reading the reference file when you still have to load the cold files isn't going to help that much. – ShadowRanger Jan 07 '24 at 21:21
9

Search for files that are a single byte. Compare them to the known value. Print and/or delete if matched

find /path/to/files -type f -size 1c -exec sh -c 'printf "\n" | cmp -s -- - "$1"' _  {} \; -print

Optionally append -delete to delete, and remove -print if you want a silent run.

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
5

With GNU grep, you can use -z to treat the entire file as a single line (-z makes grep use NUL as the line terminator, so as long as your files don't actually contain NUL, \0, it has the effect of treating the whole file as a single line). If we combine that with -l to just print the file name and -P for PCREs to use \n, we can search for "lines" that only have a single \n and nothing else:

grep -lPz '^\n$' *

For example, given these three files:

printf 'foo\n' > good_file_1
printf '\n\n\n\n' > good_file_2
printf '\n' > bad_file

Running the grep above gives:

$ grep -lPz '^\n$' *
bad_file

You can also make it recursive, using the bash globstar option (from man bash):

globstar

If set, the pattern ** used in a pathname expansion context will match all files and zero or more directo‐ ries and subdirectories. If the pattern is followed by a /, only directories and subdirectories match.

So, for example, in this situation:

$ mkdir -p ./some/long/path/here/
$ cp bad_file some/long/path/here/bad_file_2
$ tree
.
├── bad_file
├── good_file_1
├── good_file_2
└── some
    └── long
        └── path
            └── here
                └── bad_file_2

5 directories, 4 files

Enabling globstar and running grep on **/* will find both bad files (I am redirecting standard error because grep complains about being given directories to search instead of files; such errors are expected and can safely be ignored):

$ grep -lPz '^\n$' **/* 2>/dev/null 
bad_file
some/long/path/here/bad_file_2

Alternatively, use find to only search files:

$ find . -type f -exec grep -lPz '^\n$' {} +
./some/long/path/here/bad_file_2
./bad_file
terdon
  • 242,166
3

With zsh:

zmodload zsh/mapfile
print -rC1 -- **/*(ND.L1e[$' [[ $mapfile[$REPLY] = "\n" ]] '])
  • print -rC1: prints raw on 1 Column
  • N: nullglob: don't complain if there's no match, and pass an empty list to print instead.
  • D: dotglob: don't skip hidden files
  • .: regular files only (like -type f in find or file/f in rawhide).
  • L1: of Length 1.
  • e[code] runs the code on the file to further determine if that's a match
  • $mapfile[$REPLY] expands to the contents of the file (whose path is in $REPLY).

POSIXly, and avoiding spawning one or more process per file (assuming a sh implementation where read, [ and printf are builtin which is usually the case):

find . -type f -size 1c -exec sh -c '
  for file do
    IFS= read -r line < "$file" && [ -z "$line" ] && printf "%s\n" "$file"
  done' sh {} +

(note that contrary to with zsh above, the list is not sorted).

With rawhide (list not sorted either):

rh -e 'file && size == 1 && "
".body' .

With grep implementations that can cope with non-text files (NUL bytes and non-delimited lines at least) such as GNU grep in the C locale, you can also do:

LC_ALL=C find . -type f -size 1c -exec grep -l '^$' {} +
2
find . -size 1c -exec sh -c '[ -z "$(< $1)" ]' sh '{}' ';' -print

Looks for files of size exactly one byte, where the result of reading the file (in a shell) is empty -- sh strips trailing newlines from command substitutions.

glenn jackman
  • 85,964
1

Just to present a novel alternative, in FreeBSD, this could be done as:

find . -maxdepth 1 -size 1c \
  -exec md5 -q '--check=68b329da9893e34099c7d8ad5cb9c940 {} >/dev/null' \; -print 

However, an md5 hash, even of a small file, is likely somewhat more expensive than a simple cmp.

I tried to find a way to phrase the cmp method using bash's command substitution (and BSD find), but it's a bit klunky:

find . -maxdepth 1 -size 1c -exec bash -c 'cmp -s "{}" <(echo)' \; -print

Again, likely slightly more expensive to create the newline file multiple times than Kamil's method of creating the reference file once, and comparing against it repeatedly.

Jim L.
  • 7,997
  • 1
  • 13
  • 27
  • If md5 makes fewer system calls, it could be faster. Especially if it can check multiple files per invocation with find -exec md5 {} +. (BTW, GNU Coreutils md5sum doesn't have an option to supply a hash on the command line to check against. But you could get it to print the hashes for multiple files and grep that.) Hrm, a duplicate-file finder could probably be best, if there's one that lets you look for duplicates only between two sets, not within, and one set can be the reference file alone. Or perl could be fast at this, with good binary file support and no fork/exec. – Peter Cordes Jan 06 '24 at 06:16
  • Embedding {} in the shell code introduces an arbitrary command execution vulnerability and should not be done. Try after echo > '$(reboot)' for instance. – Stéphane Chazelas Jan 06 '24 at 14:11
  • 1
    That -exec md5 -q '--check=68b329da9893e34099c7d8ad5cb9c940 {} >/dev/null' \; doesn't make sense. You'd want -exec md5 -q --check=68b329da9893e34099c7d8ad5cb9c940 {} \;. You'd want to discard md5's output, but to do that without discarding that of find, you'd need to invoke a shell like: -exec sh -c 'exec md5sum --check=... "$1" > /dev/null' sh {} \; – Stéphane Chazelas Jan 06 '24 at 14:22