I have some failed experiment result files, and their contents are exactly a single \n
(newline).
I would like to list them all (perhaps with something like find
or grep
), to know what are the files and later delete them.
I have some failed experiment result files, and their contents are exactly a single \n
(newline).
I would like to list them all (perhaps with something like find
or grep
), to know what are the files and later delete them.
Create a reference file outside of the search path (it will be .
in the example):
echo >/tmp/reference
Now we have a known file identical to what you're looking for. Then compare all regular files under the search path (.
here) to the reference file:
find . -type f -size 1c -exec cmp -s -- /tmp/reference {} \; -print
-size 1c
is not necessary, it can be omitted; it's only to improve the performance. It's a quick preliminary test that rejects files of wrong sizes, without spawning additional processes. Relatively costly cmp …
processes will be created only for files of the right size.
-s
makes cmp
itself silent. We don't need its output, just the exit status.
--
is explained here: What does "--" (double-dash) mean? It's really not needed in our example case, i.e. if the reference file is specified as /tmp/reference
and the search path is .
. I used --
in case someone carelessly chooses path(s) that would otherwise make cmp
misbehave or fail; with --
it should just work.
-exec
is used as a test, it will succeed if and only if cmp
returns exit status zero; and for a tested file this will happen if the file is identical to /tmp/reference
. This way, find
will give you the pathnames of files that are identical to the reference file.
The method can be used to find files with any fixed content; you just need a reference file with the exact content (and don't forget to adjust -size …
if you use it; -size "$(</tmp/reference wc -c)c"
will be handy). In our specific case a simple echo
was used to create the file because it prints one newline character, which is exactly the content you want to find.
To make find
attempt to delete each matching file, use -delete
(xor -exec rm -- {} +
) after -print
.
Search for files that are a single byte. Compare them to the known value. Print and/or delete if matched
find /path/to/files -type f -size 1c -exec sh -c 'printf "\n" | cmp -s -- - "$1"' _ {} \; -print
Optionally append -delete
to delete, and remove -print
if you want a silent run.
With GNU grep
, you can use -z
to treat the entire file as a single line (-z
makes grep
use NUL as the line terminator, so as long as your files don't actually contain NUL, \0
, it has the effect of treating the whole file as a single line). If we combine that with -l
to just print the file name and -P
for PCREs to use \n
, we can search for "lines" that only have a single \n
and nothing else:
grep -lPz '^\n$' *
For example, given these three files:
printf 'foo\n' > good_file_1
printf '\n\n\n\n' > good_file_2
printf '\n' > bad_file
Running the grep
above gives:
$ grep -lPz '^\n$' *
bad_file
You can also make it recursive, using the bash globstar
option (from man bash
):
globstar
If set, the pattern ** used in a pathname expansion context will match all files and zero or more directo‐ ries and subdirectories. If the pattern is followed by a /, only directories and subdirectories match.
So, for example, in this situation:
$ mkdir -p ./some/long/path/here/
$ cp bad_file some/long/path/here/bad_file_2
$ tree
.
├── bad_file
├── good_file_1
├── good_file_2
└── some
└── long
└── path
└── here
└── bad_file_2
5 directories, 4 files
Enabling globstar
and running grep
on **/*
will find both bad files (I am redirecting standard error because grep complains about being given directories to search instead of files; such errors are expected and can safely be ignored):
$ grep -lPz '^\n$' **/* 2>/dev/null
bad_file
some/long/path/here/bad_file_2
Alternatively, use find
to only search files:
$ find . -type f -exec grep -lPz '^\n$' {} +
./some/long/path/here/bad_file_2
./bad_file
With zsh
:
zmodload zsh/mapfile
print -rC1 -- **/*(ND.L1e[$' [[ $mapfile[$REPLY] = "\n" ]] '])
print -rC1
: print
s r
aw on 1
C
olumnN
: nullglob: don't complain if there's no match, and pass an empty list to print
instead.D
: dotglob: don't skip hidden files.
: regular files only (like -type f
in find
or file
/f
in rawhide
).L1
: of L
ength 1
.e[code]
runs the code on the file to further determine if that's a match$mapfile[$REPLY]
expands to the contents of the file (whose path is in $REPLY
).POSIXly, and avoiding spawning one or more process per file (assuming a sh
implementation where read
, [
and printf
are builtin which is usually the case):
find . -type f -size 1c -exec sh -c '
for file do
IFS= read -r line < "$file" && [ -z "$line" ] && printf "%s\n" "$file"
done' sh {} +
(note that contrary to with zsh above, the list is not sorted).
With rawhide
(list not sorted either):
rh -e 'file && size == 1 && "
".body' .
With grep
implementations that can cope with non-text files (NUL bytes and non-delimited lines at least) such as GNU grep
in the C locale, you can also do:
LC_ALL=C find . -type f -size 1c -exec grep -l '^$' {} +
find . -size 1c -exec sh -c '[ -z "$(< $1)" ]' sh '{}' ';' -print
Looks for files of size exactly one byte, where the result of reading the file (in a shell) is empty -- sh strips trailing newlines from command substitutions.
$(<...)
is a ksh operator, not a sh operator. In ksh88, $1
should be quoted.
– Stéphane Chazelas
Jan 05 '24 at 09:27
Just to present a novel alternative, in FreeBSD, this could be done as:
find . -maxdepth 1 -size 1c \
-exec md5 -q '--check=68b329da9893e34099c7d8ad5cb9c940 {} >/dev/null' \; -print
However, an md5 hash, even of a small file, is likely somewhat more expensive than a simple cmp
.
I tried to find a way to phrase the cmp
method using bash
's command substitution (and BSD find
), but it's a bit klunky:
find . -maxdepth 1 -size 1c -exec bash -c 'cmp -s "{}" <(echo)' \; -print
Again, likely slightly more expensive to create the newline file multiple times than Kamil's method of creating the reference file once, and comparing against it repeatedly.
md5
makes fewer system calls, it could be faster. Especially if it can check multiple files per invocation with find -exec md5 {} +
. (BTW, GNU Coreutils md5sum
doesn't have an option to supply a hash on the command line to check against. But you could get it to print the hashes for multiple files and grep that.) Hrm, a duplicate-file finder could probably be best, if there's one that lets you look for duplicates only between two sets, not within, and one set can be the reference file alone. Or perl
could be fast at this, with good binary file support and no fork/exec.
– Peter Cordes
Jan 06 '24 at 06:16
echo > '$(reboot)'
for instance.
– Stéphane Chazelas
Jan 06 '24 at 14:11
-exec md5 -q '--check=68b329da9893e34099c7d8ad5cb9c940 {} >/dev/null' \;
doesn't make sense. You'd want -exec md5 -q --check=68b329da9893e34099c7d8ad5cb9c940 {} \;
. You'd want to discard md5
's output, but to do that without discarding that of find
, you'd need to invoke a shell like: -exec sh -c 'exec md5sum --check=... "$1" > /dev/null' sh {} \;
– Stéphane Chazelas
Jan 06 '24 at 14:22
-size 1c
before the exec would probably make the whole thing faster as it would not need to spawncmp
for every file (though of course it would need to be adapted if the target file has a different size). – jcaron Jan 05 '24 at 18:09stat
or asize=$(find "$temp" -printf ...)
an arg for-size
. Or if you take the contents as a string arg, thensize="${#1}c"
or something, andprintf "%s" "$1" > "$tmp"
frommktemp
or something. – Peter Cordes Jan 06 '24 at 06:08