How to find files that don’t have a suffixed version?

Question

I have a few million .jpg files and I want to generate a .jpg.webp version for each one (foo.jpg -> foo.jpg.webp). For this, I need to find all files ending in .jpg for which there’s no .jpg.webp version.

Right now, I do it like this:

find "$path" -type f -iname "*.jpg" |
  while read -r image_path; do
      if [ ! -f "$image_path.webp" ]; then
        echo "$image_path"
      fi
  done |
  # treat only 10000 files per run
  head -n 10000 |
  ...

However because I’m using a pipe, this creates a subshell. I wonder if there would be a more efficient way to do so, especially because the more WebP images I generate, the more time the script spends filtering paths to find candidates. Would there be some way to do this using find only?

I’m using Ubuntu 20.04. Files are spread across sub directories.

It's only one subshell total, not one subshell per file. That doesn't mean it's efficient (bash's read operates one character at a time), but it's a whole lot less inefficient than more naive approaches would be. — Charles Duffy, Aug 29 '23 at 17:00
(that said, you'd have correctness benefits from using -print0 on the find command, and making it while IFS= read -r -d '' image_path; do) — Charles Duffy, Aug 29 '23 at 17:02

score 8 · Answer 1 · edited Aug 29 '23 at 11:48

8

I'd do the following:

Find all suffixed (i.e., *.jpg.webp) files, put them in a sorted list removing their suffix;
Find all files without suffix (i.e., *.jpg), put them in a second sorted list
compare the two list, removing those entries that are in the first list.
operate your conversion on the "set difference" list that results from it.

So, something like

#!/bin/bash
comm -z -1 -3 \
   <(find -name '*.jpg.webp' -print0 | sed 's/\.webp\x0/\x0/g' | sort -z) \
   <(find -name '*.jpg'      -print0 | sort -z) \
| parallel -0 gm convert '{}' '{}.webp'

assuming you're using GraphicsMagick gm for conversion (in my experience, speed- and reliability-wise much preferable to ImageMagick's convert), and assuming you have GNU parallel installed (if not, xargs might work just as well).

edited Aug 29 '23 at 11:48

terdon

242,166

answered Aug 29 '23 at 11:24

Marcus Müller

28,836

@terdon: I do think <() is a POSIX shell construct, isn't it? – Marcus Müller Aug 29 '23 at 11:50
No, it isn't. Try it with sh or dash. – terdon Aug 29 '23 at 11:52
works with sh for me! (but that sh might be a self-limiting bash… so.) – Marcus Müller Aug 29 '23 at 11:52
4

That's because your sh is probably a symlink to bash, so it isn't sh but bash running in POSIX mode which is not an accurate representation of an actual POSIX shell. Dash fails with Syntax error: "(" unexpected. – terdon Aug 29 '23 at 11:53
You might consider using cwebp itself directly instead of having graphicsmagick or ImageMagick call it. – Mark Setchell Aug 31 '23 at 07:31
@MarkSetchell neither calls cwebp. – Marcus Müller Aug 31 '23 at 07:35

cas · Accepted Answer · 2023-08-29T11:34:18.163

7

Try something like this:

find "$path" -type f -iname "*.jpg" -exec \
  sh -c 'for f; do [ -e "$f.webp" ] || echo "$f" ; done' find-sh {} +

This executes sh as few times as possible (depending how many .jpg files found by find), limited by ARG_MAX (around 2 million bytes on Linux), and avoids the need for an excruciatingly slow while read ... loop by passing all the filenames as command line arguments. See Why is using a shell loop to process text considered bad practice? and Why is looping over find's output bad practice?

To efficiently process batches of these files, I'd redirect the output to a file and then split that into batches of 10,000 (or however many you need), e.g. with split -l 10000.

NOTE: If any of your .jpg filenames contain newlines then you will need to use NUL as the separator between them, otherwise use newline as the separator. To use NUL separators, replace echo "$f" with printf "%s\0" "$f". BTW, split supports NUL-separated input with -t '\0'.

The script that process the batches should read in the filenames, and check again that the corresponding .jpg.webp file doesn't exist (in case one was generated after the list was generated) before running whatever it needs to generate the .jpg.webp version.

If you had to use NUL as the filename separator, it would be easiest to use readarray (AKA mapfile) to read the entire batch's list into an array and iterate over the array of filenames. Or use awk or perl to process the filenames.

Actually, using an array would be better than a while-read loop, even with newlines as separators.

edited Aug 29 '23 at 11:34

answered Aug 29 '23 at 09:11

cas

78,579

1

Ideally, they'd do any processing on the files in the in-line script that find calls, and not downstream in some pipeline. If that means the in-line script becomes too complicated, then write it as a separate script that you call with a batch of pathnames from find. – Kusalananda Aug 29 '23 at 11:28
i'd agree with that, but they said they had millions of jpeg files and wanted to process them in batches of 10000. The find one-liner in my answer is just to generate the initial list of all .jpg files without corresponding .jpg.webp files - they can then split up that list however they want and use it as required to generate the .jpg.webp files. – cas Aug 29 '23 at 11:33
1

The OP's original code only starts a single shell already. By starting a shell per batch of files this is strictly worse. – Charles Duffy Aug 29 '23 at 17:02
@CharlesDuffy huh? one shell per ARG_MAX limit is absolutely not worse than a shell while-read loop (which is probably the worst way possible of processing text). if you're talking about my comment pointing out that the OP wanted to process the files in batches of 10000, then improve your reading comprehension skills - the one-liner above doesn't do that, it generates a list of all matching files (.jpg files without a matching .jpg.webp) which the OP can split and post-process however they like. – cas Aug 29 '23 at 22:56
1

Ah. This remains worse in terms of number of shells run (in that anything equal-to-or-greater-than 1 is worse than having a guarantee of exactly 1), but you're right that read reading from a pipe is not good. I'm curious where the breakeven point is; personally, I'd rewrite to awk or Python or such. – Charles Duffy Aug 30 '23 at 00:27
1
a one-liner that runs much faster and more efficiently than the original one-liner is in no way worse than the original, no matter how many shells are run. 2. if "number of shells run" is your metric, then that's a worse-than-useless metric. 3. personally, i'd use awk or perl (probably perl, something like find ... | perl -lne 'print unless -e "$_.webp"', optionally with -0 for NUL separators). python's pretty bad for jobs like this, and clumsy for one-liners.

cas

Aug 30 '23 at 00:39

score 6 · Answer 3 · edited Aug 30 '23 at 09:25

This sounds like a job for make. It will only generate the files that are missing, or have older modification time than the files they're generated from.

.PHONY: all
all: $(addsuffix .webp,$(shell find . -name '*.jpg'))
%.jpg.webp: %.jpg
    cwebp $< -o $@   #Some command that generates $@ from $<

Save this to a file named Makefile, and run make.
Or make -j $(nproc) to run as many parallel jobs as you have logical cores. Or pick an explicit number, perhaps tthe number of physical cores, to leave some idle logical cores for other work.)

This will break if any files or subdirectories have spaces in their names.

%.jpg.webp: %.jpg is a pattern rule.

I've never tried using Make with a variable that expands to a few million words. Might be an interesting experiment... — Toby Speight, Aug 30 '23 at 08:03

terdon · Answer 4 · 2023-08-29T09:30:01.713

The problem isn't the subshell, per se, it's the fact that you are looping over the output using the shell to iterate over and check the files. The shell is slow, so it would be faster to just run find to get the files once, and then iterate over the list of target files. Something like this:

find . -name '*jpg' > jpgs
find . -name '*jpg.webp' > webps
awk 'NR==FNR{webps[$0]++; next} !($0".webp" in webps)' webps jpgs | 
  head -n 10000 | ...

Or even

awk 'NR==FNR{webps[$0]++; next} !($0".webp" in webps)' webps jpgs > target.files
## split into lists of 10000 file names each (assuming no newlines in your file names)
split -l 10000 target.file list
## Process each list
for list in list*; do
  while read jpg; do
    whatever "$jpg" > "$jpg".webp
  done < "$list"
done

But since you're dealing with millions of files, you probably want to use GNU parallel, assuming you have access to it, so you can run the commands in parallel. Assuming the command you use to create the .webp is command foo.jpg > foo.webp, you would do:

find . -name '*jpg' > jpgs
find . -name '*jpg.webp' > webps
awk 'NR==FNR{webps[$0]++; next} 
    !($0".webp" in webps){ 
       printf "command \"%s\" > \"%s\".webp\n",$0,$0 
}' webps jpgs > script.sh
parallel -j 30 < script.sh

That will create the script script.sh with all the conversions you want to run, and then would execute them running 30 conversions in parallel at a time.

score 2 · Answer 5 · answered Aug 29 '23 at 10:48

2

Would there be some way to do this using find only?

Without considering performance and timing, this is the simplest find command to perform this task:

find "$path" -type f -iname "*.jpg" ! -exec test -e '{}.webp' \; -print

It probably won't be as fast as the other answers, but just as a reference.

By the way, if you're only looking for files that end with lowercase jpg, it's better to use -name (case sensitive) instead of -iname (case insensitive) which might be be a little bit slower, especially for millions of files.

answered Aug 29 '23 at 10:48

aviro

5,532

In terms of performance, this will have find fork/exec /bin/test once per .jpg file. That's not as bad as starting a whole shell for each file, but it's (much?) worse than a shell or perl loop over .jpg args. (The standalone /bin/test is a fairly small program that only links libc, but makes more system calls than really necessary for test -e file. But most of its system calls are startup overhead from libc and the dynamic-linker. And of course process creation and exit costs time, and the context switches on exit / wait() take time.) – Peter Cordes Aug 30 '23 at 03:15

score 1 · Answer 6 · answered Aug 31 '23 at 00:21

1

If you use zsh, you can do this with a glob:

echo **/*.jpg(.e:'! [ -e "$REPLY.webp" ]':)

Or with the max count, if you want that:

echo **/*.jpg(.Y10000e:'! [ -e "$REPLY.webp" ]':)

How exactly you use it depends on what you want to do, how your generation script/binary args work, and whether you hit the max command line length, so give details for followup questions.

answered Aug 31 '23 at 00:21

Kevin

40,767

I don’t use zsh (I tagged the question with [tag:bash]), but why would you use a glob rather than find? With Bash, the glob takes 1m44 to end while find does it in 6s. – bfontaine Aug 31 '23 at 08:52
1

Depends how you're using it, in some cases it's simpler and cleaner. If it all fits on one line and the tool takes it, or if you need it in a for loop, you can use it like any other glob, those get used plenty. As for performance, I don't know if your test had a flaw or if zsh is just more optimized, but in my test, find took 5 minutes and the glob was a few seconds. In any case, the time for a command to run is normally dwarfed by the time to write it so how you end up doing it doesn't matter that much. – Kevin Aug 31 '23 at 15:28

How to find files that don’t have a suffixed version?

6 Answers6