5

I have a few million .jpg files and I want to generate a .jpg.webp version for each one (foo.jpg -> foo.jpg.webp). For this, I need to find all files ending in .jpg for which there’s no .jpg.webp version.

Right now, I do it like this:

find "$path" -type f -iname "*.jpg" |
  while read -r image_path; do
      if [ ! -f "$image_path.webp" ]; then
        echo "$image_path"
      fi
  done |
  # treat only 10000 files per run
  head -n 10000 |
  ...

However because I’m using a pipe, this creates a subshell. I wonder if there would be a more efficient way to do so, especially because the more WebP images I generate, the more time the script spends filtering paths to find candidates. Would there be some way to do this using find only?

I’m using Ubuntu 20.04. Files are spread across sub directories.

bfontaine
  • 163
  • 3
    It's only one subshell total, not one subshell per file. That doesn't mean it's efficient (bash's read operates one character at a time), but it's a whole lot less inefficient than more naive approaches would be. – Charles Duffy Aug 29 '23 at 17:00
  • 3
    (that said, you'd have correctness benefits from using -print0 on the find command, and making it while IFS= read -r -d '' image_path; do) – Charles Duffy Aug 29 '23 at 17:02

6 Answers6

8

I'd do the following:

  1. Find all suffixed (i.e., *.jpg.webp) files, put them in a sorted list removing their suffix;
  2. Find all files without suffix (i.e., *.jpg), put them in a second sorted list
  3. compare the two list, removing those entries that are in the first list.
  4. operate your conversion on the "set difference" list that results from it.

So, something like

#!/bin/bash
comm -z -1 -3 \
   <(find -name '*.jpg.webp' -print0 | sed 's/\.webp\x0/\x0/g' | sort -z) \
   <(find -name '*.jpg'      -print0 | sort -z) \
| parallel -0 gm convert '{}' '{}.webp'

assuming you're using GraphicsMagick gm for conversion (in my experience, speed- and reliability-wise much preferable to ImageMagick's convert), and assuming you have GNU parallel installed (if not, xargs might work just as well).

terdon
  • 242,166
7

Try something like this:

find "$path" -type f -iname "*.jpg" -exec \
  sh -c 'for f; do [ -e "$f.webp" ] || echo "$f" ; done' find-sh {} +

This executes sh as few times as possible (depending how many .jpg files found by find), limited by ARG_MAX (around 2 million bytes on Linux), and avoids the need for an excruciatingly slow while read ... loop by passing all the filenames as command line arguments. See Why is using a shell loop to process text considered bad practice? and Why is looping over find's output bad practice?

To efficiently process batches of these files, I'd redirect the output to a file and then split that into batches of 10,000 (or however many you need), e.g. with split -l 10000.

NOTE: If any of your .jpg filenames contain newlines then you will need to use NUL as the separator between them, otherwise use newline as the separator. To use NUL separators, replace echo "$f" with printf "%s\0" "$f". BTW, split supports NUL-separated input with -t '\0'.

The script that process the batches should read in the filenames, and check again that the corresponding .jpg.webp file doesn't exist (in case one was generated after the list was generated) before running whatever it needs to generate the .jpg.webp version.

If you had to use NUL as the filename separator, it would be easiest to use readarray (AKA mapfile) to read the entire batch's list into an array and iterate over the array of filenames. Or use awk or perl to process the filenames.

Actually, using an array would be better than a while-read loop, even with newlines as separators.

cas
  • 78,579
  • 1
    Ideally, they'd do any processing on the files in the in-line script that find calls, and not downstream in some pipeline. If that means the in-line script becomes too complicated, then write it as a separate script that you call with a batch of pathnames from find. – Kusalananda Aug 29 '23 at 11:28
  • i'd agree with that, but they said they had millions of jpeg files and wanted to process them in batches of 10000. The find one-liner in my answer is just to generate the initial list of all .jpg files without corresponding .jpg.webp files - they can then split up that list however they want and use it as required to generate the .jpg.webp files. – cas Aug 29 '23 at 11:33
  • 1
    The OP's original code only starts a single shell already. By starting a shell per batch of files this is strictly worse. – Charles Duffy Aug 29 '23 at 17:02
  • @CharlesDuffy huh? one shell per ARG_MAX limit is absolutely not worse than a shell while-read loop (which is probably the worst way possible of processing text). if you're talking about my comment pointing out that the OP wanted to process the files in batches of 10000, then improve your reading comprehension skills - the one-liner above doesn't do that, it generates a list of all matching files (.jpg files without a matching .jpg.webp) which the OP can split and post-process however they like. – cas Aug 29 '23 at 22:56
  • 1
    Ah. This remains worse in terms of number of shells run (in that anything equal-to-or-greater-than 1 is worse than having a guarantee of exactly 1), but you're right that read reading from a pipe is not good. I'm curious where the breakeven point is; personally, I'd rewrite to awk or Python or such. – Charles Duffy Aug 30 '23 at 00:27
  • 1
  • a one-liner that runs much faster and more efficiently than the original one-liner is in no way worse than the original, no matter how many shells are run. 2. if "number of shells run" is your metric, then that's a worse-than-useless metric. 3. personally, i'd use awk or perl (probably perl, something like find ... | perl -lne 'print unless -e "$_.webp"', optionally with -0 for NUL separators). python's pretty bad for jobs like this, and clumsy for one-liners.
  • – cas Aug 30 '23 at 00:39