In any case, you need to uncompress and compress. But you don't need to save the result at every step. So:
set -o noclobber
for file in *.fastq.gz; do
gzip -d < "$file" | dos2unix | pigz > "$file.new" &&
mv -- "$file.new" "$file"
done
Here using pigz
to leverage more processors at a time to do the compression.
You could also use GNU xargs
' -P
to run several of those pipelines in parallel:
printf '%s\0' *.fastq.gz |
xargs -r0 -P4 -n1 sh -o noclobber -c '
gzip -d < "$1" | dos2unix | gzip > "$1.new" &&
mv -- "$1.new" "$1"' sh
Here running 4 in parallel. Note that gzip -d
, dos2unix
and gzip
in each pipeline already run in parallel (with gzip
likely being the pipeline's bottleneck). Running more in parallel than you have CPUs is likely to degrade performance. If you have fast CPUs and/or slow storage, you might also find that I/O is the bottleneck and that running more than one in parallel actually already degrades the performance.
While you're at recompressing, that may be a good opportunity to switch to a different compression algorithm that may better suit your needs like some that compress better or faster or let you access data randomly in independent blocks or uncompress faster...
gzip
is compressed date, which means you'll have to uncompress it to get the data you want to modify, modify it and recompress it. You may be able to do this without a tempfile by running all the commands in a pipeline, but there's no way around the decompression/recompression issue. – Shadur-don't-feed-the-AI Sep 15 '22 at 08:58ls
part is redundant since*.fastpq.gz
would already glob to the full list, but in this case the optimization from removing it would be minimal. 600 gig of data is 600 gig of data, and because gzip functions as a stream there's very little you might be able to accomplish with multiple processes. – Shadur-don't-feed-the-AI Sep 15 '22 at 09:15