Convert dos line endings in .gz zipped file faster

Question

I have 120 .gz files (each about 5G) with dos line endings, my goal is to convert those to unix line endings, but I don't want to wait multiple days.

Here is my current approach:

function conv() {
    tmpfile=$(mktemp .XXXXXX)
    zcat $1 > $tmpfile
    dos2unix $tmpfile
    gzip $tmpfile
    mv $tmpfile.gz $1
}
for a in $(ls *.fastq.gz); do
    echo "$a"
    conv "$a" &
done

Is there any way to get the line endings fixed without unpacking and repacking?

There isn't. gzip is compressed date, which means you'll have to uncompress it to get the data you want to modify, modify it and recompress it. You may be able to do this without a tempfile by running all the commands in a pipeline, but there's no way around the decompression/recompression issue. — Shadur-don't-feed-the-AI, Sep 15 '22 at 08:58
Okay, this is what I expected. Just hoped there might be some magic (or a tool that is faster than my script) — Michael M., Sep 15 '22 at 09:12
Well, the ls part is redundant since *.fastpq.gz would already glob to the full list, but in this case the optimization from removing it would be minimal. 600 gig of data is 600 gig of data, and because gzip functions as a stream there's very little you might be able to accomplish with multiple processes. — Shadur-don't-feed-the-AI, Sep 15 '22 at 09:15

Stéphane Chazelas · Accepted Answer · 2022-09-15T09:58:37.040

In any case, you need to uncompress and compress. But you don't need to save the result at every step. So:

set -o noclobber
for file in *.fastq.gz; do
  gzip -d < "$file" | dos2unix | pigz > "$file.new" &&
    mv -- "$file.new" "$file"
done

Here using pigz to leverage more processors at a time to do the compression.

You could also use GNU xargs' -P to run several of those pipelines in parallel:

printf '%s\0' *.fastq.gz |
  xargs -r0 -P4 -n1 sh -o noclobber -c '
    gzip -d < "$1" | dos2unix | gzip > "$1.new" &&
      mv -- "$1.new" "$1"' sh

Here running 4 in parallel. Note that gzip -d, dos2unix and gzip in each pipeline already run in parallel (with gzip likely being the pipeline's bottleneck). Running more in parallel than you have CPUs is likely to degrade performance. If you have fast CPUs and/or slow storage, you might also find that I/O is the bottleneck and that running more than one in parallel actually already degrades the performance.

While you're at recompressing, that may be a good opportunity to switch to a different compression algorithm that may better suit your needs like some that compress better or faster or let you access data randomly in independent blocks or uncompress faster...

Thanks! Both scripts work fine and I have enough CPUs. As of now I think this is as good as it gets. — Michael M., Sep 15 '22 at 09:51

Convert dos line endings in .gz zipped file faster

1 Answers1