Don't discount the easy way: it may be fast enough for your purpose. With avfs to access the archive as a directory:
cd ~/.avfs/path/to/original.tar.gz\#
pax -w -s '/^.*\.jpg$//' | gzip >/path/to/filtered.tar.gz # POSIX
tar -czf /path/to/filtered.tar.gz -s '/^.*\.jpg$//' . # BSD
tar -czf /path/to/filtered.tar.gz --transform '/^.*\.jpg$//' . # GNU
With more primitive tools, first extract the files excluding the .jpg
files, then create a new archive.
mkdir tmpdir && cd tmpdir
<original.tar.gz gzip -d | pax -r -pe -s '/^.*\.jpg$//'
pax -w . | gzip >filtered.tar.gz
cd .. && rm -rf tmpdir
If your tar has --exclude
:
mkdir tmpdir && cd tmpdir
tar -xzf original.tar.gz --exclude='*.jpg'
tar -czf filtered.tar.gz .
cd .. && rm -rf tmpdir
This may however mangle file ownership and modes if you don't run it as root. For best results, use a temporary directory on a fast filesystem — tmpfs if you have one that's large enough.
Support for archivers to act as a pass-through (i.e read an archive and write an archive) tends to be limited. GNU tar can delete members from an archive with the --delete
operation option (“The --delete
option has been reported to work properly when tar
acts as a filter from stdin
to stdout
.”), and that's probably your best option.
You can make powerful archive filters in a few lines of Python. Its tarfile
library can read and write from non-seekable streams, and you can use arbitrary code in Python to filter, rename, modify…
#!/usr/bin/python
import re, sys, tarfile
source = tarfile.open(fileobj=sys.stdin, mode='r|*')
dest = tarfile.open(fileobj=sys.stdout, mode='w|gz')
for member in source:
if not (member.isreg() and re.match(r'.*\.jpg\Z', member.name)):
sys.stderr.write(member.name + '\n')
dest.addfile(member, source.extractfile(member))
dest.close()