16

Assume i have an gzip compressed tar-ball compressedArchive.tgz (+100 files, totaling +5gb).

What would be the fastest way to remove all entries matching a given filename pattern for example prefix*.jpg and then store the remains in a gzip:ed tar-ball again?

Replacing the old archive or creating a new one is not important, whichever is fastest.

Aksel Willgert
  • 407
  • 2
  • 3
  • 8

6 Answers6

16

With GNU tar, you can do:

pigz -d < file.tgz |
  tar --delete --wildcards -f - '*/prefix*.jpg' |
  pigz > newfile.tgz

With bsdtar:

pigz -d < file.tgz |
  bsdtar -cf - --exclude='*/prefix*.jpg' @- |
  pigz > newfile.tgz

(pigz being the multi-threaded version of gzip).

You could overwrite the file over itself like:

{ pigz -d < file.tgz |
    tar --delete --wildcards -f - '*/prefix*.jpg' |
    pigz &&
    perl -e 'truncate STDOUT, tell STDOUT'
} 1<> file.tgz

But that's quite risky, especially if the result ends up being less compressed than the original file (in which case, the second pigz may end up overwriting areas of the file which the first one has not read yet).

  • thanks for the answer, upvoted. will run benchmark next week to see which one performs better for my archive and system and accept that. – Aksel Willgert Jun 21 '13 at 19:34
10

Don't discount the easy way: it may be fast enough for your purpose. With avfs to access the archive as a directory:

cd ~/.avfs/path/to/original.tar.gz\#
pax -w -s '/^.*\.jpg$//' | gzip >/path/to/filtered.tar.gz        # POSIX
tar -czf /path/to/filtered.tar.gz -s '/^.*\.jpg$//' .            # BSD
tar -czf /path/to/filtered.tar.gz --transform '/^.*\.jpg$//' .   # GNU

With more primitive tools, first extract the files excluding the .jpg files, then create a new archive.

mkdir tmpdir && cd tmpdir
<original.tar.gz gzip -d | pax -r -pe -s '/^.*\.jpg$//'
pax -w . | gzip >filtered.tar.gz
cd .. && rm -rf tmpdir

If your tar has --exclude:

mkdir tmpdir && cd tmpdir
tar -xzf original.tar.gz --exclude='*.jpg'
tar -czf filtered.tar.gz .
cd .. && rm -rf tmpdir

This may however mangle file ownership and modes if you don't run it as root. For best results, use a temporary directory on a fast filesystem — tmpfs if you have one that's large enough.

Support for archivers to act as a pass-through (i.e read an archive and write an archive) tends to be limited. GNU tar can delete members from an archive with the --delete operation option (“The --delete option has been reported to work properly when tar acts as a filter from stdin to stdout.”), and that's probably your best option.

You can make powerful archive filters in a few lines of Python. Its tarfile library can read and write from non-seekable streams, and you can use arbitrary code in Python to filter, rename, modify…

#!/usr/bin/python
import re, sys, tarfile
source = tarfile.open(fileobj=sys.stdin, mode='r|*')
dest = tarfile.open(fileobj=sys.stdout, mode='w|gz')
for member in source:
    if not (member.isreg() and re.match(r'.*\.jpg\Z', member.name)):
        sys.stderr.write(member.name + '\n')
        dest.addfile(member, source.extractfile(member))
dest.close()
  • It would also mangle uid/usernames if run as root unless it is done on a machine that has the same uid <=> username mapping as the one where the tar file was initially created. ACLs, extended attributes may be affected as well. With tar, you may want to add the p option. – Stéphane Chazelas Mar 19 '19 at 16:15
2

With the tar that comes on Mac OSX, you could do this:

tar -czf b.tgz --exclude '*.jpg' @a.tgz
mv b.tgz a.tgz
Jake
  • 231
1

I use:

tar -xvf myLarge.gz --exclude "prefix" | tar -czvf myLarge.gz -T -

This will:

  1. Extract all files except files including "prefix"
  2. (-T -) Pipe rest to tar and re-compress myLarge.gz
Cyborg
  • 141
1

To do this, you probably have to extract all the contenent of .tgz file in a local dir then erase the files you do not want then recompress the .tgz.

It's long and you need sufficent free disk space but to the best of my knowledge, there is non other way to do it.

Given that you already have some path like /tmpdir/withalotofspace that have sufficent free space (check it using df -h /tmpdir/withalotofspace), you can do something like this:

$ cd /tmpdir/withalotofspace
$ tar -xvfz /path/to/compressedArchive.tgz
$ find /tmpdir/withalotofspace/ -type f -iname '*.jpg' -delete
$ tar -cvzf /path/to/purgedcompressedArchive.tgz .
DavAlPi
  • 815
0

I like the answer by @Gilles, except it can be further simplified. After unzipping, for example gunzip foo.tgz the file will be foo.tar and files can be removed with tar -f foo.tar --delete file|directory. Below is an example of removing a directory from a tar file.

    phablet@ubuntu-phablet:~/Downloads$ tar -cvf moo.tar moo1/
    moo1/
    moo1/moo2/
    moo1/moo2/moo3/
    moo1/moo2/moo3/moo4/
    moo1/moo2/moo3/moo4/moo5/
    phablet@ubuntu-phablet:~/Downloads$ tar -tf moo.tar 
    moo1/
    moo1/moo2/
    moo1/moo2/moo3/
    moo1/moo2/moo3/moo4/
    moo1/moo2/moo3/moo4/moo5/
    phablet@ubuntu-phablet:~/Downloads$ tar -f moo.tar --delete "moo1/moo2/moo3"
    phablet@ubuntu-phablet:~/Downloads$ tar -tf moo.tar 
    moo1/
    moo1/moo2/

Specific file types can be found with tar -tf foo.tar|egrep -i '.jpg$'.