29

I have a large file foo.tar.xz that contains a lot (say 200000) of files. I figured out that this archive contains some (around 5000) files I don't want. I don't have sufficient disk space to decompress the whole thing onto my disk; additionally, I fear attributes / rights might get lost if I do so. I have enough space to host two copies of the compressed archive though. Is there a tool to remove some of the files from the archive (specified with a regex on the filename) on-the-fly, i.e. without unpacking the archive into individual files?

FUZxxl
  • 785

4 Answers4

33

GNU tar has a --delete option that works with archives too nowadays.

Use it like this, for example:

tar -vf yourArchive.tar --delete your/path/to/delete

Beware: It will most likely not work on any kind of magnetic tape medium. But tar has no problems working in a pipe, so you can just use a temporary tar file and overwrite the tape with that afterwards. It also won't work on compressed files, so you would need to uncompress the file.

Also, the operation will be rather slow in any case, due to the (by design) packed linear nature of tar archives.

  • 1
    It does exist, but it doesn't work with files where random access is not possible (e.g. compress archives) but this is my use-case. – FUZxxl Aug 31 '16 at 12:18
  • 1
    The other problem is that I cannot specify a pattern to delete. Note my comment from 2013 where I already address the shortcomings of gtar --delete. – FUZxxl Aug 31 '16 at 12:19
  • 6
    @FUZxxl -T works with --delete, and --wildcards allows you to use patterns rather than filenames, so create a temporary file containing the patterns and use unxz < file.tar.xz | tar --wildcards --delete -T patternfile | xz > file2.tar.xz. It won't do a full regex (if you need that, just use tar -t and build up a list of filenames to delete), just filename matching patterns. – Random832 Aug 31 '16 at 14:00
  • 1
    BEWARE: this command might corrupt your tar file. Unfortunately, it destroyed mine and I was dumb enough to not create a backup copy. I'm not sure what the reason is, but in my case, it started creating thousands of duplicates for every file. I had to SIGTERM the process because the archive grew 10x from the original size, but the data has been already lost by that moment. – noomorph Apr 10 '22 at 10:49
19

(edited, as I misunderstood the question, which was since edited also)

The best you can do is to extract, delete, and recompress the entire file.

unxz < foobar-old.tar.xz | tar --delete foo/bar | xz > foobar-new.tar.xz

It's not possible to delete files from a tar directly.

tar is a stream, originally intended for tape drives which do not do random seeks well - while in theory it could be possible on a disk filesystem to punch a hole / rewrite the remaining file, with compression the point is moot as most if not all compression methods heavily depend on contents that occured earlier in the file. In order to do this in place you would need very detailed knowledge about both the compression method as well as the tar file format. That's complexity to a point no one would even bother with it. It's cheaper to just keep the files around and ignore them.

If you need this functionality, tar is probably not what you want.

frostschutz
  • 48,978
  • Those files make up 35% of the archives size. The restrictions you point out seemingly do only apply if I rewrite the file, not if I modify it out-of-place, which I can do (I have enough space to save the packed archive twice). Is there such a tool? – FUZxxl Mar 21 '13 at 20:57
  • I may have misunderstood your question then. If you ARE willing to unpack the tar after all, and repack it, (just without actually creating the tarred files - i.e., a direct tar to tar pipe), it may be possible. – frostschutz Mar 21 '13 at 21:01
  • Yeah, I can do that. It's just that the files have uids/gids/attributes that I need to preserve. Also, I do not have enough disk space to save the unpacked representation. I have enough space to save two packed archives though. – FUZxxl Mar 21 '13 at 21:05
  • unxz < foobar-old.tar.xz | tar --delete foo/bar | xz > foobar-new.tar.xz – frostschutz Mar 21 '13 at 21:11
  • Does this scale? I have 5000 files which I can describe using the regex *data/.*.bak* AFAIK delete does only recognize individual files. – FUZxxl Mar 21 '13 at 21:13
  • It never scales because you're still extracting and repacking the entire thing. You need a different file format for a more direct deletion ability. Naturally you want the tar --delete to delete all the files you no longer want in one go. It can delete single files or entire directories and maybe even patterns - it should operate on the same file lists the original tar command does, so there should be no issue. It will still take a long time if it's a large tar. – frostschutz Mar 21 '13 at 21:16
  • 1
    That's no problem at all. If I can do this in one pass, the time won't be too long. I can't imagine any archive format that allows for fast deletation while actually releasing storage. – FUZxxl Mar 21 '13 at 21:20
  • 1
    --wildcards help... I had to include ./ at the start of the pattern though... – Gert van den Berg Mar 05 '18 at 12:59
  • Not reasonable for large files. – M. Rostami Nov 02 '20 at 09:37
0

As it is stated in the most upvoted question, GNU tar implements a --delete option that seems to be the solution for this.

But citing noomorph comment:

BEWARE: this command might corrupt your tar file. Unfortunately, it destroyed mine and I was dumb enough to not create a backup copy. I'm not sure what the reason is, but in my case, it started creating thousands of duplicates for every file. I had to SIGTERM the process because the archive grew 10x from the original size, but the data has been already lost by that moment.

This can be reproduced at least with tar version v1.30, in version v1.34 does not happen. It affects both armhf and i386 architectures.

If your try to delete a file that does not exist inside the tar file, then duplicates start to appear and corruption of the whole file can happen.

If upgrading tar is not possible, a workaround is to list all the files in tar file (--list) and check for the file existence before attempting the removal with --delete.

-7

According to the manual, you can pass a list of filenames to tar to only extract those. For example:

$ tar --file archive.tar --list
foo
bar
baz

$ tar --file archive.tar --extract foo
Michael Mrozek
  • 93,103
  • 40
  • 240
  • 233
  • I don't see how --extract helps me. Could you elaborate? Please keep in mind that I cannot unpack the archive (or substantial parts of it) to disk. – FUZxxl Mar 21 '13 at 21:07
  • 2
    Please do not just post links: this is a wiki--add sufficient content for it to be unnecessary for people to leave the page to understand your answer. – jasonwryan Mar 21 '13 at 22:02