2

I'm interested in accurately removing a git repository in a reasonable time.

But it takes quite a while to do so. Here, I have a small test repo where the .git folder is < 5MiB.

$ du -ac ~/tmp/.git | tail -1
4772    total
$ find ~/tmp/.git -type f | wc -l
991

Using shred's default options, this takes quite long. In the next command I use --force to change permissions and --zero to overwrite with zeros after shredding. The default shredding method is to overwrite with random data three times (-n3).

I also want to remove the files afterwards. According to man shred, --remove=wipesync (the default, when --remove is used) only operates on directories, but this seems to slow me down even when I operate only on files. Compare (each time I reinitialized the git repo):

$ time find ~/tmp/.git -type f | xargs shred --force --zero --remove=wipesync
real    8m18.626s
user    0m0.097s
sys     0m1.113s

$ time find ~/tmp/.git -type f | xargs shred --force --zero --remove=wipe
real    0m45.224s
user    0m0.057s
sys     0m0.473s


$ time find ~/tmp/.git -type f | xargs shred --force --zero -n1 --remove=wipe
real    0m33.605s
user    0m0.030s
sys     0m0.110s

Is there a better way to do it?


EDIT: Yes, encryption is the key. I'm now just adding two more benchmarks using -n0.

time find ~/tmp/.git -type f | xargs shred --force --zero -n0 --remove=wipe    
real 0m32.907s
user 0m0.020s
sys     0m0.333s

Using 64 parallel shreds:

time find ~/tmp/.git -type f | parallel -j64 shred --force --zero -n0 --remove=wipe
real    0m3.257s
user    0m1.067s
sys     0m1.043s
Sebastian
  • 8,817
  • 4
  • 40
  • 49
  • If the problem is really with the amount of files, a git repository can be repacked using git-repack, which collects loose object files into pack files. – ptman Jul 09 '14 at 09:06
  • are you sure? I just ran git repack (without any options) in this test repo and the number of files in the .git folder increased from 684 to 688. – Sebastian Jul 09 '14 at 09:42
  • 1
    I usually use git repack -a -d -g and looking at the documentation reveals that -a repacks everything into a new big pack, -d deletes old unneeded packs, and -f recomputes deltas instead of reusing deltas from old packs. – ptman Jul 09 '14 at 09:59
  • yes, please correct the spell error in the command (-g instead of -f). This reduces the file count to 42. A coincidence? ;-) man git-repack says: Packs are used to reduce the load on mirror systems, backup engines, disk storage, etc.. Why do you use it? – Sebastian Jul 09 '14 at 10:04

3 Answers3

5

Forget about shred, it spends a lot of time doing useless things and misses the essential.

shred wipes files by making multiple passes of overwriting files with random data (a “Gutmann wipe”), because with the disk technologies of 20–30 years ago and some expensive laboratory equipment, it was possible (at least in theory) to recover overwritten data. This is no longer the case with modern disk technologies: overwriting just once with zeroes is just as good — but the idea of multiple random passes stayed around well after it had become obsolete. See https://security.stackexchange.com/questions/10464/why-is-writing-zeros-or-random-data-over-a-hard-drive-multiple-times-better-th

On the other hand, shred utterly fails in wiping sensitive information, because it only wipes the data in the files that it is told to erase. Any data that was stored in previously erased files may still be recoverable by accessing the disk directly instead of via the filesystem. Data from a git tree may not be very easy to reconstruct; nevertheless this is a realistic threat.

To be able to quickly wipe some data, encrypt it. You can use ecryptfs (home directory encryption), or encfs (encryption of a directory tree), or dm-crypt (whole-partition encryption), or any other method. To wipe the data, just wipe the key.

See also How can I be sure that a directory or file is actually deleted?

2

I think that since there are so many small files in a Git repository, it's going to take a long time for shred to delete them and write over the old data. I would suggest doing something a little different and use tmpfs to store the data in RAM. Then, when you are done, you simply unmount it and you don't have to worry about the data being stored anywhere on your physical storage.

bash $ mkdir $REPO_NAME
bash $ sudo mount -o uid=$YOUR_USERNAME,gid=$YOUR_GROUP_NAME,size=100m \
> -t tmpfs tmpfs $REPO_NAME
bash $ git clone git://$GIT_SERVER/$REPO_PATH/${REPO_NAME}.git

And when you are done:

bash $ sudo umount $REPO_NAME
bash $ rmdir $REPO_NAME

Another option, which will persist through reboots and power failures is to create a disk image file with a filesystem on it. When you are done, shred the file. It should take less time than using shred on all the small files in the repo. You can save this as two shell scripts. (You will need to edit them to make them work with your actual repo.)

Note: We are using reiserfs because it doesn't create a lost+found directory automatically like the ext filesystems do. You can use btrfs or any other filesystem you see fit, but the syntax for creating it or mounting it may not be exactly the same.

create_repo.sh

#!/bin/bash
truncate --size 100M $IMAGE_FILE
/sbin/mkfs.reiserfs -fq $IMAGE_FILE
sudo mount -t reiserfs -o loop $IMAGE_FILE $REPO_NAME
chown $YOUR_USERNAME:$YOUR_GROUP_NAME $REPO_NAME
git clone git://$GIT_SERVER/$REPO_PATH/${REPO_NAME}.git

shred_repo.sh

#!/bin/bash
sudo umount $REPO_NAME
rmdir $REPO_NAME
shred $IMAGE_FILE
hololeap
  • 643
  • good idea, but then the local repository doesn't survive reboots/ power outages, right? – Sebastian Jul 07 '14 at 10:02
  • Right. The computer would have to be on the whole time. I will edit my answer to add another option that will persist. – hololeap Jul 07 '14 at 10:05
  • Calling it a swap file just confuses things, it's simply a disk image. – Pavel Šimerda Jul 08 '14 at 04:52
  • Also it would be better to use bs=1M count=1 seek=99 to avoid having to actually write the whole file. – Pavel Šimerda Jul 08 '14 at 04:54
  • 1
    And I just found an even better way, truncate --size 100M filename also using just coreutils. – Pavel Šimerda Jul 08 '14 at 05:08
  • 1
    And there's `fallocate, see http://stackoverflow.com/questions/257844/quickly-create-a-large-file-on-a-linux-system/11779492#11779492 – Pavel Šimerda Jul 08 '14 at 05:15
  • Yeah, it was late when I wrote that. I'm not really sure why I called it swapfile... I'll edit it so that it uses a different name and the truncate method. (fallocate failed on my machine with Operation not supported) – hololeap Jul 08 '14 at 05:42
2

Make sure to avoid unnecessary multiple overwrite passes with shred. E.g. use shred -n 1 without anything else.

The problem with secure file deletion (with git and in general) is that every time you edit, clone, switch branches, etc. a new file (or set of files) is created, possibly in a different physical location. Thus an unknown amount of copies is leaked into the free space of the filesystem.

Even the filesystem does not know where those copies may be located, as such you can not overwrite them directly, regardless which tool or method you use. Instead you have to overwrite the entire free space (which the filesystem may not let you), or the entire device to be really sure.

Overwriting an entire disk takes a long time (esp. if you have to copy the files you want to keep back and forth in the process), so the question is how important security vs. speed really is to you here.


The fastest way is probably to just use rm instead. Of course, that does not overwrite anything.

However, if you already use SSD, with discard the mount option or fstrim it's possible to ditch most of the free space quickly, and there is no obvious way to get discarded space back.

For home use, this should be a sufficient level of security while keeping things practical. For stronger security needs, use encryption.

shred -n 1 is great for overwriting entire disks, in that it writes large amounts of (pseudo)random data quickly. It fast enough to utilize full disk speeds, even for SSD. As such there is no downside compared to zeroes.

The downside of zeroes is that the storage may decide to mark as free or compress instead of actually writing them. Something to consider when you have no true control over your storage solution anymore (because it's sufficiently advanced to be considered a black box, or in a virtualization environment).

frostschutz
  • 48,978