1

I have a 500 GByte disk with a single XFS file system on it (EDIT: the OS is on another disk). On this disk I have backup data in the form of multiple hard-linked copies of the original data. After each new backup I delete the directory containing the data of the oldest backup. The corresponding rm process sometimes does not terminate (and consumes a lot of CPU). Killing it (-9) does not help, only rebooting the system does.

I tried running xfs_repair on that volume. However, it seems I do not have enough RAM for that (the machine has 4 GByte RAM and only supports 32bit).

The location of the machine makes it very hard for me to physically touch the hard disk.

How can I repair my file system and/or make rm terminate?

EDIT: I ran xfs_repair -v -t 1 /dev/disk/xxx with xfs_repair version 3.1.7. EDIT: Output:

Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - agno = 0
failed to create prefetch thread: Resource temporarily unavailable
        - agno = 1
failed to create prefetch thread: Resource temporarily unavailable
        - agno = 2
failed to create prefetch thread: Resource temporarily unavailable
        - agno = 3
failed to create prefetch thread: Resource temporarily unavailable
        - agno = 4
failed to create prefetch thread: Resource temporarily unavailable
        - agno = 5
failed to create prefetch thread: Resource temporarily unavailable
        - agno = 6
failed to create prefetch thread: Resource temporarily unavailable
        - agno = 7

fatal error -- calloc failed in dir_hash_init
C-Otto
  • 407
  • What you are doing is not backup, it if revision control. I would recommend using a revision control tool in the future. – ctrl-alt-delor Dec 15 '13 at 12:45
  • What was the output of xfs_repair -v -t 1?
    have you read this FAQ?

    Is your OS on this same filesystem or is it isolated?

    – bsd Dec 15 '13 at 14:00
  • richard, it is a backup. Whenever I delete a file by accident, I can recover it. In addition to the backup I have access to previous versions of files and, more importantly, to complete snapshots of my filesystem (what help is a previous version of a config file if the corresponding version of the tool is not available?). I am happy with this solution and do not indent to switch my setup to something else. – C-Otto Dec 15 '13 at 14:37
  • bdowning: I do not have that output available, sorry. Reproducing it takes a long while. I have read the FAQ. The OS is isolated from the disk. – C-Otto Dec 15 '13 at 14:39
  • 1
    how much swap space do you have allocated? My server has a single 10TB xfs filesystem. I have a 60GB SSD for my OS with about 4-6GB for the OS and all of the rest is configured as swap specifically so I can run xfs_check, xfs_repair and xfs_fsr. – bsd Dec 15 '13 at 18:40
  • I just added 20 GByte of swap, before that I did not have swap. However, I am working on a 32bit system, which might cause problems (I already have 4 GByte of RAM). – C-Otto Dec 15 '13 at 18:43
  • Even though you're running a 32bit system you can temporarily boot off a live cd, like sysrescuecd in 64bit mode; activate your 20GB swap and run all xfs checks and repairs from the sysrescuecd instead of the installed OS. Once repaired (if necessary) back to old OS. – bsd Dec 15 '13 at 18:57
  • I read your comment that you don't want to change your strategy, but you may wish to consider creating a volume group with multiple logical volumes (if you can carefully estimate sizes needed) and multiple xfs filesystems.

    Then when it comes time, you could simply eliminate the old logical volume in its entirety. This would save xfs from having to manage its metadata after your large delete.

    I used to keep my data in several separate logical volumes but eventually merged them into one for reasons particular to my data.

    – bsd Dec 15 '13 at 19:08
  • I do not have physical access. And the CPU does not support 64bit. – C-Otto Dec 16 '13 at 18:23
  • I solved the problem using a (long) process of moving to files to a temporary disk and back again (using EXT4 and LVM). – C-Otto Jan 13 '14 at 14:47

2 Answers2

2

Did you try strace the rm process so you could see what it's doing? When deleting lots and lots of files, XFS can just be painfully slow. I once foolishly used ccache on XFS, and it was way faster to move all other files, format, and move the files back, than attempt to rm -r the millions of ccache files. It would still have terminated eventually had I let it run its course.

As for xfs_repair, I never noticed it using a lot of memory but all my machines do have plenty of memory, so...

You could add swap (if that helps). Alternatively you could export the block device (using NBD through OpenVPN or SSH tunnel) to a machine that has more RAM available, although I am not sure if that would be faster or slower than transferring an image of the entire filesystem (possibly using xfsdump). Depends on how much data xfs_repair has to read/write during the process.

frostschutz
  • 48,978
  • I will run strace the next time. However, as the problem only appears once in a while on the same kind of data, I don't think this is a normal (slow) delete operation. Adding swap might help, I'll try. – C-Otto Dec 15 '13 at 14:41
  • strace does not show anything. – C-Otto Dec 15 '13 at 14:45
  • I guess swap won't help on this system, as it is a 32bit system. – C-Otto Dec 15 '13 at 14:59
  • "strace does not show anything" If rm is doing absolutely nothing at all, then you would appear to have a problem that is unrelated to the fact that you are running XFS. Does rm even get invoked? – user Dec 16 '13 at 20:08
  • Michael, I run strace on the PID of the rm process (which, as I already mentioned, consumes quite a lot of CPU). – C-Otto Dec 18 '13 at 08:18
0

From the 256Mb SGI Indigo 2 I remember XFS checks could be a memory clog. We had to use XFS for the 2Gb+ files we had. In case of problems we backed the data to an external drive (scsi) and restored, after reformatting the data filesystem giving the problems.

Of course only works if you have the drive capacity (500Gb USB 3.0 drives are less then €50) and your system is not on the same partition (that is not 100% clear from your question).

Zelda
  • 6,262
  • 1
  • 23
  • 28