9

It is possible to remove trailing bytes of a file without writting to a new file (> newfile) and moving it back (mv newfile file). That is done with truncate:

truncate -s -1 file

It is possible to remove leading bytes but by moving it around (which changes the inode) (for some versions of tail):

tail -c +1 file > newfile ; mv newfile file

So: How to do that without moving files around?
Ideally, like truncate, only a few bytes would need to be changed even for very big files.

note: sed -i will change the file inode, so, even if it may be useful, is not an answer to this question IMO.

  • 3
    You can create a new file in whatever way you wish, and then overwrite the original with cat new >orig (instead of mv as in your example). This leaves the original name with it's original inode. Or do you want to avoid creating an intermediate file? – Kusalananda Nov 30 '19 at 17:31
  • @Kusalananda Ideally, yes, I would like to avoid creating new files (using more disk space, double a big file takes time, etc.) like truncate does. In other words: Isn't there a truncate utility for the beginning of a file?. If that is not possible (which I suspect it isn't, but asking to confirm), then, yes, some file(s) need to be created, then your solution is perfectly valid and useful. –  Nov 30 '19 at 17:40
  • And just to be clear, you are not asking about zeroing out bytes at the start of the file (which dd would be able to do), but to actually "move the start of the file". – Kusalananda Nov 30 '19 at 17:42
  • 1
    @Kusalananda Yes, that is correct: "move the start of a file to some byte count". Thanks. –  Nov 30 '19 at 17:45
  • The issue is that, even if you have a GB and just want to remove the first byte, the other (1GB - 1) bytes all have to be physically moved. You might do that by using something like dd to pick out the separate parts of each block, and reassemble them differently, but that would take 2 processes for every block. More efficient to copy the whole file. – Paul_Pedant Nov 30 '19 at 18:22
  • 1
  • @Isaac, just to clarify, you mean to remove a part of the data from the beginning and to shift the rest back "down"? As in like removing the first three characters off of abcdefgh to get defgh? – ilkkachu Nov 30 '19 at 19:39
  • @ilkkachu If those strings represent the final file contents "as viewed by the user" (not what the File System might be storing in the disk) then yes, that is what I mean. ... Isn't that clear when I ask for something similar to truncate ? If a file contains abcdefgh and you do truncate -s -3 you get abcde as the final file, don't you?. I am not able to discern if this comment is being helpful (makes matters more clear) or I am missing something, Am I ? –  Nov 30 '19 at 21:14
  • Still not clear about a couple of things. (a) What constitutes a "big file" for the purposes of this question. (b) How will you determine the number of leading bytes to be deleted from the file. [My test knows the size of the file it used a a padding, but in general case you will need to do something.] Maybe (if it is text) head or sed some number of lines, and wc -c them to pass on to the unpadding routine. – Paul_Pedant Dec 08 '19 at 00:21

9 Answers9

6

With ksh93:

tail -c+2 < file 1<>; file

(where <>; is a ksh93 specific variant of the standard <> operator that truncates the file in the end if the command being redirected was successful).

Would remove the first byte (by writing the rest of the file over itself and truncate at the end).

Same can be done with sh with:

{
  tail -c+2 < file &&
    perl -e 'truncate STDOUT, tell STDOUT'
} 1<> file

Note that it would unsparse sparse files (you could still re-dig holes afterwards with fallocate -d though).

Upon read/write errors, tail would likely bail out leaving the file partly overwritten (so for instance, abcdefgh could end up as bcddefgh if it fails after rewriting bcd). You could adapt the above so that it reports the writing offset upon error, so you know how to recover the data. Still with ksh93:

unset -v offset
{ tail -c+2 < file || false >#((offset=CUR)); } 1<>; file

After which if $offset is set, it contains the amount of data that was successfully written.

On Linux (since 3.15) and on ext4 or xfs file systems, one can collapse ranges or bytes of size and offset that are a multiple of the filesystem block size with the fallocate() system call or fallocate utility.

So for instance

fallocate -c -l 8192 file

Would remove the first 8192 bytes of the file (assuming a FS with a block size that is a divisor of 8192) without having to rewrite the rest of the file. But that's of no use if you want to remove a section that is not a multiple of the FS block size.

  • So, the best solution possible in your opinion is to move (read and write) all the file contents down n bytes, correct? That takes some time, but if that is the best possible solution, that is the best possible answer. Is the copy process sensible to errors (the file will end corrupted/irrecoverable)?. In one word: is it reasonably safe? (not asking about another process attacking it, just if normal disk operations would end in a sane state). –  Dec 03 '19 at 23:59
  • @Isaac, see edit about safety (and a note about sparse files). I don't know of any other approach than FALLOC_FL_COLLAPSE_RANGE and read+write, it's still possible that they exist, or that you could consider other approaches like a virtual file that is a view of the section of the original... – Stéphane Chazelas Dec 04 '19 at 00:06
  • @stephane The ksh93 solution above appears not to work in either ksh93 or bash. Trivial point -- the ";" separating the 1<> from the file it refers to. More significantly, it moves all the data forward but fails to truncate the same length from the end of the file, so if you remove N bytes at the front, the last N bytes are duplicated and the file length is unchanged. – Paul_Pedant Dec 10 '19 at 23:32
  • 2
    @Paul_Pedant, the <>; operator is a ksh93 extension. It's like the Bourne/standard <> operator but truncates at the end if the command was successful. You might have hit that bug or you may be using a version of ksh93 prior to t+ (2010) – Stéphane Chazelas Dec 11 '19 at 06:12
1

Depends what you mean by "very big files". What is your limit?

You could read the whole thing into memory (as an awk string), and write back the substring to the original file. At some stage, awk would contain the original and substr at the same time, but it's a viable solution for maybe 0.5 GB. awk will do about 80 MB a second on my Laptop.

Easy in C because you can just move the start pointer for the write.

Paul_Pedant
  • 8,679
  • 1
    Truncate is lighting fast. Create a 2GiB file dd if=/dev/zero of=test bs=1M count=2k should take some time (~20 seconds in your laptop). Then, remove the trailing 1G of the file with truncate: time truncate -s 1G. It takes only 0.1 seconds, same inode, file not read. The question is: Is there an equivalent for the leading part of a file?. –  Nov 30 '19 at 19:28
  • 1
    Truncate is fast because it does not need to move any data at all. It sets the new file length into the inode, and then cuts the blocks beyond the end of the filesize out of the inode structure and adds them to the free list. It does not even need to clear any excess bytes at the end of the last block. That does not work at the front of the file. It's like taking some coaches off a train -- easy at the back end, very clumsy to remove the ones attached to the locomotive. – Paul_Pedant Nov 30 '19 at 23:47
  • 1
    So, then, your answer is: "not tool like truncate for the front of a file", correct? –  Dec 01 '19 at 00:07
  • Correct. I looked at fallocate(), and it has many restrictions on the file systems it will handle and the block sizes it will deal with. Basically, it seems able to convert a file into a sparse file by removing whole blocks, but not able to change the sequential location of any data. – Paul_Pedant Dec 01 '19 at 23:26
  • I am rooting for fallocate. I use it for rough edits to MPGTS videos on a daily basis. It can deallocate and allocate space within a file without rewriting the entire file. The "restrictions" come down to using a common file-system and a kernel more recent than 8 years. I'd expect that to be possible. – Hermann Dec 06 '19 at 23:23
  • If the block-wise insertions and deletions are not acceptable, you are out of luck: There is no general purpose file-system which addresses individual bytes. – Hermann Dec 06 '19 at 23:29
1

This is reasonably easy in C, uses same inode, uses no workfile. Just needs care to get it right. You probably need a preliminary query to find the blocksize of the device (example 4096), although some arbitrary power of 2 might be sufficient (example 64K).

Visualise the data flow as a caterpillar: stretching forwards, having the data work its way into the new location.

Open the file read/write, and do everything in system calls read/write to avoid possible buffering issues in FILE* routines.

The byte count to be removed from the file at the front (N) is a number of complete blocks, and some spare bytes (either or both of those components can be zero).

Seek to A, read X * 4096 bytes where X is chosen to be large (for efficiency) but not silly-large. Maybe 4MB buffer would be a sweet spot.

Seek to 0, write that buffer out into the complete number of blocks it should occupy. Far as I can see on paper, that can never wrap over itself -- the next unread byte cannot be in an earlier block.

Rinse and repeat (stepping up by 4MB on both seeks) until you run out of file. Deal with any short block properly.

That leaves you with an extra copy of the last N bytes, which you can truncate with a system call.

Performance should be fine. The writes are block-aligned. In theory each read takes two block accesses because of the overlap, but a contiguous read avoids that (e.g. 4MB reads 1025 blocks instead of 1024).

I thought this could be done in a script with dd command, but the block-size option in dd applies to the seek and the read, so it becomes wildly inefficient.

Test method: Get a 100MB file of random data, cksum it. Then append it to a smaller file of N bytes. Run code, cksum, and prove the file is now identical to what you appended. Time it. Test with various values of N, including 0, < 1 block, an exact number of blocks, several blocks + a bit, and the whole file.

Do I have to write and test the code to get the bounty?

Paul_Pedant
  • 8,679
0

Here are some ideas (mostly theory), and it does not address the i-node part (easy part) of the question.

I would think it is possible to remove whole blocks from the start (but know of no tools to do it). However it would be un-usual to need to remove an integer number of blocks, that it would not be worth implementing (unless there is a compelling use-case).

Then there could be a way in the file-system, to make a file start part-way through a block. (I wonder about the tail merging stuff, I wonder if it could be used, but that will complicate the File System, and would need a file-system hack). An alternative would be a fuse over-lay file-system to ignore the first part of a file. This would not remove the start, but would give the appearance of removing it. Combining it with the idea in paragraph 1 (if it can be made to work), will get almost all of the storage back.

A kernel call to remove blocks from the start/middle/end of a file, is fallocate. But it is only implemented for Linux kernel 3.15+ in ext4 (only for extent-based files) and XFS on Linux systems. It can also punch holes to make a file sparse (no block limitations, but the reported size does not change). -- thanks to @isaac for this.

There is also a tool (usable from the shell) fallocate

  • Having a file start "some bytes ahead of 0 (zero)" breaks alignment. Having to read two file blocks to return one block worth of data shifted "some bytes" makes each read inefficient (slow), that's why that kind of file storage (shifted) has not been implemented (AFAICT) and therefore, a function to cut some bytes at the start of a file is simply 'not possible'. Probably a solution to this issue will be regarded as "not a bug, won't fix" by FS developers. I don't have much hope, just seeking confirmation(s), thanks. –  Nov 30 '19 at 19:35
  • 1
    @Isaac that is why I was thinking that an overlay file-system could do it. It would apply the offset and pass it through. If the application did not care about file alignment, then there would be only a tiny overhead (off adding offset, and any other overlay-fs overhead. If the application does care about blocks, and is accessing them randomly, then there will be a times 2 overhead. – ctrl-alt-delor Nov 30 '19 at 21:44
0

You can do this using sponge from moreutils. Here's an example using tail to remove the first 3 bytes

tail -c +4 file | sponge file
Arjuna
  • 31
  • 1
    Sponge does write to an additional/temporal/new file and moves it back at the end of write process. That requires an additional (temporal) file storage. But yes, this gets to the stated end goal. –  Dec 03 '19 at 23:54
  • @Isaac, it misses the same inode requirement if file is on the same FS as ${TMPDIR:-/tmp} where in effect it works like your tail -c +4 file > newfile ; mv newfile file – Stéphane Chazelas Dec 04 '19 at 00:12
0

You could do this with printf and sed, although it's highly dangerous to edit files on the fly without using a temporary copy and I can't recommend you enough to read this great article before doing it.

Also, keep in mind this might fail if the file is too big and you don't have enough memory.

printf '%s\n' "$(sed '1d' test.txt)" > test.txt

That said, this should work and modify the file on the fly without changing the inode.

0

I have what seems to be a reliable update-in-situ for this problem. I have a test harness for it in bash (48 lines), and a C program (140 lines). My tests are on a Laptop with a 5400rpm drive, so nothing sparkling.

I can use cat to put a 7 MB file in front of a 7 GB file (takes 3m:31s) and I can rewrite the 7 GB into the front of the file and shrink the end (3m:48s) using a 64 MB buffer. So I think I'm disk limited in both cases. Just a cksum to check for file corruption takes 1m:30s.

I will tidy the code up a little, and post it on Sunday.

This test script explains the test method.

#! /bin/bash

function Usage { expand -t 4 <<'[][]'
#.. Test package for xFront C program:

    Usage: ./xFrontTest padFile mainFile
[][]
}

#.. Test package for xFront C program: ./xFrontTest padFile mainFile

#.. Strategy:
#.. Makes the binary every time in case we forget after an edit.
#.. Makes a test file by appending the two named files.
#.. Finds the size to remove off the front by sizing padFile.
#.. Finds the cksum of mainFile for verification purposes.
#.. Calls the test with appropriate args.
#.. Checks the result file checksum matches the original.

#### Script Body Starts Here.

    [[ -f "${1}" && -f "${2}" ]] || { Usage; exit 2; }

    cc -o xFront xFront.c || { echo 1>&2 "Failed to compile xFront"; exit 3; }

    WORK="./FixedFile"
    echo 1>&2 "Making: ${WORK}"
    time cat "${1}" "${2}" > "${WORK}"

    ls -ilrt --time-style=full-iso "${1}" "${2}" "${WORK}"

    X_SIZE="$( wc -c < "${1}" )"
    echo ./xFront "${X_SIZE}" "${WORK}"
    time ./xFront "${X_SIZE}" "${WORK}"

    ls -ilrt --time-style=full-iso "${1}" "${2}" "${WORK}"

    M_CKSUM="$( cksum < "${2}" )"
    W_CKSUM="$( cksum < "${WORK}" )"

    echo 1>&2 "Old checksum: ${M_CKSUM}"
    echo 1>&2 "New checksum: ${W_CKSUM}"

    [[ "${M_CKSUM}" == "${W_CKSUM}" ]] || { echo 1>&2 "CKSUM FAIL"; exit 1; }

    echo 1>&2 "CKSUM SUCCESS"
    exit 0

This is the output from the test, using files from my backup system.

Paul--) time ./xFrontTest Sudoku.tar Users.tar
Making: ./FixedFile

real    3m31.474s
user    0m0.124s
sys 0m32.996s
6302103 -rw-r--r-- 1 paul paul 6913402880 2019-12-07 21:51:38.362431068 +0000 Users.tar
6302076 -rw-r--r-- 1 paul paul    7004160 2019-12-07 21:51:38.598424089 +0000 Sudoku.tar
6302102 -rw-r--r-- 1 paul paul 6920407040 2019-12-07 22:10:24.553044325 +0000 ./FixedFile
./xFront 7004160 ./FixedFile
skEnd = 6920407040
Used 104 cycles of 67108864 buffer.

real    3m48.606s
user    0m0.012s
sys 0m27.344s
6302103 -rw-r--r-- 1 paul paul 6913402880 2019-12-07 21:51:38.362431068 +0000 Users.tar
6302076 -rw-r--r-- 1 paul paul    7004160 2019-12-07 21:51:38.598424089 +0000 Sudoku.tar
6302102 -rw-r--r-- 1 paul paul 6913402880 2019-12-07 22:14:16.838170978 +0000 ./FixedFile
Old checksum: 738098870 6913402880
New checksum: 738098870 6913402880
CKSUM SUCCESS

real    10m39.589s
user    1m32.496s
sys 1m18.960s
Paul_Pedant
  • 8,679
0

The code. It is functionally complete, although the error messages could be clearer. I found by experiment that a 64 MB buffer has no advantage over a 1 MB one.

//
// xFront.c: Remove Leading bytes from a large file efficiently.

#include <sys/types.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>
#include <sys/stat.h>
#include <fcntl.h>

#define Hint(TX) fprintf (stderr, "%s\n%s\n", Usage, TX)

static char *Usage =

  "\nUsage: xFront size filename"
  "\n   size      number of bytes to be removed from the beginning of the file"
  "\n   filename  name of the file to be modified.";

// #### Structure definition for file management.

#define MAX_BUF (1 * 1024 * 1024)

typedef struct {

    size_t  szBuf;
    void *  buf;
    char *  fn;
    int     status;
    int     fd;
    int     cycles;
    off_t   r_offset;
    off_t   w_offset;
} FileManager_t;

static FileManager_t fixFM = {

    MAX_BUF,
    NULL,
    NULL,
    0,
};

static FileManager_t *pFM = & fixFM;


// #### File copying function.

void xFront (FileManager_t *pFM)

{
off_t r_offs; ssize_t r_size;
off_t w_offs; ssize_t w_size;

    while (1) {
        r_offs = lseek (pFM->fd, pFM->r_offset, SEEK_SET);
        if (r_offs < 0) { pFM->status = 1; return; }
        r_size = read (pFM->fd, pFM->buf, pFM->szBuf);
        if (r_size < 0) { pFM->status = 2; return; }
        if (r_size == 0) { pFM->status = 0; return; }
        pFM->r_offset += r_size;

        w_offs = lseek (pFM->fd, pFM->w_offset, SEEK_SET);
        if (w_offs < 0) { pFM->status = 3; return; }
        w_size = write (pFM->fd, pFM->buf, r_size);
        if (w_size != r_size) { pFM->status = 4; return; }
        pFM->w_offset += w_size;

        pFM->cycles++;
    }
    return;
}

// #### Entry function.

int main (int argc, char **argv, char **envp)

{
long int xSize;
off_t skEnd;
char *e;

    if (argc != 3) { Hint ("Missing args"); return (2); }

    // Validate the count to be removed from the front of the file.
    xSize = strtol (*(argv + 1), &e, 10);
    if (*e != '\0' || xSize < 0) { Hint ("Bad size"); return (2); }

    // Open the file for update, and check.
    pFM->fn = *(argv + 2);
    pFM->fd = open (pFM->fn, O_RDWR);
    if (pFM->fd < 0) { perror (pFM->fn); return (2); }

    // Find the file length, and maybe use a short buffer.
    skEnd = lseek (pFM->fd, (off_t) 0, SEEK_END);
    fprintf (stderr, "skEnd = %ld\n", skEnd);
    if (xSize >= skEnd) { close (pFM->fd); Hint ("Size too big"); return (2); }
    if (pFM->szBuf > skEnd) pFM->szBuf = skEnd;

    pFM->buf = malloc (pFM->szBuf);
    if (pFM->buf == NULL) {
        fprintf (stderr, "Cannot allocate %ld for file buffer\n", pFM->szBuf);
        return (1);
    }
    // Update the file in-situ.
    pFM->cycles = 0;
    pFM->r_offset = xSize;
    pFM->w_offset = 0L;
    xFront (pFM);
    fprintf (stderr, "Used %d cycles of %ld buffer.\n", pFM->cycles, pFM->szBuf);

    // We have scuffed up the file somehow.
    if (pFM->status != 0) {
        fprintf (stderr, "Copying failed with status %d\n", pFM->status);
    }

    // Trim the duplicate data from the end of the file.
    if (pFM->status == 0) {
        pFM->status = ftruncate (pFM->fd, (off_t) (skEnd - xSize));
        if (pFM->status < 0)  perror (pFM->fn);
    }

    // Tidy up.
    if (pFM->buf != NULL) free (pFM->buf);
    pFM->status = close (pFM->fd);
    if (pFM->status < 0) { perror (pFM->fn); return (2); }

    return (0);
}
Paul_Pedant
  • 8,679
  • This code could be modified with very little change to insert space at the front of a file. It could take an arg which was either the amount of nulls to be inserted, or a file (which would then determine the size required). Changes would be simply: to adjust the two offsets to the higher end of the file; to decrement them by the sizes of the reads and writes; to exit the loop when the read offset became zero, and adjust the last cycle; to clear the space at the front of the file, or copy in the new file. – Paul_Pedant Dec 08 '19 at 11:35
0

For example, to truncate the first 9 bytes at the beginning of file, to the same inode you can do:

dd if=file of=file bs=1 skip=9 conv=notrunc

You may adjust bs and skip so that the process is more efficient, depending on the actual values.

It will skip some blocks (bs bytes) on input and then start copying to file onto itself, block by block.

  • dd with a block size of 1 byte causes as many read and write system calls as the original file size minus the skipped portion. This is awfully inefficient. Do you happen to know a command line tool using faster copy_file_range system calls? – Kai Burghardt Dec 21 '23 at 22:30