84

Today I had to remove the first 1131 bytes from an 800MB mixed text / binary file, a filtered subversion dump I'm hacking for a new repository. What's the best way to do this?

To start with I tried

dd bs=1 skip=1131 if=filtered.dump of=trimmed.dump

but after the skip this copies the remainder of the file a byte at a time, i.e. very slowly. In the end I worked out I needed 405 bytes to round this up to three blocks of 512 which I could skip

dd if=/dev/zero of=405zeros bs=1 count=405
cat 405zeros filtered.dump | dd bs=512 skip=3 of=trimmed.dump

which completed fairly quickly but there must have been a simpler / better way? Is there another tool I've forgotten about?

Alexis Wilke
  • 2,857
Rup
  • 943
  • dd is the right tool for the job - looks like you came up with a nice, elegant solution to your problem. – Justin Ethier Feb 03 '11 at 19:50
  • I don't think it's really possible without copying unless filesystem provides direct support for removing bytes from start. – Anssi Dec 03 '20 at 08:16
  • @Anssi Thanks - no, I wasn't trying to do this in-place without copying, I was just looking for an efficient way to do this at all. – Rup Dec 03 '20 at 09:21

8 Answers8

81

You can switch bs and skip options:

dd bs=1131 skip=1 if=filtered.dump of=trimmed.dump

This way the operation can benefit from a greater block.

Otherwise, you could try with tail (although it's not safe to use it with binary files):

tail -c +1132 filtered.dump >trimmed.dump

Finally, you may use 3 dd instances to write something like this:

dd if=filtered.dump bs=512k | { dd bs=1131 count=1 of=/dev/null; dd bs=512k of=trimmed.dump; }

where the first dd prints its standard output filtered.dump; the second one just reads 1131 bytes and throws them away; then, the last one reads from its standard input the remaining bytes of filtered.dump and write them to trimmed.dump.

marco
  • 967
  • 8
    Thanks! I didn't know that piped input carried over to a second process like that - that's very neat. I can't believe I didn't think of bs=1131 skip=1 though :-/ – Rup Feb 04 '11 at 00:35
  • 2
    Most modern implementations of shell utilities work correctly with binary files (i.e. they have no trouble with null characters and won't insert an extra newline at the end of the file). Certainly GNU and *BSD implementations are safe. – Gilles 'SO- stop being evil' Feb 05 '11 at 00:04
  • 1
    What does "not safe to use it with binary files" mean? – Him Oct 29 '19 at 13:56
  • 1
    @Scott It means that some implementations of tail (especially legacy systems) tend to mangle binary data. But on Linux that's definitely not the case. – CoolKoon Apr 09 '20 at 18:10
38

Not sure when skip_bytes was added, but to skip the first 11 bytes you have:

# echo {123456789}-abcdefgh- | 
                              dd bs=4096 skip=11 iflag=skip_bytes
-abcdefgh-
0+1 records in
0+1 records out
11 bytes (11 B) copied, 6.963e-05 s, 158 kB/s

Where iflag=skip_bytes tells dd to interpret the value for the skip option as bytes instead of blocks, making it straightforward.

16

You can use a sub-shell and two dd calls like this:

$ ( dd bs=1131 count=1 of=dev_null && dd bs=4K of=out.mp3 ) < 100827_MR029_LobbyControl.mp3
1+0 records in
1+0 records out
1131 bytes (1.1 kB) copied, 7.9691e-05 s, 14.2 MB/s
22433+1 records in
22433+1 records out
91886130 bytes (92 MB) copied, 0.329823 s, 279 MB/s
$ ls -l *
-rw------- 1 max users 91887261 2011-02-03 22:59 100827_MR029_LobbyControl.mp3
-rw-r--r-- 1 max users     1131 2011-02-03 23:04 dev_null
-rw-r--r-- 1 max users 91886130 2011-02-03 23:04 out.mp3
$ cat dev_null out.mp3 > orig
$ cmp 100827_MR029_LobbyControl.mp3 orig
maxschlepzig
  • 57,532
  • 1
    Thanks - I didn't know piped input continued to a second process like that, I guess that's the sub shell? I'll definitely remember that! I've given Marco the tick because he got here first but +1 and thanks for the answer! – Rup Feb 04 '11 at 00:36
  • 1
    @Rup, yes, the sub-shell - created via the parentheses - provides a stdin file descriptor and both dd calls successively consume the input from it. Yeah - Marco beat me by 29 seconds :) – maxschlepzig Feb 04 '11 at 08:23
12

If the filesystem and the Linux kernel support it then you could try fallocate if you want to make the changes inplace: in the best case there is no data IO at all:

$ fallocate <magic> -o 0 -l 1131 inplace.dump

where <magic> depends on the filesystem, Linux version, and the file type (FALLOC_FL_COLLAPSE_RANGE or FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE could be used internally).

jfs
  • 894
  • 1
    This is my preferred method, but running this in a container has its issues. https://stackoverflow.com/questions/31155591/fallocate-failing-inside-my-docker-container – KernelCurry Feb 02 '18 at 19:08
4

You should use count=0 - that's a simple lseek() whenever possible.

Like this:

{  dd bs=1131 skip=1 count=0; cat; } <filtered.dump >trimmed.dump

dd will lseek() the input file descriptor to an 1131 byte offset, and then cat will simply copy whatever remains to output.

mikeserv
  • 58,310
3

Yet another way to remove leading bytes from a file (without using dd at all) is to use xxd and sed or tail respectively.

bytes=$((1131*2))

xxd -p -c 256 filtered.dump | tr -d '\n' | sed "s/^.\{0,${bytes}\}//" | xxd -r -p > trimmed.dump

bytes=$((bytes + 1)) 
xxd -p -c 256 filtered.dump | tr -d '\n' | tail -c +${bytes} | xxd -r -p > trimmed.dump
wop
  • 39
  • That's neat, but I think I prefer just working with the file in binary rather than converting it to and from hex. – Rup Apr 15 '13 at 14:56
3

For GNU dd(coreutils) 9.1, you can directly use B suffix to skip bytes rather than blocks, which replaces the old way iflag=skip_bytes.

From the man page:

skip=N (or iseek=N) skip N ibs-sized input blocks

N and BYTES may be followed by the following multiplicative suffixes: c=1, w=2, b=512, kB=1000, K=1024, MB=1000*1000, M=1024*1024, xM=M, GB=1000*1000*1000, G=1024*1024*1024, and so on for T, P, E, Z, Y. Binary prefixes can be used, too: KiB=K, MiB=M, and so on. If N ends in B, it counts bytes not blocks.

AdminBee
  • 22,803
sify
  • 141
2

@maxschlepzig asks for a online liner. Here is one in perl. It takes 2 argument: From byte and length. The input file must be given by '<' and the output will be on stdout:

perl -e 'sysseek(STDIN,shift,0) || die; $left = shift;
     while($read = sysread(STDIN,$buf, ($left > 32768 ? 32768 : $left))){
        $left -= $read; syswrite(STDOUT,$buf);
     }' 12345678901 19876543212 < bigfile > outfile

If length is larger than the file, the rest of the file will be copied.

On my system this delivers 3.5 GB/s.

Ole Tange
  • 35,514
  • I think his one-line challenge was to get you to prove that the scripting language solution was better than his one-line shell solution though. And I prefer his: it's shorter and clearer to me. If yours performs better it's because you're using a larger block size than him which is easily upped in his version too. – Rup Apr 01 '14 at 10:06
  • @Rup Alas, but no. You seem to forget that dd does not guarantee a full read. Try: yes | dd bs=1024k count=10 | wc http://unix.stackexchange.com/questions/17295/when-is-dd-suitable-for-copying-data-or-when-are-read-and-write-partial – Ole Tange Apr 01 '14 at 14:56
  • Also my solution will not read the bytes you do not need (which could be several terabytes long). – Ole Tange Apr 01 '14 at 14:59