9

How can I copy arbitrary part of a binary file with reasonable speed?

Apparently, dd with bs=1 is very slow, while setting bs to another value makes it impossible to copy arbitrary parts.

Is it wrong? Is it possible to do with dd? If not, then what is the tool?

For example, this command,

dd if="$img" of=tail.bin bs=2147483648 skip=1 status=progress

copies an incorrect tail.

And this command,

dd if="$img" of=tail.bin bs=1 skip=2147483648 status=progress

is very slow.

Dims
  • 3,255
  • 4
  • 1
  • I have an example program in my answer to that question that does what you want - you specify an input and an output file as well as an offset into the input file and a length and it just goes off and copies that part with as few context switches as possible, leaving optimal transfer sizes up to the kernel – Marcus Müller Mar 27 '23 at 21:09
  • Incorrect tail? GNU strace dd if=file of=/tmp/tail.bin bs=2147483648 skip=1 appears to do the right thing, first lseek(0, 2147483648, SEEK_CUR) on the input, then read(0, "\205..."..., 2147483648) = 2147479552 and write(1, buf, read_size), then another read(0, buf, 2147483648) gets to the end of the file, returning much shorter instead of slightly shorter. After printing a partial-read warning, it writes out that partial block, then read(0, buf, bs) returns 0. So it ends and prints "0+2 records in\n0+2 records out\n". – Peter Cordes Mar 29 '23 at 03:03
  • 1
    So it works for me, unless the offset is a prime number larger than the amount of memory it can use for a buffer for the rest of the file. GNU dd's skip_bytes is of course even better. Or for the whole tail, tail -c +offset. But if you want to write to the middle of a file with conv=notrunc, you want dd. – Peter Cordes Mar 29 '23 at 03:06
  • cmp --ignore-initial=2147483648:0 src tail.bin confirms tail.bin is identical. Oh, but apparently if dd is interrupted during a read system call, it can lose data? – Peter Cordes Mar 29 '23 at 04:24
  • 1
    @PeterCordes, if it reads a partial block, it should write a partial block, so the resulting byte stream should be the same. That's not the case if 1) you use count=N, since it limits the total number of input blocks it reads, counting partial ones; or 2) you use conv=sync, when it fills the partial blocks with NULs; or 3) the device you write is blocksize sensitive (where I suppose sync could help in some cases). But having to think any of this should work as evidence that dd usually isn't the right tool for generic byte-stream copying :) – ilkkachu Mar 29 '23 at 08:05
  • @ilkkachu: Ah, that makes sense. I was only testing with the command in the question, which copies the whole tail, not a fixed-length chunk from the middle. That would indeed have failed since bs was so large I got a short read. – Peter Cordes Mar 29 '23 at 15:11
  • How about spinning up nginx and curl-ing the necessary part from localhost with a Range header? ;-) – Boldewyn Mar 30 '23 at 08:04

3 Answers3

21

GNU dd supports count_bytes, seek_bytes, skip_bytes flags. This allows you to use a performant blocksize choice with arbitrary offsets and sizes.

GNU dd since version 9.1 does this by default if you specify a byte unit.

Quoting from coreutils/NEWS:

dd now counts bytes instead of blocks if a block count ends in "B". For example, 'dd count=100KiB' now copies 100 KiB of data, not 102,400 blocks of data. The flags count_bytes, skip_bytes and seek_bytes are therefore obsolescent and are no longer documented, though they still work.

Another option is to create a loop device with the desired offset and size limits. That works for dd (or any other program) that does not support arbitrary offsets by itself.

Alternatively you could also consider copying only the partial first/last block with bs=1 and the middle segment with your desired larger blocksize.

frostschutz
  • 48,978
  • 2
    The last paragraph, while a working suggestion, is unnecessarily complex to code. – iBug Mar 29 '23 at 06:02
8

Try this:

tail -c +$FROM file.dat | head -c $LENGTH > file1.dat

Just assign FROM and LENGTH as you need. Note that the numbering for tail is one-based and e.g. tail -c +4 means to start at the fourth byte, that is to skip the first three bytes.

As an example:

$ printf 'abcdefghijklmnopqrstuvwxyz\n' > test.txt
$ tail -c +4 test.txt | head -c 6; echo
defghi
White Owl
  • 5,129
  • 2
    Are parameters here zero based and TO exclusive and FROM inclusive? – Dims Mar 27 '23 at 19:32
  • 2
    @Dims reading man head and man tail you can see that head -c 5 would give you the first five bytes of the source, and piping to tail -c +3 (or indeed head -c 3) would give you the first three bytes of that – Chris Davies Mar 27 '23 at 19:40
  • @roaima You got it wrong with the tail. use -c +NUM to output starting with byte NUM. It will give you the last characters, starting with 3. – UncleCarl Mar 28 '23 at 16:20
  • 4
    If $FROM can be large, it's probably more efficient to do tail -c +$FROM file.dat | head -c $LENGTH. That way tail can seek past the first $FROM bytes and start reading from there (and stop when it gets a SIGPIPE after head stops reading more data). – Ilmari Karonen Mar 28 '23 at 22:30
  • @IlmariKaronen, yes, that's a good point. At least GNU tail does seek (no surprise), Busybox didn't. – ilkkachu Mar 29 '23 at 10:26
0

If you want the whole tail of the file starting from some position, one way would be to seek a file descriptor to the correct position and just copy with cat. Relatively simple to do with Perl and a shell redirection:

$ printf 'abcdefghijklmnopqrstuvwxyz\n' > test.txt
$ { perl -e 'sysseek STDIN, 23, 0'; cat; } < test.txt
xyz

or with the count passed through the environment instead to not mix it up with the code:

$ { N=23 perl -e 'sysseek STDIN, $ENV{N}, 0'; cat; } < test.txt
xyz

A subshell (perl; cat) works too. Note that this is not a pipeline, just perl and cat using the same open file description. Though note the lack of error checking above, if the seek fails, it'll just not happen.

To copy a set number of bytes only, pipe to head -c instead of cat:

$ start=3 count=6
$ { N=$start perl -e 'sysseek STDIN, $ENV{N}, 0'; head -c $count; } < test.txt; echo
defghi
ilkkachu
  • 138,973