125

If I have a large file and need to split it into 100 megabyte chunks I will do

split -b 100m myImage.iso

That usually give me something like

xaa
xab
xac
xad

And to get them back together I have been using

cat x* > myImage.iso

Seems like there should be a more efficient way than reading through each line of code in a group of files with cat and redirecting the output to a new file. Like a way of just opening two files, removing the EOF marker from the first one, and connecting them - without having to go through all the contents.

Windows/DOS has a copy command for binary files. The help mentions that this command was designed to able able to combine multiple files. It works with this syntax: (/b is for binary mode)

copy /b file1 + file2 + file3 outputfile

Is there something similar or a better way to join large files on Linux than cat?

Update

It seems that cat is in fact the right way and best way to join files. Glad to know i was using the right command all along :) Thanks everyone for your feedback.

cwd
  • 45,389
  • 1
    Why do you think 'cat x* > myImage.iso' is 'more efficient' than 'copy /b file1 + file2 + file3 outputfile'? – symcbean Nov 15 '11 at 12:25
  • 30
    Side note: Better not use cat x*, because the order of files depends on your locale settings. Better start typing cat x, than press Esc and then * - you'll see the expanded order of files and can rearrange. – rozcietrzewiacz Nov 15 '11 at 12:33
  • 26
    Instead of cat x* you could consider shell brace expansion, cat xa{a..g} which expands the specified sequence to cat xaa xab xac xad xae xaf xag – Peter.O Nov 15 '11 at 12:57
  • @symcbean - I actually was thinking that a command like copy (on windows) seemed like a more efficient method than cat, party beacuse help for copy mentions that it can be used this way. I knew that cat would work to join files, and it works quickly with small files, but I was trying to ask if there was a better way to join files - especially very large files. – cwd Nov 15 '11 at 14:21
  • 3
    @rozcietrzewiacz - can you give an example of how I would adjust my locale setting that would break cat x* ? Would the new locale setting not also affect split so that if split and cat x* were used on the same system they would always work? – cwd Nov 15 '11 at 14:29
  • 3
    "opening two files, removing the EOF marker from the first one, and connecting them - without having to go through all the contents."... sounds like you need to invent a new filesystem in order to do what you want – JoelFan Nov 15 '11 at 16:14
  • 1
    @JoelFan - or just acquire a deeper understand the capabilities of the existing file system. – cwd Nov 15 '11 at 17:23
  • 1
    copy /b … outputfile does exactly what cat … >outputfile does. The /b flag tells copy not to mess up the data, and the syntax of copy is weird, but under the hood they do the same job. – Gilles 'SO- stop being evil' Nov 15 '11 at 23:31
  • 1
    @Giles - thanks, that makes me feel better. the whole point of the question was just to make sure I'm doing this the 'right' way - and from the response it seems very apparent that cat is in fact the best way. – cwd Nov 15 '11 at 23:44
  • @rozcietrzewiacz: I think the split command constructs its output file names in a manner that isn't susceptible to locale-specific reordering. (Though I suppose you could create a customized locale in which the 26 lowercase Latin letters aren't in their usual order.) – Keith Thompson Nov 16 '11 at 02:00
  • 8
    @cwd: Looking at split.c in GNU Coreutils, the suffixes are constructed from a fixed array of characters: static char const *suffix_alphabet = "abcdefghijklmnopqrstuvwxyz";. The suffix wouldn't be affected by the locale. (But I don't think any sane locale would reorder the lowercase letters; even EBCDIC maintains their standard order.) – Keith Thompson Nov 16 '11 at 02:04
  • @Keith & cwd: Sorry, I overlooked the first prompt. In case of files produced with split, I agree with Keith. I was referring to a general habit of concatenating files. And, more broadly, feeding a list of files to a command. – rozcietrzewiacz Nov 16 '11 at 08:06
  • @Davide notes: "Tip: To be sure that no errors occurred when splitting and joining is to calculate an hash of source (before splitting) and compare that with the file resulting from the merge if the 2 hashes match I can be sure the procedure produced no errors. so when giving out a splitted files always give the hash" – drs Jan 21 '15 at 15:36
  • 3
    @Peter.O you can nest brace expansion cat x{{a..j}{a..z},k{a..f}} > myImage.iso. That will expand from xaa to xkf. – Madacol Mar 01 '20 at 22:46

6 Answers6

79

That's just what cat was made for. Since it is one of the oldest GNU tools, I think it's very unlikely that any other tool does that faster/better. And it's not piping - it's only redirecting output.

  • The cat x, then press Esc trick you mentioned is neat.. I've been looking for something like that, thanks... good comment and good answer – Peter.O Nov 15 '11 at 12:46
  • 2
    You're welcome :) Also, when you have that list of files on the command line, you can use Ctrl+W to cut out a word and then Ctrl+Y to paste it. – rozcietrzewiacz Nov 15 '11 at 12:50
  • cat means "concatenate" – JoelFan Nov 15 '11 at 16:12
  • 7
    .. and "catenate" derrives from a Latin word "catena" which means "a chain".. concatenating is joining up the links of a chain. ... (and a bit further off-topic, a catenary curve also derrives from "catena". It is the way a chain hangs) – Peter.O Nov 15 '11 at 17:03
26

Under the hood

There is no more efficient way than copying the first file, then copying the second file after it, and so on. Both DOS copy and cat do that.

Each file is stored independently of other files on the disk. Almost every filesystem designed to store data on a disk-like device operates by blocks. Here's a highly simplified presentation of what happens: the disk is divided into blocks of, say 1kB, and for each file the operating system stores the list of blocks that make it up. Most files aren't an integer number of blocks long, so the last block is only partially occupied. In practice, filesystems have many optimizations, such as sharing the last partial block between several files or storing “blocks 46798 to 47913” rather than “block 46798, block 46799, …”. When the operating system needs to create a new file, it looks for free blocks. The blocks don't have to be consecutive: if only blocks 4, 5, 98 and 178 are free, you can still store a 4kB file. Using blocks rather than going down to the byte level helps make finding free blocks for a new or growing file considerably faster, and reduces the problems due to fragmentation when you create or grow and delete or shrink a lot of files (leaving an increasing number of holes).

You could support partial blocks in mid-file, but that would add considerable complexity, particularly when accessing files non-sequentially: to jump to the 10340th byte, you could no longer jump to the 100th byte of the 11th block, you'd have to check the length of every intervening block.

Given the use of blocks, you can't just join two files, because in general the first file ends in mid-block. Sure, you could have a special case, but only if you want to delete both files when concatenating. That would be a highly specific handling for a rare operation. Such special handling doesn't live on its own, because on a typical filesystem, many file are being accessed at the same time. So if you want to add an optimization, you need to think carefully: what happens if some other process is reading one of the files involved? What happens if someone tries to concatenate A and B while someone is concatenating A and C? And so on. All in all, this rare optimization would be a huge burden.

All in all, you can't make joining files more efficient without making major sacrifices elsewhere. It's not worth it.

On splitting and joining

split and cat are simple ways of splitting and joining files. split takes care of producing files named in alphabetical order, so that cat * works for joining.

A downside of cat for joining is that it is not robust against common failure modes. If one of the files is truncated or missing, cat will not complain, you'll just get damaged output.

There are compression utilities that produce multipart archives, such as zipsplit and rar -v. They aren't very unixy, because they compress and pack (assemble multiple files into one) in addition to splitting (and conversely unpack and uncompress in addition to joining). But they are useful in that they verify that you have all the parts, and that the parts are complete.

10

Seems like there should be a more efficient way than piping all of the contents through the system's stdin / stdout

Except that's not really what's happening. The shell is connecting the stdout of cat directly to the open file, which means that "going through stdout" is the same as writing to disk.

  • I was just imagining using cat to display several gigabytes of code in the console, then having it captured and put into a file. That's the mental image I have for what must be happening when I use cat and redirect the output that I can't see. It just seemed like if there was a way you could open two files, connect them, and then close them it would be more efficient than running through all of lines of code with cat. Thanks for letting me know about the direct connection. – cwd Nov 15 '11 at 14:16
  • @cwd It would be possible to design a filesystem where you could join two files that way, but that would complicate the design of the filesystem immensely. You'd optimize for that one operation at the cost of making a lot of common tasks more complicated and slower. – Gilles 'SO- stop being evil' Nov 15 '11 at 23:29
  • @Gilles - it'd be interesting to know more about the low level details. To me, reading all of the sectors off the hard disk for several files and then dumping them back into other unused sectors on the disk seems inefficient. And I think large files must be stored across multiple blocks of free sectors at times because there may not always be enough blocks side by side to store them. Therefore theoretically you could join files into one by removing the EOF marker and pointing to group of sectors at the start of the next file. *nix is powerful so I wondered if there was a better way than cat. – cwd Nov 15 '11 at 23:53
  • @cwd There's no “EOF marker”. No sane modern filesystem works like that, because it prevents some characters from occurring in files (or else requires complex encodings). But even if there was an EOF marker, most of the time, you would not have the right file after it. – Gilles 'SO- stop being evil' Nov 15 '11 at 23:59
  • I meant the concept of the EOF marker and not an actual EOF marker. Otherwise if you look at the bits and bytes of a file on the hard drive, how do you know where it ends? Do you specify the length of the file at the start of it? I'm talking about a really low level thing. Is that what you are also referring to? – cwd Nov 16 '11 at 00:04
6

Files Spliting

Split By Size

If you want to split big file into small files and choose name and size of small output files this is the way.

split -b 500M videos\BigVideoFile.avi SmallFile.

In this way you choose to split one big file to smaller parts of 500 MB. Also you want that names of part files is SmallFile. Note that you need dot after file name. The result should be generation of new files like this:

SmallFile.ab SmallFile.ad SmallFile.af SmallFile.ah SmallFile.aj
SmallFile.aa SmallFile.ac SmallFile.ae SmallFile.ag SmallFile.ai SmallFile.ak
...

Split By Number Of Lines

This way you'll split textual file into smaller files limited to 50 lines.

split -l 50 text_to_split.txt

The result should be something like this:

xaa xab xac ...

Split By Bytes

Split to small files with custom size of small files in bytes:

split -b 2048 BigFile.mp4

The result should be similar to result from Spliting By Number Of Lines.

Files Joining

You can join files in two ways. The first one is:

cat SmallFile.* > OutputBigVideoFile.avi

or with:

cat SmallFile.?? > OutputBigVideoFile.avi

Note: When you are joining files small files should not be damaged. Also all small (part) files should be in the same directory.

Nole
  • 181
  • 1
  • 2
3

I once had exactly this problem: I wanted to join some files, but had not enough disk space to hold them doubly.

So I wrote a bunch of programs:

  • one to "suck up" a file by reading it, sending it to stdout and, if finished, removing it
  • and one to buffer data "on the fly".

This enabled me to do something like

partto sourcefile | mybuffer 128M >>cumufile

and thus removing the source file while 128M was still unwritten. A little bit dangerous, but if the data are not that precious, or they exist somewhere else as well, it is feasible.

If needed, I can provide the source.

glglgl
  • 1,210
1

Technically speaking, this is a way of accessing the entire file without having to read and write the entire contents, and could be useful for huge files or if there is little space left:

$ mkfifo myImage.iso
$ cat xa{a..g} > myImage.iso &

And then use myImage.iso, for example

$ md5sum myImage.iso

Though of course myImage.iso is a special file (named pipe) and not a regular file, so this may be of use or not depending on what you're trying to do.

golimar
  • 417