4

I'd like to do following using shell script: (for simplicity, I use same data for INPUT. In real case, the data changes with loop label jj)

#!/bin/sh
for jj in `seq 100`; do
    cat INPUT.file >> OUTPUT.file
done

However, this is very inefficient since opening and closing the file are in the loop. When the size of the INPUT.file is large, this code will be very slow. So I am wondering if there is a way to have/create a buffer like create a pre-allocated variable in C.

Mohammad
  • 688
  • I found out I can do "Creating a ram disk on Linux" (http://www.jamescoyle.net/how-to/943-create-a-ram-disk-in-linux). Unfortunately, I don't have the root privilege. Is this the only way that can speed up this operation? – Xiaopeng Huang Sep 16 '15 at 16:25
  • You still haven't said what you intend to achieve with this. – muru Sep 16 '15 at 16:26
  • Thanks for your attention. What I intended to do is essentially inserting different data into different places (~4000 places) of a large file. I use shell with sed/awk. I use loop to generate the data and insert them to the large file. So the code above is what I came with and it is not efficient. – Xiaopeng Huang Sep 16 '15 at 16:37
  • 3
    You do realise this loop runs only twice? – Chris Davies Sep 17 '15 at 07:02
  • for k in blah foo; do cat "$k" >> bar; done (or the prefix one) -> cat blah foo >> bar. Cat is used to Concatenate FILE(s). – Mingye Wang Sep 17 '15 at 18:35
  • Actually the loop runs 100 times since seq 100 generates numbers from 1 to 100. (grave accent (`) performs a command substitution). – Xiaopeng Huang Sep 22 '15 at 15:55

3 Answers3

6

Thanks to Stéphane Chazelas's answer to "Why there is such a difference in execution time of echo and cat?", the answer of muru may be a little improved by calling cat only once (however, this "a little" amount may become a lot for big data and numerous loop iterations; on my system, this script takes ~75% time the loop-having script takes):

#!/bin/sh
yes INPUT.file | head -100 | xargs cat >> OUTPUT.file
Mohammad
  • 688
4

Consider redirecting the loop itself:

#!/bin/sh
for jj in seq 100; do
    cat INPUT.file
done >> OUTPUT.file
muru
  • 72,889
1

If speed is your main concern, then you may find that cat isn't fast enough at this task. You may want to write the constituent files to the output in parallel.

I knocked up a quick version of a parallel cat with the following caveats:

  1. all the input files must be regular files (so we know the size in advance).
  2. do not write or truncate the input files while fcat is running
  3. the output file musn't already exist (to prevent accidents, and also to avoid wasting time reading what we're about to overwrite).

Obviously, this is a quick proof-of-concept, so could be made more robust, but here's the idea:

fcat.c:

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>

#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>


struct in_fd {
    int fd;
    int err;
    off_t start;
    struct stat s;
};

int main(int argc, char**argv)
{
    char *outfile = argv[--argc];

    if (argc < 2) {
        fprintf(stderr, "Usage: %s INFILE... OUTFILE\n", argv[0]);
        return 1;
    }

    struct in_fd *infiles = calloc(argc, sizeof *infiles);

#pragma omp parallel for
    for (int i = 1;  i < argc;  ++i) {
        struct in_fd *const input = infiles + i;
        char const *const filename = argv[i];
        input->err = 0;
        if ((input->fd = open(filename, O_RDONLY)) < 0) {
            perror(filename);
            input->err = errno;
            continue;
        }
        if (fstat(input->fd, &input->s)) {
            perror(filename);
            input->err = errno;
            continue;
        }
        if (!S_ISREG(input->s.st_mode)) {
            fprintf(stderr, "%s: not a regular file\n", filename);
            input->err = EINVAL;
            continue;
        }
    }

    off_t total = 0;
    for (int i = 1;  i < argc;  ++i) {
        if (infiles[i].err)
            return EXIT_FAILURE;
        infiles[i].start = total;
        total += infiles[i].s.st_size;
    }

    int out_fd = open(outfile, O_RDWR | O_CREAT | O_EXCL, 0666);
    if (out_fd < 1) {
        perror(outfile);
        return 1;
    }

    if (ftruncate(out_fd, total)) {
        perror(outfile);
        return 1;
    }

    /* On Linux, you might wish to add MAP_HUGETLB */
    char *out_mem = mmap(NULL, total, PROT_WRITE, MAP_SHARED, out_fd, 0);
    if (out_mem == MAP_FAILED) {
        perror(outfile);
        return 1;
    }

#pragma omp parallel for
    for (int i = 1;  i < argc;  ++i) {
        struct in_fd *const input = infiles + i;
        char *p = out_mem + input->start;
        char *end = p + input->s.st_size;
        input->err = 0;
        while (p < end) {
            int r = read(input->fd, p, end-p);
            if (r < 0) {
                if (errno != EINTR) {
                    perror(argv[i]);
                    input->err = errno;
                    break;
                }
            } else {
                p += r;
            }
        }
        close(infiles->fd);
    }


    if (munmap(out_mem, total)) {
        perror(outfile);
    }

    for (int i = 1;  i < argc;  ++i) {
        if (infiles[i].err) {
            unlink(outfile);
            return EXIT_FAILURE;
        }
    }

    return EXIT_SUCCESS;
}

Makefile:

CFLAGS += -Wall -Wextra
CFLAGS += -std=c99 -D_GNU_SOURCE
CFLAGS += -g -O2
CFLAGS += -fopenmp

all: fcat
.PHONY:all

My timing results with 12 threads show elapsed times of 0.2 seconds compared with 2.3 seconds for cat (median of three runs each, with hot cache, on 48 files totalling 138M).

Toby Speight
  • 8,678