0

One of our scripts is writing contents of 2 files to another file, the below command takes 4 mins. File 1 has header record and file 2 has data of 4 GB. On file3 header record should be at top followed by contents of file 2.

Is there a better way to achieve this in less than 4 mins.

cat file1 file2 > file3

Thanks Raghu

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
Raghu
  • 1
  • Are you running that command in a loop, or are you writing a script that needs to be done in a particular amount of time and four minutes is too long? The way I see it is that if you had time to measure the time, you had time to wait for it to finish, and now it's done. – Kusalananda Mar 07 '21 at 12:37

2 Answers2

0

Just doing a couple local test appears the the sed h file >> destination command is about 66% faster then cat you would have to rewrite your script a little bit to add a second command because sed only takes a single file argument but either way it still came out faster.

EDIT: Tests used a 4GB files of random text and unicode characters times where measured with time command.

0

My old Laptop with HDD is about twice as fast as your timings.

I suspect you may be running the cat from BusyBox, not the optimised stand-alone cat.

I checked timings in four commands, and they all came out in the same ball-park (within 10%). I used GNU cat, sed, awk and dd. I cleared caches before each test (in another window as sudo) with:
echo 3 > /proc/sys/vm/drop_caches.

sed does (incidentally) process multiple input files.

$ time cat Timer1 Timer2 > Timer3

real 1m57.536s user 0m0.072s sys 0m20.456s $ $ time sed -e '1n' Timer1 Timer2 > Timer3

real 1m54.450s user 0m15.924s sys 0m23.420s $ $ time awk 1 Timer1 Timer2 > Timer3

real 2m0.080s user 0m21.752s sys 0m21.444s $ $ time { cat Timer1 > Timer3 > dd status=none conv=notrunc oflag=append bs=100M if=Timer2 of=Timer3 > } $

real 2m9.426s user 0m0.012s sys 0m18.260s $ $ ls -lh Timer? -rw-r--r-- 1 paul paul 17 Mar 7 11:01 Timer1 -rw-r--r-- 1 paul paul 3.7G Mar 7 11:03 Timer2 -rw-r--r-- 1 paul paul 3.7G Mar 7 11:50 Timer3 $ $ ls -l Timer? -rw-r--r-- 1 paul paul 17 Mar 7 11:01 Timer1 -rw-r--r-- 1 paul paul 3942530050 Mar 7 11:03 Timer2 -rw-r--r-- 1 paul paul 3942530067 Mar 7 12:06 Timer3

This suggests the timings are dominated by the I/O performance, and the command used is much less signicicant. (Using a shell read loop would still not be a great idea.)

Notably, though, cat and dd use far less user time than the edit tools.

Paul_Pedant
  • 8,679