8

What I want to do is, to monitor a directory (not recursive, just one) for new files created and append those files to one single big file as they are being written.

The number of files that are being written is huge, could reach as much as 50,000.

By using inotifywait, I am monitoring the directory like:

inotifywait -m -e create ~/folder | awk '($2=="CREATE"){print $3}' > ~/output.file

So I am storing names of new files created in ~/output.file and then using a for loop

for FILE in `cat ~/output.file` 
do
    cat $FILE >> ~/test.out
done

It works fine, if the rate at which a file is being written (created) in ~/folder is like 1 file per second.

But the requirement is large, and the rate at which the files are being created is very high, like 500 files per minute (or even more).

I checked the number of files in the ~/folder after the process is complete, but it does not match the inotifywait output. There is a difference of like 10–15 files, varies.

Also, the loop

for FILE in `cat ~/output.file`
do
done

doesn't process all the files in ~/output.file as they are being written.

Can anyone please suggest me an elegant solution to this problem?

2 Answers2

6

No need to post-process the output... use inotifywait options --format and --outfile
If I run:

inotifywait -m --format '%f' -e create /home/don/folder/ --outfile /home/don/output.file

then open another tab, cd to ~/folder and run:

time seq -w 00001 50000 | parallel touch {}

real 1m44.841s user 3m22.042s sys 1m34.001s

(so I get much more than 500 files per minute) everything works fine and output.file contains all the 50000 file names that I just created.
Once the process has finished writing the files to disk you can append their content to your test.out (assuming you are always in ~/folder):

xargs < /home/don/output.file cat >> test.out

Or use read if you want to process files as they are created. So, while in ~/folder you could run:

inotifywait -m --format '%f' -e create ~/folder | while read file; do cat -- "$file" >> ~/test.out; done
don_crissti
  • 82,805
  • I want to do the things in parallel, to save time. Creation of small files and appending them as they are created. So, awk will filter the created files from the total list inotiify generates. – rohitkulky May 26 '13 at 17:01
  • Hey don, this works just fine! I had come across this earlier, but could not get things to work somehow. Thanks! :) – rohitkulky May 26 '13 at 18:09
  • You can put this comment in the answer for clarity, others' sake! :) – rohitkulky May 26 '13 at 18:10
  • sorry to bring this up late, this above script works fine like I said. But once the file creation process in the directory is over, the inotifywait runs indefinitely, so I have to kill the process manually. Is there any way of doing this elegantly? The --timeout option waits only for first event and then exits. Thanks! – rohitkulky Jun 10 '13 at 09:01
  • @rohitvk - You cannot use monitor and timeout together with the current version, you'll have to install the git version. Answer updated. – don_crissti Jun 10 '13 at 12:38
0

One thing you could do is make a small program that moves the processed files out of the directory to another one after they have been processed. Just restart the scan of the directory after you are done. Sleep for a reasonable amount of time before re-scanning if no files are there and do this for the duration of the generation of files (the process generating the files seems only to be running for up to 100 minutes or so).

If you cannot move the files from the directory, another approach is to start with a date-time-stamp DTS somewhere in the past. Then find all the files newer than DTS, process them and update DTS if the timestamp of the file is newer than DTS . Repeat this process as with the solution above. If the granularity of your timestamps prevents two files from having the same one, you can just look for files newer than DTS. If not, you have to look for files not older than DTS and keep a list of files with the DTS you are going to use on the next run and filter those out on the next run.

Anthon
  • 79,293