Filter columns basing on the value of other column

Question

I have a text file with two columns and more than 3,00,000 rows. Format is as below

Filename1.txt Num1
Filename2.txt Num2
Filename3.txt Num3

I want to copy all the filenames for which the corresponding Numx is greater than 50 and less than 200 into a different file.

Once I copy those file names into a different file, I want to copy all of those files into a different folder.

How do I do that?

score 3 · Accepted Answer · answered Oct 16 '16 at 22:51

3

If you want you can do the comparison and copying at the same time with awk:

awk '$2>50 && $2<200 {system("cp -- "$1" /path/to/destination/")}' file.txt

Assuming you want to copy the files to destination directory, change this to meet your need.

$2>50 && $2<200 does the required comparison
if matches, then the cp operation is executed ({system("cp -- "$1" /path/to/destination/")}), done by the system() function of awk

answered Oct 16 '16 at 22:51

heemayl

56,300

Awesome. Thanks. What does -- in cp means? – Innocent Oct 16 '16 at 22:56
@Innocent It is there to tackle any filename beginning with -. – heemayl Oct 16 '16 at 22:58
Can you please be more specific? What does tackling here mean? – Innocent Oct 16 '16 at 23:17
@Innocent see http://unix.stackexchange.com/a/313687/68757 – heemayl Oct 16 '16 at 23:20
Marked. I am not sure if it will be recorded. I am a new user and it gives me some 15 reputation min requirement etc. – Innocent Oct 17 '16 at 03:06
@Innocent Welcome to Unix and Linux. The reputations are your achievements. Please take a tour of the site. Hope to see you regularly here. – heemayl Oct 17 '16 at 04:02

John1024 · Answer 2 · 2016-10-16T23:20:16.777

1

Let's consider this test file:

$ cat file
Filename1.txt 49
Filename2.txt 72
Filename3.txt 189
Filename4.txt 203

To select only those files for which the second column is greater than or equal to 50 and also less than or equal to 200:

$ awk '$2>=50 && $2<=200 { print $1}' file
Filename2.txt
Filename3.txt

To put those file names in a new file at some path:

awk '$2>=50 && $2<=200 { print $1}' file >/path/to/newfile

Copying the selected files

Assuming that the numbers are integers, try:

while read fname num; do [ "$num" -ge 50 ] && [ "$num" -le 200 ] && cp -- "$fname" /some/path/ ; done <file

Or, for those who prefer their code spread over multiple lines:

while read fname num
do
   [ "$num" -ge 50 ] && [ "$num" -le 200 ] && cp -- "$fname" /some/path/
done <file

edited Oct 16 '16 at 23:20

answered Oct 16 '16 at 22:47

John1024

74,655

You got first part of my question right. Regarding the second part, I want to copy all those "files" itself into a different folder. – Innocent Oct 16 '16 at 22:50
@Innocent See update for a way to move the files. – John1024 Oct 16 '16 at 23:21

score 0 · Answer 3 · answered Oct 17 '16 at 18:30

The question is tagged sed and grep, so I assume there is interest in an answer that uses regular expressions. Also the question indicates the input data file is large and so I assume that performance is a consideration.

I also assume that given that the input file contains one filename per line that there will be no (pathological) filenames that contain newline characters.

The other answers effectively spawn a cp process for every file. This causes unnecessary performance reduction. Instead we can use the facilities of xargs to call cp with as many filenames as it can fit on a command line.

sed -rn 's/ (5[1-9]|[6-9].|1..)$//p' input.txt | tr '\n' '\0' | xargs -0 cp -t /destdir

The sed uses a regular expression to match the closed numerical interval (50, 200). Using regular expressions for numerical inequalities is not always the most elegant thing to do, but in this case the required expression is fairly straightforward.

We are assuming that the filenames contain no newlines, but they may contain other unhelpful characters, such as spaces. xargs will handle this correctly if given \0-delimited data, so we use tr to convert all newlines to null characters.

The above assumes the GNU versions of sed and xargs. If instead you have BSD versions (e.g. OSX), then the command is slightly different:

sed -En 's/ (5[1-9]|[6-9].|1..)$//p' input.txt | tr '\n' '\0' | xargs -0 -J {} cp {} /destdir

These commands will spawn exactly one copy of sed, tr and xargs. There will be more than one spawn of cp, but each one will copy multiple files - xargs will attempt to fill up each cp command line to acheive efficient utilisation. This should provide a significant performance improvement over the other answers when the input data is large.

Filter columns basing on the value of other column

3 Answers3

Copying the selected files