while read line find – faster with perl?

Question

I have a text file with three columns separated by tab and I read the third column line by line to find all files in a directory which have this in the name. Since it is a file with up to 1000 entries, my attempt to solve it with "find" is not suitable because it takes too much time.

while read f; 
do var1=`echo "$f" | cut -f1`; 
var2=`echo "$f" | cut -f2` ; 
var3=`echo "$f" | cut -f3`; 
echo "\n ID1 = $var1 \n ID2 = $var2 \n\n Path:";
find //myDirectory/ -type f -name *$var3* -not -path '*/zz_masters/*' -exec ls -Sd {} + ;
echo "\n----------------------"; 
done >> /SearchList.txt < /ResultList.txt

As you can see, one folder is excluded and the results are sorted by size because some files are in different resolutions.

Searchlist.txt:

a1 a    1 x1    Trappist
b2 b    2 y2    Mars
c3 c    3 z3    Pegasi

Result:

/myDirectory/
ID1 = a1 a 
 ID2 = 1 x1
Path:
/myDirectory/xx/Trappist-1.png
/myDirectory/xx/Trappist-2.png

ID1 = b2 b 
 ID2 = 2 y2
Path:
/myDirectory/yy/Mars-1.jpg

ID1 = c3 c 
 ID2 = 3 z3
Path:
/myDirectory/xx/51PegasiB.tif

In the hope that it works faster, I tried it with perl. I am new to perl but my results are sad and I am stuck in the script. It creates a loop . That's where I'm at:

perl find.pl /myDirectory/ /SearchList.txt /ResultList.txt
#!/usr/bin/perl -w
use strict; 
use warnings; 
use File::Find;
open (IN, "$ARGV[1]") or die;
open(my $fh_out, '>', "$ARGV[2]");
my @files;
print $fh_out "$ARGV[0]\n";
while (my $line = <IN>) {

    chomp $line;
my @columns = split(/\t/, $line);
find(sub { 
      push @files,"$File::Find::name" if /$columns[2]/;
I think print has to be inside sub but each search result  shows separately and is still slow:
print $fh_out "\n\n----------------------------\n
#ID1: $columns[0]\nID2: $columns[1]Searchstring: $columns[2]\n
#Path:\n", "$File::Find::name\n" if /$columns[2]/;
}, $ARGV[0]);


outside sub: displays the search results together, but also slow and with a loop :(
print $fh_out "\n\n----------------------------\n
ID1: $columns[0]\nID2: $columns[1]
Searchstring: $columns[2]\n\nPath:\n", join "\n", @files;
}
close IN;
close $fh_out;
exit;

Will perl possibly not give the speed increase I want, and if not, what alternatives would there be?

If you copy/paste your shell script into http://shellcheck.net it'll tell you about some of the issues with it. See also why-is-using-a-shell-loop-to-process-text-considered-bad-practice — Ed Morton, Jan 18 '21 at 19:33
Shellcheck i did not know. The second link have to read and understand. Thank you — spazek, Jan 18 '21 at 23:38

score 2 · Answer 1 · answered Jan 18 '21 at 20:52

A code review of your bash code:

read can pick out the words for you
echo "\n" won't print a newline
use $(...) instead of `...` - ref
use proper indentation Be more careful with your redirection symbols

while read -r var1 var2 var3 rest; do
    printf "\n ID1 = %s \n ID2 = %s \n\n Path:\n" "$var1" "$var2"
    find //myDirectory/ -type f -name "*$var3*" -not -path '*/zz_masters/*' -exec ls -Sd {} +
    # ........................ quoted ^.......^
    printf "\n----------------------\n"; 
done < /SearchList.txt > /ResultList.txt

However the way to speed this up is to only run find once:

id1=()
id2=()
substrings=()
names=( -false )
declare -A paths=()
while read -r var1 var2 var3 rest; do
    id1+=( "$var1" )
    id2+=( "$var2" )
    substrings+=( "$var3" )
    names+=( -o -name "$var3" )
done < /SearchList.txt
find /myDirectory/ -type f ( "${names[@]}" ) -not -path '/zz_masters/' -prinf "%s %p\0" 

| sort -znr 

| while read -d '' -r size name; do
    for s in "${substrings[@]}"; do
        if [[ $name == "$s" ]]; then
            paths[$s]+="$name"$'\n'
            break
        fi
    done
done
fmt="\n ID1 = %s \n ID2 = %s \n\n Path:\n%s\n----------------------\n"
for idx in "${!id1[@]}"; do
    printf "$fmt" "${id1[idx]}" "${id2[idx]}" "${paths[${substrings[idx]}]}"
done > /ResultList.txt

I edited my post because it was not clear that the columns can contain spaces and it is not just png..... Your improvements and precise explanations to my script made some things clear to me. The second script of yours I have not really understood yet.There are problems with declare -A in bash (I'm on a Mac) with bin/zsh it should work though. However, I get an error message: find: -printf: unknown primary or operator -: line 28: *Trappist*: syntax error: operand expected (error token is "*Trappist*") — spazek, Jan 18 '21 at 23:58
echo "\n" outputs two newlines in many echo implementations, including the builtin echo of bash with some build/environments. That's a UNIX (POSIX+XSI) requirement, so would be the case of the /bin/sh of macos (which AFAIK is still bash). — Stéphane Chazelas, Jan 19 '21 at 15:30

Ed Morton · Accepted Answer · 2021-01-20T00:50:19.437

1

You could try this if your file names don't contain tabs or newlines:

find . -type f -print |
awk '
    NR==FNR {
        name2ids[$3][1] = $1
        name2ids[$3][2] = $2
        next
    }
    {
        for (name in name2ids) {
            if ( index($NF,name) ) {
                matches[name][$0]
            }
        }
    }
    END {
        for (name in name2ids) {
            print "ID1 =", name2ids[name][1]
            print "ID2 =", name2ids[name][2]
            print "\nPath:"
            if (name in matches) {
                for (file in matches[name]) {
                    print file
                }
            }
        }
    }
' FS='\t' SearchList.txt FS='/' -

The above uses GNU awk for arrays of arrays, here is a POSIX version (untested):

find . -type f -print |
awk '
    NR==FNR {
        name2ids[$3] = $1 RS $2
        next
    }
    {
        for (name in name2ids) {
            if ( index($NF,name) ) {
                matches[name] = (name in matches ? matches[name] RS : "") $0
            }
        }
    }
    END {
        for (name in name2ids) {
            split(name2ids[name],ids,RS)
            print "ID1 =", ids[1]
            print "ID2 =", ids[2]
            print "\nPath:"
            split(matches[name],files,RS)
            for (idx in files) {
                print files[idx]
            }
        }
    }
' FS='\t' SearchList.txt FS='/' -

edited Jan 20 '21 at 00:50

answered Jan 18 '21 at 19:47

Ed Morton

31,617

Dear drive-by downvoter - do you have any particular reason for the downvote that you'd care to share? – Ed Morton Jan 18 '21 at 23:16
That's really fast! Now I just have to figure out how to write that into the file along with the associated IDs. – spazek Jan 19 '21 at 00:15
Your script returns the search results as a list (it doesn't matter that the order is not the same as in the search text). I need the results in the form as described above in groups (ID1 \nId2\nPath\n matching search results) . Since I am still far from understanding how exactly the script works, I cannot set the necessary print commands correctly. It would be great if you could also write a few words of explanation (would be much appreciated). Why e.g. at the end the FS='/' ? – spazek Jan 19 '21 at 13:14
I updated it so it will now produce output grouped by the search strings and will also now output the IDs from SearchList. I'm setting FS to / when reading the find output to separate the directory path into fields so that $NF contains the file name. – Ed Morton Jan 19 '21 at 14:01
I'm honestly not sure which parts of it need any explanation, it seems really clear to me (but of course I'm biased) - is there anything in particular that you have a question about? – Ed Morton Jan 19 '21 at 14:07
I should have added that the script should run on various machines, so GNU awk is not really ideal here. The script works really great though! Thanks a lot! – spazek Jan 19 '21 at 22:36
I posted a POSIX version, give it a try and let me know of any issues. – Ed Morton Jan 19 '21 at 23:11
After I removed the FS='/' at the end, it runs fine. There are slashes in the IDs from time to time... – spazek Jan 19 '21 at 23:44
Ah, right, I forgot the split() in the END would be using the final FS value, I've fixed that now. The tool can't work without FS='/' though as that's separating the directory path from the file name that you want to compare against your search strings. If you remove FS='/' then you'll be searching the whole path to the file for each string instead of just the file name. It might give you the output you expect given the input you have today, but it'd fail with different input some other time. – Ed Morton Jan 20 '21 at 00:51

while read line find – faster with perl?

I think print has to be inside sub but each search result shows separately and is still slow:

print $fh_out "\n\n----------------------------\n

outside sub: displays the search results together, but also slow and with a loop :(

2 Answers2