Complete 10000 paths in.txt document and check if files exist... with awk?

Question

I want to read all files of my photo library and check if they really exist. My AppleScript knowledge is so far ok and enough to realize this. But this is about a large number of files and AppleScript is -definitely- not suitable for this. For 10,000 files, it takes 20 minutes. So I decided to do the most important parts of the script with shell scripts.... but I am quite inexperienced in the Unix world and had to complete a two day internet search crash course. However, I have now reached a point where I would like to take your help!

Here are my experiments:

I'll embed it all in an AppleScript. Since a lot of files have to be edited, I thought it would be better to save them in temporary text files between the steps. In the first step, the database is read out. It'll only take a second:

Path | Name | ID | Reference | External hard disk name

2018/03/27/20180327-122110/TVTower.JPG|TVTower|hA3CRRfPSS6FXqk7IDobLw|0|
Projekte/BCT 2017/BCT Fotos GPS/BCT_GPS_001.JPG|BCT_A_GPS_001|hyvsQgiaR4e3ou7XIZ%Gjg|1|Media
Leo/Carmina Burana/Leo UdK/IMG_0626.JPG|IMG_0626|j7342DtGSmag7YVLN1Nzhg|1|Logic
Users/spazek/Desktop/WeTransfer/Bild 2.png|Bild 2|Sa7rckZiSd2bIiRVO0JidA|1|macOS

In the next step, the missing path parts are added

/Users/spazek/Pictures/Fotos Library.photoslibrary/Masters/2018/03/27/20180327-122110/TVTower.JPG|TVTower|hA3CRRfPSS6FXqk7IDobLw|0|
/Volumes/Logic/Projekte/BCT 2017/BCT Fotos GPS/BCT_GPS_001.JPG|BCT_A_GPS_001|hyvsQgiaR4e3ou7XIZ%Gjg|1|Media
/Volumes/Logic/Leo/Carmina Burana/Leo UdK/IMG_0626.JPG|IMG_0626|j7342DtGSmag7YVLN1Nzhg|1|Logic
/Users/spazek/Desktop/WeTransfer/Bild 2.png|Bild 2|Sa7rckZiSd2bIiRVO0JidA|1|macOS

It takes 2:30 minutes with my solution for 10,000 files on my Mac. The running AppleScript seems to be at the limit of overload! Running in the Terminal.app, I see in the header of the window that there is always a jump between awk and bash... I guess there's something wrong.

In the next step I want to check the paths to see if they exist. Since it is similar to the previous script, it also takes a little longer. The last step writes missing files to a text file.

.

sqlite3  -separator $'|' /Users/spazek/Desktop/xsystx/systphotos.db 'select RKMaster.imagePath, RKMaster.name, RKMaster.uuid, RKMaster.fileIsReference, ( select RKVolume.name from RKVolume where RKVolume.modelId  = RKMaster.volumeId) from RKMaster' > /Users/spazek/Desktop/filelist1.txt

.

while read f; do
    var1=`echo "$f" | awk -F[=\|] '{print $1}'`;
    var2=`echo "$f" | awk -F[=\|] '{print $2}'` ;
    var3=`echo "$f" | awk -F[=\|] '{print $3}'` ;
    var4=`echo "$f" | awk -F[=\|] '{print $4}'` ;
    var5=`echo "$f" | awk -F[=\|] '{print $5}'` ;
    if  [ "$var4" == 0 ] ; then
        echo /Users/spazek/Pictures/Fotos Library.photoslibrary/Masters/"${f}" ;
    else
        if [ "$var5" == "macOS" ]; then
            echo /"${f}" ;
        else
            echo /Volumes/"$var5"/"${f}";
        fi;
    fi >> /Users/spazek/Desktop/filelist2.txt;
done < /Users/spazek/Desktop/filelist1.txt

.

while read f; do
    var1=`echo "$f" | awk -F[=\|] '{print $1}'`;
    var3=`echo "$f" | awk -F[=\|] '{print $3}'` ;
    test -f "$var1" || echo "$var1|$var3" >> /Users/spazek/Desktop/filelist3.txt;
done < /Users/spazek/Desktop/filelist2.txt

.

while read f; do
    var1=`echo "$f" | awk -F[=\|] '{print $1}'`;
    var2=`echo "$f" | awk -F[=\|] '{print $2}'` ;
    test -f "$var1" || echo "Name = $var2 \n Path = $var1 \n";
done > ~/Desktop/Photos_MissingItems.txt < /Users/spazek/Desktop/filelist3.txt

I would be very happy about help or suggestions to improve the scripts

So you simply need to read the paths and check if the file exists, right? If it doesn't exist you want to put it into a new file? — jesse_b, Mar 31 '18 at 22:25
Why not have awk use an array to contain the path prefixes, and then print them out before the line? — Ignacio Vazquez-Abrams, Mar 31 '18 at 22:26
Or just put them as another table in the SQLite database and perform a join? — Ignacio Vazquez-Abrams, Mar 31 '18 at 22:27
Doing this with shell and awk is NOT going to be any faster if you run awk repeatedly in a loop like that, and will probably be much slower - there's a lot of overhead every time you run awk, and shell interpreters are extremely slow at processing text files with read. In short, your code is probably the slowest possible way to do it. You WILL see huge performance increases if you do the whole job in awk. or perl. or python. or any other scripting language if you use it to read in the entire input file itself, rather than pipe it in one line at a time in a shell loop. — cas, Apr 01 '18 at 05:06
See Why is using a shell loop to process text considered bad practice? — cas, Apr 01 '18 at 05:07
@Jesse_b That's right! Read paths and check if they exist. To complete the paths: if the reference is "0", then the images are located in the media library and "/path to the library/" must be added. If the reference is "1", they are outside the media library. Then there is also a fifth column with the name of the volume. If it is the name of the startup disc, then the path is already complete (except for a missing "/")... otherwise, prefix "/volume/name of the volume/". — spazek, Apr 01 '18 at 11:13
@IgnacioVazquez-Abrams I don't really understand the first answer, but the second answer is a clever idea! — spazek, Apr 01 '18 at 11:19
@cas Ah! That's probably the heart of my problem. I will read the link and try to understand. — spazek, Apr 01 '18 at 11:27
If I use "cut" instead of "awk" the speed doubles. But my syntax is probably still too complicated: while read f; do var1=echo "$f" | cut -d"|" -f1; test -f "$var1" || echo "$f" | cut -d"|" -f1,2,3 ; done >> /Users/spazek/Desktop/filelist3.txt < /Users/spazek/Desktop/filelist2.txt — spazek, Apr 01 '18 at 15:11

cas · Accepted Answer · 2018-04-09T01:37:53.540

If you have GNU awk version 4 or later installed, it has the ability to load external modules that provide functionality not present in standard awk or even GNU-enhanced awk. It comes with a set of modules, including one called filefuncs. The filefuncs module includes an awk wrapper to the system stat function which can be used to get information about files (including whether they exist or not).

The following awk script load the filefuncs module, reads each input line, checks the 5th column to decide what path to pre-pend to each input filename and checks whether the file exists. If it does, it prints the full path and filename to stdout. If it doesn't, it prints a warning message to stderr.

The paths associative array (AKA a "hash" or "hashed array") and the default pre-pended path are my best guesses about what you intend. Adjust as required. It matches the data in your provided samples (even the obvious mistake with Media -> /Volumes/Logic), not what you said in one of your comments. If your comment is accurate, then the code can be simplified.

#!/usr/bin/awk -f

# this will only work with GNU awk >= version 4.0
@load "filefuncs"

BEGIN {
  FS=OFS="|";
  paths["default"] = "/Users/spazek/Pictures/Fotos Library.photoslibrary/Masters/";
  paths["Logic"] = "/Volumes/Logic/";
  paths["Media"] = "/Volumes/Logic/";
  paths["macOS"] = "/";
}

{ if ($5 in paths) {
    filename = paths[$5] $1;
  } else { # $5 not known in paths array, use a default
    filename = paths["default"] $1;
  }

  # try to stat the file. get the return code in variable 'rc' and error
  # string (if any) in 'error'.
  rc=stat(filename,fstat);
  error=ERRNO;   # oddly, ERRNO is a string, not a number.

  if (rc == -1) {  # return code of -1 is "No such file or directory"
    # print warning to stdout and skip to next input line
    print filename ": " error > "/dev/stderr"
    next;
  };

  # filename exists, do something with filename.
  print filename, $2, $3, $4, $5;
}

Save this as, e.g. ./exists.awk, make it executable with chmod +x (same as you would with a shell script) and run it like this:

./exists.awk /Users/spazek/Desktop/filelist1.txt

or pipe sqlite3 directly into it:

sqlite3  -separator $'|' /Users/spazek/Desktop/xsystx/systphotos.db \
'select RKMaster.imagePath, RKMaster.name, RKMaster.uuid, RKMaster.fileIsReference, ( select RKVolume.name from RKVolume where RKVolume.modelId  = RKMaster.volumeId) from RKMaster' \
  | ./exists.awk

I don't know what version of awk comes with Mac OS these days. I suspect it's probably either a BSD awk or some ancient version of GNU awk from a time before the Free Software Foundation switched to using the GPLv3 license (which is why Macs are stuck on the ancient bash v3 rather than current bash version 4 - it's not because Apple can't upgrade bash, it's because they won't. Use brew if you need a later version of GNU bash or awk).

Anyway, if you don't have GNU awk >= v4.0 installed, you can do the same thing with any version of perl.

The following perl script doesn't use any non-standard perl modules or features, and doesn't even need to use perl's built-in stat() function because perl has operators similar to those in sh for testing whether a file exists. We'll use the -e operator here which tests for a file's existence, same as in sh:

#!/usr/bin/perl

use strict;

# declare %paths hash
my %paths = (
  "default" => "/Users/spazek/Pictures/Fotos Library.photoslibrary/Masters/",
  "Media"   => "/Volumes/Logic/",
  "Logic"   => "/Volumes/Logic/",
  "macOS"   => "/",
);

# main loop, read in each line of input and process it.
while(<>) {
  chomp; # strip trailing linefeed from end-of-line
  my $filename='';  # declare $filename to belong to this scope

  # split input on "|" characters
  my ($path,$name,$id,$reference,$diskname) = split /\|/;

  if (defined($paths{$diskname})) {
    $filename = $paths{$diskname} . $path;
  } else {  # diskname not known in %paths hash, use a default
    $filename = paths{"default"} . $path;
  }

  if (! -e $filename) {
    # print warning to stderr and skip to next input line
    warn "$filename: No such file or directory\n";
    next;
  };

  # filename exists, do something with filename.
  print join('|', $filename, $id, $reference, $diskname), "\n";
}

Again, save it as exists.pl and make it executable with chmod +x. Run as:

./exists.pl /Users/spazek/Desktop/filelist1.txt

Either of these two scripts will be hundreds or thousands of times faster than a shell script using a while read or similar loop.

dave_thompson_085 · Answer 2 · 2018-04-08T09:23:42.890

I concur that gawk4 or perl -- or python -- is a better solution to this problem. However, for future reference and edification, it is possible to make your shell script better, or at least less bad.

First and most important, you don't need to run either awk or cut many many times to split the fields; as long as your fields are separated by a single character, which they are, shell read can do that for you. I'm not sure why you specified the delimiter to awk as [=\|] meaning either equal-sign or vert-rule-aka-pipe, when your data is from a sqlite3 command that uses only vert-rule and never equal-sign. Thus you want to start with something like:

 while IFS='=|' read var1 var2 var3 var4 var5; do ... done <filelist1
 # change IFS='|' if you don't actually need to split on equal-sign 

 # could skip the first temp file, if you don't need it for anything else,
 # with either a pipeline (any shell):
 sqlite3 ... 'select ...' | while IFS.. read ...; do ... done
 # or process substitution (only bash and some others):
 while IFS.. read ...; do ... done < <(sqlite3 ... 'select ...')

It's probably best to add the -r option on read; your example data didn't contain any backslash, but if the actual data ever did it would get corrupted without -r. The pipeline approach is a little more portable but in general a little riskier because it may not work when one needs to set var(s) or make other shell change(s) like cd within the loop that persist after the loop -- but you don't.

Second, you don't need multiple passes and (so many) intermediate files if you merge the logic:

while IFS.. read -r var1 var2 var3 var4 var5; do 
    if  [ "$var4" == 0 ]; then var1="/Users/spazek/Pictures/Fotos Library.photoslibrary/Masters/$var1"
    elif [ "$var5" == "macOS" ]; then var1="/$var1"
    else echo var1="/Volumes/$var5/$var1; fi
    test -f "$var1" || echo "Name = $var3 \n Path = $var1 \n"
done >~/Desktop/MissingPhotos.txt <filelist1 
# or options to avoid filelist1 per above

Finally I would suggest using more meaningful variable names like path name id instead of var1 etc, but that only matters to humans reading the script, such as yourself a few months from now; the computer doesn't care. You can choose lowercase variable names freely for shell variables; by convention environment variables (i.e. shell variables that are exported to programs, and child shells) are uppercase, but then you have to be a little careful not to conflict with some special vars/envvars built-in to the shell or standardized systemwide.

since my reputation is < 15, I can't mark the answer as helpful. But it is definitely! — spazek, May 02 '18 at 13:15

Complete 10000 paths in.txt document and check if files exist... with awk?

2 Answers2