1

Below are two example file listings. I need to compare files of lists (of files) - the last "X" characters of each record to the right of the last "/".

If the file NAME is not found I need the entire row sent to a third file as output.

These are file listings, might be three files in the second listing, two thousand in the first.
FIRST:
1 /home/dev/share/Datafiles/cases.dbf
2 /home/dev/share/Datafiles/cells.csv
3 /home/dev/share/Datafiles/clusters.db
4 /home/dev/share/Datafiles/competition.csv
5 /home/dev/share/Datafiles/coplot.csv
6 /home/dev/share/Datafiles/daphnia.csv
7 /home/dev/share/Datafiles/das.txt
8 /home/dev/share/Datafiles/deaths.sas7bdat
9 /home/dev/share/Datafiles/decay.csv
10 /home/dev/share/Datafiles/example.db
11 /home/dev/share/Datafiles/fertyield.lst
12 /home/dev/share/Datafiles/fisher.csv

TWO:
1 /test/kitchen/cooks/transfer/cases.dbf
2 /test/kitchen/cooks/transfer/cells.csv
3 /test/kitchen/cooks/transfer/clusters.db
4 /test/kitchen/cooks/transfer/coplot.csv
5 /test/kitchen/cooks/transfer/das.txt
6 /test/kitchen/cooks/transfer/deaths.sas7bdat
7 /test/kitchen/cooks/transfer/decay.csv
8 /test/kitchen/cooks/transfer/example.db
9 /test/kitchen/cooks/transfer/fertyield.lst
10 /test/kitchen/cooks/transfer/fisher.csv

Two files not found in listing TWO that exist in listing ONE : "Competition.csv" (#4) and "daphinia.csv" (#6).

Sorting the files does not work, file paths can be very short or very long and multiple copies of files can be found in muliple directories.

Comm/diff/cmp produced unsatisfactory results as I'm only looking for the last 'X" number of characters (based on file name, extension) in the RIGHT of each row.
(In Microsfot EXCEL I would simply extract everything to the right of the last "/", row-by-row, save it to a another list and VLOOKUP that list with the first list.)

But this is not a Microsoft installation.

A script to awk in the contents of list (file) two, and search through list (file) one, output not matching to file three?

Also parsing out the directory names with sed and leaving only two lists of file names has been difficult - don't know what paths I'd be replacing as they would differ every time. I played around with cut, but the start of the file name could be anywhere from column 10 to column 150. My intuition is there has to be a way to isolate all characters to the right of that last "/" in the file path.

Then again, I could be wrong.

1 Answers1

4

Using grep:

grep -F -x -v -f <(grep -o '[^/]*$' file2) <(grep -o '[^/]*$' file1) > file3

The inner two grep's are returning the filename parts of each line (everything after the last /) and the outer grep is using the output of the first inner grep as pattern input file (-f) for the second argument.

I.e. it is returning all the filenames of file2 present in file1 inversed (-v). The output is redirected to file3. Option -F is used to match fixed strings instead of regular expressions and -x matches whole lines.

Content of file3:

$ cat file3
competition.csv
daphnia.csv
Freddy
  • 25,565
  • 1
    There is a God. – jumboshrimps Nov 18 '19 at 02:40
  • First, I used it on two almost identical sorted file listings, that worked. Then I reverse sorted the listings and shazam!!! found the two files I had deleted from the list. Is there any chance at all I could also get the entire row, along with the name of the missing files? That wold just be too much. i.e. /home/dev/share/Datafiles/compeitition.csv THANX!!! – jumboshrimps Nov 18 '19 at 02:45
  • You could remove the -x and omit the second inner grep. This makes the whole command also easier to read: grep -F -v -f <(grep -o '[^/]*$' file2) file1 – Freddy Nov 18 '19 at 02:51